feat(klausur-service): Add Tesseract OCR, DSFA RAG, TrOCR, grid detection and vocab session store

New modules: - tesseract_vocab_extractor.py: Bounding-box OCR with multi-PSM pipeline - grid_detection_service.py: CV-based grid/table detection for worksheets - vocab_session_store.py: PostgreSQL persistence for vocab sessions - trocr_api.py: TrOCR handwriting recognition endpoint - dsfa_rag_api.py + dsfa_corpus_ingestion.py: DSFA RAG corpus search Changes: - Dockerfile: Install tesseract-ocr + deu/eng language packs - requirements.txt: Add PyMuPDF, pytesseract, Pillow - main.py: Register new routers, init DB pools + Qdrant collections Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 00:00:19 +01:00
parent 46cb873190
commit 53219e3eaf
9 changed files with 3829 additions and 4 deletions
--- a/klausur-service/backend/requirements.txt
+++ b/klausur-service/backend/requirements.txt
@@ -9,6 +9,7 @@ python-dotenv>=1.0.0
 qdrant-client>=1.7.0
 cryptography>=41.0.0
 PyPDF2>=3.0.0
+PyMuPDF>=1.24.0

 # PyTorch CPU-only (smaller, no CUDA needed for Docker on Mac)
 --extra-index-url https://download.pytorch.org/whl/cpu
@@ -23,6 +24,10 @@ minio>=7.2.0
 # OpenCV for handwriting detection (headless = no GUI, smaller for CI)
 opencv-python-headless>=4.8.0

+# Tesseract OCR Python binding (requires system tesseract-ocr package)
+pytesseract>=0.3.10
+Pillow>=10.0.0
+
 # PostgreSQL (for metrics storage)
 psycopg2-binary>=2.9.0
 asyncpg>=0.29.0