feat(ocr): Add CV Document Reconstruction Pipeline for vocabulary extraction

Archived

New OCR method using classical Computer Vision: high-res rendering (432 DPI),
deskew, dewarp, binarization, projection-profile layout analysis, multi-pass
Tesseract OCR with region-specific PSM, and Y-coordinate line alignment.
Includes bugfix for convert_pdf_to_image call (line 869) and 39 unit tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This commit is contained in:

BreakPilot Dev

2026-02-09 23:52:35 +01:00

parent 916ecef476

commit fa958d31f6

4 changed files with 2096 additions and 50 deletions

1019

klausur-service/backend/cv_vocab_pipeline.py Normal file

View File

File diff suppressed because it is too large Load Diff

feat(ocr): Add CV Document Reconstruction Pipeline for vocabulary extraction

1019 klausur-service/backend/cv_vocab_pipeline.py Normal file View File

1019

klausur-service/backend/cv_vocab_pipeline.py Normal file

View File