feat(ocr): Add CV Document Reconstruction Pipeline for vocabulary extraction

New OCR method using classical Computer Vision: high-res rendering (432 DPI),
deskew, dewarp, binarization, projection-profile layout analysis, multi-pass
Tesseract OCR with region-specific PSM, and Y-coordinate line alignment.
Includes bugfix for convert_pdf_to_image call (line 869) and 39 unit tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
BreakPilot Dev
2026-02-09 23:52:35 +01:00
parent 916ecef476
commit fa958d31f6
4 changed files with 2096 additions and 50 deletions

File diff suppressed because it is too large Load Diff