Sprachbasiertes Scoring (classify_column_types) verursachte vertauschte
Spalten auf Seite 3 bei Beispielsaetzen mit vielen englischen Funktionswoertern.
Neue _positional_column_regions() ordnet Spalten rein geometrisch (links→rechts)
zu. OCR Pipeline Admin bleibt unveraendert.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After iterative projection (pass 1) and word-alignment (pass 2), a third
pass uses Tesseract word positions + linear regression per text line to
measure and correct residual rotation. This catches cases where passes 1-2
leave significant slope (e.g. 1.7° residual on heavily skewed scans).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Increase iterative deskew coarse_range from ±2° to ±5° to handle
heavily skewed scans
- New deskew_two_pass(): runs iterative projection first, then
word-alignment on the corrected image to detect/fix residual skew
(applied when residual ≥ 0.3°)
- OCR pipeline API auto_deskew now uses deskew_two_pass by default
- Vocab worksheet _run_ocr_pipeline_for_page uses deskew_two_pass
- Deskew result now includes angle_residual and two_pass_debug
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
process-single-page now runs the full CV pipeline (deskew → dewarp → columns →
rows → cell-first OCR v2 → LLM review) for much better extraction quality.
Falls back to LLM vision if pipeline imports are unavailable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>