breakpilot-lehrer/admin-lehrer/components/ocr-pipeline/PipelineStepper.tsx at 4e8ea77140db1cdb5e2b3e7d72046e0ff4820fca

Files

Benjamin Admin 29c74a9962 feat: cell-first OCR + document type detection + dynamic pipeline steps

Cell-First OCR (v2): Each cell is cropped and OCR'd in isolation,
eliminating neighbour bleeding (e.g. "to", "ps" in marker columns).
Uses ThreadPoolExecutor for parallel Tesseract calls.

Document type detection: Classifies pages as vocab_table, full_text,
or generic_table using projection profiles (<2s, no OCR needed).
Frontend dynamically skips columns/rows steps for full-text pages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-04 13:52:38 +01:00

4.9 KiB

Raw Blame History

View Raw

4.9 KiB Raw Blame History

4.9 KiB

Raw Blame History