breakpilot-lehrer

Benjamin_Boenisch/breakpilot-lehrer

Fork 0

Commit Graph

Author	SHA1	Message	Date
Benjamin Admin	29c74a9962	feat: cell-first OCR + document type detection + dynamic pipeline steps Cell-First OCR (v2): Each cell is cropped and OCR'd in isolation, eliminating neighbour bleeding (e.g. "to", "ps" in marker columns). Uses ThreadPoolExecutor for parallel Tesseract calls. Document type detection: Classifies pages as vocab_table, full_text, or generic_table using projection profiles (<2s, no OCR needed). Frontend dynamically skips columns/rows steps for full-text pages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 13:52:38 +01:00
Benjamin Admin	e718353d9f	feat(ocr-pipeline): 6 systematic improvements for robustness, performance & UX Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 37s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m57s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 21s Details 1. Unit tests: 76 new parametrized tests for noise filter, phonetic detection, cell text cleaning, and row merging (116 total, all green) 2. Continuation-row merge: detect multi-line vocab entries where text wraps (lowercase EN + empty DE) and merge into previous entry 3. Empty DE fallback: secondary PSM=7 OCR pass for cells missed by PSM=6 4. Batch-OCR: collect empty cells per column, run single Tesseract call on column strip instead of per-cell (~66% fewer calls for 3+ empty cells) 5. StepReconstruction UI: font scaling via naturalHeight, empty EN/DE field highlighting, undo/redo (Ctrl+Z), per-cell reset button 6. Session reprocess: POST /sessions/{id}/reprocess endpoint to re-run from any step, with reprocess button on completed pipeline steps Also fixes pre-existing dewarp_image tuple unpacking bug in run_cv_pipeline and updates dewarp tests to match current (image, info) return signature. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 14:46:38 +01:00
Benjamin Admin	d552fd8b6b	feat: OCR Pipeline mit 6-Schritt-Wizard fuer Seitenrekonstruktion All checks were successful CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 38s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Successful in 1m46s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 22s Details Neue Route /ai/ocr-pipeline mit schrittweiser Begradigung (Deskew), Raster-Overlay und Ground Truth. Schritte 2-6 als Platzhalter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 15:38:08 +01:00

Author

SHA1

Message

Date

Benjamin Admin

29c74a9962

feat: cell-first OCR + document type detection + dynamic pipeline steps

Cell-First OCR (v2): Each cell is cropped and OCR'd in isolation,
eliminating neighbour bleeding (e.g. "to", "ps" in marker columns).
Uses ThreadPoolExecutor for parallel Tesseract calls.

Document type detection: Classifies pages as vocab_table, full_text,
or generic_table using projection profiles (<2s, no OCR needed).
Frontend dynamically skips columns/rows steps for full-text pages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-04 13:52:38 +01:00

Benjamin Admin

e718353d9f

feat(ocr-pipeline): 6 systematic improvements for robustness, performance & UX

CI / go-lint (push) Has been skipped

Details

CI / python-lint (push) Has been skipped

Details

CI / nodejs-lint (push) Has been skipped

Details

CI / test-go-school (push) Successful in 37s

Details

CI / test-go-edu-search (push) Successful in 26s

Details

CI / test-python-klausur (push) Failing after 1m57s

Details

CI / test-python-agent-core (push) Successful in 19s

Details

CI / test-nodejs-website (push) Successful in 21s

Details

1. Unit tests: 76 new parametrized tests for noise filter, phonetic detection,
   cell text cleaning, and row merging (116 total, all green)
2. Continuation-row merge: detect multi-line vocab entries where text wraps
   (lowercase EN + empty DE) and merge into previous entry
3. Empty DE fallback: secondary PSM=7 OCR pass for cells missed by PSM=6
4. Batch-OCR: collect empty cells per column, run single Tesseract call on
   column strip instead of per-cell (~66% fewer calls for 3+ empty cells)
5. StepReconstruction UI: font scaling via naturalHeight, empty EN/DE field
   highlighting, undo/redo (Ctrl+Z), per-cell reset button
6. Session reprocess: POST /sessions/{id}/reprocess endpoint to re-run from
   any step, with reprocess button on completed pipeline steps

Also fixes pre-existing dewarp_image tuple unpacking bug in run_cv_pipeline
and updates dewarp tests to match current (image, info) return signature.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-02 14:46:38 +01:00

Benjamin Admin

d552fd8b6b

feat: OCR Pipeline mit 6-Schritt-Wizard fuer Seitenrekonstruktion

CI / go-lint (push) Has been skipped

Details

CI / python-lint (push) Has been skipped

Details

CI / nodejs-lint (push) Has been skipped

Details

CI / test-go-school (push) Successful in 38s

Details

CI / test-go-edu-search (push) Successful in 29s

Details

CI / test-python-klausur (push) Successful in 1m46s

Details

CI / test-python-agent-core (push) Successful in 17s

Details

CI / test-nodejs-website (push) Successful in 22s

Details

Neue Route /ai/ocr-pipeline mit schrittweiser Begradigung (Deskew),
Raster-Overlay und Ground Truth. Schritte 2-6 als Platzhalter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-26 15:38:08 +01:00

3 Commits