Commit Graph

15 Commits

Author SHA1 Message Date
Benjamin Admin
29c74a9962 feat: cell-first OCR + document type detection + dynamic pipeline steps
Cell-First OCR (v2): Each cell is cropped and OCR'd in isolation,
eliminating neighbour bleeding (e.g. "to", "ps" in marker columns).
Uses ThreadPoolExecutor for parallel Tesseract calls.

Document type detection: Classifies pages as vocab_table, full_text,
or generic_table using projection profiles (<2s, no OCR needed).
Frontend dynamically skips columns/rows steps for full-text pages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 13:52:38 +01:00
Benjamin Admin
4d428980c1 refactor(word-step): make table fully generic and fix marker-only row filter
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 24s
CI / test-python-klausur (push) Failing after 1m43s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 17s
Frontend: Replace hardcoded EN/DE/Example vocab table with unified dynamic
table driven by columns_used from backend. Labeling, confirmation, counts,
and summary badges are now all cell-based instead of branching on isVocab.

Backend: Change _cells_to_vocab_entries() entry filter from checking only
english/german/example to checking ANY mapped field. This preserves rows
with only marker or source_page content, fixing the issue where marker
sub-columns disappeared at the end of OCR processing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 08:45:24 +01:00
Benjamin Admin
6527beae03 fix(sub-columns): exclude header/footer words from alignment clustering
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 24s
CI / test-python-klausur (push) Failing after 1m51s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 18s
Header/footer words (page numbers, chapter titles) could pollute the
left-edge alignment bins and trigger false sub-column splits. Now
_detect_header_footer_gaps() runs early and its boundaries are passed
to _detect_sub_columns() to filter those words from clustering and
the split threshold check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 07:33:54 +01:00
Benjamin Admin
3904ddb493 fix(sub-columns): convert relative word positions to absolute coords for split
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m51s
CI / test-python-agent-core (push) Successful in 14s
CI / test-nodejs-website (push) Successful in 17s
Word 'left' values in ColumnGeometry.words are relative to the content
ROI (left_x), but geo.x is in absolute image coordinates. The split
position was computed from relative word positions and then compared
against absolute geo.x, resulting in negative widths and no splits on
real data. Pass left_x through to _detect_sub_columns to bridge the
two coordinate systems.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 19:16:13 +01:00
Benjamin Admin
6e1a349eed fix(tests): adjust word counts so 10% threshold works correctly
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 19:00:14 +01:00
Benjamin Admin
7252f9a956 refactor(ocr-pipeline): use left-edge alignment approach for sub-column detection
Replace gap-based splitting with alignment-bin approach: cluster word
left-edges within 8px tolerance, find the leftmost bin with >= 10% of
words as the true column start, split off any words to its left as a
sub-column. This correctly handles both page references ("p.59") and
misread exclamation marks ("!" → "I") even when the pixel gap is small.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 18:56:38 +01:00
Benjamin Admin
f13116345b fix(tests): use correct bbox_pct dict format in _cells_to_vocab_entries tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 18:26:24 +01:00
Benjamin Admin
991984d9c3 fix(tests): pass columns_meta arg to _cells_to_vocab_entries tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 18:23:55 +01:00
Benjamin Admin
1a246eb059 feat(ocr-pipeline): generic sub-column detection via left-edge clustering
Detects hidden sub-columns (e.g. page references like "p.59") within
already-recognized columns by clustering word left-edge positions and
splitting when a clear minority cluster exists. The sub-column is then
classified as page_ref and mapped to VocabRow.source_page.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 18:18:02 +01:00
Benjamin Admin
0532b2a797 fix(ocr-pipeline): skip edge-touching gaps in header/footer detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m50s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
Gaps that extend to the image boundary (top/bottom edge) are not valid
content separators — they typically represent dewarp padding. Only gaps
with content on both sides qualify as header/footer boundaries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 17:54:49 +01:00
Benjamin Admin
c8981423d4 feat(ocr-pipeline): distinguish header/footer vs margin_top/margin_bottom
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 2m0s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 19s
Check for actual ink content in detected top/bottom regions:
- 'header'/'footer' when text is present (e.g. title, page number)
- 'margin_top'/'margin_bottom' when the region is empty page margin

Also update all skip-type sets and color maps for the new types.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 16:55:41 +01:00
Benjamin Admin
f615c5f66d feat(ocr-pipeline): generic header/footer detection via projection gap analysis
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m48s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 16s
Replace the trivial top_y/bottom_y threshold check with horizontal
projection gap analysis that finds large whitespace gaps separating
header/footer content from the main body. This correctly detects
headers (e.g. "VOCABULARY" banners) and footers (page numbers) even
when _find_content_bounds includes them in the content area.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 16:13:48 +01:00
Benjamin Admin
34ccdd5fd1 feat(ocr-pipeline): filter scan artifacts in content bounds and add margin regions
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m50s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Thin black lines (1-5px) at page edges from scanning were incorrectly
detected as content, shifting content bounds and creating spurious
IGNORE columns. This filters narrow projection runs (<1% of image
dimension) and introduces explicit margin_left/margin_right regions
for downstream page reconstruction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 15:29:18 +01:00
Benjamin Admin
e718353d9f feat(ocr-pipeline): 6 systematic improvements for robustness, performance & UX
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 37s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 21s
1. Unit tests: 76 new parametrized tests for noise filter, phonetic detection,
   cell text cleaning, and row merging (116 total, all green)
2. Continuation-row merge: detect multi-line vocab entries where text wraps
   (lowercase EN + empty DE) and merge into previous entry
3. Empty DE fallback: secondary PSM=7 OCR pass for cells missed by PSM=6
4. Batch-OCR: collect empty cells per column, run single Tesseract call on
   column strip instead of per-cell (~66% fewer calls for 3+ empty cells)
5. StepReconstruction UI: font scaling via naturalHeight, empty EN/DE field
   highlighting, undo/redo (Ctrl+Z), per-cell reset button
6. Session reprocess: POST /sessions/{id}/reprocess endpoint to re-run from
   any step, with reprocess button on completed pipeline steps

Also fixes pre-existing dewarp_image tuple unpacking bug in run_cv_pipeline
and updates dewarp tests to match current (image, info) return signature.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 14:46:38 +01:00
Benjamin Boenisch
5a31f52310 Initial commit: breakpilot-lehrer - Lehrer KI Platform
Services: Admin-Lehrer, Backend-Lehrer, Studio v2, Website,
Klausur-Service, School-Service, Voice-Service, Geo-Service,
BreakPilot Drive, Agent-Core

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 23:47:26 +01:00