Commit Graph

46 Commits

Author SHA1 Message Date
Benjamin Admin
4290f70885 Fix unbracketed IPA continuations: detect garbled IPA in single-cell rows
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 24s
CI / test-python-klausur (push) Failing after 1m42s
CI / test-python-agent-core (push) Successful in 13s
CI / test-nodejs-website (push) Successful in 14s
Step 5d now also processes IPA continuations without brackets (e.g.
"ska:f – ska:vz", "'sekandarr sku:l") when the row has only 1 content
cell and the text is pure-ASCII garbled IPA (no real IPA Unicode symbols).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-20 08:30:44 +01:00
Benjamin Admin
5c935eec23 Refine garbled IPA filter: skip only pure-ASCII garbled text, not text with real IPA
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m55s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
"Theme [θˈiːm]" contains real IPA symbols (θ, ˈ) and should NOT be filtered.
Only filter text that has garbled IPA markers (:, ') but no real Unicode IPA chars.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-20 08:15:51 +01:00
Benjamin Admin
c4a5cd2d8a Skip garbled IPA text in single-cell heading detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m47s
CI / test-python-agent-core (push) Successful in 14s
CI / test-nodejs-website (push) Successful in 15s
Unbracketed IPA continuations like "ska:f – ska:vz" were falsely detected
as headings. Now _text_has_garbled_ipa() filters them out.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-20 08:11:02 +01:00
Benjamin Admin
bc5ab29c06 Fix false positive: exclude first/last rows from single-cell heading detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 15s
Page numbers like "two hundred and twelve" in the last row were falsely
detected as headings. Now first and last non-header rows are excluded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-20 08:06:05 +01:00
Benjamin Admin
7c5d95b858 Fix heading col_index + detect black single-cell headings like "Theme"
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 1m53s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
- Color headings now preserve actual starting col_index instead of hardcoded 0
- New _detect_heading_rows_by_single_cell: detects rows with only 1 content
  cell (excl. page_ref) as headings — catches black headings like "Theme"
  that have normal color/height but are alone in their row
- Runs after Step 5d (IPA continuation) to avoid false positives
- 5 new tests (32 total)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-20 08:00:06 +01:00
Benjamin Admin
58c9565ba5 Fix en_col_type detection: use bracket IPA count instead of longest avg text
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 40s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m55s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 17s
The previous heuristic picked the column with the longest average text as
the English headword column. In layouts with long example sentences, this
picked the wrong column (examples instead of headwords). Now counts cells
with bracket patterns per column — the column with the most brackets is
the headword column where IPA needs fixing.

Fixes garbled OCR-IPA like "change [tfeind3]" → "change [tʃˈeɪndʒ]".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-20 06:50:47 +01:00
Benjamin Admin
92a7b85c2d Fix IPA continuation: only process fully-bracketed cells, keep phrasal verb particles
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m53s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
Two fixes:
1. Step 5d now only treats cells as continuation when text is entirely
   inside brackets (e.g. "[n, nn]"). Cells with headwords outside brackets
   (e.g. "employee [im'ploi:]") are no longer overwritten.
2. fix_ipa_continuation_cell no longer skips grammar words like "down" —
   they are part of the headword in phrasal verbs like "close sth. down".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-20 00:43:51 +01:00
Benjamin Admin
5f89913a9a Fix IPA continuation to check all columns, not just en_col_type
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m53s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 21s
The en_col_type heuristic (longest avg text) picks the example column,
missing IPA continuation cells in the actual headword column. Now Step 5d
checks all column_* cells for garbled IPA patterns independently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 23:34:41 +01:00
Benjamin Admin
6bfa9eed86 Fix garbled IPA detection for bracket-notation like [n, nn] and [1uedtX,1]
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
- Detect bracketed text without real IPA symbols as garbled OCR phonetics
- Allow IPA continuation fix even when other columns have content (for rows
  where EN cell is clearly garbled bracketed IPA)
- Strip parenthetical grammar annotations like (no pl) from headword before
  IPA lookup in fix_ipa_continuation_cell

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 23:28:00 +01:00
Benjamin Admin
7750b2a05f Fix ghost filter for borderless boxes + remove oversized graphic artifacts
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m0s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
1. Skip ghost filtering for boxes with border_thickness=0 (images/graphics
   have no border lines to produce OCR artifacts like |, I)
2. Remove individual word_boxes with height > 3x zone median (OCR from
   graphics like a huge "N" from a map image below text)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 23:04:00 +01:00
Benjamin Admin
e3395ae8cf Fix overlay word leak, ghost filter false positive, merged zone header
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 41s
1. Filter words inside image_overlays (removes OCR from images)
2. Ghost filter: only remove single-char border artifacts, not multi-char
   like (= which is real content
3. Skip first-row header detection for zones with image_overlays
   (merged geometry creates artificial gaps)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 13:56:04 +01:00
Benjamin Admin
df30d4eae3 Add zone merging across images + heading detection by color/height
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 1m56s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 20s
Zone merging: content zones separated by box zones (images) are merged
into a single zone with image_overlays, so split tables reconnect.
Heading detection: after color annotation, rows where all words are
non-black and taller than 1.2x median are merged into spanning heading cells.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 12:22:11 +01:00
Benjamin Admin
fc0ab84e40 Fix garbled IPA in continuation rows using headword lookup
IPA continuation rows (phonetic transcription that wraps below the
headword) now get proper IPA by looking up headwords from the row
above. E.g. "ska:f – ska:vz" → "[skˈɑːf] – [skˈɑːvz]".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 10:28:14 +01:00
Benjamin Admin
050d410ba0 Preserve IPA continuation rows in grid output
Stop removing rows that contain only phonetic transcription below
the headword. These rows are valid content that users need to see.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 10:22:58 +01:00
Benjamin Admin
432eee3694 Auto-filter decorative margin strips and header junk
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m45s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 15s
- _filter_decorative_margin: Phase 2 now also removes short words (<=3
  chars) in the same narrow x-range as the detected single-char strip,
  catching multi-char OCR artifacts like "Vv" from alphabet graphics.
- _filter_header_junk: New filter detects the content start (first row
  with 3+ high-confidence words) and removes low-conf short fragments
  above it that are OCR artifacts from header illustrations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 09:38:24 +01:00
Benjamin Admin
f9d71d50d1 Add exclude region marking in Structure step
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m47s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 16s
Users can now draw rectangles on the document image in the Structure
Detection step to mark areas (e.g. header graphics, alphabet strips)
that should be excluded from OCR results during grid building.

- Backend: PUT/DELETE endpoints for exclude regions stored in structure_result
- Backend: _build_grid_core() filters all words inside user-defined exclude regions
- Frontend: Interactive rectangle drawing with visual overlay and delete buttons
- Preserve exclude regions when re-running structure detection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 09:08:30 +01:00
Benjamin Admin
f655db30e4 Add Ground Truth regression test system for OCR pipeline
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 35s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m47s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 22s
Extract _build_grid_core() from build_grid() endpoint for reuse.
New ocr_pipeline_regression.py with endpoints to mark sessions as
ground truth, list them, and run regression comparisons after code
changes. Frontend button in StepGroundTruth.tsx to mark/update GT.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 13:46:48 +01:00
Benjamin Admin
c894a0feeb Improve IPA continuation row detection with phonetic heuristics
Strip IPA brackets that fix_cell_phonetics may have added for short
dictionary words (e.g. "si" → "[si]") before checking if the row is
a garbled phonetic continuation. Detect phonetic text by presence of
':' (length marks), leading apostrophe (stress marks), or absence of
any word with ≥3 letters.

Fixes Row 39 ("si: [si] — So: - si:n") not being removed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 12:08:21 +01:00
Benjamin Admin
8ef4c089cf Remove IPA continuation rows and support hyphenated word lookup
- grid_editor_api: After IPA correction, detect rows containing only
  garbled phonetics in the English column (no German translation, no
  IPA brackets inserted). These are wrap-around lines where printed
  IPA extends to the line below the headword. Remove them since the
  headword row already has correct IPA.
- cv_ocr_engines: _insert_missing_ipa now tries dehyphenated form
  as fallback (e.g. "second-hand" → "secondhand") for dictionary
  lookup, fixing IPA insertion for compound words.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 12:05:38 +01:00
Benjamin Admin
821e5481c2 Only apply IPA correction on vocabulary tables (≥3 columns)
Single-column German text pages were getting IPA inserted for words
that happen to exist in the English dictionary ("die" → [dˈaɪ],
"Das" → [dɑs]). Now IPA correction only runs when the grid has ≥3
columns, which is the minimum for a vocabulary table layout
(English | article | German).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 11:50:03 +01:00
Benjamin Admin
f139d0903e Preserve alphabetic marker columns, broaden junk filter, enable IPA in grid
- _merge_inline_marker_columns: skip merge when ≥50% of words are
  alphabetic (preserves "to", "in", "der" columns)
- Rule 2 (oversized stub): widen to ≤3 words / ≤5 chars (catches "SEA &")
- IPA phonetics: map longest-avg-text column to column_en so
  fix_cell_phonetics runs in the grid editor
- ocr_pipeline_overlays: add missing split_page_into_zones import

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 11:08:23 +01:00
Benjamin Admin
962bbbe9f6 Remove scattered debris rows and disable spanning header detection
- Add Rule 3 to junk-row filter: rows where no word is longer than
  2 chars are removed as scattered OCR debris from illustrations
- Fully disable spanning-header detection which falsely flagged IPA
  transcriptions and vocabulary entries as spanning headers
- First-row heuristic remains for genuine header detection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 10:47:17 +01:00
Benjamin Admin
9da45c2a59 Fix false header detection and add decorative margin/footer filters
- Remove all_colored spanning header heuristic that falsely flagged
  colored vocabulary entries (Scotland, secondary school) as headers
- Add _filter_decorative_margin: removes vertical A-Z alphabet strips
  along page margins (single-char words in a compact vertical strip)
- Add _filter_footer_words: removes page numbers in bottom 5% of page
- Tighten spanning header rule: require ≥3 columns spanned + ≤3 words

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 10:38:20 +01:00
Benjamin Admin
00cbf266cb Add oversized-stub filter for large page numbers/marks in grid rows
Rows with ≤2 words, total text ≤3 chars, and word height >1.8x median
are removed as non-content elements (e.g. red page number "( 9").

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 09:05:07 +01:00
Benjamin Admin
f9bad7beaa Filter phantom rows from recovered color artifacts and low-conf OCR noise
- Apply recovered-artifact filter to ALL zones (was box-zones only)
- Filter any recovered word with text ≤ 2 chars (not just !?•·)
- Add post-grid junk-row removal: rows where all word_boxes have
  conf < 50 and text ≤ 3 chars are dropped as OCR noise

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 09:00:43 +01:00
Benjamin Admin
19b93f7762 fix: conservative column detection + smart graphic word filter
Column detection:
- Raise MIN_COVERAGE_PRIMARY 20%→35% (prevents false columns in
  flowing text where random gaps < 35% of rows)
- Raise MIN_COVERAGE_SECONDARY 12%→20%, MIN_DISTINCT_ROWS 2→3
- Vocabulary worksheets unaffected (columns appear in >80% of rows)

Graphic word filter:
- Only remove words with OCR confidence < 50 inside graphic regions
- High-confidence words are real text, not image artifacts
- Prevents legitimate colored text from being discarded

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 18:19:25 +01:00
Benjamin Admin
b1cdb2531c feat: CSS Grid editor with OCR-measured column widths and row heights
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 1m59s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Backend: add layout_metrics (avg_row_height_px, font_size_suggestion_px)
to build-grid response for faithful grid reconstruction.

Frontend: rewrite GridTable from HTML <table> to CSS Grid layout.
Column widths are now proportional to the OCR-measured x_min/x_max
positions. Row heights use the average content row height from the
scan. Column and row resize via drag handles (Excel-like).

Font: add Noto Sans (supports IPA characters) via next/font/google.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 13:48:47 +01:00
Benjamin Admin
ab30e8b17a feat: apply IPA phonetic correction in build-grid combo mode
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 18s
fix_cell_phonetics was only called in the OCR pipeline endpoints
(/words, /cells) but not in the combo mode (build-grid / ocr-overlay).
Garbled IPA like [teist] is now corrected to [teɪst] using the
IPA dictionary, same as in the pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 12:53:58 +01:00
Benjamin Admin
b0e1fbc8d6 feat: box zone artifact filter, spanning headers, parenthesis fix
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m59s
CI / test-python-agent-core (push) Successful in 22s
CI / test-nodejs-website (push) Successful in 19s
1. Filter recovered single-char artifacts (!, ?, •) from box zones
   where they are decorative noise, not real text markers

2. Detect spanning header rows (e.g. "Unit4: Bonnie Scotland") that
   stretch across multiple columns with colored text. Merge their
   cells into a single spanning cell in column 0.

3. Fix missing opening parentheses: when cell text has ")" but no
   matching "(", prepend "(" to the text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 11:31:55 +01:00
Benjamin Admin
872b47f691 fix: filter words and color recoveries inside graphic/image regions
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m8s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 21s
- Load structure_result from session to get detected graphic bounds
- Exclude OCR words whose center falls inside a graphic region
- Exclude recovered colored text inside graphic regions
- Reject color recovery regions wider than 4x median word height

Fixes garbage characters (!, ?, •) in box zones and false OCR
detections (N, ?) in image areas.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 11:20:07 +01:00
Benjamin Admin
324f39a9cc fix: merge inline marker columns + improve ghost edge detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 16s
1. Add _merge_inline_marker_columns(): narrow columns (<80px) with
   avg word length <=2 chars (bullets, numbering) are merged into
   the adjacent text column. Fixes box zones getting 2 columns when
   bullet points are just indentation markers.

2. Improve ghost filter: check word edges (left/right/top/bottom)
   against border bands instead of center-only. Catches = at x=947
   whose left edge touches the box border.

3. Add = and + to _GRID_GHOST_CHARS for border artifact detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:10:07 +01:00
Benjamin Admin
febd0a2f84 fix: border ghost filter + row overlap fix for box zones
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m53s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 17s
1. Add _filter_border_ghosts() to grid editor - removes OCR artefacts
   like | sitting on box borders before row/column clustering.
   The tall | (h=55) was inflating row 0's y_max, causing row overlap.

2. Fix _assign_word_to_row() to prefer closest y_center when rows
   overlap, instead of always returning the first matching row.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 09:54:50 +01:00
Benjamin Admin
43b1f8be58 diag: increase zone logging threshold to 60 words
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 1m48s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 17s
Box zones have 40-60 words, need to capture their diagnostics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 09:49:19 +01:00
Benjamin Admin
43dec5dd91 diag: add row-clustering logging for small/box zones
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m55s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Logs word positions, median height, Y tolerance, and resulting
rows for zones with <= 30 words to diagnose row merging issues.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 09:45:29 +01:00
Benjamin Admin
92a52a3199 fix: apply column union when total_cols >= max (not just >)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 2m5s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 17s
Zone 4 found 4 columns incl. page_ref, union also yields 4.
The strict > check prevented union from applying to Zone 0.
Changed to >= so all content zones get the merged column set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 00:14:59 +01:00
Benjamin Admin
427fecdce0 fix: union column detection across all content zones
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Instead of propagating columns from the largest content zone only
(which missed narrow columns like page_ref), collect column split
points from ALL content zones and merge them. This way a column
found in any zone (e.g. page_ref at x=132 in the zone below boxes)
is available everywhere.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 23:02:33 +01:00
Benjamin Admin
9fb3229270 fix: lower tertiary gap threshold for narrow margin column detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 16s
Reduce gap threshold from max(40, 5%) to max(30, 2%) so page_ref
columns (e.g. p.55/p.57) at ~56px gap are detected as tertiary columns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 22:56:03 +01:00
Benjamin Admin
91625a2646 fix: add tertiary tier for narrow margin columns (page refs, markers)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 17s
Page references (p.55, p.57) and marker columns (!) appear in very few
rows (< 12% coverage) but sit at the far left/right margin with a clear
gap to the main content.  Add a third detection tier that catches these
narrow margin columns when they have >= 2 distinct rows and are within
15% of the content edge with >= 40px gap to the nearest main column.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 22:40:40 +01:00
Benjamin Admin
02ae6249ca fix: propagate columns from largest content zone instead of global detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m5s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 21s
Global column detection diluted narrow sub-columns (page refs, markers)
because they appeared in too few rows relative to the total.  Instead,
detect columns per zone independently, then propagate the best columns
(from the content zone with the most words) to smaller content zones.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 22:30:15 +01:00
Benjamin Admin
cf995f2d52 fix: global column detection across content zones in Kombi grid builder
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m3s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 26s
Content zones (above/between/below boxes) now share the same column
structure: columns are detected once from ALL content-zone words, then
applied to each content zone.  Box zones still detect columns independently.

This fixes the issue where narrow columns (page refs like p.55) were not
detected in small content zones above boxes, even though the same column
existed in the larger content zone below the box.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 22:04:17 +01:00
Benjamin Admin
bcd55e12d7 fix: run color annotation on final cell word_boxes, not pre-grid words
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 16s
_build_cells() creates new word_box dicts, so color fields set before
grid building were lost. Now detect_word_colors() runs after cells
are built, on the final word_boxes. Recovery still runs before grid
building so recovered words participate in column/row detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 00:53:04 +01:00
Benjamin Admin
2bd63ec402 feat: add color detection for OCR word boxes
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m51s
CI / test-nodejs-website (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
New cv_color_detect.py module:
- detect_word_colors(): annotates existing words with text color (HSV analysis)
- recover_colored_text(): finds colored text regions missed by standard OCR
  (e.g. red ! markers) using HSV masks + contour detection

Integrated into build-grid: words get color/color_name fields, recovered
colored regions are merged into the word list before grid building.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 00:50:09 +01:00
Benjamin Admin
39a4d8564c chore: add per-cluster debug logging for column alignment detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 2m0s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 00:18:28 +01:00
Benjamin Admin
1162eac7b4 fix: use group-start positions for column detection, not all word left-edges
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 18s
Only cluster left-edges of words that begin a new group within their row
(first word or preceded by a large gap). This filters out mid-phrase
word positions (IPA transcriptions, second words in multi-word entries)
that were causing too many false columns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 00:10:29 +01:00
Benjamin Admin
28352f5bab feat: replace gap-based column detection with left-edge alignment algorithm
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 17s
Column detection now clusters word left-edges by X-proximity and filters
by row coverage (Y-coverage), matching the proven approach from cv_layout.py
but using precise OCR word positions instead of ink-based estimates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 00:03:58 +01:00
Benjamin Admin
c3f1547e32 feat: add Excel-like grid editor for OCR overlay (Kombi mode step 6)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 17s
Backend: new grid_editor_api.py with build-grid endpoint that detects
bordered boxes, splits page into zones, clusters columns/rows per zone
from Kombi word positions. New DB column grid_editor_result JSONB.

Frontend: GridEditor component with editable HTML tables per zone,
column bold toggle, header row toggle, undo/redo, keyboard navigation
(Tab/Enter/Arrow), image overlay verification, and save/load.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 23:41:03 +01:00