Cell text was rebuilt using naive (top, left) sorting after removing
word_boxes in Steps 4c/4d/5i. This produced wrong word order when
words on the same visual line had slightly different top values (1-6px).
Now uses _words_to_reading_order_text() which groups words into visual
lines by y-tolerance before sorting by x within each line, matching
the initial cell text construction in _build_cells.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 5i: For word_boxes with >90% x-overlap and different text, use IPA
dictionary to decide which to keep (e.g. "tightly" in dict, "fighily" not).
Red threshold raised from 80 to 90 to catch remaining scanner artifacts
like "tight" and "5" that were still misclassified as red.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pages with two side-by-side vocabulary columns separated by a vertical
black line are now split into independent sub-zones before row/column
detection. Each sub-zone gets its own rows, preventing misalignment from
different heading rhythms.
- _detect_vertical_dividers(): finds pipe word_boxes at consistent x
positions spanning >50% of zone height
- _split_zone_at_vertical_dividers(): creates left/right PageZone objects
with layout_hint and vsplit_group metadata
- Column union skips vsplit zones (independent column sets)
- Frontend renders vsplit zones side by side via flex layout
- PageZone gets layout_hint + vsplit_group fields
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents false narrow columns from text overflow at page edges.
Session 355f3c84 had a 3-row/4% tertiary cluster creating a spurious
third column from right-column text overflow.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reject /.../ matches containing spaces, parens, or commas (e.g. sb/sth up)
- Second pass converts trailing /ipa2/ after [ipa1] (double pronunciation)
- Validate standalone /ipa/ at start against same reject pattern
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dictionary-style pages print IPA between slashes (e.g. tiger /'taiga/).
Step 5h detects these patterns, looks up the headword in the IPA dictionary
for proper Unicode IPA, and falls back to OCR text when not found.
Converts /ipa/ to [ipa] bracket notation matching the rest of the pipeline.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 4d removes "|" and "||" word_boxes that OCR produces when reading
physical vertical divider lines between columns. Also strips stray pipe
chars from cell text.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three fixes:
1. fix_ipa_continuation_cell: when headword has inline IPA like
"beat [bˈiːt] , beat, beaten", only generate IPA for uncovered
words (beaten), not words already shown (beat). When bracket is
at end like "the Highlands [ˈhaɪləndz]", return inline IPA directly.
2. Step 5d: recover garbled IPA from word_boxes when Step 5c emptied
the cell text (e.g. "[n, nn]" → "").
3. Added 2 tests for inline IPA behavior (35 total).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Footer rows like "two hundred and twelve" are no longer removed from
the grid. Instead they stay in cells/rows and get tagged so the
frontend can render them differently.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column_1 cells like "to" (infinitive markers) were incorrectly extracted
as page_refs. Now only cells matching p.70, ,.65, or bare digits are
treated as page references.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 5g now extracts column_1 cells individually as page_refs (instead of
requiring the whole row to be column_1-only), and footer detection skips
rows containing real IPA Unicode symbols to avoid false positives on
IPA continuation rows like [sˈiː] – [sˈɔː] – [sˈiːn].
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Step 5f: Remove dictionary IPA from headings detected after IPA
correction (e.g. "Theme [θˈiːm]" → "Theme")
- Step 5g: Extract page_ref rows (column_1 only, e.g. "p.70") and
footer rows (last single-cell row, e.g. page number "212") from
the vocabulary table into zone-level metadata (page_refs, footer)
so the frontend can render them separately
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 5d now also processes IPA continuations without brackets (e.g.
"ska:f – ska:vz", "'sekandarr sku:l") when the row has only 1 content
cell and the text is pure-ASCII garbled IPA (no real IPA Unicode symbols).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
"Theme [θˈiːm]" contains real IPA symbols (θ, ˈ) and should NOT be filtered.
Only filter text that has garbled IPA markers (:, ') but no real Unicode IPA chars.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Unbracketed IPA continuations like "ska:f – ska:vz" were falsely detected
as headings. Now _text_has_garbled_ipa() filters them out.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Page numbers like "two hundred and twelve" in the last row were falsely
detected as headings. Now first and last non-header rows are excluded.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Color headings now preserve actual starting col_index instead of hardcoded 0
- New _detect_heading_rows_by_single_cell: detects rows with only 1 content
cell (excl. page_ref) as headings — catches black headings like "Theme"
that have normal color/height but are alone in their row
- Runs after Step 5d (IPA continuation) to avoid false positives
- 5 new tests (32 total)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous heuristic picked the column with the longest average text as
the English headword column. In layouts with long example sentences, this
picked the wrong column (examples instead of headwords). Now counts cells
with bracket patterns per column — the column with the most brackets is
the headword column where IPA needs fixing.
Fixes garbled OCR-IPA like "change [tfeind3]" → "change [tʃˈeɪndʒ]".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes:
1. Step 5d now only treats cells as continuation when text is entirely
inside brackets (e.g. "[n, nn]"). Cells with headwords outside brackets
(e.g. "employee [im'ploi:]") are no longer overwritten.
2. fix_ipa_continuation_cell no longer skips grammar words like "down" —
they are part of the headword in phrasal verbs like "close sth. down".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The en_col_type heuristic (longest avg text) picks the example column,
missing IPA continuation cells in the actual headword column. Now Step 5d
checks all column_* cells for garbled IPA patterns independently.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Detect bracketed text without real IPA symbols as garbled OCR phonetics
- Allow IPA continuation fix even when other columns have content (for rows
where EN cell is clearly garbled bracketed IPA)
- Strip parenthetical grammar annotations like (no pl) from headword before
IPA lookup in fix_ipa_continuation_cell
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Skip ghost filtering for boxes with border_thickness=0 (images/graphics
have no border lines to produce OCR artifacts like |, I)
2. Remove individual word_boxes with height > 3x zone median (OCR from
graphics like a huge "N" from a map image below text)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Filter words inside image_overlays (removes OCR from images)
2. Ghost filter: only remove single-char border artifacts, not multi-char
like (= which is real content
3. Skip first-row header detection for zones with image_overlays
(merged geometry creates artificial gaps)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zone merging: content zones separated by box zones (images) are merged
into a single zone with image_overlays, so split tables reconnect.
Heading detection: after color annotation, rows where all words are
non-black and taller than 1.2x median are merged into spanning heading cells.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IPA continuation rows (phonetic transcription that wraps below the
headword) now get proper IPA by looking up headwords from the row
above. E.g. "ska:f – ska:vz" → "[skˈɑːf] – [skˈɑːvz]".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Stop removing rows that contain only phonetic transcription below
the headword. These rows are valid content that users need to see.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- _filter_decorative_margin: Phase 2 now also removes short words (<=3
chars) in the same narrow x-range as the detected single-char strip,
catching multi-char OCR artifacts like "Vv" from alphabet graphics.
- _filter_header_junk: New filter detects the content start (first row
with 3+ high-confidence words) and removes low-conf short fragments
above it that are OCR artifacts from header illustrations.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Users can now draw rectangles on the document image in the Structure
Detection step to mark areas (e.g. header graphics, alphabet strips)
that should be excluded from OCR results during grid building.
- Backend: PUT/DELETE endpoints for exclude regions stored in structure_result
- Backend: _build_grid_core() filters all words inside user-defined exclude regions
- Frontend: Interactive rectangle drawing with visual overlay and delete buttons
- Preserve exclude regions when re-running structure detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract _build_grid_core() from build_grid() endpoint for reuse.
New ocr_pipeline_regression.py with endpoints to mark sessions as
ground truth, list them, and run regression comparisons after code
changes. Frontend button in StepGroundTruth.tsx to mark/update GT.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strip IPA brackets that fix_cell_phonetics may have added for short
dictionary words (e.g. "si" → "[si]") before checking if the row is
a garbled phonetic continuation. Detect phonetic text by presence of
':' (length marks), leading apostrophe (stress marks), or absence of
any word with ≥3 letters.
Fixes Row 39 ("si: [si] — So: - si:n") not being removed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- grid_editor_api: After IPA correction, detect rows containing only
garbled phonetics in the English column (no German translation, no
IPA brackets inserted). These are wrap-around lines where printed
IPA extends to the line below the headword. Remove them since the
headword row already has correct IPA.
- cv_ocr_engines: _insert_missing_ipa now tries dehyphenated form
as fallback (e.g. "second-hand" → "secondhand") for dictionary
lookup, fixing IPA insertion for compound words.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Single-column German text pages were getting IPA inserted for words
that happen to exist in the English dictionary ("die" → [dˈaɪ],
"Das" → [dɑs]). Now IPA correction only runs when the grid has ≥3
columns, which is the minimum for a vocabulary table layout
(English | article | German).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- _merge_inline_marker_columns: skip merge when ≥50% of words are
alphabetic (preserves "to", "in", "der" columns)
- Rule 2 (oversized stub): widen to ≤3 words / ≤5 chars (catches "SEA &")
- IPA phonetics: map longest-avg-text column to column_en so
fix_cell_phonetics runs in the grid editor
- ocr_pipeline_overlays: add missing split_page_into_zones import
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Rule 3 to junk-row filter: rows where no word is longer than
2 chars are removed as scattered OCR debris from illustrations
- Fully disable spanning-header detection which falsely flagged IPA
transcriptions and vocabulary entries as spanning headers
- First-row heuristic remains for genuine header detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rows with ≤2 words, total text ≤3 chars, and word height >1.8x median
are removed as non-content elements (e.g. red page number "( 9").
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Apply recovered-artifact filter to ALL zones (was box-zones only)
- Filter any recovered word with text ≤ 2 chars (not just !?•·)
- Add post-grid junk-row removal: rows where all word_boxes have
conf < 50 and text ≤ 3 chars are dropped as OCR noise
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column detection:
- Raise MIN_COVERAGE_PRIMARY 20%→35% (prevents false columns in
flowing text where random gaps < 35% of rows)
- Raise MIN_COVERAGE_SECONDARY 12%→20%, MIN_DISTINCT_ROWS 2→3
- Vocabulary worksheets unaffected (columns appear in >80% of rows)
Graphic word filter:
- Only remove words with OCR confidence < 50 inside graphic regions
- High-confidence words are real text, not image artifacts
- Prevents legitimate colored text from being discarded
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: add layout_metrics (avg_row_height_px, font_size_suggestion_px)
to build-grid response for faithful grid reconstruction.
Frontend: rewrite GridTable from HTML <table> to CSS Grid layout.
Column widths are now proportional to the OCR-measured x_min/x_max
positions. Row heights use the average content row height from the
scan. Column and row resize via drag handles (Excel-like).
Font: add Noto Sans (supports IPA characters) via next/font/google.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix_cell_phonetics was only called in the OCR pipeline endpoints
(/words, /cells) but not in the combo mode (build-grid / ocr-overlay).
Garbled IPA like [teist] is now corrected to [teɪst] using the
IPA dictionary, same as in the pipeline.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Filter recovered single-char artifacts (!, ?, •) from box zones
where they are decorative noise, not real text markers
2. Detect spanning header rows (e.g. "Unit4: Bonnie Scotland") that
stretch across multiple columns with colored text. Merge their
cells into a single spanning cell in column 0.
3. Fix missing opening parentheses: when cell text has ")" but no
matching "(", prepend "(" to the text.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Load structure_result from session to get detected graphic bounds
- Exclude OCR words whose center falls inside a graphic region
- Exclude recovered colored text inside graphic regions
- Reject color recovery regions wider than 4x median word height
Fixes garbage characters (!, ?, •) in box zones and false OCR
detections (N, ?) in image areas.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Add _merge_inline_marker_columns(): narrow columns (<80px) with
avg word length <=2 chars (bullets, numbering) are merged into
the adjacent text column. Fixes box zones getting 2 columns when
bullet points are just indentation markers.
2. Improve ghost filter: check word edges (left/right/top/bottom)
against border bands instead of center-only. Catches = at x=947
whose left edge touches the box border.
3. Add = and + to _GRID_GHOST_CHARS for border artifact detection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Add _filter_border_ghosts() to grid editor - removes OCR artefacts
like | sitting on box borders before row/column clustering.
The tall | (h=55) was inflating row 0's y_max, causing row overlap.
2. Fix _assign_word_to_row() to prefer closest y_center when rows
overlap, instead of always returning the first matching row.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Logs word positions, median height, Y tolerance, and resulting
rows for zones with <= 30 words to diagnose row merging issues.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zone 4 found 4 columns incl. page_ref, union also yields 4.
The strict > check prevented union from applying to Zone 0.
Changed to >= so all content zones get the merged column set.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of propagating columns from the largest content zone only
(which missed narrow columns like page_ref), collect column split
points from ALL content zones and merge them. This way a column
found in any zone (e.g. page_ref at x=132 in the zone below boxes)
is available everywhere.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reduce gap threshold from max(40, 5%) to max(30, 2%) so page_ref
columns (e.g. p.55/p.57) at ~56px gap are detected as tertiary columns.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>