Column detection:
- Raise MIN_COVERAGE_PRIMARY 20%→35% (prevents false columns in
flowing text where random gaps < 35% of rows)
- Raise MIN_COVERAGE_SECONDARY 12%→20%, MIN_DISTINCT_ROWS 2→3
- Vocabulary worksheets unaffected (columns appear in >80% of rows)
Graphic word filter:
- Only remove words with OCR confidence < 50 inside graphic regions
- High-confidence words are real text, not image artifacts
- Prevents legitimate colored text from being discarded
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: add layout_metrics (avg_row_height_px, font_size_suggestion_px)
to build-grid response for faithful grid reconstruction.
Frontend: rewrite GridTable from HTML <table> to CSS Grid layout.
Column widths are now proportional to the OCR-measured x_min/x_max
positions. Row heights use the average content row height from the
scan. Column and row resize via drag handles (Excel-like).
Font: add Noto Sans (supports IPA characters) via next/font/google.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix_cell_phonetics was only called in the OCR pipeline endpoints
(/words, /cells) but not in the combo mode (build-grid / ocr-overlay).
Garbled IPA like [teist] is now corrected to [teɪst] using the
IPA dictionary, same as in the pipeline.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Filter recovered single-char artifacts (!, ?, •) from box zones
where they are decorative noise, not real text markers
2. Detect spanning header rows (e.g. "Unit4: Bonnie Scotland") that
stretch across multiple columns with colored text. Merge their
cells into a single spanning cell in column 0.
3. Fix missing opening parentheses: when cell text has ")" but no
matching "(", prepend "(" to the text.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Load structure_result from session to get detected graphic bounds
- Exclude OCR words whose center falls inside a graphic region
- Exclude recovered colored text inside graphic regions
- Reject color recovery regions wider than 4x median word height
Fixes garbage characters (!, ?, •) in box zones and false OCR
detections (N, ?) in image areas.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Add _merge_inline_marker_columns(): narrow columns (<80px) with
avg word length <=2 chars (bullets, numbering) are merged into
the adjacent text column. Fixes box zones getting 2 columns when
bullet points are just indentation markers.
2. Improve ghost filter: check word edges (left/right/top/bottom)
against border bands instead of center-only. Catches = at x=947
whose left edge touches the box border.
3. Add = and + to _GRID_GHOST_CHARS for border artifact detection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Add _filter_border_ghosts() to grid editor - removes OCR artefacts
like | sitting on box borders before row/column clustering.
The tall | (h=55) was inflating row 0's y_max, causing row overlap.
2. Fix _assign_word_to_row() to prefer closest y_center when rows
overlap, instead of always returning the first matching row.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Logs word positions, median height, Y tolerance, and resulting
rows for zones with <= 30 words to diagnose row merging issues.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zone 4 found 4 columns incl. page_ref, union also yields 4.
The strict > check prevented union from applying to Zone 0.
Changed to >= so all content zones get the merged column set.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of propagating columns from the largest content zone only
(which missed narrow columns like page_ref), collect column split
points from ALL content zones and merge them. This way a column
found in any zone (e.g. page_ref at x=132 in the zone below boxes)
is available everywhere.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reduce gap threshold from max(40, 5%) to max(30, 2%) so page_ref
columns (e.g. p.55/p.57) at ~56px gap are detected as tertiary columns.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Page references (p.55, p.57) and marker columns (!) appear in very few
rows (< 12% coverage) but sit at the far left/right margin with a clear
gap to the main content. Add a third detection tier that catches these
narrow margin columns when they have >= 2 distinct rows and are within
15% of the content edge with >= 40px gap to the nearest main column.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Global column detection diluted narrow sub-columns (page refs, markers)
because they appeared in too few rows relative to the total. Instead,
detect columns per zone independently, then propagate the best columns
(from the content zone with the most words) to smaller content zones.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Content zones (above/between/below boxes) now share the same column
structure: columns are detected once from ALL content-zone words, then
applied to each content zone. Box zones still detect columns independently.
This fixes the issue where narrow columns (page refs like p.55) were not
detected in small content zones above boxes, even though the same column
existed in the larger content zone below the box.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_build_cells() creates new word_box dicts, so color fields set before
grid building were lost. Now detect_word_colors() runs after cells
are built, on the final word_boxes. Recovery still runs before grid
building so recovered words participate in column/row detection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New cv_color_detect.py module:
- detect_word_colors(): annotates existing words with text color (HSV analysis)
- recover_colored_text(): finds colored text regions missed by standard OCR
(e.g. red ! markers) using HSV masks + contour detection
Integrated into build-grid: words get color/color_name fields, recovered
colored regions are merged into the word list before grid building.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only cluster left-edges of words that begin a new group within their row
(first word or preceded by a large gap). This filters out mid-phrase
word positions (IPA transcriptions, second words in multi-word entries)
that were causing too many false columns.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column detection now clusters word left-edges by X-proximity and filters
by row coverage (Y-coverage), matching the proven approach from cv_layout.py
but using precise OCR word positions instead of ink-based estimates.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: new grid_editor_api.py with build-grid endpoint that detects
bordered boxes, splits page into zones, clusters columns/rows per zone
from Kombi word positions. New DB column grid_editor_result JSONB.
Frontend: GridEditor component with editable HTML tables per zone,
column bold toggle, header row toggle, undo/redo, keyboard navigation
(Tab/Enter/Arrow), image overlay verification, and save/load.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>