- Enrich column geometries with original full-page words (box-filtered)
so _detect_sub_columns() finds narrow sub-columns across box boundaries
- Add inline marker guard: bullet points (1., 2., •) are not split into
sub-columns (minimum gap check: 1.2× word height or 20px)
- Add box_rects parameter to build_grid_from_words() — words inside boxes
are excluded from X-gap column clustering
- Pass box rects from zones to words_first grid builder
- Add 9 tests for box-aware column detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When PaddleOCR returns "!Betonung" as a single word box, the overlay
positions text starting at the "!" instead of the actual word. Split
such boxes into ["!", "Betonung"] with proportional position splitting,
matching the existing IPA bracket splitting logic.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR returns "badge[bxd3]" without space, but the IPA fixer
produces "badge [bˈædʒ]" with space, creating a token count mismatch
between cell.text and word_boxes. Now also split at "[" boundaries
so each IPA bracket gets its own sub-box.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR returns phrase-level bounding boxes (e.g. "competition
[kompa'tifn]" as one box) but the overlay slide mechanism expects
one box per word for accurate positioning. Multi-word boxes are now
split proportionally by character count with small gaps between words.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
words_first was storing word_boxes in percent coordinates while
cv_cell_grid.py uses absolute pixel coordinates. The overlay slide
mechanism divides by imgW to get percentages, so percent-in-percent
caused positions near zero. Now both grid builders use the same format.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>