The border strip filter (Step 4e) used the LARGEST x-gap which incorrectly
removed base words along with edge artifacts. Now uses a two-stage approach:
1. _filter_border_strip_words() pre-filters raw words BEFORE column detection,
scanning from the page edge inward to find the FIRST significant gap (>30px)
2. Step 4e runs as fallback only when pre-filter didn't apply
Session 4233 now correctly detects 3 columns (base word | oder | synonyms)
instead of 2. Threshold raised from 15% to 20% to handle pages with many
edge artifacts. All 4 ground-truth sessions pass regression.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Content word_boxes in test used x-spacing (i%3)*100 which created
internal gaps larger than the border-to-content gap. Changed to
(i%2)*51 so content words overlap and the border gap remains dominant.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Textbooks with decorative alphabet strips along page edges produce
OCR artifacts (scattered colored letters at x<150 while real content
starts at x>=179). Step 4e detects a significant x-gap (>30px) between
a small cluster (<15% of total word_boxes) near the page edge and the
main content, then removes the border-strip word_boxes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The frontend renders colored cells from the word_boxes array order,
not from cell.text. After post-processing steps (5i bullet removal etc),
word_boxes could remain in their original insertion order instead of
left-to-right reading order. Step 5j now explicitly sorts word_boxes
using _group_words_into_lines before the result is built.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 5i: For word_boxes with >90% x-overlap and different text, use IPA
dictionary to decide which to keep (e.g. "tightly" in dict, "fighily" not).
Red threshold raised from 80 to 90 to catch remaining scanner artifacts
like "tight" and "5" that were still misclassified as red.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scanner artifacts on black text produce slight warm tint (hue ~0, sat ~60)
that was misclassified as red. Now requires median_sat >= 80 specifically
for red classification, since genuine red text always has high saturation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reject /.../ matches containing spaces, parens, or commas (e.g. sb/sth up)
- Second pass converts trailing /ipa2/ after [ipa1] (double pronunciation)
- Validate standalone /ipa/ at start against same reject pattern
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dictionary-style pages print IPA between slashes (e.g. tiger /'taiga/).
Step 5h detects these patterns, looks up the headword in the IPA dictionary
for proper Unicode IPA, and falls back to OCR text when not found.
Converts /ipa/ to [ipa] bracket notation matching the rest of the pipeline.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 4d removes "|" and "||" word_boxes that OCR produces when reading
physical vertical divider lines between columns. Also strips stray pipe
chars from cell text.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes:
1. Add pl, sg, no, also, ae, be etc. to _GRAMMAR_BRACKET_WORDS so
annotations like (pl) and (no pl) are not replaced with IPA.
2. Skip articles (the, a, an) in fix_ipa_continuation_cell — they
never get IPA in vocabulary books.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three fixes:
1. fix_ipa_continuation_cell: when headword has inline IPA like
"beat [bˈiːt] , beat, beaten", only generate IPA for uncovered
words (beaten), not words already shown (beat). When bracket is
at end like "the Highlands [ˈhaɪləndz]", return inline IPA directly.
2. Step 5d: recover garbled IPA from word_boxes when Step 5c emptied
the cell text (e.g. "[n, nn]" → "").
3. Added 2 tests for inline IPA behavior (35 total).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Page numbers like "two hundred and twelve" in the last row were falsely
detected as headings. Now first and last non-header rows are excluded.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Color headings now preserve actual starting col_index instead of hardcoded 0
- New _detect_heading_rows_by_single_cell: detects rows with only 1 content
cell (excl. page_ref) as headings — catches black headings like "Theme"
that have normal color/height but are alone in their row
- Runs after Step 5d (IPA continuation) to avoid false positives
- 5 new tests (32 total)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes:
1. Step 5d now only treats cells as continuation when text is entirely
inside brackets (e.g. "[n, nn]"). Cells with headwords outside brackets
(e.g. "employee [im'ploi:]") are no longer overwritten.
2. fix_ipa_continuation_cell no longer skips grammar words like "down" —
they are part of the headword in phrasal verbs like "close sth. down".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Detect bracketed text without real IPA symbols as garbled OCR phonetics
- Allow IPA continuation fix even when other columns have content (for rows
where EN cell is clearly garbled bracketed IPA)
- Strip parenthetical grammar annotations like (no pl) from headword before
IPA lookup in fix_ipa_continuation_cell
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Skip ghost filtering for boxes with border_thickness=0 (images/graphics
have no border lines to produce OCR artifacts like |, I)
2. Remove individual word_boxes with height > 3x zone median (OCR from
graphics like a huge "N" from a map image below text)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Filter words inside image_overlays (removes OCR from images)
2. Ghost filter: only remove single-char border artifacts, not multi-char
like (= which is real content
3. Skip first-row header detection for zones with image_overlays
(merged geometry creates artificial gaps)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zone merging: content zones separated by box zones (images) are merged
into a single zone with image_overlays, so split tables reconnect.
Heading detection: after color annotation, rows where all words are
non-black and taller than 1.2x median are merged into spanning heading cells.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>