The previous heuristic picked the column with the longest average text as
the English headword column. In layouts with long example sentences, this
picked the wrong column (examples instead of headwords). Now counts cells
with bracket patterns per column — the column with the most brackets is
the headword column where IPA needs fixing.
Fixes garbled OCR-IPA like "change [tfeind3]" → "change [tʃˈeɪndʒ]".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes:
1. Step 5d now only treats cells as continuation when text is entirely
inside brackets (e.g. "[n, nn]"). Cells with headwords outside brackets
(e.g. "employee [im'ploi:]") are no longer overwritten.
2. fix_ipa_continuation_cell no longer skips grammar words like "down" —
they are part of the headword in phrasal verbs like "close sth. down".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The en_col_type heuristic (longest avg text) picks the example column,
missing IPA continuation cells in the actual headword column. Now Step 5d
checks all column_* cells for garbled IPA patterns independently.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Detect bracketed text without real IPA symbols as garbled OCR phonetics
- Allow IPA continuation fix even when other columns have content (for rows
where EN cell is clearly garbled bracketed IPA)
- Strip parenthetical grammar annotations like (no pl) from headword before
IPA lookup in fix_ipa_continuation_cell
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Skip ghost filtering for boxes with border_thickness=0 (images/graphics
have no border lines to produce OCR artifacts like |, I)
2. Remove individual word_boxes with height > 3x zone median (OCR from
graphics like a huge "N" from a map image below text)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Filter words inside image_overlays (removes OCR from images)
2. Ghost filter: only remove single-char border artifacts, not multi-char
like (= which is real content
3. Skip first-row header detection for zones with image_overlays
(merged geometry creates artificial gaps)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zone merging: content zones separated by box zones (images) are merged
into a single zone with image_overlays, so split tables reconnect.
Heading detection: after color annotation, rows where all words are
non-black and taller than 1.2x median are merged into spanning heading cells.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The walk-back was going 4 chars, eating the last letter of the
headword: "schoolbag" → "schoolba". Limiting to 3 gives correct
split: "schoolbag" + "[sku:lbæg]".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For merged tokens like "schoolbagsku:lbæg", split at IPA marker
boundary instead of prefix-matching to a shorter dictionary word.
Result: "schoolbag [sku:lbæg]" instead of "school [skˈuːl]".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents false positives where punctuation (apostrophes) in merged
tokens caused wrong dictionary matches (e.g. "'se" from "'sekandarr"
matching as a word, breaking IPA continuation row fix).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IPA continuation rows (phonetic transcription that wraps below the
headword) now get proper IPA by looking up headwords from the row
above. E.g. "ska:f – ska:vz" → "[skˈɑːf] – [skˈɑːvz]".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Stop removing rows that contain only phonetic transcription below
the headword. These rows are valid content that users need to see.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_insert_missing_ipa was adding dictionary IPA to cells that had NO
phonetic transcription on the original page (e.g. "scissors" heading,
"scarf - scarves" without IPA). Now guarded by _text_has_garbled_ipa()
which checks for OCR-mangled phonetic markers (stress marks, length
marks, IPA special chars) before allowing insertion.
Rule: if a line has no phonetics, don't add any. Where garbled IPA
exists, replace it with correct IPA notation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- _filter_decorative_margin: Phase 2 now also removes short words (<=3
chars) in the same narrow x-range as the detected single-char strip,
catching multi-char OCR artifacts like "Vv" from alphabet graphics.
- _filter_header_junk: New filter detects the content start (first row
with 3+ high-confidence words) and removes low-conf short fragments
above it that are OCR artifacts from header illustrations.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When exclude regions are saved or deleted, the cached grid result is
cleared so the grid rebuilds with updated exclusions on the next step.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Users can now draw rectangles on the document image in the Structure
Detection step to mark areas (e.g. header graphics, alphabet strips)
that should be excluded from OCR results during grid building.
- Backend: PUT/DELETE endpoints for exclude regions stored in structure_result
- Backend: _build_grid_core() filters all words inside user-defined exclude regions
- Frontend: Interactive rectangle drawing with visual overlay and delete buttons
- Preserve exclude regions when re-running structure detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The _detect_spine_shadow function was triggering on normal text content
because shadow_range > 20 was too low and convolution edge artifacts
created artificially low values. Now requires: range > 40, darkest < 180,
narrow valley (not text plateau), and brightness rise toward page content.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Refactor left/right shadow detection into shared _detect_spine_shadow()
that finds the darkest column (= book spine center) via argmin of
smoothed brightness. Both sides now cut at the spine center, ensuring
equal page sizes in double-page scans regardless of shadow position.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mirror the left-edge shadow detection for the right side: analyze
brightness gradient in the right 25% to find scanner gray strips
from book spines. Cuts at the last bright column before the shadow
dip. Fixes cropping of book scans where the next page bleeds in.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Ground Truth button on last step of Pipeline/Kombi modes in ocr-overlay
- Prominent category picker in active session info bar (pulses when unset)
- GT badge shown when session has ground truth reference
- Backend: auto-detect pipeline from ocr_engine, store in GT snapshot
- Pipeline info shown in GT session list and regression reports
- Also pass pipeline param from ocr-pipeline StepGroundTruth
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract _build_grid_core() from build_grid() endpoint for reuse.
New ocr_pipeline_regression.py with endpoints to mark sessions as
ground truth, list them, and run regression comparisons after code
changes. Frontend button in StepGroundTruth.tsx to mark/update GT.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strip IPA brackets that fix_cell_phonetics may have added for short
dictionary words (e.g. "si" → "[si]") before checking if the row is
a garbled phonetic continuation. Detect phonetic text by presence of
':' (length marks), leading apostrophe (stress marks), or absence of
any word with ≥3 letters.
Fixes Row 39 ("si: [si] — So: - si:n") not being removed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- grid_editor_api: After IPA correction, detect rows containing only
garbled phonetics in the English column (no German translation, no
IPA brackets inserted). These are wrap-around lines where printed
IPA extends to the line below the headword. Remove them since the
headword row already has correct IPA.
- cv_ocr_engines: _insert_missing_ipa now tries dehyphenated form
as fallback (e.g. "second-hand" → "secondhand") for dictionary
lookup, fixing IPA insertion for compound words.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Single-column German text pages were getting IPA inserted for words
that happen to exist in the English dictionary ("die" → [dˈaɪ],
"Das" → [dɑs]). Now IPA correction only runs when the grid has ≥3
columns, which is the minimum for a vocabulary table layout
(English | article | German).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_insert_missing_ipa now removes garbled phonetic text (e.g. "skea",
"sku:l", "'sizaz") that follows the inserted IPA bracket. Keeps
delimiters (–, -), uppercase words (German), and known English words.
Fixes: "scare [skˈɛə] skea" → "scare [skˈɛə]"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- _merge_inline_marker_columns: skip merge when ≥50% of words are
alphabetic (preserves "to", "in", "der" columns)
- Rule 2 (oversized stub): widen to ≤3 words / ≤5 chars (catches "SEA &")
- IPA phonetics: map longest-avg-text column to column_en so
fix_cell_phonetics runs in the grid editor
- ocr_pipeline_overlays: add missing split_page_into_zones import
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Rule 3 to junk-row filter: rows where no word is longer than
2 chars are removed as scattered OCR debris from illustrations
- Fully disable spanning-header detection which falsely flagged IPA
transcriptions and vocabulary entries as spanning headers
- First-row heuristic remains for genuine header detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Black text has median_sat ~6-7, green text ~63-65. At threshold 50,
scanner blue tints (median_sat ~50-54) on words like "Wasser" were
falsely classified as blue. Threshold 55 has good margin on both sides.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rows with ≤2 words, total text ≤3 chars, and word height >1.8x median
are removed as non-content elements (e.g. red page number "( 9").
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Apply recovered-artifact filter to ALL zones (was box-zones only)
- Filter any recovered word with text ≤ 2 chars (not just !?•·)
- Add post-grid junk-row removal: rows where all word_boxes have
conf < 50 and text ≤ 3 chars are dropped as OCR noise
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extracted 4 overlay functions (_get_structure_overlay, _get_columns_overlay,
_get_rows_overlay, _get_words_overlay) that were missing from the initial
split. Provides render_overlay() dispatcher used by sessions module.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous commit added `cached["word_result"]` but `cached` was
not defined in these functions. Changed to safely check `_cache` dict
first. Also includes sat_threshold fix (70→50) for green text.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Green text words like "Insel" and "Internet" had median_sat=65, just
below the threshold of 70, causing them to be classified as black.
Black text has median_sat=6-7, so threshold=50 provides clear
separation (6-7 vs 63-65) without false positives.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The frontend was checking for an existing structure_result and reusing
it, which meant the backend fix (passing word_boxes to graphic detection)
never had a chance to run on existing sessions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both kombi OCR functions wrote word_result to DB but not to the
in-memory cache. When detect-structure ran next, it found no words
and passed an empty list to graphic detection, making all word-overlap
heuristics ineffective. This caused green text words to be wrongly
classified as graphic regions.
Also adds a fallback in detect-structure to use raw OCR word lists
if cell word_boxes are empty.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two issues in paddle-kombi word merge:
1. Overlap threshold too strict: PaddleOCR "Stick" and Tesseract
"Stück" overlap at 48.6%, just below the 50% threshold. Both words
ended up in the result, overlapping on the same position.
Fix: lower threshold from 50% to 40%.
2. Text selection blind to confidence: always took PaddleOCR text
even when Tesseract had higher confidence and correct text.
Fix: when texts differ due to spatial-only match, prefer the
engine with higher confidence.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When union columns from multiple content zones are applied, column
boundaries can span wider than any single zone's bbox. Using
zone.bbox_px.w as the scale reference caused the total scaled width
to exceed the container, pushing the table off-screen.
Now uses the actual total column width sum as the scale reference,
guaranteeing columns always fit within the container.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column detection:
- Raise MIN_COVERAGE_PRIMARY 20%→35% (prevents false columns in
flowing text where random gaps < 35% of rows)
- Raise MIN_COVERAGE_SECONDARY 12%→20%, MIN_DISTINCT_ROWS 2→3
- Vocabulary worksheets unaffected (columns appear in >80% of rows)
Graphic word filter:
- Only remove words with OCR confidence < 50 inside graphic regions
- High-confidence words are real text, not image artifacts
- Prevents legitimate colored text from being discarded
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 25x25 dilation kernel merges nearby green words into large regions,
so pixel-overlap with OCR word boxes drops below 50%. Previous density
checks alone weren't sufficient.
New multi-layered approach:
- Count OCR word CENTROIDS inside each colored region
- ≥2 centroids → definitely text (images don't produce multiple words)
- 1 centroid + 10%+ pixel overlap → likely text
- Lower pixel overlap threshold from 50% to 40%
- Raise density+height thresholds for text-line detection
- Use INFO logging to diagnose remaining false positives
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add color pixel density checks to cv_graphic_detect.py Pass 1:
- density < 20% → skip (text strokes are thin, images are filled)
- density < 30% + height < 4% page → skip (colored text line)
This fixes green headings (Insel, Internet, Inuit) being removed
as graphic regions, which also caused word reordering in lines.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous algorithm used binary ink projection and found false
splits at normal text column gaps. The spine of a book on a scanner
has a characteristic DARK gray strip (scanner bed) flanked by bright
white paper on both sides.
New approach: column-mean brightness with heavy smoothing, looking for
a dark valley (< 88% of paper brightness) in the center region that
has bright paper on both sides.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: merge gaps within 5% of image width — the spine area may have
thin ink strips splitting one physical gap into multiple detected gaps.
Only use gaps >= 2% width as split points.
Frontend: StepCrop now handles multi_page crop responses without
crashing on missing original_size/cropped_size fields.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tesseract OSD 'rotate' returns the clockwise correction needed,
but the code was applying counterclockwise for 90° and clockwise
for 270° — exactly reversed. This caused pages scanned sideways
to be flipped upside down instead of corrected.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a book scan (double-page spread) is detected during the crop step,
the system automatically:
1. Detects vertical center gaps (spine area) via ink density projection
2. Splits into N page sub-sessions (reusing existing sub-session mechanism)
3. Individually crops each page (removing its own borders)
4. Returns sub-session IDs for downstream pipeline processing
Detection: landscape images (w > h * 1.15), vertical gap < 15% peak
density in center region (25-75%), gap width >= 0.8% of image width.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: add layout_metrics (avg_row_height_px, font_size_suggestion_px)
to build-grid response for faithful grid reconstruction.
Frontend: rewrite GridTable from HTML <table> to CSS Grid layout.
Column widths are now proportional to the OCR-measured x_min/x_max
positions. Row heights use the average content row height from the
scan. Column and row resize via drag handles (Excel-like).
Font: add Noto Sans (supports IPA characters) via next/font/google.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>