Extracted 4 overlay functions (_get_structure_overlay, _get_columns_overlay,
_get_rows_overlay, _get_words_overlay) that were missing from the initial
split. Provides render_overlay() dispatcher used by sessions module.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous commit added `cached["word_result"]` but `cached` was
not defined in these functions. Changed to safely check `_cache` dict
first. Also includes sat_threshold fix (70→50) for green text.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Green text words like "Insel" and "Internet" had median_sat=65, just
below the threshold of 70, causing them to be classified as black.
Black text has median_sat=6-7, so threshold=50 provides clear
separation (6-7 vs 63-65) without false positives.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The frontend was checking for an existing structure_result and reusing
it, which meant the backend fix (passing word_boxes to graphic detection)
never had a chance to run on existing sessions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both kombi OCR functions wrote word_result to DB but not to the
in-memory cache. When detect-structure ran next, it found no words
and passed an empty list to graphic detection, making all word-overlap
heuristics ineffective. This caused green text words to be wrongly
classified as graphic regions.
Also adds a fallback in detect-structure to use raw OCR word lists
if cell word_boxes are empty.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two issues in paddle-kombi word merge:
1. Overlap threshold too strict: PaddleOCR "Stick" and Tesseract
"Stück" overlap at 48.6%, just below the 50% threshold. Both words
ended up in the result, overlapping on the same position.
Fix: lower threshold from 50% to 40%.
2. Text selection blind to confidence: always took PaddleOCR text
even when Tesseract had higher confidence and correct text.
Fix: when texts differ due to spatial-only match, prefer the
engine with higher confidence.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When union columns from multiple content zones are applied, column
boundaries can span wider than any single zone's bbox. Using
zone.bbox_px.w as the scale reference caused the total scaled width
to exceed the container, pushing the table off-screen.
Now uses the actual total column width sum as the scale reference,
guaranteeing columns always fit within the container.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column detection:
- Raise MIN_COVERAGE_PRIMARY 20%→35% (prevents false columns in
flowing text where random gaps < 35% of rows)
- Raise MIN_COVERAGE_SECONDARY 12%→20%, MIN_DISTINCT_ROWS 2→3
- Vocabulary worksheets unaffected (columns appear in >80% of rows)
Graphic word filter:
- Only remove words with OCR confidence < 50 inside graphic regions
- High-confidence words are real text, not image artifacts
- Prevents legitimate colored text from being discarded
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 25x25 dilation kernel merges nearby green words into large regions,
so pixel-overlap with OCR word boxes drops below 50%. Previous density
checks alone weren't sufficient.
New multi-layered approach:
- Count OCR word CENTROIDS inside each colored region
- ≥2 centroids → definitely text (images don't produce multiple words)
- 1 centroid + 10%+ pixel overlap → likely text
- Lower pixel overlap threshold from 50% to 40%
- Raise density+height thresholds for text-line detection
- Use INFO logging to diagnose remaining false positives
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add color pixel density checks to cv_graphic_detect.py Pass 1:
- density < 20% → skip (text strokes are thin, images are filled)
- density < 30% + height < 4% page → skip (colored text line)
This fixes green headings (Insel, Internet, Inuit) being removed
as graphic regions, which also caused word reordering in lines.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous algorithm used binary ink projection and found false
splits at normal text column gaps. The spine of a book on a scanner
has a characteristic DARK gray strip (scanner bed) flanked by bright
white paper on both sides.
New approach: column-mean brightness with heavy smoothing, looking for
a dark valley (< 88% of paper brightness) in the center region that
has bright paper on both sides.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: merge gaps within 5% of image width — the spine area may have
thin ink strips splitting one physical gap into multiple detected gaps.
Only use gaps >= 2% width as split points.
Frontend: StepCrop now handles multi_page crop responses without
crashing on missing original_size/cropped_size fields.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tesseract OSD 'rotate' returns the clockwise correction needed,
but the code was applying counterclockwise for 90° and clockwise
for 270° — exactly reversed. This caused pages scanned sideways
to be flipped upside down instead of corrected.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a book scan (double-page spread) is detected during the crop step,
the system automatically:
1. Detects vertical center gaps (spine area) via ink density projection
2. Splits into N page sub-sessions (reusing existing sub-session mechanism)
3. Individually crops each page (removing its own borders)
4. Returns sub-session IDs for downstream pipeline processing
Detection: landscape images (w > h * 1.15), vertical gap < 15% peak
density in center region (25-75%), gap width >= 0.8% of image width.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: add layout_metrics (avg_row_height_px, font_size_suggestion_px)
to build-grid response for faithful grid reconstruction.
Frontend: rewrite GridTable from HTML <table> to CSS Grid layout.
Column widths are now proportional to the OCR-measured x_min/x_max
positions. Row heights use the average content row height from the
scan. Column and row resize via drag handles (Excel-like).
Font: add Noto Sans (supports IPA characters) via next/font/google.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix_cell_phonetics was only called in the OCR pipeline endpoints
(/words, /cells) but not in the combo mode (build-grid / ocr-overlay).
Garbled IPA like [teist] is now corrected to [teɪst] using the
IPA dictionary, same as in the pipeline.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Filter recovered single-char artifacts (!, ?, •) from box zones
where they are decorative noise, not real text markers
2. Detect spanning header rows (e.g. "Unit4: Bonnie Scotland") that
stretch across multiple columns with colored text. Merge their
cells into a single spanning cell in column 0.
3. Fix missing opening parentheses: when cell text has ")" but no
matching "(", prepend "(" to the text.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Load structure_result from session to get detected graphic bounds
- Exclude OCR words whose center falls inside a graphic region
- Exclude recovered colored text inside graphic regions
- Reject color recovery regions wider than 4x median word height
Fixes garbage characters (!, ?, •) in box zones and false OCR
detections (N, ?) in image areas.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous version only checked X overlap, causing false positives for
short words like "=" and "I" that appear at similar X positions in
different rows. Now requires >=50% overlap in both dimensions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR can return overlapping phrases (e.g. "von jm." and "jm. =")
that produce duplicate words after splitting. Added _deduplicate_words()
post-merge pass that removes words with same text at overlapping positions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Words on the same visual line can have slightly different top values
(1-6px). Sorting by (top, left) produced wrong word order in the
frontend display. Now uses _group_words_into_lines to group by Y
proximity first, then sort by X within each line.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Colored-pixel fragments in narrow inter-word gaps were being recovered
as false characters (e.g., "!" between "lend" and "sb."), disrupting
word order. Use adaptive padding based on median word height instead
of fixed 4px.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Add _merge_inline_marker_columns(): narrow columns (<80px) with
avg word length <=2 chars (bullets, numbering) are merged into
the adjacent text column. Fixes box zones getting 2 columns when
bullet points are just indentation markers.
2. Improve ghost filter: check word edges (left/right/top/bottom)
against border bands instead of center-only. Catches = at x=947
whose left edge touches the box border.
3. Add = and + to _GRID_GHOST_CHARS for border artifact detection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Add _filter_border_ghosts() to grid editor - removes OCR artefacts
like | sitting on box borders before row/column clustering.
The tall | (h=55) was inflating row 0's y_max, causing row overlap.
2. Fix _assign_word_to_row() to prefer closest y_center when rows
overlap, instead of always returning the first matching row.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Logs word positions, median height, Y tolerance, and resulting
rows for zones with <= 30 words to diagnose row merging issues.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a cell has colored words (red !, blue phonetics), render each
word as a separate span with its own color instead of coloring the
entire input text with the first non-black color found.
Switches to editable input on cell selection (click).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zone 4 found 4 columns incl. page_ref, union also yields 4.
The strict > check prevented union from applying to Zone 0.
Changed to >= so all content zones get the merged column set.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of propagating columns from the largest content zone only
(which missed narrow columns like page_ref), collect column split
points from ALL content zones and merge them. This way a column
found in any zone (e.g. page_ref at x=132 in the zone below boxes)
is available everywhere.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reduce gap threshold from max(40, 5%) to max(30, 2%) so page_ref
columns (e.g. p.55/p.57) at ~56px gap are detected as tertiary columns.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Page references (p.55, p.57) and marker columns (!) appear in very few
rows (< 12% coverage) but sit at the far left/right margin with a clear
gap to the main content. Add a third detection tier that catches these
narrow margin columns when they have >= 2 distinct rows and are within
15% of the content edge with >= 40px gap to the nearest main column.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Global column detection diluted narrow sub-columns (page refs, markers)
because they appeared in too few rows relative to the total. Instead,
detect columns per zone independently, then propagate the best columns
(from the content zone with the most words) to smaller content zones.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Content zones (above/between/below boxes) now share the same column
structure: columns are detected once from ALL content-zone words, then
applied to each content zone. Box zones still detect columns independently.
This fixes the issue where narrow columns (page refs like p.55) were not
detected in small content zones above boxes, even though the same column
existed in the larger content zone below the box.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Enrich column geometries with original full-page words (box-filtered)
so _detect_sub_columns() finds narrow sub-columns across box boundaries
- Add inline marker guard: bullet points (1., 2., •) are not split into
sub-columns (minimum gap check: 1.2× word height or 20px)
- Add box_rects parameter to build_grid_from_words() — words inside boxes
are excluded from X-gap column clustering
- Pass box rects from zones to words_first grid builder
- Add 9 tests for box-aware column detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New approach: dilate color mask heavily (25x25) to merge nearby colored
pixels into regions, then check word overlap:
- >50% overlap with OCR word boxes → colored text → skip
- <50% overlap → colored image/graphic → keep
This detects balloon clusters as one "image" region instead of trying
to classify individual shapes. Red words like "borrow/lend" are filtered
because they overlap with their word boxes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 5x5 MORPH_CLOSE was connecting scattered color pixels into one
page-spanning contour that swallowed individual balloons. Fix:
- Remove MORPH_CLOSE, keep only MORPH_OPEN for speckle removal
- Lower sat threshold 50→40 to catch more colored elements
- Filter contours spanning >50% of width OR height (was AND)
- Filter contours >10% of image area
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pass 1 (color): Detect colored graphics on HSV saturation channel.
Black text is invisible on this channel, so no word exclusion needed.
Catches colored balloons, arrows, icons reliably.
Pass 2 (ink): Detect large black illustrations on dark ink mask
minus word exclusion. Only keeps area > 5000 to avoid text fragments.
Fixes: all 5 balloons now detectable (previously word exclusion zones
were eating colored graphics that overlapped with nearby OCR words).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Text fragments after word exclusion are indistinguishable from arrows
and icons via contour metrics. Since the goal is detecting graphics,
images, boxes and colors (not arrows/icons), simplify to only:
- circle/balloon (circularity > 0.55 — very reliable)
- illustration (area > 3000 — clearly non-text)
Boxes and colors are handled by cv_box_detect and cv_color_detect.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Lower min_area from 200 to 80 (small balloons ~100-300px²)
- Lower word_pad from 10 to 5 (10px was eating nearby graphics)
- Relax circle detection: circularity>0.55, min_dim>15 (was 0.70/25)
- Text fragments still filtered by _classify_shape noise threshold
- Add ACCEPT logging for debugging
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Raise min_area from 30 to 200 (text fragments are small)
- Raise word_pad from 3 to 10px (OCR bboxes are tight)
- Reduce morph close kernel from 5x5 to 3x3 (avoid reconnecting text)
- Tighten arrow detection: min 20px, circularity<0.35, >=2 defects
- Add 'noise' category for too-small elements, filter them out
- Raise min dimension from 4 to 8px
- Add debug logging for word count and exclusion coverage
- Raise max_area_ratio to 0.25 (allow larger illustrations)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Graphic detection needs word positions to exclude text from the ink mask.
Previously Struktur ran before OCR, causing every word to be detected as
a graphic element. Now:
- Pipeline: Struktur at index 7 (after Wörter)
- Kombi: Struktur at index 5 (after PP-OCRv5+Tesseract, before Tabelle)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add cv_graphic_detect.py for detecting non-text visual elements (arrows,
circles, lines, exclamation marks, icons, illustrations). Draw detected
graphics on structure overlay image and display them in the frontend
StepStructureDetection component with shape counts and individual listings.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Insert the Struktur detection step between Zuschneiden and
PP-OCRv5+Tesseract in the Kombi pipeline on /ai/ocr-overlay.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New pipeline step between Crop and Columns that visualizes detected
document structure: boxes (line-based + shading), page zones, and
color regions. Shows original image on the left, annotated overlay
on the right.
Backend: POST /detect-structure endpoint + /image/structure-overlay
Frontend: StepStructureDetection component with zone/box/color details
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously color/shading detection only ran as fallback when no line-based
boxes were found. Now both methods run in parallel with result merging,
so smaller shaded boxes (like "German leihen") get detected even when
larger bordered boxes are already found. Uses median-blur background
analysis that works for both colored and grayscale/B&W scans.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>