Words on the same visual line can have slightly different top values
(1-6px). Sorting by (top, left) produced wrong word order in the
frontend display. Now uses _group_words_into_lines to group by Y
proximity first, then sort by X within each line.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Colored-pixel fragments in narrow inter-word gaps were being recovered
as false characters (e.g., "!" between "lend" and "sb."), disrupting
word order. Use adaptive padding based on median word height instead
of fixed 4px.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Add _merge_inline_marker_columns(): narrow columns (<80px) with
avg word length <=2 chars (bullets, numbering) are merged into
the adjacent text column. Fixes box zones getting 2 columns when
bullet points are just indentation markers.
2. Improve ghost filter: check word edges (left/right/top/bottom)
against border bands instead of center-only. Catches = at x=947
whose left edge touches the box border.
3. Add = and + to _GRID_GHOST_CHARS for border artifact detection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Add _filter_border_ghosts() to grid editor - removes OCR artefacts
like | sitting on box borders before row/column clustering.
The tall | (h=55) was inflating row 0's y_max, causing row overlap.
2. Fix _assign_word_to_row() to prefer closest y_center when rows
overlap, instead of always returning the first matching row.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Logs word positions, median height, Y tolerance, and resulting
rows for zones with <= 30 words to diagnose row merging issues.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zone 4 found 4 columns incl. page_ref, union also yields 4.
The strict > check prevented union from applying to Zone 0.
Changed to >= so all content zones get the merged column set.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of propagating columns from the largest content zone only
(which missed narrow columns like page_ref), collect column split
points from ALL content zones and merge them. This way a column
found in any zone (e.g. page_ref at x=132 in the zone below boxes)
is available everywhere.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reduce gap threshold from max(40, 5%) to max(30, 2%) so page_ref
columns (e.g. p.55/p.57) at ~56px gap are detected as tertiary columns.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Page references (p.55, p.57) and marker columns (!) appear in very few
rows (< 12% coverage) but sit at the far left/right margin with a clear
gap to the main content. Add a third detection tier that catches these
narrow margin columns when they have >= 2 distinct rows and are within
15% of the content edge with >= 40px gap to the nearest main column.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Global column detection diluted narrow sub-columns (page refs, markers)
because they appeared in too few rows relative to the total. Instead,
detect columns per zone independently, then propagate the best columns
(from the content zone with the most words) to smaller content zones.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Content zones (above/between/below boxes) now share the same column
structure: columns are detected once from ALL content-zone words, then
applied to each content zone. Box zones still detect columns independently.
This fixes the issue where narrow columns (page refs like p.55) were not
detected in small content zones above boxes, even though the same column
existed in the larger content zone below the box.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Enrich column geometries with original full-page words (box-filtered)
so _detect_sub_columns() finds narrow sub-columns across box boundaries
- Add inline marker guard: bullet points (1., 2., •) are not split into
sub-columns (minimum gap check: 1.2× word height or 20px)
- Add box_rects parameter to build_grid_from_words() — words inside boxes
are excluded from X-gap column clustering
- Pass box rects from zones to words_first grid builder
- Add 9 tests for box-aware column detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New approach: dilate color mask heavily (25x25) to merge nearby colored
pixels into regions, then check word overlap:
- >50% overlap with OCR word boxes → colored text → skip
- <50% overlap → colored image/graphic → keep
This detects balloon clusters as one "image" region instead of trying
to classify individual shapes. Red words like "borrow/lend" are filtered
because they overlap with their word boxes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 5x5 MORPH_CLOSE was connecting scattered color pixels into one
page-spanning contour that swallowed individual balloons. Fix:
- Remove MORPH_CLOSE, keep only MORPH_OPEN for speckle removal
- Lower sat threshold 50→40 to catch more colored elements
- Filter contours spanning >50% of width OR height (was AND)
- Filter contours >10% of image area
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pass 1 (color): Detect colored graphics on HSV saturation channel.
Black text is invisible on this channel, so no word exclusion needed.
Catches colored balloons, arrows, icons reliably.
Pass 2 (ink): Detect large black illustrations on dark ink mask
minus word exclusion. Only keeps area > 5000 to avoid text fragments.
Fixes: all 5 balloons now detectable (previously word exclusion zones
were eating colored graphics that overlapped with nearby OCR words).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Text fragments after word exclusion are indistinguishable from arrows
and icons via contour metrics. Since the goal is detecting graphics,
images, boxes and colors (not arrows/icons), simplify to only:
- circle/balloon (circularity > 0.55 — very reliable)
- illustration (area > 3000 — clearly non-text)
Boxes and colors are handled by cv_box_detect and cv_color_detect.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Lower min_area from 200 to 80 (small balloons ~100-300px²)
- Lower word_pad from 10 to 5 (10px was eating nearby graphics)
- Relax circle detection: circularity>0.55, min_dim>15 (was 0.70/25)
- Text fragments still filtered by _classify_shape noise threshold
- Add ACCEPT logging for debugging
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Raise min_area from 30 to 200 (text fragments are small)
- Raise word_pad from 3 to 10px (OCR bboxes are tight)
- Reduce morph close kernel from 5x5 to 3x3 (avoid reconnecting text)
- Tighten arrow detection: min 20px, circularity<0.35, >=2 defects
- Add 'noise' category for too-small elements, filter them out
- Raise min dimension from 4 to 8px
- Add debug logging for word count and exclusion coverage
- Raise max_area_ratio to 0.25 (allow larger illustrations)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add cv_graphic_detect.py for detecting non-text visual elements (arrows,
circles, lines, exclamation marks, icons, illustrations). Draw detected
graphics on structure overlay image and display them in the frontend
StepStructureDetection component with shape counts and individual listings.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New pipeline step between Crop and Columns that visualizes detected
document structure: boxes (line-based + shading), page zones, and
color regions. Shows original image on the left, annotated overlay
on the right.
Backend: POST /detect-structure endpoint + /image/structure-overlay
Frontend: StepStructureDetection component with zone/box/color details
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously color/shading detection only ran as fallback when no line-based
boxes were found. Now both methods run in parallel with result merging,
so smaller shaded boxes (like "German leihen") get detected even when
larger bordered boxes are already found. Uses median-blur background
analysis that works for both colored and grayscale/B&W scans.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Median hue instead of mean (robust to background contamination)
- Otsu threshold instead of fixed 180 (adapts to colored backgrounds)
- Background sampling from border pixels with hue-distance filter
- Higher sat_threshold (70) + min_sat_ratio (25%) to reduce false positives
- Classify using saturated pixels only for cleaner hue signal
Fixes: borrow/lend misdetected as orange (actually red, median_H=5)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_build_cells() creates new word_box dicts, so color fields set before
grid building were lost. Now detect_word_colors() runs after cells
are built, on the final word_boxes. Recovery still runs before grid
building so recovered words participate in column/row detection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New cv_color_detect.py module:
- detect_word_colors(): annotates existing words with text color (HSV analysis)
- recover_colored_text(): finds colored text regions missed by standard OCR
(e.g. red ! markers) using HSV masks + contour detection
Integrated into build-grid: words get color/color_name fields, recovered
colored regions are merged into the word list before grid building.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only cluster left-edges of words that begin a new group within their row
(first word or preceded by a large gap). This filters out mid-phrase
word positions (IPA transcriptions, second words in multi-word entries)
that were causing too many false columns.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column detection now clusters word left-edges by X-proximity and filters
by row coverage (Y-coverage), matching the proven approach from cv_layout.py
but using precise OCR word positions instead of ink-based estimates.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: new grid_editor_api.py with build-grid endpoint that detects
bordered boxes, splits page into zones, clusters columns/rows per zone
from Kombi word positions. New DB column grid_editor_result JSONB.
Frontend: GridEditor component with editable HTML tables per zone,
column bold toggle, header row toggle, undo/redo, keyboard navigation
(Tab/Enter/Arrow), image overlay verification, and save/load.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RapidOCR uses the same PP-OCRv5 ONNX models locally, avoiding 504 timeouts
from remote PaddleOCR on large images. Set FORCE_REMOTE_PADDLE=1 to bypass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add /rapid-kombi backend endpoint using local RapidOCR + Tesseract merge,
KombiCompareStep component for parallel execution and side-by-side overlay,
and wordResultOverride prop on OverlayReconstruction for direct data injection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: Add spatial overlap check (>=50% horizontal IoU) to Kombi merge
so words at the same position are deduplicated even when OCR text differs.
Frontend: Add yPct/hPct to WordPosition so each word renders at its actual
vertical position instead of all words collapsing to the cell center Y.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR returns entire phrases as single boxes (e.g. "More than 200
singers took part in the"). The merge algorithm compared word-by-word
but Paddle had multi-word boxes vs Tesseract's individual words, so
nothing matched and all Tesseract words were added as "extras" causing
duplicates. Now splits Paddle boxes into individual words before merge.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces position-based word matching with row-based sequence alignment
to fix doubled words and cross-line averaging in Kombi-Modus.
New algorithm:
1. Group words into rows by Y-position clustering
2. Match rows between engines by vertical center proximity
3. Within each row: walk both sequences left-to-right, deduplicating
4. Unmatched rows kept as-is
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Even after multi-criteria matching, near-duplicate words can slip through
(same text, centers within 30px horizontal / 15px vertical). The new
_deduplicate_words() removes these, keeping the higher-confidence copy.
Regression test with real session data (row 2 with 145 near-dupes)
confirms no duplicates remain after merge + deduplication.
Tests: 37 → 45 (added TestDeduplicateWords, TestMergeRealWorldRegression).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The merge algorithm now uses 3 criteria instead of just IoU > 0.3:
1. IoU > 0.15 (relaxed threshold)
2. Center proximity < word height AND same row
3. Text similarity > 0.7 AND same row
This prevents doubled overlapping words when both PaddleOCR and
Tesseract find the same word at similar positions. Unique words
from either engine (e.g. bullets from Tesseract) are still added.
Tests expanded: 19 → 37 (added _box_center_dist, _text_similarity,
_words_match tests + deduplication regression test).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs both OCR engines on the preprocessed image and merges results:
word boxes matched by IoU, coordinates averaged by confidence weight.
Unmatched Tesseract words (bullets, symbols) are added for better coverage.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When PaddleOCR returns "!Betonung" as a single word box, the overlay
positions text starting at the "!" instead of the actual word. Split
such boxes into ["!", "Betonung"] with proportional position splitting,
matching the existing IPA bracket splitting logic.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces custom _paddle_words_to_grid_cells with the proven
build_grid_from_words from cv_words_first.py — same function the
regular pipeline uses with PaddleOCR. Handles phrase splitting,
column clustering, and produces cells with word_boxes that the
slide/cluster positioning hooks expect.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
One cell per row with all words as word_boxes instead of one cell per
word. Gives OverlayReconstruction a row-spanning bbox_pct for correct
font sizing and per-word positions for slide/cluster placement.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Uses the cropped/dewarped image instead of the original so the overlay
shows the correctly oriented page. 5 steps instead of 2.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New 2-step mode (Upload → PaddleOCR+Overlay) alongside the existing
7-step pipeline. Backend endpoint runs PaddleOCR on the original image
and clusters words into rows/cells directly. Frontend adds a mode
toggle and PaddleDirectStep component.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR returns "badge[bxd3]" without space, but the IPA fixer
produces "badge [bˈædʒ]" with space, creating a token count mismatch
between cell.text and word_boxes. Now also split at "[" boundaries
so each IPA bracket gets its own sub-box.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR returns phrase-level bounding boxes (e.g. "competition
[kompa'tifn]" as one box) but the overlay slide mechanism expects
one box per word for accurate positioning. Multi-word boxes are now
split proportionally by character count with small gaps between words.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
words_first was storing word_boxes in percent coordinates while
cv_cell_grid.py uses absolute pixel coordinates. The overlay slide
mechanism divides by imgW to get percentages, so percent-in-percent
caused positions near zero. Now both grid builders use the same format.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>