New approach: dilate color mask heavily (25x25) to merge nearby colored
pixels into regions, then check word overlap:
- >50% overlap with OCR word boxes → colored text → skip
- <50% overlap → colored image/graphic → keep
This detects balloon clusters as one "image" region instead of trying
to classify individual shapes. Red words like "borrow/lend" are filtered
because they overlap with their word boxes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 5x5 MORPH_CLOSE was connecting scattered color pixels into one
page-spanning contour that swallowed individual balloons. Fix:
- Remove MORPH_CLOSE, keep only MORPH_OPEN for speckle removal
- Lower sat threshold 50→40 to catch more colored elements
- Filter contours spanning >50% of width OR height (was AND)
- Filter contours >10% of image area
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pass 1 (color): Detect colored graphics on HSV saturation channel.
Black text is invisible on this channel, so no word exclusion needed.
Catches colored balloons, arrows, icons reliably.
Pass 2 (ink): Detect large black illustrations on dark ink mask
minus word exclusion. Only keeps area > 5000 to avoid text fragments.
Fixes: all 5 balloons now detectable (previously word exclusion zones
were eating colored graphics that overlapped with nearby OCR words).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Text fragments after word exclusion are indistinguishable from arrows
and icons via contour metrics. Since the goal is detecting graphics,
images, boxes and colors (not arrows/icons), simplify to only:
- circle/balloon (circularity > 0.55 — very reliable)
- illustration (area > 3000 — clearly non-text)
Boxes and colors are handled by cv_box_detect and cv_color_detect.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Lower min_area from 200 to 80 (small balloons ~100-300px²)
- Lower word_pad from 10 to 5 (10px was eating nearby graphics)
- Relax circle detection: circularity>0.55, min_dim>15 (was 0.70/25)
- Text fragments still filtered by _classify_shape noise threshold
- Add ACCEPT logging for debugging
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Raise min_area from 30 to 200 (text fragments are small)
- Raise word_pad from 3 to 10px (OCR bboxes are tight)
- Reduce morph close kernel from 5x5 to 3x3 (avoid reconnecting text)
- Tighten arrow detection: min 20px, circularity<0.35, >=2 defects
- Add 'noise' category for too-small elements, filter them out
- Raise min dimension from 4 to 8px
- Add debug logging for word count and exclusion coverage
- Raise max_area_ratio to 0.25 (allow larger illustrations)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Graphic detection needs word positions to exclude text from the ink mask.
Previously Struktur ran before OCR, causing every word to be detected as
a graphic element. Now:
- Pipeline: Struktur at index 7 (after Wörter)
- Kombi: Struktur at index 5 (after PP-OCRv5+Tesseract, before Tabelle)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add cv_graphic_detect.py for detecting non-text visual elements (arrows,
circles, lines, exclamation marks, icons, illustrations). Draw detected
graphics on structure overlay image and display them in the frontend
StepStructureDetection component with shape counts and individual listings.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Insert the Struktur detection step between Zuschneiden and
PP-OCRv5+Tesseract in the Kombi pipeline on /ai/ocr-overlay.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New pipeline step between Crop and Columns that visualizes detected
document structure: boxes (line-based + shading), page zones, and
color regions. Shows original image on the left, annotated overlay
on the right.
Backend: POST /detect-structure endpoint + /image/structure-overlay
Frontend: StepStructureDetection component with zone/box/color details
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously color/shading detection only ran as fallback when no line-based
boxes were found. Now both methods run in parallel with result merging,
so smaller shaded boxes (like "German leihen") get detected even when
larger bordered boxes are already found. Uses median-blur background
analysis that works for both colored and grayscale/B&W scans.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Median hue instead of mean (robust to background contamination)
- Otsu threshold instead of fixed 180 (adapts to colored backgrounds)
- Background sampling from border pixels with hue-distance filter
- Higher sat_threshold (70) + min_sat_ratio (25%) to reduce false positives
- Classify using saturated pixels only for cleaner hue signal
Fixes: borrow/lend misdetected as orange (actually red, median_H=5)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add color/color_name/recovered fields to OcrWordBox type
- GridTable: show colored text + left-edge color indicator strip
- GridEditor: show color stats and recovered count in summary bar
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_build_cells() creates new word_box dicts, so color fields set before
grid building were lost. Now detect_word_colors() runs after cells
are built, on the final word_boxes. Recovery still runs before grid
building so recovered words participate in column/row detection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New cv_color_detect.py module:
- detect_word_colors(): annotates existing words with text color (HSV analysis)
- recover_colored_text(): finds colored text regions missed by standard OCR
(e.g. red ! markers) using HSV masks + contour detection
Integrated into build-grid: words get color/color_name fields, recovered
colored regions are merged into the word list before grid building.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only cluster left-edges of words that begin a new group within their row
(first word or preceded by a large gap). This filters out mid-phrase
word positions (IPA transcriptions, second words in multi-word entries)
that were causing too many false columns.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column detection now clusters word left-edges by X-proximity and filters
by row coverage (Y-coverage), matching the proven approach from cv_layout.py
but using precise OCR word positions instead of ink-based estimates.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: new grid_editor_api.py with build-grid endpoint that detects
bordered boxes, splits page into zones, clusters columns/rows per zone
from Kombi word positions. New DB column grid_editor_result JSONB.
Frontend: GridEditor component with editable HTML tables per zone,
column bold toggle, header row toggle, undo/redo, keyboard navigation
(Tab/Enter/Arrow), image overlay verification, and save/load.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Since ocr_region_paddle() now runs RapidOCR locally (same PP-OCRv5 models),
the "PaddleOCR (Hetzner)" labels were misleading. Renamed to "PP-OCRv5 (lokal)".
Removed the Kombi-Vergleich tab since both sides would produce identical results.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RapidOCR uses the same PP-OCRv5 ONNX models locally, avoiding 504 timeouts
from remote PaddleOCR on large images. Set FORCE_REMOTE_PADDLE=1 to bypass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add /rapid-kombi backend endpoint using local RapidOCR + Tesseract merge,
KombiCompareStep component for parallel execution and side-by-side overlay,
and wordResultOverride prop on OverlayReconstruction for direct data injection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: Add spatial overlap check (>=50% horizontal IoU) to Kombi merge
so words at the same position are deduplicated even when OCR text differs.
Frontend: Add yPct/hPct to WordPosition so each word renders at its actual
vertical position instead of all words collapsing to the cell center Y.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR returns entire phrases as single boxes (e.g. "More than 200
singers took part in the"). The merge algorithm compared word-by-word
but Paddle had multi-word boxes vs Tesseract's individual words, so
nothing matched and all Tesseract words were added as "extras" causing
duplicates. Now splits Paddle boxes into individual words before merge.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR 3.4.0 removed 'latin' language support. Use 'en' with
explicit ocr_version='PP-OCRv5' instead, with fallback for older API.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces position-based word matching with row-based sequence alignment
to fix doubled words and cross-line averaging in Kombi-Modus.
New algorithm:
1. Group words into rows by Y-position clustering
2. Match rows between engines by vertical center proximity
3. Within each row: walk both sequences left-to-right, deduplicating
4. Unmatched rows kept as-is
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Even after multi-criteria matching, near-duplicate words can slip through
(same text, centers within 30px horizontal / 15px vertical). The new
_deduplicate_words() removes these, keeping the higher-confidence copy.
Regression test with real session data (row 2 with 145 near-dupes)
confirms no duplicates remain after merge + deduplication.
Tests: 37 → 45 (added TestDeduplicateWords, TestMergeRealWorldRegression).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The merge algorithm now uses 3 criteria instead of just IoU > 0.3:
1. IoU > 0.15 (relaxed threshold)
2. Center proximity < word height AND same row
3. Text similarity > 0.7 AND same row
This prevents doubled overlapping words when both PaddleOCR and
Tesseract find the same word at similar positions. Unique words
from either engine (e.g. bullets from Tesseract) are still added.
Tests expanded: 19 → 37 (added _box_center_dist, _text_similarity,
_words_match tests + deduplication regression test).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs both OCR engines on the preprocessed image and merges results:
word boxes matched by IoU, coordinates averaged by confidence weight.
Unmatched Tesseract words (bullets, symbols) are added for better coverage.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The slide positioning hook was re-matching cell.text tokens against
word_boxes via fuzzy text similarity, which broke positioning for
special characters (!, bullet points, IPA). Now uses word_box
coordinates directly — exact OCR positions without re-interpretation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When PaddleOCR returns "!Betonung" as a single word box, the overlay
positions text starting at the "!" instead of the actual word. Split
such boxes into ["!", "Betonung"] with proportional position splitting,
matching the existing IPA bracket splitting logic.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces custom _paddle_words_to_grid_cells with the proven
build_grid_from_words from cv_words_first.py — same function the
regular pipeline uses with PaddleOCR. Handles phrase splitting,
column clustering, and produces cells with word_boxes that the
slide/cluster positioning hooks expect.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
One cell per row with all words as word_boxes instead of one cell per
word. Gives OverlayReconstruction a row-spanning bbox_pct for correct
font sizing and per-word positions for slide/cluster placement.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Uses the cropped/dewarped image instead of the original so the overlay
shows the correctly oriented page. 5 steps instead of 2.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New 2-step mode (Upload → PaddleOCR+Overlay) alongside the existing
7-step pipeline. Backend endpoint runs PaddleOCR on the original image
and clusters words into rows/cells directly. Frontend adds a mode
toggle and PaddleDirectStep component.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace sequential 1:1 token-to-box mapping with fuzzy text matching.
Each token from cell.text finds its best matching word_box by text
similarity (normalized prefix match + substring bonus). Handles:
- Reordered boxes (different sort between text and boxes)
- IPA corrections changing token boundaries
- Token/box count mismatches
Unmatched tokens get interpolated positions from matched neighbors.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR returns "badge[bxd3]" without space, but the IPA fixer
produces "badge [bˈædʒ]" with space, creating a token count mismatch
between cell.text and word_boxes. Now also split at "[" boundaries
so each IPA bracket gets its own sub-box.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR returns phrase-level bounding boxes (e.g. "competition
[kompa'tifn]" as one box) but the overlay slide mechanism expects
one box per word for accurate positioning. Multi-word boxes are now
split proportionally by character count with small gaps between words.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
words_first was storing word_boxes in percent coordinates while
cv_cell_grid.py uses absolute pixel coordinates. The overlay slide
mechanism divides by imgW to get percentages, so percent-in-percent
caused positions near zero. Now both grid builders use the same format.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When engine=paddle is selected, the backend overrides grid_method to
words_first and returns plain JSON (no SSE streaming). The frontend
was not aware of this override — it sent stream=true and tried to parse
SSE events from a JSON response, resulting in "Keine Daten".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bilder > 1500px werden vor dem Upload verkleinert. Koordinaten
werden zurueckskaliert. JPEG statt PNG fuer schnelleren Upload.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update chunk counts for 8 successfully ingested DE laws (Phase H1)
- Add 6 new BGB-Teile entries (AGB, Fernabsatz, Kaufrecht, Widerruf, Digital)
- Add EGBGB Widerrufsbelehrung entry
- Update COLLECTION_TOTALS: gesetze 58304→63567 (+5263 Phase H chunks)
- Add Verbraucherschutz thematic group to Landkarte
- Extend ecommerce industry map with consumer protection regulations
- Update date to March 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>