Add /rapid-kombi backend endpoint using local RapidOCR + Tesseract merge,
KombiCompareStep component for parallel execution and side-by-side overlay,
and wordResultOverride prop on OverlayReconstruction for direct data injection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: Add spatial overlap check (>=50% horizontal IoU) to Kombi merge
so words at the same position are deduplicated even when OCR text differs.
Frontend: Add yPct/hPct to WordPosition so each word renders at its actual
vertical position instead of all words collapsing to the cell center Y.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR returns entire phrases as single boxes (e.g. "More than 200
singers took part in the"). The merge algorithm compared word-by-word
but Paddle had multi-word boxes vs Tesseract's individual words, so
nothing matched and all Tesseract words were added as "extras" causing
duplicates. Now splits Paddle boxes into individual words before merge.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces position-based word matching with row-based sequence alignment
to fix doubled words and cross-line averaging in Kombi-Modus.
New algorithm:
1. Group words into rows by Y-position clustering
2. Match rows between engines by vertical center proximity
3. Within each row: walk both sequences left-to-right, deduplicating
4. Unmatched rows kept as-is
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Even after multi-criteria matching, near-duplicate words can slip through
(same text, centers within 30px horizontal / 15px vertical). The new
_deduplicate_words() removes these, keeping the higher-confidence copy.
Regression test with real session data (row 2 with 145 near-dupes)
confirms no duplicates remain after merge + deduplication.
Tests: 37 → 45 (added TestDeduplicateWords, TestMergeRealWorldRegression).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The merge algorithm now uses 3 criteria instead of just IoU > 0.3:
1. IoU > 0.15 (relaxed threshold)
2. Center proximity < word height AND same row
3. Text similarity > 0.7 AND same row
This prevents doubled overlapping words when both PaddleOCR and
Tesseract find the same word at similar positions. Unique words
from either engine (e.g. bullets from Tesseract) are still added.
Tests expanded: 19 → 37 (added _box_center_dist, _text_similarity,
_words_match tests + deduplication regression test).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>