Prevents first content row from expanding into header area (causing
"ulary" from "VOCABULARY" to appear in DE column) and last content row
from expanding into footer area (causing page numbers to appear as content).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old per-cell streaming timed out because sequential cell OCR was
too slow to send the first event before proxy timeout. Now uses
build_cell_grid_v2 (parallel ThreadPoolExecutor) via run_in_executor,
then streams all cells at once after batch completes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cell-First OCR (v2): Each cell is cropped and OCR'd in isolation,
eliminating neighbour bleeding (e.g. "to", "ps" in marker columns).
Uses ThreadPoolExecutor for parallel Tesseract calls.
Document type detection: Classifies pages as vocab_table, full_text,
or generic_table using projection profiles (<2s, no OCR needed).
Frontend dynamically skips columns/rows steps for full-text pages.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The HSV-based coloured marker detection caused false positives in
nearly every marker cell. Coloured markers like red "!" are an
extreme edge case — better handled manually in reconstruction.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes for marker column content (e.g. red "!" marks):
1. Skip _clean_cell_text() noise filter for column_marker — it
requires 2+ consecutive letters, which drops punctuation-only
markers like "!" or "*".
2. For marker columns, detect coloured pixels via HSV saturation
check (S>80) in addition to grayscale darkness. Create a
binarized image where both dark AND saturated pixels become
black foreground, so Tesseract can see red markers that appear
near-white in standard grayscale conversion.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a narrow column expands into neighbor space, the neighbor's
boundaries must be adjusted to avoid overlap. After expansion, left
neighbor's right edge and right neighbor's left edge are trimmed to
match the expanded column's new boundaries, with words re-assigned.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sub-column splits create adjacent columns with 0px gap between them.
The previous expansion only worked with explicit gaps. Now it looks at
where the neighbor's actual words are and claims unused space up to
MIN_WORD_MARGIN (4px) from the nearest word, even if there's no gap
in the column boundaries.
Also added debug logging for expansion input.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After deskew, horizontal text lines are already straight (~0° slope).
Method D was measuring this (always ~0°) instead of the actual vertical
shear (column edge drift). This caused it to report 0.112° with 0.96
confidence, overwhelming Method A's correct detection of negative shear.
New Method D groups words by X-position into vertical columns, then
measures how left-edge X drifts with Y position via linear regression.
dx/dy = tan(shear_angle), directly measuring column tilt.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The narrow column expansion was running inside detect_column_geometry()
on the 4 main columns, but the narrowest columns (marker ~14px,
page_ref ~93px) are created AFTERWARDS by _detect_sub_columns().
Extracted expand_narrow_columns() as standalone function and call it
after sub-column splitting in the columns API endpoint.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes for edge case where residual shear pushes content out of
narrow columns (marker, page_ref):
1. Column expansion (Step 10): After detection, narrow columns (<10%
content width) expand into adjacent whitespace gaps, claiming up to
40% of the gap but never past the nearest word in the neighbor
column. This gives marker/page_ref columns breathing room.
2. Dewarp sensitivity: Lower minimum angle from 0.15° to 0.08°, lower
ensemble min confidence from 0.5 to 0.35, lower final threshold
from 0.5 to 0.4, and skip quality gate for small corrections
(<0.5°) where projection variance change is negligible.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The detections array was empty when shear was below threshold, hiding
all 4 method results from the frontend Details panel.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add DewarpDetection type with per-method results
- Expand method labels for all 4 detectors (A-D)
- Show green/amber banner: applied vs quality-gate-rejected
- Expandable "Details" panel showing all 4 methods with confidence bars
- Visual confidence bars instead of plain percentage
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace setBackgroundImage() with backgroundImage property (v6 breaking change)
- Replace setWidth/setHeight with Canvas constructor options
- Fix opacity handler to use direct property access
- Update CLAUDE.md: use git -C and docker compose -f instead of cd
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Rename Step 6 label to "Korrektur" (was "OCR-Zeichenkorrektur")
2. Move _fix_character_confusion from pipeline Step 1 into
llm_review_entries_streaming so corrections are visible in the UI:
char changes (| → I, 1 → I, 8 → B) are now emitted as a batch event
right after the meta event, appearing in the corrections list
3. StepReconstruction: all cells (including empty) are now rendered as
editable inputs — removed filter that hid empty cells from the editor
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
content_roi was cropped to [left_x:right_x] — the detected content boundary.
Words at the right edge of the last column (beyond right_x) were never
found in the initial scan, so they remained missing even after the column
geometry was extended to full image width (w).
Fix: crop to [left_x:w] so all words including those near the right margin
are detected and assigned correctly to the last column.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_CHAR_CONFUSION_RULES: standalone "1" → "I" now skips "1." and "1,"
Cross-language fallback rule: same lookahead (?![\d.,]) added
Fixes: "cross = 1. Kreuz" being converted to "cross = I. Kreuz" in Step 1
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
right_x is the detected content boundary, which can still be several
pixels short of actual text near the page margin. Since the page margin
contains only white space, extending the last column's OCR crop to the
full image width (w) is always safe and prevents right-edge text cutoff.
Affects three locations in detect_column_geometry():
- Word count logging loop
- ColumnGeometry boundary building (Step 8)
- Phantom filter boundary adjustment (Step 9)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phantom column fix:
Adjacent tiny gaps (e.g. 11px + 35px) can create very narrow columns
(< 3% of content width) with 0 words. These are scan artefacts, not
real columns. New Step 9 in detect_column_geometry():
- Filter columns where width < max(20px, 3% content_w) AND words < 3
- After filtering, extend each remaining column to close the gap with
its right neighbor, and re-assign words to correct column
Example from logs: 5 columns → 4 columns (phantom at x=710, width=36px
eliminated; neighbors expanded to cover the gap)
UI rename:
- 'Schritt 6: LLM-Korrektur' → 'Schritt 6: OCR-Zeichenkorrektur'
- 'LLM-Korrektur starten' → 'Zeichenkorrektur starten'
- Error message updated accordingly
(No LLM involved anymore — spell-checker is the active engine)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two new functions:
- _is_artifact_row(): marks rows as artifacts if all detected tokens
are single characters (scanner shadows produce dots/dashes, not words).
A real vocabulary row always contains at least one 2+ char word.
- _heal_row_gaps(): after removing empty/artifact rows, expands each
remaining content row to the midpoint of adjacent gaps, so OCR crops
are not artificially narrow. First row extends to content top_bound;
last row to content bottom_bound.
Applied in both build_cell_grid() and build_cell_grid_streaming() after
the word_count>0 filter and before OCR.
Addresses cases like:
- Row 21: scan shadow → single-char artifacts → filtered before OCR
- Row 23: completely empty (word_count=0) → already filtered
- Row 22: real content → now expanded upward/downward to fill the space
that rows 21 and 23 occupied, giving OCR the correct full height
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously detect_column_geometry() ended the last column at the start
of the detected right-margin gap (left_x + right_boundary), which could
cut into actual text near the right edge of the Example column.
Since only the page margin lies to the right of the last column, the
rightmost column now always extends to right_x regardless of whether
a right-margin gap was detected. This prevents OCR crops from missing
words at the right edge of wide columns like column_example.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add pyspellchecker (MIT) to requirements for EN+DE dictionary lookup
- New spell_review_entries_sync() + spell_review_entries_streaming():
- Dictionary-backed substitution: checks if corrected word is known
- Structural rule: digit at pos 0 + lowercase rest → most likely letter
(e.g. "8en"→"Ben", "8uch"→"Buch", "5ee"→"See", "6eld"→"Geld")
- Pattern rule: "|." → "1." for numbered list prefixes
- Standalone "|" → "I" (capital I)
- IPA entries still protected via existing _entry_needs_review filter
- Headings/untranslated words (e.g. "Story") are untouched (no susp. chars)
- llm_review_entries + llm_review_entries_streaming: route via REVIEW_ENGINE
env var ("spell" default, "llm" to restore previous behaviour)
- docker-compose.yml: REVIEW_ENGINE=${REVIEW_ENGINE:-spell}
- LLM code preserved for fallback (set REVIEW_ENGINE=llm in .env)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extend _OCR_CHAR_MAP to treat '|' as a possible misread of digit '1'
in addition to letters l/L/i/I. Fixes cases like 'cross = |. Kreuz'
→ 'cross = 1. Kreuz' (numbered list prefix) being rejected.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Die UI nutzt llm_review_entries_streaming, nicht llm_review_entries.
Die Streaming-Version hatte kein think:false → qwen3:0.6b verbrachte
9 Sekunden im Denkprozess ohne Token-Budget für die eigentliche Antwort.
- think: false in Streaming-Version ergänzt
- num_predict: 4096 → 8192 (konsistent mit nicht-streaming)
- Logging für batch-Fortschritt, Response-Länge, geparste Einträge
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Log zeigte: "Invalid control character at: line 28 column 27"
Das Pipe-Zeichen | in OCR-Texten (z.B. "| want" statt "I want")
bricht den JSON-Parser wenn es als Literal im LLM-Response steht.
Fixes:
- _sanitize_for_json(): entfernt ASCII Control-Chars 0x00-0x1f
(außer Tab/LF/CR die in JSON valid sind)
- | → I als erlaubte OCR-Korrektur in _is_spurious_change und Prompt
- Reverse-Check in _is_spurious_change (l→I etc.)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Der digit-in-word Pre-Filter hat alle 41 Einträge geblockt (skipped=41
im Log). OCR-Fehler können nicht im voraus erkannt werden.
Zurück zum ursprünglichen Ansatz: alle nicht-leeren Einträge ohne
IPA-Klammern werden ans LLM gesendet. Schutz gegen Übersetzungen
erfolgt ausschließlich über den strikten Prompt und _is_spurious_change().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- think: false in Ollama API Request (qwen3 disables CoT nativ)
- <think>...</think> Stripping in _parse_llm_json_array (Fallback falls
think:false nicht greift)
- INFO-Logging: wie viele Einträge gesendet werden, Response-Länge,
Anzahl geparster Einträge
- DEBUG-Logging: erste 3 Eingabe-Einträge, ersten 500 Zeichen der Antwort
- Bessere Fehlermeldung wenn JSON-Parsing fehlschlägt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
## Problem
qwen3:0.6b interpretierte den Prompt zu weit und versuchte:
- Englische Wörter zu übersetzen (EN-Spalte umschreiben)
- Korrekte deutsche Wörter neu zu übersetzen
- IPA-Einträge in Klammern zu 'korrigieren'
## Fixes
### 1. Strengerer Pre-Filter (entry_needs_review)
Sendet jetzt NUR Einträge ans LLM, die tatsächlich ein
Ziffer-in-Wort-Muster haben (0158 zwischen Buchstaben).
→ Korrekte Einträge werden gar nicht erst gesendet.
### 2. Viel restriktiverer Prompt
- Explizites Verbot: "du übersetzt NICHTS, weder EN→DE noch DE→EN"
- Nur die 5 Ziffer→Buchstaben-Fälle sind erlaubt
- Konkrete Beispiele für erlaubte Korrekturen
- Kein vager "Im Zweifel nicht ändern" — sondern explizites VERBOTEN
### 3. Stärkerer Spurious-Change-Filter
Verwirft LLM-Änderungen, die keine Ziffer→Buchstabe-Substitution sind.
Verhindert Übersetzungen und Neuformulierungen auch wenn der Prompt
sie nicht vollständig unterdrückt.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The try/except block for the deskew step had 4 extra spaces of
indentation from a previous edit. Python rejected the file with
IndentationError at startup.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Word-to-column assignment now uses overlap-based matching instead of
center-point matching. This fixes narrow page_ref columns losing
their last digit (e.g. "p.59" → "p.5") when the digit's center
falls slightly past the midpoint boundary into the next column.
2. Post-OCR empty row filter: rows where ALL cells have empty text
are removed after OCR. This catches inter-row gaps that had stray
Tesseract artifacts giving word_count > 0 but no actual content.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add marker/bbox_marker fields to WordEntry type
- Add page_ref/column_marker colors to StepReconstruction
- Make StepLlmReview table dynamic based on columns_used metadata,
showing all detected columns (EN, DE, Example, page_ref, marker)
instead of hardcoded EN/DE/Beispiel only
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Decouple display bbox from OCR crop region. Display bbox now uses exact
col.x/row.y/col.width/row.height (no padding), so adjacent cells touch
without gaps. OCR crop keeps 4px internal padding for edge character
detection.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Frontend: Replace hardcoded EN/DE/Example vocab table with unified dynamic
table driven by columns_used from backend. Labeling, confirmation, counts,
and summary badges are now all cell-based instead of branching on isVocab.
Backend: Change _cells_to_vocab_entries() entry filter from checking only
english/german/example to checking ANY mapped field. This preserves rows
with only marker or source_page content, fixing the issue where marker
sub-columns disappeared at the end of OCR processing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three fixes for sub-columns disappearing at end of streaming:
1. Backend: add column_marker mapping in _cells_to_vocab_entries()
so marker text is included in vocab entries (not silently dropped)
2. Frontend types: add source_page and bbox_ref to WordEntry interface
3. Frontend table: show page_ref column (Seite) in vocab table when
entries have source_page data, instead of only EN/DE/Example
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add is_sub_column flag to ColumnGeometry. Sub-columns created by
_detect_sub_columns() are now exempt from the edge-column word_count<8
rule that converts them to column_ignore.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The streaming word endpoint excluded page_ref from _skip_types,
causing sub-column splits to be lost in the meta event and final
grid_shape. Aligned _skip_types with build_cell_grid_streaming().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Header/footer words (page numbers, chapter titles) could pollute the
left-edge alignment bins and trigger false sub-column splits. Now
_detect_header_footer_gaps() runs early and its boundaries are passed
to _detect_sub_columns() to filter those words from clustering and
the split threshold check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Word 'left' values in ColumnGeometry.words are relative to the content
ROI (left_x), but geo.x is in absolute image coordinates. The split
position was computed from relative word positions and then compared
against absolute geo.x, resulting in negative widths and no splits on
real data. Pass left_x through to _detect_sub_columns to bridge the
two coordinate systems.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace gap-based splitting with alignment-bin approach: cluster word
left-edges within 8px tolerance, find the leftmost bin with >= 10% of
words as the true column start, split off any words to its left as a
sub-column. This correctly handles both page references ("p.59") and
misread exclamation marks ("!" → "I") even when the pixel gap is small.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Detects hidden sub-columns (e.g. page references like "p.59") within
already-recognized columns by clustering word left-edge positions and
splitting when a clear minority cluster exists. The sub-column is then
classified as page_ref and mapped to VocabRow.source_page.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>