The tokenizer regex only matches alphabetic characters, so text
before the first word match (like "(= " in "(= I won...") was
silently dropped when reassembling the corrected text.
Now preserves text[:first_match_start] as a leading prefix.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of keeping only specific symbols (_KEEP_SYMBOLS), now only
removes explicitly decorative symbols (_REMOVE_SYMBOLS: > < ~ \ ^ etc).
All other punctuation (= ( ) ; : - etc.) is preserved by default.
This is more robust: any new symbol used in textbooks will be kept
unless it's in the small block-list of known decorative artifacts.
Fixes: (= token still being removed on page 5 despite being in
the allow-list (possibly due to Unicode variants or whitespace).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rule (a2) in Step 5i removed word_boxes with no letters/digits as
"graphic OCR artifacts". This incorrectly removed = signs used as
definition markers in textbooks ("film = 1. Film; 2. filmen").
Added exception list _KEEP_SYMBOLS for meaningful punctuation:
= (= =) ; : - – — / + • · ( ) & * → ← ↔
The root cause: PaddleOCR returns "film = 1. Film; 2. filmen" as one
block, which gets split into word_boxes ["film", "=", "1.", ...].
The "=" word_box had no alphanumeric chars and was removed as artifact.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
= signs are used as definition markers in textbooks ("film = 1. Film").
They were incorrectly removed by two filters:
1. grid_build_core.py Step 5j-pre: _PURE_JUNK_RE matched "=" as
artifact noise. Now exempts =, (=, ;, :, - and similar meaningful
punctuation tokens.
2. cv_ocr_engines.py _is_noise_tail_token: "pure non-alpha" check
removed trailing = tokens. Now exempts meaningful punctuation.
Fixes: "film = 1. Film; 2. filmen" losing the = sign,
"(= I won and he lost.)" losing the (=.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Words like "probieren)" or "Englisch)" were incorrectly flagged as
gutter OCR errors because the closing parenthesis wasn't stripped
before dictionary lookup. The spellchecker then suggested "probierend"
(replacing ) with d, edit distance 1).
Two fixes:
1. Strip trailing/leading parentheses in _try_spell_fix before checking
if the bare word is valid — skip correction if it is
2. Add )( to the rstrip characters in the analysis phase so
"probieren)" becomes "probieren" for the known-word check
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Text like "Betonung auf der 1. Silbe: profit ['profit]" was
incorrectly detected as garbled IPA and replaced with generated
IPA transcription of the previous row's example sentence.
Added guard: if the cell text contains >=3 recognizable words
(3+ letter alpha tokens), it's normal text, not garbled IPA.
Garbled IPA is typically short and has no real dictionary words.
Fixes: Row 13 C3 showing IPA instead of pronunciation hint text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend (unified_grid.py):
- build_unified_grid(): merges content + box zones into one zone
- Dominant row height from median of content row spacings
- Full-width boxes: rows integrated directly
- Partial-width boxes: extra rows inserted when box has more text
lines than standard rows fit (e.g., 7 lines in 5-row height)
- Box-origin cells tagged with source_zone_type + box_region metadata
Backend (grid_editor_api.py):
- POST /sessions/{id}/build-unified-grid → persists as unified_grid_result
- GET /sessions/{id}/unified-grid → retrieve persisted result
Frontend:
- GridEditorCell: added source_zone_type, box_region fields
- GridTable: box-origin cells get tinted background + left border
- StepAnsicht: split-view with original image (left) + editable
unified GridTable (right). Auto-builds on first load.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Boxes whose vertical center falls within top/bottom 7% of image
height are filtered out (page numbers, unit headers, running footers).
At typical scan resolutions, 7% ≈ 2.5cm margin.
Fixes: "Box 1" containing just "3" from "Unit 3" page header being
incorrectly treated as an embedded box.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GridTable calculates column widths from col.x_max_px - col.x_min_px.
Flowing and header_only layouts were missing these fields, producing
NaN widths which collapsed the CSS grid layout and showed empty rows
with only row numbers visible.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Colspan: use original word-block text instead of split cell texts.
Prevents "euros a nd cents" from split_cross_column_words.
Box rows: add is_header field (was undefined, causing GridTable
rendering issues). Add y_min_px/y_max_px to header_only rows.
These missing fields caused empty rows with only row numbers visible.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_split_cross_column_words was destroying the colspan information by
cutting word-blocks at column boundaries BEFORE _detect_colspan_cells
could analyze them. Now passes original (pre-split) words to colspan
detection while using split words for cell building.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New _detect_colspan_cells() in grid_editor_helpers.py:
- Runs after _build_cells() for every zone (content + box)
- Detects word-blocks that extend across column boundaries
- Merges affected cells into spanning_header with colspan=N
- Uses column midpoints to determine which columns are covered
- Works for full-page scans and box zones equally
Also fixes box flowing/bullet_list row height fields (y_min_px/y_max_px).
Removed duplicate spanning logic from cv_box_layout.py — now uses
the generic _detect_colspan_cells from grid_editor_helpers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Box 3 empty rows: flowing/bullet_list rows were missing y_min_px/
y_max_px fields that GridTable uses for row height calculation.
Added _px and _pct variants.
Box 2 spanning cells: rows with fewer word-blocks than columns
(e.g., "In Britain..." spanning 2 columns) are now detected and
merged into spanning_header cells. GridTable already renders
spanning_header cells across the full row width.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PaddleOCR returns multi-word blocks (whole phrases), so ALL inter-word
gaps in small zones (boxes, ≤60 words) are column boundaries. Previous
3x-median approach produced thresholds too high to detect real columns.
New approach for small zones: gap_threshold = max(median_h * 1.0, 25).
This correctly detects 4 columns in "Pounds and euros" box where gaps
range from 50-297px and word height is ~31px.
Also includes SmartSpellChecker fixes from previous commits:
- Frequency-based scoring, IPA protection, slash→l, rare-word threshold
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major improvements:
- Frequency-based boundary repair: always tries repair, uses word
frequency product to decide (Pound sand→Pounds and: 2000x better)
- IPA bracket protection: words inside [brackets] are never modified,
even when brackets land in tokenizer separators
- Slash→l substitution: "p/" → "pl" for italic l misread as slash
- Abbreviation guard uses rare-word threshold (freq < 1e-6) instead
of binary known/unknown — prevents "Can I" → "Ca nI" while still
fixing "ats th." → "at sth."
- Tokenizer includes / character for slash-word detection
43 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, boundary repair was skipped when both words were valid
dictionary words (e.g., "Pound sand", "wit hit", "done euro").
Now uses word-frequency scoring (product of bigram frequencies) to
decide if the repair produces a more common word pair.
Threshold: repair accepted when new pair is >5x more frequent, or
when repair produces a known abbreviation.
New fixes: Pound sand→Pounds and (2000x), wit hit→with it (100000x),
done euro→one euro (7x).
43 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In small zones (boxes), intra-phrase gaps inflate the median gap,
causing gap_threshold to become too large to detect real column
boundaries. Cap at 25% of zone width to prevent this.
Example: Box "Pounds and euros" has 4 columns at x≈148,534,751,1137
but gap_threshold was 531 (larger than the column gaps themselves).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Source boxes from structure_result (Step 7) instead of grid zones
- Use raw_paddle_words (top/left/width/height) instead of grid cells
- Create new box zones from all detected boxes (not just existing zones)
- Sort zones by y-position for correct reading order
- Include box background color metadata
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New pipeline step between Gutter Repair and Ground Truth that processes
embedded boxes (grammar tips, exercises) independently from the main grid.
Backend:
- cv_box_layout.py: classify_box_layout() detects flowing/columnar/
bullet_list/header_only layout types per box
- build_box_zone_grid(): layout-aware grid building (single-column for
flowing text, independent columns for tabular content)
- POST /sessions/{id}/build-box-grids endpoint with SmartSpellChecker
- Layout type overridable per box via request body
Frontend:
- StepBoxGridReview.tsx: shows each box with cropped image + editable
GridTable. Layout type dropdown per box. Auto-builds on first load.
- Auto-skip when no boxes detected on page
- Pipeline steps updated: 13 steps (0-12), Ground Truth moved to 12
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New features:
- Boundary repair: "ats th." → "at sth." (shifted OCR word boundaries)
Tries shifting 1-2 chars between adjacent words, accepts if result
includes a known abbreviation or produces better dictionary matches
- Context split: "anew book" → "a new book" (ambiguous word merges)
Explicit allow/deny list for article+word patterns (alive, alone, etc.)
- Abbreviation awareness: 120+ known abbreviations (sth, sb, adj, etc.)
are now recognized as valid words, preventing false corrections
- Quality gate: boundary repairs only accepted when result scores
higher than original (known words + abbreviations)
40 tests passing, all edge cases covered.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SmartSpellChecker now runs during grid build (not just LLM review),
so corrections are visible immediately in the grid editor.
Language detection per column:
- EN column detected via IPA signals (existing logic)
- All other columns assumed German for vocab tables
- Auto-detection for single/two-column layouts
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When ipa_mode=none, the entire IPA processing block was skipped,
including the bracket-stripping logic. Now strips ALL square brackets
from content columns BEFORE the skip, so IPA:Aus actually removes
all IPA from the display.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OCR text contains ASCII IPA approximations like [kompa'tifn] instead
of Unicode [kˈɒmpətɪʃən]. The strip regex required Unicode IPA chars
inside brackets and missed the ASCII ones. Now strips all [bracket]
content from excluded columns since square brackets in vocab columns
are always IPA.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
English IPA from the original OCR scan (e.g. [ˈgrænˌdæd]) was always
shown because fix_cell_phonetics only ADDS/CORRECTS but never removes.
Now strips IPA brackets containing Unicode IPA chars from the EN column
when ipa_mode is "de" or "none".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Strip IPA brackets [ipa] before attempting word split, so
"makeadecision[dɪsˈɪʒən]" is processed as "makeadecision"
2. Handle contractions: "solet's" → split "solet" → "so let" + "'s"
3. DP tiebreaker: prefer longer first word when scores are equal
("task is" over "ta skis")
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
"taskis" was split as "ta skis" instead of "task is" because both
have the same DP score. Changed comparison from > to >= so that
later candidates (with longer first words) win ties.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Short merged words like "anew" (a new), "Imadea" (I made a),
"makeadecision" (make a decision) were missed because the split
threshold was too high. Now processes tokens >= 4 chars.
English single-letter words (a, I) are already handled by the DP
algorithm which allows them as valid split points.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Footer rows that are page numbers (digits or written-out like
"two hundred and nine") are now removed from the grid entirely
and promoted to the page_number metadata field. Non-page-number
footer content stays as a visible footer row.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
"two hundred and nine" (22 chars) was kept as a content row because
the footer detection only accepted text ≤20 chars. Now recognizes
written-out number words (English + German) as page numbers regardless
of length.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Step 5g was extracting page refs (p.55, p.70) as zone metadata and
removing them from the cell table. Users want to see them as a
separate column. Now keeps cells in place while still extracting
metadata for the frontend header display.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rows containing only a page reference (p.55, S.12) were removed as
"oversized stubs" (Rule 2) when their word-box height exceeded the
median. Now skips Rule 2 if any word matches the page-ref pattern.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single-cell rows were incorrectly detected as headings when they were
actually continuation lines. Two new guards:
1. Text starting with "(" is a continuation (e.g. "(usw.)", "(TV-Serie)")
2. Single cells beyond the first two content columns are overflow lines,
not headings. Real headings appear in the first columns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_insert_missing_ipa stripped "1" from "Theme 1" because it treated
the digit as garbled OCR phonetics. Now treats pure digits/numbering
patterns (1, 2., 3)) as delimiters that stop the garble-stripping.
Also fixes _has_non_dict_trailing which incorrectly flagged "Theme 1"
as having non-dictionary trailing text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Columns with zero cells (e.g. from tertiary detection where the word
was assigned to a neighboring column by overlap) are stripped from the
final result. Remaining columns and cells are re-indexed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Step 5j-pre wrongly classified "p.43", "p.50" etc as artifacts
(mixed digits+letters, <=5 chars). Added exception for page
reference patterns (p.XX, S.XX).
2. IPA spacing regex was too narrow (only matched Unicode IPA chars).
Now matches any [bracket] content >=2 chars directly after a letter,
fixing German IPA like "Opa[oːpa]" → "Opa [oːpa]".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Ensure space before IPA brackets in cell text: "word[ipa]" → "word [ipa]"
Applied as final cleanup in grid-build finalization.
2. Add debug logging for zone-word assignment to diagnose why marker
column cells are empty despite correct column detection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Words to the left of the first detected column boundary must always
form their own column, regardless of how few rows they appear in.
Previously required 4+ distinct rows for tertiary (margin) columns,
which missed page references like p.62, p.63, p.64 (only 3 rows).
Now any cluster at the left/right margin with a clear gap to the
nearest significant column qualifies as its own column.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Left-side book fold shadows have a V-shape: brightness dips from the
edge toward a peak at ~5-10% of width, then rises again. The previous
algorithm scanned from the edge inward and immediately found a low
dark fraction (0.13 at x=0), missing the gutter entirely.
Now finds the PEAK of the dark fraction profile first, then scans from
that peak toward the page center to find the transition point. Works
for both V-shaped left gutters and edge-darkening right gutters.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The frontend expects session_id in the upload response, but multi-page
PDFs returned only document_group_id + pages[]. Now includes session_id
pointing to the first page for backwards compatibility.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The spell review only runs on vocab entries, but the OCR pipeline's
grid-editor cells also contain merged words (e.g. "atmyschool").
Now splits merged words directly in the grid-build finalization step,
right before returning the result. Uses the same _try_split_merged_word()
dictionary-based DP algorithm.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When uploading a PDF with > 1 page to the OCR pipeline, each page
now gets its own session (grouped by document_group_id). Previously
only page 1 was processed. The response includes a pages array with
all session IDs so the frontend can navigate between them.
Single-page PDFs and images continue to work as before.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
"Comeon" was split as "Com eon" instead of "Come on" because both
are 2-word splits. Now uses sum-of-squared-lengths as tiebreaker:
"come"(16) + "on"(4) = 20 > "com"(9) + "eon"(9) = 18.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OCR often merges adjacent words when spacing is tight, e.g.
"atmyschool" → "at my school", "goodidea" → "good idea".
New _try_split_merged_word() uses dynamic programming to find the
shortest sequence of dictionary words covering the token. Integrated
as step 5 in _spell_fix_token() after general spell correction.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scanner shadow detection (range > 40, darkest < 180) fails on camera
book scans where the gutter shadow is subtle (range ~25, darkest ~214).
New _detect_gutter_continuity() detects gutters by their unique property:
the shadow runs continuously from top to bottom without interruption.
Divides the image into horizontal strips and checks what fraction of
strips are darker than the page median at each column. A gutter column
has >= 75% of strips darker. The transition point where the smoothed
dark fraction drops below 50% marks the crop boundary.
Integrated as fallback between scanner shadow and binary projection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>