Each zone becomes its own Excel sheet tab with independent column widths:
- Sheet "Vokabeln": main content zone with EN/DE/example columns
- Sheet "Pounds and euros": Box 1 with its own 4-column layout
- Sheet "German leihen": Box 2 with single column for flowing text
This solves the column-width conflict: boxes have different column
widths optimized for their content, which is impossible in a single
unified sheet (Excel limitation: column width is per-column, not per-cell).
Sheet tabs visible at bottom (showSheetTabs: true).
Box sheets get colored tab (from box_bg_hex).
First sheet active by default.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Install @fortune-sheet/react (MIT, v1.0.4) as Excel-like spreadsheet
component. New SpreadsheetView.tsx converts unified grid data to
Fortune Sheet format (celldata, merge config, column/row sizes).
StepAnsicht now has Spreadsheet/Grid toggle:
- Spreadsheet mode: full Fortune Sheet with toolbar (bold, italic,
color, borders, merge cells, text wrap, undo/redo)
- Grid mode: existing GridTable for quick editing
Box-origin cells get light tinted background in spreadsheet view.
Colspan cells converted to Fortune Sheet merge format.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend (unified_grid.py):
- build_unified_grid(): merges content + box zones into one zone
- Dominant row height from median of content row spacings
- Full-width boxes: rows integrated directly
- Partial-width boxes: extra rows inserted when box has more text
lines than standard rows fit (e.g., 7 lines in 5-row height)
- Box-origin cells tagged with source_zone_type + box_region metadata
Backend (grid_editor_api.py):
- POST /sessions/{id}/build-unified-grid → persists as unified_grid_result
- GET /sessions/{id}/unified-grid → retrieve persisted result
Frontend:
- GridEditorCell: added source_zone_type, box_region fields
- GridTable: box-origin cells get tinted background + left border
- StepAnsicht: split-view with original image (left) + editable
unified GridTable (right). Auto-builds on first load.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Content sections: use dominant (median) row height from all content
rows instead of per-section average. This ensures uniform row height
above and below boxes (the standard case on textbook pages).
Box sections: distribute height proportionally by text line count
per row. A header (1 line) gets 1/7 of box height, a bullet with
3 lines gets 3/7. Fixes Box 2 where row 3 was cut off because
even distribution didn't account for multi-line cells.
Removed overflow:hidden from box container to prevent clipping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Content rows were incorrectly filtered out when their Y overlapped
with a box, even if the box only covered the right half of the page.
Now checks both Y AND X overlap — rows are only excluded if they
start within the box's horizontal range.
Fixes: rows next to Box 2 (lend, coconut, taste) were missing from
reconstruction because Box 2 (x=871, w=525) only covers the right
side, but left-side content rows at x≈148 were being filtered.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major rewrite of reconstruction rendering:
- Page split into vertical sections (content/box) around box boundaries
- Content sections: uniform row height = (last_row - first_row) / (n-1)
- Box sections: rows evenly distributed within box height
- Content rows positioned absolutely at original y-coordinates
- Font size derived from row height (55% of row height)
- Multi-line cells (bullets) get expanded height with indentation
- Boxes render at exact bbox position with colored border
- Preparation for unified grid where boxes become part of main grid
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace manual word_box positioning (wild/unsnapped) with the
server-rendered words-overlay image from the OCR step endpoint.
This shows the same cleanly snapped red letters as the OCR step.
Endpoint: /sessions/{id}/image/words-overlay
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Font: use font_size_suggestion_px * scale directly (removed 0.85 factor)
- Row height: calculate from row-to-row spacing (y_min of next row
minus y_min of current row) instead of text height (y_max - y_min).
This produces correct line spacing matching the original layout.
- Multi-line cells: height multiplied by line count
Content zone should now span from ~250 to ~2050 matching the original.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Left panel: Original scan + OCR word overlay (red text at exact
word_box positions) + coordinate grid
Right panel: Reconstructed layout + same coordinate grid
Features:
- Coordinate grid toggle with 50/100/200px spacing options
- Grid lines labeled with pixel coordinates in original image space
- Both panels share the same scale for direct visual comparison
- OCR overlay shows detected text in red mono font at original positions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New pipeline step showing the reconstructed page with all zones
positioned at their original coordinates:
- Content zones with vocabulary grid cells
- Box zones with colored borders (from structure detection)
- Colspan cells rendered across multiple columns
- Multi-line cells (bullets) with pre-wrap whitespace
- Toggle to overlay original scan image at 15% opacity
- Proportionally scaled to viewport width
- Pure CSS positioning (no canvas/Fabric.js)
Pipeline: 14 steps (0-13), Ground Truth moved to Step 13.
Added colspan field to GridEditorCell type.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Chained ternary (colored ? div : multiline ? textarea : input) caused
webpack SWC parser issues. Replaced with IIFE {(() => { if/return })()}
which is more robust and readable.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove extra curly braces around the textarea/input ternary that
caused webpack syntax error. The ternary is now a chained condition:
hasColoredWords ? <div> : text.includes('\n') ? <textarea> : <input>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cells containing \n (bullet items with continuation lines) now use
<textarea> instead of <input type=text>, making all lines visible.
Row height auto-expands based on line count in the cell.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Auto-build was triggering on every grid.zones.length change, which
happens on every rebuild (zone indices increment). Now uses a ref
to ensure auto-build fires only once. Also removed boxZones.length===0
condition that could trigger unnecessary builds.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Boxes whose vertical center falls within top/bottom 7% of image
height are filtered out (page numbers, unit headers, running footers).
At typical scan resolutions, 7% ≈ 2.5cm margin.
Fixes: "Box 1" containing just "3" from "Unit 3" page header being
incorrectly treated as an embedded box.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GridTable calculates column widths from col.x_max_px - col.x_min_px.
Flowing and header_only layouts were missing these fields, producing
NaN widths which collapsed the CSS grid layout and showed empty rows
with only row numbers visible.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously GridTable only supported full-row spanning (one cell across
all columns). Now renders each spanning_header cell with its actual
colspan, positioned at the correct grid column. This allows rows like
"In Britain..." (colspan=2) + "In Germany..." (colspan=2) to render
side by side instead of only showing the first cell.
Also fix box row fields: is_header always set (was undefined for
flowing/bullet_list), y_min_px/y_max_px for header_only rows.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Colspan: use original word-block text instead of split cell texts.
Prevents "euros a nd cents" from split_cross_column_words.
Box rows: add is_header field (was undefined, causing GridTable
rendering issues). Add y_min_px/y_max_px to header_only rows.
These missing fields caused empty rows with only row numbers visible.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_split_cross_column_words was destroying the colspan information by
cutting word-blocks at column boundaries BEFORE _detect_colspan_cells
could analyze them. Now passes original (pre-split) words to colspan
detection while using split words for cell building.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New _detect_colspan_cells() in grid_editor_helpers.py:
- Runs after _build_cells() for every zone (content + box)
- Detects word-blocks that extend across column boundaries
- Merges affected cells into spanning_header with colspan=N
- Uses column midpoints to determine which columns are covered
- Works for full-page scans and box zones equally
Also fixes box flowing/bullet_list row height fields (y_min_px/y_max_px).
Removed duplicate spanning logic from cv_box_layout.py — now uses
the generic _detect_colspan_cells from grid_editor_helpers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Box 3 empty rows: flowing/bullet_list rows were missing y_min_px/
y_max_px fields that GridTable uses for row height calculation.
Added _px and _pct variants.
Box 2 spanning cells: rows with fewer word-blocks than columns
(e.g., "In Britain..." spanning 2 columns) are now detected and
merged into spanning_header cells. GridTable already renders
spanning_header cells across the full row width.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PaddleOCR returns multi-word blocks (whole phrases), so ALL inter-word
gaps in small zones (boxes, ≤60 words) are column boundaries. Previous
3x-median approach produced thresholds too high to detect real columns.
New approach for small zones: gap_threshold = max(median_h * 1.0, 25).
This correctly detects 4 columns in "Pounds and euros" box where gaps
range from 50-297px and word height is ~31px.
Also includes SmartSpellChecker fixes from previous commits:
- Frequency-based scoring, IPA protection, slash→l, rare-word threshold
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major improvements:
- Frequency-based boundary repair: always tries repair, uses word
frequency product to decide (Pound sand→Pounds and: 2000x better)
- IPA bracket protection: words inside [brackets] are never modified,
even when brackets land in tokenizer separators
- Slash→l substitution: "p/" → "pl" for italic l misread as slash
- Abbreviation guard uses rare-word threshold (freq < 1e-6) instead
of binary known/unknown — prevents "Can I" → "Ca nI" while still
fixing "ats th." → "at sth."
- Tokenizer includes / character for slash-word detection
43 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, boundary repair was skipped when both words were valid
dictionary words (e.g., "Pound sand", "wit hit", "done euro").
Now uses word-frequency scoring (product of bigram frequencies) to
decide if the repair produces a more common word pair.
Threshold: repair accepted when new pair is >5x more frequent, or
when repair produces a known abbreviation.
New fixes: Pound sand→Pounds and (2000x), wit hit→with it (100000x),
done euro→one euro (7x).
43 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In small zones (boxes), intra-phrase gaps inflate the median gap,
causing gap_threshold to become too large to detect real column
boundaries. Cap at 25% of zone width to prevent this.
Example: Box "Pounds and euros" has 4 columns at x≈148,534,751,1137
but gap_threshold was 531 (larger than the column gaps themselves).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use box_bg_hex for border color (from Step 7 structure detection)
- Numbered color badges per box
- Show color name in box header
- Add box_bg_color/box_bg_hex to GridZone type
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GridTable expects zone (singular), onSelectCell, onCellTextChange,
onToggleColumnBold, onToggleRowHeader, onNavigate — not the
incorrect prop names from the first version.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Source boxes from structure_result (Step 7) instead of grid zones
- Use raw_paddle_words (top/left/width/height) instead of grid cells
- Create new box zones from all detected boxes (not just existing zones)
- Sort zones by y-position for correct reading order
- Include box background color metadata
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New pipeline step between Gutter Repair and Ground Truth that processes
embedded boxes (grammar tips, exercises) independently from the main grid.
Backend:
- cv_box_layout.py: classify_box_layout() detects flowing/columnar/
bullet_list/header_only layout types per box
- build_box_zone_grid(): layout-aware grid building (single-column for
flowing text, independent columns for tabular content)
- POST /sessions/{id}/build-box-grids endpoint with SmartSpellChecker
- Layout type overridable per box via request body
Frontend:
- StepBoxGridReview.tsx: shows each box with cropped image + editable
GridTable. Layout type dropdown per box. Auto-builds on first load.
- Auto-skip when no boxes detected on page
- Pipeline steps updated: 13 steps (0-12), Ground Truth moved to 12
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New features:
- Boundary repair: "ats th." → "at sth." (shifted OCR word boundaries)
Tries shifting 1-2 chars between adjacent words, accepts if result
includes a known abbreviation or produces better dictionary matches
- Context split: "anew book" → "a new book" (ambiguous word merges)
Explicit allow/deny list for article+word patterns (alive, alone, etc.)
- Abbreviation awareness: 120+ known abbreviations (sth, sb, adj, etc.)
are now recognized as valid words, preventing false corrections
- Quality gate: boundary repairs only accepted when result scores
higher than original (known words + abbreviations)
40 tests passing, all edge cases covered.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SmartSpellChecker now runs during grid build (not just LLM review),
so corrections are visible immediately in the grid editor.
Language detection per column:
- EN column detected via IPA signals (existing logic)
- All other columns assumed German for vocab tables
- Auto-detection for single/two-column layouts
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dropdowns are now in the vocabulary table header (after processing),
not in the worksheet settings (before processing). Changing a mode
automatically reprocesses all successful pages with the new settings.
Same dropdown options as the OCR pipeline grid editor.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Vocab worksheet now has the same IPA/syllable mode options as the
OCR pipeline grid editor: Auto, nur EN, nur DE, Alle, Aus.
Previously only had on/off checkboxes mapping to auto/none.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When ipa_mode=none, the entire IPA processing block was skipped,
including the bracket-stripping logic. Now strips ALL square brackets
from content columns BEFORE the skip, so IPA:Aus actually removes
all IPA from the display.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
loadGrid depended on buildGrid (for 404 fallback), which depended on
ipaMode/syllableMode. Every mode change created a new loadGrid ref,
triggering StepGridReview's useEffect to load the OLD saved grid,
overwriting the freshly rebuilt one.
Now loadGrid only depends on sessionId. The 404 fallback builds inline
with current modes. Mode changes are handled exclusively by the
separate rebuild useEffect.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OCR text contains ASCII IPA approximations like [kompa'tifn] instead
of Unicode [kˈɒmpətɪʃən]. The strip regex required Unicode IPA chars
inside brackets and missed the ASCII ones. Now strips all [bracket]
content from excluded columns since square brackets in vocab columns
are always IPA.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
English IPA from the original OCR scan (e.g. [ˈgrænˌdæd]) was always
shown because fix_cell_phonetics only ADDS/CORRECTS but never removes.
Now strips IPA brackets containing Unicode IPA chars from the EN column
when ipa_mode is "de" or "none".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The old guard checked if grid was loaded AND set initialLoadDone in
the same pass, then returned without rebuilding. This meant the first
user-triggered mode change was always swallowed.
Simplified to a mount-skip ref: skip exactly the first useEffect trigger
(component mount), rebuild on every subsequent trigger (user changes).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The useEffect for mode changes called buildGrid() which was a
useCallback closing over stale ipaMode/syllableMode values due to
React's asynchronous state batching. The first click triggered a
rebuild with the OLD mode; only the second click used the new one.
Now inlines the API call directly in the useEffect, reading ipaMode
and syllableMode from the effect's closure which always has the
current values.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Strip IPA brackets [ipa] before attempting word split, so
"makeadecision[dɪsˈɪʒən]" is processed as "makeadecision"
2. Handle contractions: "solet's" → split "solet" → "so let" + "'s"
3. DP tiebreaker: prefer longer first word when scores are equal
("task is" over "ta skis")
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
"taskis" was split as "ta skis" instead of "task is" because both
have the same DP score. Changed comparison from > to >= so that
later candidates (with longer first words) win ties.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Short merged words like "anew" (a new), "Imadea" (I made a),
"makeadecision" (make a decision) were missed because the split
threshold was too high. Now processes tokens >= 4 chars.
English single-letter words (a, I) are already handled by the DP
algorithm which allows them as valid split points.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Footer rows that are page numbers (digits or written-out like
"two hundred and nine") are now removed from the grid entirely
and promoted to the page_number metadata field. Non-page-number
footer content stays as a visible footer row.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
"two hundred and nine" (22 chars) was kept as a content row because
the footer detection only accepted text ≤20 chars. Now recognizes
written-out number words (English + German) as page numbers regardless
of length.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Step 5g was extracting page refs (p.55, p.70) as zone metadata and
removing them from the cell table. Users want to see them as a
separate column. Now keeps cells in place while still extracting
metadata for the frontend header display.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>