"Comeon" was split as "Com eon" instead of "Come on" because both
are 2-word splits. Now uses sum-of-squared-lengths as tiebreaker:
"come"(16) + "on"(4) = 20 > "com"(9) + "eon"(9) = 18.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OCR often merges adjacent words when spacing is tight, e.g.
"atmyschool" → "at my school", "goodidea" → "good idea".
New _try_split_merged_word() uses dynamic programming to find the
shortest sequence of dictionary words covering the token. Integrated
as step 5 in _spell_fix_token() after general spell correction.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scanner shadow detection (range > 40, darkest < 180) fails on camera
book scans where the gutter shadow is subtle (range ~25, darkest ~214).
New _detect_gutter_continuity() detects gutters by their unique property:
the shadow runs continuously from top to bottom without interruption.
Divides the image into horizontal strips and checks what fraction of
strips are darker than the page median at each column. A gutter column
has >= 75% of strips darker. The transition point where the smoothed
dark fraction drops below 50% marks the crop boundary.
Integrated as fallback between scanner shadow and binary projection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Step 1 is now document selection (full width).
After selecting a file, Step 2 shows a side-by-side layout with
document preview (3/5 width, scrollable, with fullscreen modal)
and session naming (2/5 width, with start button).
Also adds PDF preview via blob URL before upload.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The initial build_grid_from_words() under-clusters to 1 column while
_build_grid_core() correctly finds 4 columns (marker, EN, DE, example).
Now extracts vocab from grid zones directly, with heuristic to skip
narrow marker columns. Falls back to original cells if zones fail.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When columns can't be classified as EN/DE, map them by position:
col 0 → english, col 1 → german, col 2+ → example. This ensures
vocabulary pages are always extracted, even without explicit
language classification. Classified pages still use the proper
EN/DE/example mapping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The grid-build zones use generic column types, losing the EN/DE
classification from build_grid_from_words(). Now extracts improved
cells from grid zones but classifies them using the original
columns_meta which has the correct column_en/column_de types.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Change "PDF wird analysiert..." to "PDF wird hochgeladen..." (accurate)
- Switch to pages tab immediately after upload (before thumbnails load)
- Show progressive status: "5 Seiten erkannt. Vorschau wird geladen..."
- Show backend error detail instead of generic "HTTP 404"
- Backend returns helpful message when session not in memory after restart
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend:
- _run_ocr_pipeline_for_page() now runs the full Kombi pipeline:
orientation → deskew → dewarp → content crop → dual-engine OCR
(RapidOCR + Tesseract merge) → _build_grid_core() with pipe-autocorrect,
word-gap merge, dictionary detection
- Accepts ipa_mode and syllable_mode query params on process-single-page
- Pipeline sessions are visible in admin OCR Kombi UI for debugging
Frontend (vocab-worksheet):
- New "Anzeigeoptionen" section with IPA and syllable toggles
- Settings are passed to process-single-page as query parameters
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extracts page_number from grid_editor_result when opening a session
and displays it as "S. 233" badge in the SessionHeader, next to the
category and GT badges.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Words ending with "-" where the stem is a known word (e.g. "wunder-"
→ "wunder" is known) are valid line-break hyphenations, not gutter
errors. Gutter problems cause the hyphen to be LOST ("ve" instead of
"ver-"), so a visible hyphen + known stem = intentional word-wrap.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The GT badge was only shown on ungrouped SessionRow items. Now also
visible on document group rows (e.g. "GT 1/2") and individual pages
within expanded groups.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs fixed:
- Apply no longer removes the continuation word from the next row.
"künden" stays in row 31 — only the current row is repaired
("ve" → "ver-"). The original line-break layout is preserved.
- Analysis now skips words that already end with "-" when the direct
join with the next row is a known word (valid hyphenation, not an error).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- StepGroundTruth now shows the split view (original image + table)
so the user can verify the final result before marking as GT
- Backend session list now returns is_ground_truth flag
- SessionList shows amber "GT" badge for marked sessions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The next-row word "künden," had a trailing comma, causing dictionary
lookup to fail for "verkünden,". Now strips .,;:!? before joining.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Lower min word length from 3→2 for hyphen-join candidates so fragments
like "ve" (from "ver-künden") are no longer skipped
- Return all spellchecker candidates instead of just top-1, so user can
pick the correct form (e.g. "stammeln" vs "stammelt")
- Frontend shows clickable alternative buttons for spell_fix suggestions
- Backend accepts text_overrides in apply endpoint for user-selected alternatives
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New step "Wortkorrektur" between Grid-Review and Ground Truth that detects
and fixes words truncated or blurred at the book gutter (binding area) of
double-page scans. Uses pyspellchecker (DE+EN) for validation.
Two repair strategies:
- hyphen_join: words split across rows with missing chars (ve + künden → verkünden)
- spell_fix: garbled trailing chars from gutter blur (stammeli → stammeln)
Interactive frontend with per-suggestion accept/reject and batch controls.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When OCR merges adjacent words from different columns into one word box
(e.g. "sichzie" spanning Col 1+2, "dasZimmer" crossing boundary), the
grid builder assigned the entire merged word to one column.
New _split_cross_column_words() function splits these at column
boundaries using case transitions and spellchecker validation to
avoid false positives on real words like "oder", "Kabel", "Zeitung".
Regression: 12/12 GT sessions pass with diff=+0.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Page-split now creates independent sessions that appear directly in
the session list. After split, the UI switches to the first child
session. BoxSessionTabs, sub-session state, and parent-child tracking
removed from Kombi code. Legacy ocr-overlay still uses BoxSessionTabs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pyphen is a pattern-based hyphenator that accepts nonsense strings
like "Zeplpelin". Switch to spellchecker (frequency-based word list)
which correctly rejects garbled words and can suggest corrections.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- autocorrect_pipe_artifacts(): strips OCR pipe artifacts from printed
syllable dividers, validates with pyphen, tries char-deletion near
pipe positions for garbled words (e.g. "Ze|plpe|lin" → "Zeppelin")
- Rule (a2): filters isolated non-alphanumeric word boxes (≤2 chars,
no letters/digits) — catches small icons OCR'd as ">", "<" etc.
- Both fixes are generic: pyphen-validated, no session-specific logic
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When OCR merge expands a prefix word box (e.g. "zer" w=42 → w=104),
it heavily overlaps (>75%) with the next fragment ("brech"). The grid
builder's overlap filter previously removed the prefix as a duplicate.
Fix: when overlap > 75% but both boxes are alphabetic with different
text and one is ≤ 4 chars, merge instead of removing. Also enable
chain merging via merge_parent tracking so "zer" + "brech" + "lich"
→ "zerbrechlich" in a single pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add du/dich/dir/mich/mir/uns/euch/ihm/ihn to _STOP_WORDS to prevent
false merges like "du" + "zerlegst" → "duzerlegst"
- Reduce max_short threshold from 6 to 5 to prevent merging multi-word
phrases like "ziehen lassen" → "ziehenlassen"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Syllable "Original" (auto) mode: only normalize cells that already
have | from OCR — don't add new syllable marks via pyphen to words
without printed dividers on the original scan.
2. Syllable "Aus" (none) mode: strip residual | chars from OCR text
so cells display clean (e.g. "Zel|le" → "Zelle").
3. Heading detection: add text length guard in single-cell heuristic —
words > 4 alpha chars starting lowercase (like "zentral") are regular
vocabulary, not section headings.
4. Word-gap merge: new merge_word_gaps_in_zones() step with relaxed
threshold (6 chars) fixes OCR splits like "zerknit tert" → "zerknittert".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
StepPageSplit now:
- Auto-calls POST /page-split on step entry
- Shows oriented image + detection result
- If double page: creates sub-sessions named "Title — S. 1/2"
- If single page: green badge "keine Trennung noetig"
- Manual "Weiter" button (no auto-advance)
Also:
- StepOrientation wrapper simplified (no page-split in orientation)
- StepUpload passes name back via onUploaded(sid, name)
- page.tsx: after page-split "Weiter" switches to first sub-session
- useKombiPipeline exposes setSessionName
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
StepUpload now has 3 phases:
1. File selection: drop zone / file picker → shows preview
2. Review: title input, category, file info → "Hochladen" button
3. Uploaded: shows session image → "Weiter" button
No more auto-advance after upload. User controls every step.
openSession() removed from onUploaded callback to prevent
step-reset race condition.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
openSession mapped dbStep=1 to uiStep=0 (upload), overriding handleNext's
advancement to step 1. Fix: sessions always exist post-upload, so always
skip past the upload step in openSession.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 1 of the clean architecture refactor: Replaces the 751-line ocr-overlay
monolith with a modular pipeline. Each step gets its own component file.
Frontend: /ai/ocr-kombi route with 11 steps (Upload, Orientation, PageSplit,
Deskew, Dewarp, ContentCrop, OCR, Structure, GridBuild, GridReview, GroundTruth).
Session list supports document grouping for multi-page uploads.
Backend: New ocr_kombi/ module with multi-page PDF upload (splits PDF into N
sessions with shared document_group_id). DB migration adds document_group_id
and page_number columns.
Old /ai/ocr-overlay remains fully functional for A/B testing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The page_number was only shown in GridEditor.tsx (ocr-overlay) but
the OCR pipeline uses StepGridReview.tsx which has its own summary bar.
Display the extracted page number (e.g. "S. 233") next to the
dictionary detection badge.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_filter_footer_words now returns page number info (text, y_pct, number)
instead of just removing footer words. The page number is included in
the grid result as `page_number` and displayed in the frontend summary
bar as "S. 233".
This preserves page numbers for later page concatenation in the
customer frontend while still removing them from the grid content.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Add per-cell artifact filter (4b2): removes single-word cells with
≤2 chars and confidence <65 (e.g. "as" from stray OCR marks)
2. Add narrow connector column normalization (4d2): when ≥60% of cells
in a column share the same short text (e.g. "oder"), normalize
near-match outliers like "oderb" → "oder"
3. Fix footer detection: require short text (≤20 chars) and no commas.
Comma-separated lists like "Uhrzeit, Vergangenheit, Zukunft" are
content continuations, not page numbers.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The _IPA_RE check in _syllabify_text() skipped entire cells containing
any IPA character. After German IPA insertion adds [bɪltʃøn], the check
blocked syllabification entirely. Now strips bracket content before
checking, so programmatically inserted IPA doesn't prevent syllable
divider insertion on the surrounding text.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hybrid approach mirroring English IPA:
- Primary: wiki-pronunciation-dict (636k entries, CC-BY-SA, Wiktionary)
- Fallback: epitran rule-based G2P (MIT license)
IPA modes now use language-appropriate dictionaries:
- auto/en: English IPA (Britfone + eng_to_ipa)
- de: German IPA (wiki-pronunciation-dict + epitran)
- all: EN column gets English IPA, other columns get German IPA
- none: disabled
Frontend shows CC-BY-SA attribution when German IPA is active.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The zone merging function used PageZone but the import was only
in grid_editor_api.py. Caused NameError on sessions that trigger
zone merging (e.g. original_scan_b59a1b1b).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Our IPA system only has English dictionaries (Britfone MIT, eng_to_ipa
MIT). The "IPA: nur DE" option was useless at best and misleading.
Removed from dropdown, type definition, and API validation.
Syllable DE mode stays — pyphen has a German hyphenation dictionary.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Header detection: Add 25% cap to single-cell heading heuristic.
On German synonym dicts where most rows naturally have only 1
content cell, the old logic marked 60%+ of rows as headers.
2. IPA de/all mode: Use "column_text" (light processing) for non-
English columns instead of "column_en" (full processing). The
full path runs _insert_missing_ipa() which splits on whitespace,
matches English prefixes ("bildschön" → "bild"), and truncates
the rest — destroying German comma-separated synonym lists.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dropdowns only updated state but didn't trigger buildGrid().
Now a useEffect watches ipaMode/syllableMode and rebuilds
automatically (skipping the initial mount).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When no IPA signals exist (e.g. German-only dicts), the fallback
that guesses en_col_type was incorrectly triggered for en/de modes,
causing false IPA and syllable insertions. Now only fires for 'all'
mode. Syllable en mode also returns empty set when no EN column found.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend ipa_mode and syllable_mode toggles with language options:
- auto: smart detection (default)
- en: only English headword column
- de: only German definition columns
- all: all content columns
- none: skip entirely
Also improve English column auto-detection: use garbled IPA patterns
(apostrophes, colons) in addition to bracket patterns. This correctly
identifies English dictionary pages where OCR produces garbled ASCII
instead of bracket IPA.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: Remove en_col_type fallback heuristic (longest avg text) that
incorrectly identified German columns as English. IPA now only applied
when OCR bracket patterns are actually found. Add ipa_mode (auto/all/none)
and syllable_mode (auto/all/none) query params to build-grid API.
Frontend: Add IPA and Silben dropdown selects to GridToolbar. Modes
are passed as query params on rebuild. Auto = current smart detection,
All = force for all words, Aus = skip entirely.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 5i was overwriting IPA-corrected text from Step 5c when
reconstructing cells from word_boxes. Added _ipa_corrected flag
to preserve corrections. Also tightened merged-token prefix matching
(min prefix 4 chars, min suffix 3 chars) to prevent false positives
like "sis" being extracted from "si:said".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix Step 5h: restrict slash-IPA conversion to English headword column
only — prevents converting "der/die/das" to "der [dər]das" in German
columns (confirmed working)
- Fix _text_has_garbled_ipa: detect embedded apostrophes in merged
tokens like "Scotland'skotland" where OCR reads ˈ as '
- Fix _insert_missing_ipa: detect dictionary word prefix in merged
trailing tokens like "fictionsalans'fIkfn" → extract "fiction" with IPA
- Move en_col_type to wider scope for Step 5h access
Note: Fixes 1+2 confirmed working in unit tests but not yet applying
in the full build-grid pipeline — needs further debugging.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Real dictionary pages have only ~3% OCR-detected pipes because the thin
syllable divider lines are hard for OCR to read. The primary false-positive
guard (article_col_index check) already blocks synonym dictionaries.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrite cv_syllable_detect.py with pyphen-first approach:
- Remove unreliable CV gate (morphological pipe detection)
- Strip existing pipes and re-syllabify via pyphen (DE then EN)
- Merge pipe-gap spaces where OCR split words at divider positions
- Guard merges with function word blacklist and punctuation checks
Add false-positive prevention:
- Pre-check: skip if <5% of cells have existing | from OCR
- Call-site check: require article_col_index (der/die/das column)
- Prevents syllabification of synonym dictionaries and word lists
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Page-split sessions (start_step=1) have no orientation_result stored.
StepOrientation now auto-runs orientation detection when loading an
existing session that lacks a result.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Page-split now creates independent sessions (no parent_session_id),
parent marked as status='split' and hidden from list. Navigation uses
useSearchParams for URL-based step tracking (browser back/forward works).
page.tsx reduced from 684 to 443 lines via usePipelineNavigation hook.
Box sub-sessions (column detection) remain unchanged.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of returning to parent (which creates a redirect loop), the
handleNext function now finds the next incomplete sub-session and opens
it directly. When all sub-sessions are done, returns to session list.
Also fixes openSession auto-redirect to prefer the first incomplete
sub-session over the most advanced one.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When reopening a parent session that has page-split sub-sessions,
the UI was showing the parent's pipeline step (always step 1/Orientation)
instead of navigating to the sub-sessions. Now automatically opens the
most advanced sub-session, matching the behavior of handleOrientationComplete.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>