65 files in klausur-service packages + 3 in backend-lehrer packages
had stale imports referencing deleted shim modules.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
klausur-service: 183 shims deleted, 26 test files + 8 source files updated
backend-lehrer: 59 shims deleted, main.py + 8 source files updated
All imports now use the new package paths directly.
Zero shims remaining in the entire codebase.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Added ocr_region import to cell_grid/build.py and legacy.py
- Fixed circular import in engines.py via lazy import
- Auto-fixed 22 unused imports via ruff --fix
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
overflow-hidden → overflow-y-auto so all nav items are reachable.
Added /parent (Eltern-Portal) link with people icon.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 1.1 — user_language_api.py: Stores native language preference
per user (TR/AR/UK/RU/PL/DE/EN). Onboarding page with flag-based
language selection for students and parents.
Phase 1.2 — translation_service.py: Batch-translates vocabulary words
into target languages via Ollama LLM. Stores in translations JSONB.
New endpoint POST /vocabulary/translate triggers translation.
Phase 2.1 — Parent Portal (/parent): Simplified UI in parent's native
language showing child's learning progress. Daily tips translated.
Phase 2.2 — Parent Quiz (/parent/quiz/[unitId]): Parents can quiz
their child on vocabulary WITHOUT speaking DE or EN. Shows word in
child's learning language + parent's native language as hint.
Answer hidden by default, revealed on tap.
All UI text translated into 7 languages (DE/EN/TR/AR/UK/RU/PL).
Arabic gets RTL layout support.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Turbopack doesn't support tsconfig path aliases pointing outside
the project root. Reverted to copying shared types directly into
each service. The canonical source remains shared/types/*.ts,
synced via scripts/sync-shared-types.sh.
Changes:
- Reverted docker-compose.yml contexts to ./service
- Reverted Dockerfiles to simple COPY . .
- Removed @shared/* from tsconfigs
- Removed symlinks + .gitignore hacks
- Added scripts/sync-shared-types.sh for keeping copies in sync
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Turbopack only resolves tsconfig paths within the project root.
Changed @shared/* from ../shared/* to ./shared/* in all tsconfigs.
Docker copies shared/ into the project dir at build time.
Local dev uses symlinks (gitignored).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Feature 1 — StarRating: 1-3 stars per exercise (100%=3, 70%=2, <70%=1)
Feature 2 — Progress bar in UnitCard with Leitner box distribution
Feature 3 — Listening exercise: hear word via TTS, choose correct translation
Feature 4 — Matching game: tap-to-match EN↔DE pairs (6 per round)
Feature 5 — Pronunciation: word with syllable bows + mic → STT comparison
Feature 6 — Syllable bows in FlashCards (SyllableBow under word + IPA)
UnitCard now shows 6 exercise types: Karten, Quiz, Tippen, Hoeren,
Zuordnen, Sprechen. Progress bar and star count displayed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shared types extracted to shared/types/:
- companion.ts (33+ types, was 100% duplicated admin-lehrer ↔ studio-v2)
- klausur.ts (18+ types, was 95% duplicated across 4 locations)
- ocr-labeling.ts (11 types, was 100% duplicated admin-lehrer ↔ website)
Original type files replaced with re-exports for backward compat.
tsconfig.json paths updated with @shared/* alias in all 3 services.
Docker: Changed build context from ./service to . (root) so shared/
is accessible. Dockerfiles updated to COPY service/ + shared/.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
audio_service.py: Connects to compliance-tts-service (Piper TTS,
MIT license) for high-quality German (Thorsten) and English (Lessac)
voices. Audio cached as MP3 on first request.
vocabulary_api.py: New endpoints:
- GET /vocabulary/word/{id}/audio/{lang} — word pronunciation
- GET /vocabulary/word/{id}/audio-syllables/{lang} — slow syllable-by-syllable
Anton App analysis: identified 5 features to adopt (star system,
games as rewards, progress bars, listening exercises, matching exercises).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sed replacement left orphaned hostname references in story page
and empty lines in getApiBase functions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HTTPS pages cannot fetch from HTTP backend ports. Added Next.js
API route proxies for /api/vocabulary, /api/learning-units, /api/progress
that forward to backend-lehrer internally (same Docker network, HTTP).
All frontend pages now use same-origin requests (getApiBase = '')
instead of direct port:8001 connections.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Trigram extension and index are now created in a separate try/catch
so table creation succeeds even without pg_trgm. Search falls back
to ILIKE when trigram functions are not available.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Strip SQLAlchemy dialect prefix from DATABASE_URL for asyncpg.
Set search_path via server_settings on pool creation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Strategic pivot: Studio-v2 becomes a language learning platform.
Compliance guardrail added to CLAUDE.md — no scan/OCR of third-party
content in customer frontend. Upload of OWN materials remains allowed.
Phase 1.1 — vocabulary_db.py: PostgreSQL model for 160k+ words
with english, german, IPA, syllables, examples, images, audio,
difficulty, tags, translations (multilingual). Trigram search index.
Phase 1.2 — vocabulary_api.py: Search, browse, filters, bulk import,
learning unit creation from word selection. Creates QA items with
enhanced fields (IPA, syllables, image, audio) for flashcards.
Phase 1.3 — /vocabulary page: Search bar with POS/difficulty filters,
word cards with audio buttons, unit builder sidebar. Teacher selects
words → creates learning unit → redirects to flashcards.
Sidebar: Added "Woerterbuch" (/vocabulary) and "Lernmodule" (/learn).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
qwen2.5vl:32b needs ~100GB RAM and crashes Ollama.
llama3.2-vision:11b is already installed and fits in memory.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New module vision_ocr_fusion.py: Sends scan image + OCR word
coordinates + document type to Qwen2.5-VL 32B. The LLM reads
the image visually while using OCR positions as structural hints.
Key features:
- Document-type-aware prompts (Vokabelseite, Woerterbuch, etc.)
- OCR words grouped into lines with x/y coordinates in prompt
- Low-confidence words marked with (?) for LLM attention
- Continuation row merging instructions in prompt
- JSON response parsing with markdown code block handling
- Fallback to original OCR on any error
Frontend (admin-lehrer Grid Review):
- "Vision-LLM" checkbox toggle
- "Typ" dropdown (Vokabelseite, Woerterbuch, etc.)
- Steps 1-3 defaults set to inactive
Activate: Check "Vision-LLM", select document type, click "OCR neu + Grid".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New endpoint POST /sessions/{id}/rerun-ocr-and-build-grid that:
1. Runs scan quality assessment
2. Applies CLAHE enhancement if degraded (controlled by enhance toggle)
3. Re-runs dual-engine OCR (RapidOCR + Tesseract) with min_conf filter
4. Merges OCR results and stores updated word_result
5. Builds grid with max_columns constraint
Frontend: Orange "OCR neu + Grid" button in GridToolbar.
Unlike "Neu berechnen" (which only rebuilds grid from existing words),
this button re-runs the full OCR pipeline with quality settings.
Now CLAHE toggle actually has an effect — it enhances the image
before OCR runs, not after.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The max_columns parameter was only implemented in cv_words_first.py
(vocab-worksheet path) but NOT in _build_grid_core which is what
the admin OCR Kombi pipeline uses. The Kombi pipeline uses
grid_editor_helpers._cluster_columns_by_alignment() which has its
own column detection.
Fix: Post-processing step 5k merges narrowest columns after grid
building when zone has more columns than max_columns. Cells from
merged columns get their text appended to the target column.
min_conf word filtering was already working (applied before grid build).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dev-only toggles belong in admin-lehrer (port 3002) only.
The customer frontend runs the pipeline with optimal defaults
and shows only the finished results.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each quality improvement step can now be toggled independently:
- CLAHE checkbox (Step 3: image enhancement on/off)
- MaxCols dropdown (Step 2: 0=unlimited, 2-5)
- MinConf dropdown (Step 1: auto/20/30/40/50/60)
Backend: Query params enhance, max_cols, min_conf on process-single-page.
Response includes active_steps dict showing which steps are enabled.
Frontend: Toggle controls in VocabularyTab above the table.
This allows empirical A/B testing of each step on the same scan.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Step 1: scan_quality.py — Laplacian blur + contrast scoring, adjusts
OCR confidence threshold (40 for good scans, 30 for degraded).
Quality report included in API response + shown in frontend.
Step 2: max_columns parameter in cv_words_first.py — limits column
detection to 3 for vocab tables, preventing phantom columns D/E
from degraded OCR fragments.
Step 3: ocr_image_enhance.py — CLAHE contrast + bilateral filter
denoising + unsharp mask, only for degraded scans (gated by
quality score). Pattern from handwriting_htr_api.py.
Frontend: quality info shown in extraction status after processing.
Reprocess button now derives pages from vocabulary data.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Types from deleted ocr-pipeline/types.ts inlined into ocr-kombi/types.ts.
All imports updated across components/ocr-kombi/ and components/ocr-pipeline/.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs fixed:
1. reprocessPages() failed silently after session resume because
successfulPages was empty. Now derives pages from vocabulary
source_page or selectedPages as fallback.
2. process-single-page endpoint built vocabulary entries WITHOUT
applying merge logic (_merge_wrapped_rows, _merge_continuation_rows).
Now applies full merge pipeline after vocabulary extraction.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allows reprocessing pages from the vocabulary view to apply
new merge logic without navigating back to page selection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When textbook authors wrap text within a cell (e.g. long German
translations), OCR treats each physical line as a separate row.
New _merge_wrapped_rows() detects this by checking if the primary
column (EN) is empty — indicating a continuation, not a new entry.
Handles: empty EN + DE text, empty EN + example text, parenthetical
continuations like "(bei)", triple wraps, comma-separated lists.
12 tests added covering all cases.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 3.2 — MicrophoneInput.tsx: Browser Web Speech API for
speech-to-text recognition (EN+DE), integrated for pronunciation practice.
Phase 4.1 — Story Generator: LLM-powered mini-stories using vocabulary
words, with highlighted vocab in HTML output. Backend endpoint
POST /learning-units/{id}/generate-story + frontend /learn/[unitId]/story.
Phase 4.2 — SyllableBow.tsx: SVG arc component for syllable visualization
under words, clickable for per-syllable TTS.
Phase 4.3 — Gamification system:
- CoinAnimation.tsx: Floating coin rewards with accumulator
- CrownBadge.tsx: Crown/medal display for milestones
- ProgressRing.tsx: Circular progress indicator
- progress_api.py: Backend tracking coins, crowns, streaks per unit
Also adds "Geschichte" exercise type button to UnitCard.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New feature: After OCR vocabulary extraction, users can generate interactive
learning modules (flashcards, quiz, type trainer) with one click.
Frontend (studio-v2):
- Fortune Sheet spreadsheet editor tab in vocab-worksheet
- "Lernmodule generieren" button in ExportTab
- /learn page with unit overview and exercise type cards
- /learn/[unitId]/flashcards — Flip-card trainer with Leitner spaced repetition
- /learn/[unitId]/quiz — Multiple choice quiz with explanations
- /learn/[unitId]/type — Type-in trainer with Levenshtein distance feedback
- AudioButton component using Web Speech API for EN+DE TTS
Backend (klausur-service):
- vocab_learn_bridge.py: Converts VocabularyEntry[] to analysis_data format
- POST /sessions/{id}/generate-learning-unit endpoint
Backend (backend-lehrer):
- generate-qa, generate-mc, generate-cloze endpoints on learning units
- get-qa/mc/cloze data retrieval endpoints
- Leitner progress update + next review items endpoints
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The tokenizer regex only matches alphabetic characters, so text
before the first word match (like "(= " in "(= I won...") was
silently dropped when reassembling the corrected text.
Now preserves text[:first_match_start] as a leading prefix.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of keeping only specific symbols (_KEEP_SYMBOLS), now only
removes explicitly decorative symbols (_REMOVE_SYMBOLS: > < ~ \ ^ etc).
All other punctuation (= ( ) ; : - etc.) is preserved by default.
This is more robust: any new symbol used in textbooks will be kept
unless it's in the small block-list of known decorative artifacts.
Fixes: (= token still being removed on page 5 despite being in
the allow-list (possibly due to Unicode variants or whitespace).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rule (a2) in Step 5i removed word_boxes with no letters/digits as
"graphic OCR artifacts". This incorrectly removed = signs used as
definition markers in textbooks ("film = 1. Film; 2. filmen").
Added exception list _KEEP_SYMBOLS for meaningful punctuation:
= (= =) ; : - – — / + • · ( ) & * → ← ↔
The root cause: PaddleOCR returns "film = 1. Film; 2. filmen" as one
block, which gets split into word_boxes ["film", "=", "1.", ...].
The "=" word_box had no alphanumeric chars and was removed as artifact.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
= signs are used as definition markers in textbooks ("film = 1. Film").
They were incorrectly removed by two filters:
1. grid_build_core.py Step 5j-pre: _PURE_JUNK_RE matched "=" as
artifact noise. Now exempts =, (=, ;, :, - and similar meaningful
punctuation tokens.
2. cv_ocr_engines.py _is_noise_tail_token: "pure non-alpha" check
removed trailing = tokens. Now exempts meaningful punctuation.
Fixes: "film = 1. Film; 2. filmen" losing the = sign,
"(= I won and he lost.)" losing the (=.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Words like "probieren)" or "Englisch)" were incorrectly flagged as
gutter OCR errors because the closing parenthesis wasn't stripped
before dictionary lookup. The spellchecker then suggested "probierend"
(replacing ) with d, edit distance 1).
Two fixes:
1. Strip trailing/leading parentheses in _try_spell_fix before checking
if the bare word is valid — skip correction if it is
2. Add )( to the rstrip characters in the analysis phase so
"probieren)" becomes "probieren" for the known-word check
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Text like "Betonung auf der 1. Silbe: profit ['profit]" was
incorrectly detected as garbled IPA and replaced with generated
IPA transcription of the previous row's example sentence.
Added guard: if the cell text contains >=3 recognizable words
(3+ letter alpha tokens), it's normal text, not garbled IPA.
Garbled IPA is typically short and has no real dictionary words.
Fixes: Row 13 C3 showing IPA instead of pronunciation hint text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multi-line cells (containing \n) that don't already start with a
bullet character get • prepended in the frontend. This ensures
bullet points are visible regardless of whether the backend inserted
them (depends on when boxes were last rebuilt).
Skips header rows and cells that already have •, -, or – prefix.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Revert row expansion — multi-line bullet cells stay as single cells
with \n and text-wrap (tb='2'). This way the text reflows when the
user resizes the column, like normal Excel behavior.
Row height auto-scales by line count (24px * lines).
Vertical alignment: top (vt=0) for multi-line cells.
Removed leading-space indentation hack (didn't work reliably).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hardcodierte REGULATIONS/INDUSTRIES/INDUSTRY_REGULATION_MAP durch
JSON-Import ersetzt. 320 Dokumente in 17 Kategorien mit collapsible
Sektionen pro doc_type. page.tsx von 3672 auf 2655 Zeilen reduziert.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multi-line cells (\n): expanded into separate rows so each line gets
its own cell. Continuation lines (after •) indented with leading spaces.
Bullet marker lines (•) are bold.
Font-size detection: cells with word_box height >1.3x median get bold
and larger font (fs=12) for box titles.
Headers: is_header rows always bold with light background tint.
Box borders: thick colored outside border + thin inner grid lines.
Content zone: light gray grid borders.
Auto-fit column widths from longest text per column.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Column widths now calculated from the longest text in each column
(~7.5px per character + padding). Takes the maximum of auto-fit
width and scaled original pixel width.
Multi-line cells: uses the longest line for width calculation.
Spanning header cells excluded from width calculation (they span
multiple columns and would inflate single-column widths).
Minimum column width: 60px.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Height: sheet height auto-calculated from row count (26px/row + toolbar),
no more cutoff at 21 rows. Row count set to exact (no padding).
Box borders: thick colored outside border + thin inner grid lines.
Content zone: light gray grid lines on all cells.
Headers: bold (bl=1) for is_header rows. Larger font detected via
word_box height comparison (>1.3x median → fs=12 + bold).
Box cells: light tinted background from box_bg_hex.
Header cells in boxes: slightly stronger tint.
Multi-line cells: text wrap enabled (tb='2'), \n preserved.
Bullet points (•) and indentation preserved in cell text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each zone becomes its own Excel sheet tab with independent column widths:
- Sheet "Vokabeln": main content zone with EN/DE/example columns
- Sheet "Pounds and euros": Box 1 with its own 4-column layout
- Sheet "German leihen": Box 2 with single column for flowing text
This solves the column-width conflict: boxes have different column
widths optimized for their content, which is impossible in a single
unified sheet (Excel limitation: column width is per-column, not per-cell).
Sheet tabs visible at bottom (showSheetTabs: true).
Box sheets get colored tab (from box_bg_hex).
First sheet active by default.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Install @fortune-sheet/react (MIT, v1.0.4) as Excel-like spreadsheet
component. New SpreadsheetView.tsx converts unified grid data to
Fortune Sheet format (celldata, merge config, column/row sizes).
StepAnsicht now has Spreadsheet/Grid toggle:
- Spreadsheet mode: full Fortune Sheet with toolbar (bold, italic,
color, borders, merge cells, text wrap, undo/redo)
- Grid mode: existing GridTable for quick editing
Box-origin cells get light tinted background in spreadsheet view.
Colspan cells converted to Fortune Sheet merge format.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend (unified_grid.py):
- build_unified_grid(): merges content + box zones into one zone
- Dominant row height from median of content row spacings
- Full-width boxes: rows integrated directly
- Partial-width boxes: extra rows inserted when box has more text
lines than standard rows fit (e.g., 7 lines in 5-row height)
- Box-origin cells tagged with source_zone_type + box_region metadata
Backend (grid_editor_api.py):
- POST /sessions/{id}/build-unified-grid → persists as unified_grid_result
- GET /sessions/{id}/unified-grid → retrieve persisted result
Frontend:
- GridEditorCell: added source_zone_type, box_region fields
- GridTable: box-origin cells get tinted background + left border
- StepAnsicht: split-view with original image (left) + editable
unified GridTable (right). Auto-builds on first load.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Content sections: use dominant (median) row height from all content
rows instead of per-section average. This ensures uniform row height
above and below boxes (the standard case on textbook pages).
Box sections: distribute height proportionally by text line count
per row. A header (1 line) gets 1/7 of box height, a bullet with
3 lines gets 3/7. Fixes Box 2 where row 3 was cut off because
even distribution didn't account for multi-line cells.
Removed overflow:hidden from box container to prevent clipping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Content rows were incorrectly filtered out when their Y overlapped
with a box, even if the box only covered the right half of the page.
Now checks both Y AND X overlap — rows are only excluded if they
start within the box's horizontal range.
Fixes: rows next to Box 2 (lend, coconut, taste) were missing from
reconstruction because Box 2 (x=871, w=525) only covers the right
side, but left-side content rows at x≈148 were being filtered.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major rewrite of reconstruction rendering:
- Page split into vertical sections (content/box) around box boundaries
- Content sections: uniform row height = (last_row - first_row) / (n-1)
- Box sections: rows evenly distributed within box height
- Content rows positioned absolutely at original y-coordinates
- Font size derived from row height (55% of row height)
- Multi-line cells (bullets) get expanded height with indentation
- Boxes render at exact bbox position with colored border
- Preparation for unified grid where boxes become part of main grid
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace manual word_box positioning (wild/unsnapped) with the
server-rendered words-overlay image from the OCR step endpoint.
This shows the same cleanly snapped red letters as the OCR step.
Endpoint: /sessions/{id}/image/words-overlay
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Font: use font_size_suggestion_px * scale directly (removed 0.85 factor)
- Row height: calculate from row-to-row spacing (y_min of next row
minus y_min of current row) instead of text height (y_max - y_min).
This produces correct line spacing matching the original layout.
- Multi-line cells: height multiplied by line count
Content zone should now span from ~250 to ~2050 matching the original.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Left panel: Original scan + OCR word overlay (red text at exact
word_box positions) + coordinate grid
Right panel: Reconstructed layout + same coordinate grid
Features:
- Coordinate grid toggle with 50/100/200px spacing options
- Grid lines labeled with pixel coordinates in original image space
- Both panels share the same scale for direct visual comparison
- OCR overlay shows detected text in red mono font at original positions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New pipeline step showing the reconstructed page with all zones
positioned at their original coordinates:
- Content zones with vocabulary grid cells
- Box zones with colored borders (from structure detection)
- Colspan cells rendered across multiple columns
- Multi-line cells (bullets) with pre-wrap whitespace
- Toggle to overlay original scan image at 15% opacity
- Proportionally scaled to viewport width
- Pure CSS positioning (no canvas/Fabric.js)
Pipeline: 14 steps (0-13), Ground Truth moved to Step 13.
Added colspan field to GridEditorCell type.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Chained ternary (colored ? div : multiline ? textarea : input) caused
webpack SWC parser issues. Replaced with IIFE {(() => { if/return })()}
which is more robust and readable.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove extra curly braces around the textarea/input ternary that
caused webpack syntax error. The ternary is now a chained condition:
hasColoredWords ? <div> : text.includes('\n') ? <textarea> : <input>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cells containing \n (bullet items with continuation lines) now use
<textarea> instead of <input type=text>, making all lines visible.
Row height auto-expands based on line count in the cell.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Auto-build was triggering on every grid.zones.length change, which
happens on every rebuild (zone indices increment). Now uses a ref
to ensure auto-build fires only once. Also removed boxZones.length===0
condition that could trigger unnecessary builds.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Boxes whose vertical center falls within top/bottom 7% of image
height are filtered out (page numbers, unit headers, running footers).
At typical scan resolutions, 7% ≈ 2.5cm margin.
Fixes: "Box 1" containing just "3" from "Unit 3" page header being
incorrectly treated as an embedded box.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GridTable calculates column widths from col.x_max_px - col.x_min_px.
Flowing and header_only layouts were missing these fields, producing
NaN widths which collapsed the CSS grid layout and showed empty rows
with only row numbers visible.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously GridTable only supported full-row spanning (one cell across
all columns). Now renders each spanning_header cell with its actual
colspan, positioned at the correct grid column. This allows rows like
"In Britain..." (colspan=2) + "In Germany..." (colspan=2) to render
side by side instead of only showing the first cell.
Also fix box row fields: is_header always set (was undefined for
flowing/bullet_list), y_min_px/y_max_px for header_only rows.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Colspan: use original word-block text instead of split cell texts.
Prevents "euros a nd cents" from split_cross_column_words.
Box rows: add is_header field (was undefined, causing GridTable
rendering issues). Add y_min_px/y_max_px to header_only rows.
These missing fields caused empty rows with only row numbers visible.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_split_cross_column_words was destroying the colspan information by
cutting word-blocks at column boundaries BEFORE _detect_colspan_cells
could analyze them. Now passes original (pre-split) words to colspan
detection while using split words for cell building.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New _detect_colspan_cells() in grid_editor_helpers.py:
- Runs after _build_cells() for every zone (content + box)
- Detects word-blocks that extend across column boundaries
- Merges affected cells into spanning_header with colspan=N
- Uses column midpoints to determine which columns are covered
- Works for full-page scans and box zones equally
Also fixes box flowing/bullet_list row height fields (y_min_px/y_max_px).
Removed duplicate spanning logic from cv_box_layout.py — now uses
the generic _detect_colspan_cells from grid_editor_helpers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Box 3 empty rows: flowing/bullet_list rows were missing y_min_px/
y_max_px fields that GridTable uses for row height calculation.
Added _px and _pct variants.
Box 2 spanning cells: rows with fewer word-blocks than columns
(e.g., "In Britain..." spanning 2 columns) are now detected and
merged into spanning_header cells. GridTable already renders
spanning_header cells across the full row width.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PaddleOCR returns multi-word blocks (whole phrases), so ALL inter-word
gaps in small zones (boxes, ≤60 words) are column boundaries. Previous
3x-median approach produced thresholds too high to detect real columns.
New approach for small zones: gap_threshold = max(median_h * 1.0, 25).
This correctly detects 4 columns in "Pounds and euros" box where gaps
range from 50-297px and word height is ~31px.
Also includes SmartSpellChecker fixes from previous commits:
- Frequency-based scoring, IPA protection, slash→l, rare-word threshold
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major improvements:
- Frequency-based boundary repair: always tries repair, uses word
frequency product to decide (Pound sand→Pounds and: 2000x better)
- IPA bracket protection: words inside [brackets] are never modified,
even when brackets land in tokenizer separators
- Slash→l substitution: "p/" → "pl" for italic l misread as slash
- Abbreviation guard uses rare-word threshold (freq < 1e-6) instead
of binary known/unknown — prevents "Can I" → "Ca nI" while still
fixing "ats th." → "at sth."
- Tokenizer includes / character for slash-word detection
43 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, boundary repair was skipped when both words were valid
dictionary words (e.g., "Pound sand", "wit hit", "done euro").
Now uses word-frequency scoring (product of bigram frequencies) to
decide if the repair produces a more common word pair.
Threshold: repair accepted when new pair is >5x more frequent, or
when repair produces a known abbreviation.
New fixes: Pound sand→Pounds and (2000x), wit hit→with it (100000x),
done euro→one euro (7x).
43 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In small zones (boxes), intra-phrase gaps inflate the median gap,
causing gap_threshold to become too large to detect real column
boundaries. Cap at 25% of zone width to prevent this.
Example: Box "Pounds and euros" has 4 columns at x≈148,534,751,1137
but gap_threshold was 531 (larger than the column gaps themselves).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use box_bg_hex for border color (from Step 7 structure detection)
- Numbered color badges per box
- Show color name in box header
- Add box_bg_color/box_bg_hex to GridZone type
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GridTable expects zone (singular), onSelectCell, onCellTextChange,
onToggleColumnBold, onToggleRowHeader, onNavigate — not the
incorrect prop names from the first version.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Source boxes from structure_result (Step 7) instead of grid zones
- Use raw_paddle_words (top/left/width/height) instead of grid cells
- Create new box zones from all detected boxes (not just existing zones)
- Sort zones by y-position for correct reading order
- Include box background color metadata
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New pipeline step between Gutter Repair and Ground Truth that processes
embedded boxes (grammar tips, exercises) independently from the main grid.
Backend:
- cv_box_layout.py: classify_box_layout() detects flowing/columnar/
bullet_list/header_only layout types per box
- build_box_zone_grid(): layout-aware grid building (single-column for
flowing text, independent columns for tabular content)
- POST /sessions/{id}/build-box-grids endpoint with SmartSpellChecker
- Layout type overridable per box via request body
Frontend:
- StepBoxGridReview.tsx: shows each box with cropped image + editable
GridTable. Layout type dropdown per box. Auto-builds on first load.
- Auto-skip when no boxes detected on page
- Pipeline steps updated: 13 steps (0-12), Ground Truth moved to 12
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New features:
- Boundary repair: "ats th." → "at sth." (shifted OCR word boundaries)
Tries shifting 1-2 chars between adjacent words, accepts if result
includes a known abbreviation or produces better dictionary matches
- Context split: "anew book" → "a new book" (ambiguous word merges)
Explicit allow/deny list for article+word patterns (alive, alone, etc.)
- Abbreviation awareness: 120+ known abbreviations (sth, sb, adj, etc.)
are now recognized as valid words, preventing false corrections
- Quality gate: boundary repairs only accepted when result scores
higher than original (known words + abbreviations)
40 tests passing, all edge cases covered.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SmartSpellChecker now runs during grid build (not just LLM review),
so corrections are visible immediately in the grid editor.
Language detection per column:
- EN column detected via IPA signals (existing logic)
- All other columns assumed German for vocab tables
- Auto-detection for single/two-column layouts
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dropdowns are now in the vocabulary table header (after processing),
not in the worksheet settings (before processing). Changing a mode
automatically reprocesses all successful pages with the new settings.
Same dropdown options as the OCR pipeline grid editor.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Vocab worksheet now has the same IPA/syllable mode options as the
OCR pipeline grid editor: Auto, nur EN, nur DE, Alle, Aus.
Previously only had on/off checkboxes mapping to auto/none.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When ipa_mode=none, the entire IPA processing block was skipped,
including the bracket-stripping logic. Now strips ALL square brackets
from content columns BEFORE the skip, so IPA:Aus actually removes
all IPA from the display.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
loadGrid depended on buildGrid (for 404 fallback), which depended on
ipaMode/syllableMode. Every mode change created a new loadGrid ref,
triggering StepGridReview's useEffect to load the OLD saved grid,
overwriting the freshly rebuilt one.
Now loadGrid only depends on sessionId. The 404 fallback builds inline
with current modes. Mode changes are handled exclusively by the
separate rebuild useEffect.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OCR text contains ASCII IPA approximations like [kompa'tifn] instead
of Unicode [kˈɒmpətɪʃən]. The strip regex required Unicode IPA chars
inside brackets and missed the ASCII ones. Now strips all [bracket]
content from excluded columns since square brackets in vocab columns
are always IPA.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
English IPA from the original OCR scan (e.g. [ˈgrænˌdæd]) was always
shown because fix_cell_phonetics only ADDS/CORRECTS but never removes.
Now strips IPA brackets containing Unicode IPA chars from the EN column
when ipa_mode is "de" or "none".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The old guard checked if grid was loaded AND set initialLoadDone in
the same pass, then returned without rebuilding. This meant the first
user-triggered mode change was always swallowed.
Simplified to a mount-skip ref: skip exactly the first useEffect trigger
(component mount), rebuild on every subsequent trigger (user changes).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The useEffect for mode changes called buildGrid() which was a
useCallback closing over stale ipaMode/syllableMode values due to
React's asynchronous state batching. The first click triggered a
rebuild with the OLD mode; only the second click used the new one.
Now inlines the API call directly in the useEffect, reading ipaMode
and syllableMode from the effect's closure which always has the
current values.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Strip IPA brackets [ipa] before attempting word split, so
"makeadecision[dɪsˈɪʒən]" is processed as "makeadecision"
2. Handle contractions: "solet's" → split "solet" → "so let" + "'s"
3. DP tiebreaker: prefer longer first word when scores are equal
("task is" over "ta skis")
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
"taskis" was split as "ta skis" instead of "task is" because both
have the same DP score. Changed comparison from > to >= so that
later candidates (with longer first words) win ties.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Short merged words like "anew" (a new), "Imadea" (I made a),
"makeadecision" (make a decision) were missed because the split
threshold was too high. Now processes tokens >= 4 chars.
English single-letter words (a, I) are already handled by the DP
algorithm which allows them as valid split points.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Footer rows that are page numbers (digits or written-out like
"two hundred and nine") are now removed from the grid entirely
and promoted to the page_number metadata field. Non-page-number
footer content stays as a visible footer row.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
"two hundred and nine" (22 chars) was kept as a content row because
the footer detection only accepted text ≤20 chars. Now recognizes
written-out number words (English + German) as page numbers regardless
of length.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Step 5g was extracting page refs (p.55, p.70) as zone metadata and
removing them from the cell table. Users want to see them as a
separate column. Now keeps cells in place while still extracting
metadata for the frontend header display.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rows containing only a page reference (p.55, S.12) were removed as
"oversized stubs" (Rule 2) when their word-box height exceeded the
median. Now skips Rule 2 if any word matches the page-ref pattern.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single-cell rows were incorrectly detected as headings when they were
actually continuation lines. Two new guards:
1. Text starting with "(" is a continuation (e.g. "(usw.)", "(TV-Serie)")
2. Single cells beyond the first two content columns are overflow lines,
not headings. Real headings appear in the first columns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_insert_missing_ipa stripped "1" from "Theme 1" because it treated
the digit as garbled OCR phonetics. Now treats pure digits/numbering
patterns (1, 2., 3)) as delimiters that stop the garble-stripping.
Also fixes _has_non_dict_trailing which incorrectly flagged "Theme 1"
as having non-dictionary trailing text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Columns with zero cells (e.g. from tertiary detection where the word
was assigned to a neighboring column by overlap) are stripped from the
final result. Remaining columns and cells are re-indexed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Step 5j-pre wrongly classified "p.43", "p.50" etc as artifacts
(mixed digits+letters, <=5 chars). Added exception for page
reference patterns (p.XX, S.XX).
2. IPA spacing regex was too narrow (only matched Unicode IPA chars).
Now matches any [bracket] content >=2 chars directly after a letter,
fixing German IPA like "Opa[oːpa]" → "Opa [oːpa]".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Ensure space before IPA brackets in cell text: "word[ipa]" → "word [ipa]"
Applied as final cleanup in grid-build finalization.
2. Add debug logging for zone-word assignment to diagnose why marker
column cells are empty despite correct column detection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Words to the left of the first detected column boundary must always
form their own column, regardless of how few rows they appear in.
Previously required 4+ distinct rows for tertiary (margin) columns,
which missed page references like p.62, p.63, p.64 (only 3 rows).
Now any cluster at the left/right margin with a clear gap to the
nearest significant column qualifies as its own column.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Grid Build step was showing word_result.grid_shape (from the initial
OCR word clustering, often just 1 column) instead of the grid-editor
summary (zone-based, with correct column/row/cell counts). Now reads
summary.total_rows/total_columns/total_cells from the grid-editor result.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Left-side book fold shadows have a V-shape: brightness dips from the
edge toward a peak at ~5-10% of width, then rises again. The previous
algorithm scanned from the edge inward and immediately found a low
dark fraction (0.13 at x=0), missing the gutter entirely.
Now finds the PEAK of the dark fraction profile first, then scans from
that peak toward the page center to find the transition point. Works
for both V-shaped left gutters and edge-darkening right gutters.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The frontend expects session_id in the upload response, but multi-page
PDFs returned only document_group_id + pages[]. Now includes session_id
pointing to the first page for backwards compatibility.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The spell review only runs on vocab entries, but the OCR pipeline's
grid-editor cells also contain merged words (e.g. "atmyschool").
Now splits merged words directly in the grid-build finalization step,
right before returning the result. Uses the same _try_split_merged_word()
dictionary-based DP algorithm.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When uploading a PDF with > 1 page to the OCR pipeline, each page
now gets its own session (grouped by document_group_id). Previously
only page 1 was processed. The response includes a pages array with
all session IDs so the frontend can navigate between them.
Single-page PDFs and images continue to work as before.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
"Comeon" was split as "Com eon" instead of "Come on" because both
are 2-word splits. Now uses sum-of-squared-lengths as tiebreaker:
"come"(16) + "on"(4) = 20 > "com"(9) + "eon"(9) = 18.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OCR often merges adjacent words when spacing is tight, e.g.
"atmyschool" → "at my school", "goodidea" → "good idea".
New _try_split_merged_word() uses dynamic programming to find the
shortest sequence of dictionary words covering the token. Integrated
as step 5 in _spell_fix_token() after general spell correction.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scanner shadow detection (range > 40, darkest < 180) fails on camera
book scans where the gutter shadow is subtle (range ~25, darkest ~214).
New _detect_gutter_continuity() detects gutters by their unique property:
the shadow runs continuously from top to bottom without interruption.
Divides the image into horizontal strips and checks what fraction of
strips are darker than the page median at each column. A gutter column
has >= 75% of strips darker. The transition point where the smoothed
dark fraction drops below 50% marks the crop boundary.
Integrated as fallback between scanner shadow and binary projection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Step 1 is now document selection (full width).
After selecting a file, Step 2 shows a side-by-side layout with
document preview (3/5 width, scrollable, with fullscreen modal)
and session naming (2/5 width, with start button).
Also adds PDF preview via blob URL before upload.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The initial build_grid_from_words() under-clusters to 1 column while
_build_grid_core() correctly finds 4 columns (marker, EN, DE, example).
Now extracts vocab from grid zones directly, with heuristic to skip
narrow marker columns. Falls back to original cells if zones fail.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When columns can't be classified as EN/DE, map them by position:
col 0 → english, col 1 → german, col 2+ → example. This ensures
vocabulary pages are always extracted, even without explicit
language classification. Classified pages still use the proper
EN/DE/example mapping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The grid-build zones use generic column types, losing the EN/DE
classification from build_grid_from_words(). Now extracts improved
cells from grid zones but classifies them using the original
columns_meta which has the correct column_en/column_de types.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Change "PDF wird analysiert..." to "PDF wird hochgeladen..." (accurate)
- Switch to pages tab immediately after upload (before thumbnails load)
- Show progressive status: "5 Seiten erkannt. Vorschau wird geladen..."
- Show backend error detail instead of generic "HTTP 404"
- Backend returns helpful message when session not in memory after restart
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend:
- _run_ocr_pipeline_for_page() now runs the full Kombi pipeline:
orientation → deskew → dewarp → content crop → dual-engine OCR
(RapidOCR + Tesseract merge) → _build_grid_core() with pipe-autocorrect,
word-gap merge, dictionary detection
- Accepts ipa_mode and syllable_mode query params on process-single-page
- Pipeline sessions are visible in admin OCR Kombi UI for debugging
Frontend (vocab-worksheet):
- New "Anzeigeoptionen" section with IPA and syllable toggles
- Settings are passed to process-single-page as query parameters
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extracts page_number from grid_editor_result when opening a session
and displays it as "S. 233" badge in the SessionHeader, next to the
category and GT badges.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Words ending with "-" where the stem is a known word (e.g. "wunder-"
→ "wunder" is known) are valid line-break hyphenations, not gutter
errors. Gutter problems cause the hyphen to be LOST ("ve" instead of
"ver-"), so a visible hyphen + known stem = intentional word-wrap.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The GT badge was only shown on ungrouped SessionRow items. Now also
visible on document group rows (e.g. "GT 1/2") and individual pages
within expanded groups.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs fixed:
- Apply no longer removes the continuation word from the next row.
"künden" stays in row 31 — only the current row is repaired
("ve" → "ver-"). The original line-break layout is preserved.
- Analysis now skips words that already end with "-" when the direct
join with the next row is a known word (valid hyphenation, not an error).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- StepGroundTruth now shows the split view (original image + table)
so the user can verify the final result before marking as GT
- Backend session list now returns is_ground_truth flag
- SessionList shows amber "GT" badge for marked sessions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The next-row word "künden," had a trailing comma, causing dictionary
lookup to fail for "verkünden,". Now strips .,;:!? before joining.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Lower min word length from 3→2 for hyphen-join candidates so fragments
like "ve" (from "ver-künden") are no longer skipped
- Return all spellchecker candidates instead of just top-1, so user can
pick the correct form (e.g. "stammeln" vs "stammelt")
- Frontend shows clickable alternative buttons for spell_fix suggestions
- Backend accepts text_overrides in apply endpoint for user-selected alternatives
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New step "Wortkorrektur" between Grid-Review and Ground Truth that detects
and fixes words truncated or blurred at the book gutter (binding area) of
double-page scans. Uses pyspellchecker (DE+EN) for validation.
Two repair strategies:
- hyphen_join: words split across rows with missing chars (ve + künden → verkünden)
- spell_fix: garbled trailing chars from gutter blur (stammeli → stammeln)
Interactive frontend with per-suggestion accept/reject and batch controls.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When OCR merges adjacent words from different columns into one word box
(e.g. "sichzie" spanning Col 1+2, "dasZimmer" crossing boundary), the
grid builder assigned the entire merged word to one column.
New _split_cross_column_words() function splits these at column
boundaries using case transitions and spellchecker validation to
avoid false positives on real words like "oder", "Kabel", "Zeitung".
Regression: 12/12 GT sessions pass with diff=+0.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Page-split now creates independent sessions that appear directly in
the session list. After split, the UI switches to the first child
session. BoxSessionTabs, sub-session state, and parent-child tracking
removed from Kombi code. Legacy ocr-overlay still uses BoxSessionTabs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pyphen is a pattern-based hyphenator that accepts nonsense strings
like "Zeplpelin". Switch to spellchecker (frequency-based word list)
which correctly rejects garbled words and can suggest corrections.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- autocorrect_pipe_artifacts(): strips OCR pipe artifacts from printed
syllable dividers, validates with pyphen, tries char-deletion near
pipe positions for garbled words (e.g. "Ze|plpe|lin" → "Zeppelin")
- Rule (a2): filters isolated non-alphanumeric word boxes (≤2 chars,
no letters/digits) — catches small icons OCR'd as ">", "<" etc.
- Both fixes are generic: pyphen-validated, no session-specific logic
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When OCR merge expands a prefix word box (e.g. "zer" w=42 → w=104),
it heavily overlaps (>75%) with the next fragment ("brech"). The grid
builder's overlap filter previously removed the prefix as a duplicate.
Fix: when overlap > 75% but both boxes are alphabetic with different
text and one is ≤ 4 chars, merge instead of removing. Also enable
chain merging via merge_parent tracking so "zer" + "brech" + "lich"
→ "zerbrechlich" in a single pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add du/dich/dir/mich/mir/uns/euch/ihm/ihn to _STOP_WORDS to prevent
false merges like "du" + "zerlegst" → "duzerlegst"
- Reduce max_short threshold from 6 to 5 to prevent merging multi-word
phrases like "ziehen lassen" → "ziehenlassen"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Syllable "Original" (auto) mode: only normalize cells that already
have | from OCR — don't add new syllable marks via pyphen to words
without printed dividers on the original scan.
2. Syllable "Aus" (none) mode: strip residual | chars from OCR text
so cells display clean (e.g. "Zel|le" → "Zelle").
3. Heading detection: add text length guard in single-cell heuristic —
words > 4 alpha chars starting lowercase (like "zentral") are regular
vocabulary, not section headings.
4. Word-gap merge: new merge_word_gaps_in_zones() step with relaxed
threshold (6 chars) fixes OCR splits like "zerknit tert" → "zerknittert".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
StepPageSplit now:
- Auto-calls POST /page-split on step entry
- Shows oriented image + detection result
- If double page: creates sub-sessions named "Title — S. 1/2"
- If single page: green badge "keine Trennung noetig"
- Manual "Weiter" button (no auto-advance)
Also:
- StepOrientation wrapper simplified (no page-split in orientation)
- StepUpload passes name back via onUploaded(sid, name)
- page.tsx: after page-split "Weiter" switches to first sub-session
- useKombiPipeline exposes setSessionName
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
StepUpload now has 3 phases:
1. File selection: drop zone / file picker → shows preview
2. Review: title input, category, file info → "Hochladen" button
3. Uploaded: shows session image → "Weiter" button
No more auto-advance after upload. User controls every step.
openSession() removed from onUploaded callback to prevent
step-reset race condition.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
openSession mapped dbStep=1 to uiStep=0 (upload), overriding handleNext's
advancement to step 1. Fix: sessions always exist post-upload, so always
skip past the upload step in openSession.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 1 of the clean architecture refactor: Replaces the 751-line ocr-overlay
monolith with a modular pipeline. Each step gets its own component file.
Frontend: /ai/ocr-kombi route with 11 steps (Upload, Orientation, PageSplit,
Deskew, Dewarp, ContentCrop, OCR, Structure, GridBuild, GridReview, GroundTruth).
Session list supports document grouping for multi-page uploads.
Backend: New ocr_kombi/ module with multi-page PDF upload (splits PDF into N
sessions with shared document_group_id). DB migration adds document_group_id
and page_number columns.
Old /ai/ocr-overlay remains fully functional for A/B testing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The page_number was only shown in GridEditor.tsx (ocr-overlay) but
the OCR pipeline uses StepGridReview.tsx which has its own summary bar.
Display the extracted page number (e.g. "S. 233") next to the
dictionary detection badge.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_filter_footer_words now returns page number info (text, y_pct, number)
instead of just removing footer words. The page number is included in
the grid result as `page_number` and displayed in the frontend summary
bar as "S. 233".
This preserves page numbers for later page concatenation in the
customer frontend while still removing them from the grid content.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Add per-cell artifact filter (4b2): removes single-word cells with
≤2 chars and confidence <65 (e.g. "as" from stray OCR marks)
2. Add narrow connector column normalization (4d2): when ≥60% of cells
in a column share the same short text (e.g. "oder"), normalize
near-match outliers like "oderb" → "oder"
3. Fix footer detection: require short text (≤20 chars) and no commas.
Comma-separated lists like "Uhrzeit, Vergangenheit, Zukunft" are
content continuations, not page numbers.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The _IPA_RE check in _syllabify_text() skipped entire cells containing
any IPA character. After German IPA insertion adds [bɪltʃøn], the check
blocked syllabification entirely. Now strips bracket content before
checking, so programmatically inserted IPA doesn't prevent syllable
divider insertion on the surrounding text.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hybrid approach mirroring English IPA:
- Primary: wiki-pronunciation-dict (636k entries, CC-BY-SA, Wiktionary)
- Fallback: epitran rule-based G2P (MIT license)
IPA modes now use language-appropriate dictionaries:
- auto/en: English IPA (Britfone + eng_to_ipa)
- de: German IPA (wiki-pronunciation-dict + epitran)
- all: EN column gets English IPA, other columns get German IPA
- none: disabled
Frontend shows CC-BY-SA attribution when German IPA is active.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The zone merging function used PageZone but the import was only
in grid_editor_api.py. Caused NameError on sessions that trigger
zone merging (e.g. original_scan_b59a1b1b).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Our IPA system only has English dictionaries (Britfone MIT, eng_to_ipa
MIT). The "IPA: nur DE" option was useless at best and misleading.
Removed from dropdown, type definition, and API validation.
Syllable DE mode stays — pyphen has a German hyphenation dictionary.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Header detection: Add 25% cap to single-cell heading heuristic.
On German synonym dicts where most rows naturally have only 1
content cell, the old logic marked 60%+ of rows as headers.
2. IPA de/all mode: Use "column_text" (light processing) for non-
English columns instead of "column_en" (full processing). The
full path runs _insert_missing_ipa() which splits on whitespace,
matches English prefixes ("bildschön" → "bild"), and truncates
the rest — destroying German comma-separated synonym lists.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dropdowns only updated state but didn't trigger buildGrid().
Now a useEffect watches ipaMode/syllableMode and rebuilds
automatically (skipping the initial mount).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When no IPA signals exist (e.g. German-only dicts), the fallback
that guesses en_col_type was incorrectly triggered for en/de modes,
causing false IPA and syllable insertions. Now only fires for 'all'
mode. Syllable en mode also returns empty set when no EN column found.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend ipa_mode and syllable_mode toggles with language options:
- auto: smart detection (default)
- en: only English headword column
- de: only German definition columns
- all: all content columns
- none: skip entirely
Also improve English column auto-detection: use garbled IPA patterns
(apostrophes, colons) in addition to bracket patterns. This correctly
identifies English dictionary pages where OCR produces garbled ASCII
instead of bracket IPA.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: Remove en_col_type fallback heuristic (longest avg text) that
incorrectly identified German columns as English. IPA now only applied
when OCR bracket patterns are actually found. Add ipa_mode (auto/all/none)
and syllable_mode (auto/all/none) query params to build-grid API.
Frontend: Add IPA and Silben dropdown selects to GridToolbar. Modes
are passed as query params on rebuild. Auto = current smart detection,
All = force for all words, Aus = skip entirely.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 5i was overwriting IPA-corrected text from Step 5c when
reconstructing cells from word_boxes. Added _ipa_corrected flag
to preserve corrections. Also tightened merged-token prefix matching
(min prefix 4 chars, min suffix 3 chars) to prevent false positives
like "sis" being extracted from "si:said".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix Step 5h: restrict slash-IPA conversion to English headword column
only — prevents converting "der/die/das" to "der [dər]das" in German
columns (confirmed working)
- Fix _text_has_garbled_ipa: detect embedded apostrophes in merged
tokens like "Scotland'skotland" where OCR reads ˈ as '
- Fix _insert_missing_ipa: detect dictionary word prefix in merged
trailing tokens like "fictionsalans'fIkfn" → extract "fiction" with IPA
- Move en_col_type to wider scope for Step 5h access
Note: Fixes 1+2 confirmed working in unit tests but not yet applying
in the full build-grid pipeline — needs further debugging.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Real dictionary pages have only ~3% OCR-detected pipes because the thin
syllable divider lines are hard for OCR to read. The primary false-positive
guard (article_col_index check) already blocks synonym dictionaries.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrite cv_syllable_detect.py with pyphen-first approach:
- Remove unreliable CV gate (morphological pipe detection)
- Strip existing pipes and re-syllabify via pyphen (DE then EN)
- Merge pipe-gap spaces where OCR split words at divider positions
- Guard merges with function word blacklist and punctuation checks
Add false-positive prevention:
- Pre-check: skip if <5% of cells have existing | from OCR
- Call-site check: require article_col_index (der/die/das column)
- Prevents syllabification of synonym dictionaries and word lists
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Page-split sessions (start_step=1) have no orientation_result stored.
StepOrientation now auto-runs orientation detection when loading an
existing session that lacks a result.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Page-split now creates independent sessions (no parent_session_id),
parent marked as status='split' and hidden from list. Navigation uses
useSearchParams for URL-based step tracking (browser back/forward works).
page.tsx reduced from 684 to 443 lines via usePipelineNavigation hook.
Box sub-sessions (column detection) remain unchanged.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of returning to parent (which creates a redirect loop), the
handleNext function now finds the next incomplete sub-session and opens
it directly. When all sub-sessions are done, returns to session list.
Also fixes openSession auto-redirect to prefer the first incomplete
sub-session over the most advanced one.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When reopening a parent session that has page-split sub-sessions,
the UI was showing the parent's pipeline step (always step 1/Orientation)
instead of navigating to the sub-sessions. Now automatically opens the
most advanced sub-session, matching the behavior of handleOrientationComplete.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After refactoring grid_editor_api.py into helpers, the function
_text_has_garbled_ipa was used in _detect_heading_rows_by_single_cell
but never imported from cv_ocr_engines. This caused HTTP 500 on
build-grid for sessions that trigger single-cell heading detection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Removed overflow-y-auto and maxHeight from the grid container div.
The page itself handles scrolling — nested scroll containers caused
the bottom rows to be cut off after editing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Extracted 1367 lines of helper functions from grid_editor_api.py
(3051→1620 lines) into grid_editor_helpers.py (filters, detectors,
zone grid building).
2. Created cv_syllable_detect.py with generic CV+pyphen logic:
- Checks EVERY word_box for vertical pipe lines (not just first word)
- No article-column dependency — works with any dictionary layout
- CV morphological detection gates pyphen insertion
3. Grid editor scroll: calc(100vh-200px) for reliable scrolling.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Syllable dividers now require CV validation: morphological vertical
line detection checks if word_box image actually shows thin isolated
pipe lines before applying pyphen. Only first word per cell gets
pipes (matching dictionary print layout).
2. Grid editor scroll: changed maxHeight from 80vh to calc(100vh-200px)
so editor remains scrollable after edits.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
OCR engines don't detect | pipe chars used as syllable dividers in
dictionaries. After dictionary detection (is_dict=True), use pyphen
(MIT) to insert syllable breaks into headword cells. Tries DE first,
then EN. Skips IPA content, short words, and cells already containing |.
Also adds pyphen>=0.16.0 to requirements.txt.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column_1 data showed avg_len=1.0 with 13 single-char cells (alphabet
letters from sidebar). Old fill_ratio check (76% > 35%) missed it.
New criteria: avg_len ≤ 1.5 AND ≥ 70% single chars → removes column.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Pipe divider fix: Changed OCR char-confusion regex so | between
letters (Ka|me|rad) is NOT converted to I. Only standalone/
word-boundary pipes are converted (|ch → Ich, | want → I want).
2. Alphabet sidebar detection improvements:
- _filter_decorative_margin() now considers 2-char words (OCR reads
"Aa", "Bb" from sidebars), lowered min strip from 8→6
- _filter_border_strip_words() lowered decorative threshold from 50%→45%
- New step 4f: grid-level thin-edge-column filter as safety net —
removes edge columns with <35% fill rate and >60% short text
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The right panel (grid area) had no vertical overflow handling, causing
the last ~5 rows to be clipped and invisible. Added overflow-y-auto
with max-height 80vh, and removed overflow-hidden from the GridTable
wrapper that was cutting off content.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- build-grid now saves the automatic OCR result as ground_truth.auto_grid_snapshot
- mark-ground-truth includes a correction_diff comparing auto vs corrected
- New endpoint GET /correction-diff returns detailed diff with per-col_type
accuracy breakdown (english, german, ipa, etc.)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Page-split sub-sessions (current_step=2) had orientation marked as skipped
but uiStep remained at 0 (orientation step), causing StepOrientation to
render for a sub-session that has no orientation data. Now advances to
uiStep=1 (deskew) when orientation is skipped.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The reset useEffects in StepOrientation/Deskew/Dewarp/Crop were clearing
orientationResult when sessionId changed (e.g. during handleOrientationComplete),
causing the right side of ImageCompareView to show nothing. Using key={sessionId}
on the step components instead forces React to remount with fresh state when
switching sessions, without interfering with the upload/orientation flow.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step components (Deskew, Dewarp, Crop, Orientation) had local state
guards that prevented reloading when sessionId changed via sub-session
tab clicks. Added useEffect reset hooks that clear all local state
when sessionId changes, allowing the component to properly reload
the new session's data.
Also renamed "Box N" to "Seite N" in BoxSessionTabs per user feedback.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The page-split detection was only implemented in the regular pipeline
page but not in the OCR Overlay page where the user actually tests
with Kombi mode. Now the overlay page has full sub-session support:
- openSession: handles sub_sessions, parent_session_id, skip logic
for page-split vs crop-based sub-sessions, preserves current mode
- handleOrientationComplete: async, fetches API to detect sub-sessions
- BoxSessionTabs: shown between stepper and step content
- handleNext: returns to parent after sub-session completion
- handleSessionChange/handleBoxSessionsCreated: session switching
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
handleOrientationComplete was checking subSessions from React state,
but due to batching the state was still empty when the user clicked
"Seiten verarbeiten". Now fetches session data directly from the API
to reliably detect sub-sessions and auto-open the first one.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After orientation detection, the frontend now automatically calls the
page-split endpoint. When a double-page book spread is detected, two
sub-sessions are created and each goes through the full pipeline
(deskew/dewarp/crop) independently — essential because each page of a
spread tilts differently due to the spine.
Frontend changes:
- StepOrientation: calls POST /page-split after orientation, shows
split info ("Doppelseite erkannt"), notifies parent of sub-sessions
- page.tsx: distinguishes page-split sub-sessions (current_step < 5)
from crop-based sub-sessions (current_step >= 5). Page-split subs
only skip orientation, not deskew/dewarp/crop.
- page.tsx: handleOrientationComplete opens first sub-session when
page-split was detected
Backend changes (orientation_crop_api.py):
- page-split endpoint falls back to original image when orientation
rotated a landscape spread to portrait
- start_step parameter: 1 if split from original, 2 if from oriented
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each page of a double-page scan tilts differently due to the book spine.
The new POST /page-split endpoint detects spreads after orientation and
creates sub-sessions that go through the full pipeline (deskew, dewarp,
crop, etc.) individually, so each page gets its own deskew correction.
Also fixes border-strip filter incorrectly removing German translation
words by adding a decorative-strip validation check.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. page_crop: Score all dark runs by center-proximity × darkness ×
narrowness instead of picking the widest. Fixes ad810209 where a
wide dark area at 35% was chosen over the actual spine at 50%.
2. cv_words_first: Replace x-center-only word→column assignment with
overlap-based three-pass strategy (overlap → midpoint-range → nearest).
Fixes truncated German translations like "Schal" instead of
"Schal - die Schals" in session 079cd0d9.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Users can now right-click any cell to set text color (red, green, blue,
orange, purple, black) or remove the color bar without changing text.
A "reset" option restores the OCR-detected color. This enables accurate
Ground Truth marking when OCR assigns colors to wrong cells.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Color bar (red/colored indicator) now only shows when word_boxes
text still matches the cell text — editing the cell hides stale colors
- New "Auto-Korrektur" button: detects dominant prefix+number patterns
per column (e.g. p.70, p.71) and completes partial entries (.65 → p.65)
— requires 3+ matching entries before correcting
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resize handle: wider (9px), z-40 (above z-30 buttons).
Add-column button moved to bottom-right corner to avoid overlap.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- New ImageLayoutEditor: SVG overlay on original scan with draggable
column dividers, horizontal guidelines (margins/header/footer),
double-click to add columns, x-button to delete
- GridTable: MIN_COL_WIDTH 40→80px for better readability
- Arrow up/down keys navigate between rows in the grid editor
- Ctrl+Click for multi-cell selection, Ctrl+B to toggle bold on selection
- getAdjacentCell works for cells that don't exist yet (new rows/cols)
- deleteColumn now merges x-boundaries correctly
- Session restore fix: grid_editor_result/structure_result in session GET
- Footer row 3-state cycle, auto-create cells for empty footer rows
- Grid save/build/GT-mark now advance current_step=11
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- New document category "Woerterbuch" (frontend type + backend validation)
- Column delete: hover column header → red "x" button (with confirmation)
- Column add: hover column header → "+" button inserts after that column
- Both operations support undo/redo, update cell IDs and summary
- Available in both GridEditor and StepGridReview (Kombi last step)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- New StepGridReview component: split-view (scan image left, grid right),
confidence stats, row-accept buttons, zoom controls
- Kombi Pipeline case 6 now uses StepGridReview instead of plain GridEditor
- Kombi step label changed to "Review & GT"
- Ground Truth queue page simplified to overview/navigation only
(links to Kombi pipeline for actual review work)
- Deep-link support: /ai/ocr-overlay?session=xxx&mode=kombi
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Ground-truth: zone.columns use 'label' not 'col_type' — calling
.replace() on undefined crashed the page after grid data loaded
- Model-management: same AIToolsSidebarResponsive wrapper bug as the
other pages — does not render children
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AIToolsSidebarResponsive does not accept children — it renders only a
sidebar nav. Using it as a wrapper caused page content to never render.
Replaced with plain div, matching the pattern used by ocr-pipeline.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both pages passed `moduleId` which is not a valid prop for PagePurpose.
The component expects explicit title/purpose/audience — calling
audience.join() on undefined caused the client-side crash.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add three new Projekt documentation pages covering product vision
(offline-first desktop app for teachers), 6-phase development roadmap,
and 3-tier hardware strategy with distribution plan.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Run detect_graphic_elements() in the grid pipeline after image loading
and remove ALL words whose centroids fall inside detected graphic regions,
regardless of confidence. Previously only low-confidence words (conf < 50)
were removed, letting artifacts like "Tr", "Su" survive.
Changes:
- grid_editor_api.py: Import and call detect_graphic_elements() at Step 3a,
passing only significant words (len >= 3) to avoid short artifacts fooling
the text-vs-graphic heuristic. Hard-filter all words in graphic regions.
- cv_graphic_detect.py: Lower density threshold from 20% to 5% for large
regions (>100x80px) — photos/illustrations have low color saturation.
Raise page-spanning limit from 50% to 60% width/height.
Tested: 5 ground-truth sessions pass regression (079cd0d9, d8533a2c,
2838c7a7, 4233d7e3, 5997b635). Session 5997 now detects 2 graphic regions
and removes 29 artifact words including "Tr" and "Su".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
OCR splits words at syllable marks into overlapping word_boxes (e.g.
"zu" + "tiefst" with 52% x-overlap). Step 5i previously removed the
lower-confidence box, losing the prefix. Now: when both boxes are
alphabetic text with 20-75% overlap, MERGE them into one word_box
("zutiefst") instead of removing.
Also relaxed artifact cell filter: 2-char alphabetic text like "Zw"
(dictionary guide word) is no longer removed. Only non-alphabetic
short text like "a=" is filtered.
Results for session 5997: "tiefst"→"zutiefst", "zu"→"zuständig",
"Zu die Zuschüsse"→"Zuschuss, die Zuschüsse", "Zw" restored.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three fixes for dictionary page session 5997:
1. Heading detection: column_1 cells with article words (die/der/das)
now count as content cells, preventing "die Zuschrift, die Zuschriften"
from being falsely merged into a spanning heading cell.
2. Step 5j-pre: new artifact cell filter removes short garbled text from
OCR on image areas (e.g. "7 EN", "Tr", "\\", "PEE", "a="). Cells
survive earlier filters because their rows have real content in other
columns. Also cleans up empty rows after removal.
3. Footer "PEE" auto-fixed: artifact filter removes the noise cell,
empty row gets cleaned up, footer detection no longer sees it.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dictionary pages have 2 dictionary columns, each with article + headword
sub-columns. The right article column (die/der at x≈626) had only 14.3%
row coverage — below the 20% secondary threshold. Lowered to 12% so
dictionary article columns qualify. Also strip pipe characters from
individual word_box text (not just cell text) to remove OCR syllable
separation marks (e.g. "zu|trau|en" → "zutrauen").
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The border strip filter (Step 4e) used the LARGEST x-gap which incorrectly
removed base words along with edge artifacts. Now uses a two-stage approach:
1. _filter_border_strip_words() pre-filters raw words BEFORE column detection,
scanning from the page edge inward to find the FIRST significant gap (>30px)
2. Step 4e runs as fallback only when pre-filter didn't apply
Session 4233 now correctly detects 3 columns (base word | oder | synonyms)
instead of 2. Threshold raised from 15% to 20% to handle pages with many
edge artifacts. All 4 ground-truth sessions pass regression.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 5i rule (a) only caught blue tiny symbols. Graphic fragments from
page illustrations (e.g. orange quote mark from man illustration) were
missed. Now filters any non-black colored word_box with area < 200 and
confidence < 85.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Content word_boxes in test used x-spacing (i%3)*100 which created
internal gaps larger than the border-to-content gap. Changed to
(i%2)*51 so content words overlap and the border gap remains dominant.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Textbooks with decorative alphabet strips along page edges produce
OCR artifacts (scattered colored letters at x<150 while real content
starts at x>=179). Step 4e detects a significant x-gap (>30px) between
a small cluster (<15% of total word_boxes) near the page edge and the
main content, then removes the border-strip word_boxes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The frontend renders colored cells from the word_boxes array order,
not from cell.text. After post-processing steps (5i bullet removal etc),
word_boxes could remain in their original insertion order instead of
left-to-right reading order. Step 5j now explicitly sorts word_boxes
using _group_words_into_lines before the result is built.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cell text was rebuilt using naive (top, left) sorting after removing
word_boxes in Steps 4c/4d/5i. This produced wrong word order when
words on the same visual line had slightly different top values (1-6px).
Now uses _words_to_reading_order_text() which groups words into visual
lines by y-tolerance before sorting by x within each line, matching
the initial cell text construction in _build_cells.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 5i: For word_boxes with >90% x-overlap and different text, use IPA
dictionary to decide which to keep (e.g. "tightly" in dict, "fighily" not).
Red threshold raised from 80 to 90 to catch remaining scanner artifacts
like "tight" and "5" that were still misclassified as red.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scanner artifacts on black text produce slight warm tint (hue ~0, sat ~60)
that was misclassified as red. Now requires median_sat >= 80 specifically
for red classification, since genuine red text always has high saturation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Changed `typeof grid.zones[][]` to `GridZone[][]` which was causing
a silent build error, preventing the vsplit zone grouping logic from
being compiled into the production bundle.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pages with two side-by-side vocabulary columns separated by a vertical
black line are now split into independent sub-zones before row/column
detection. Each sub-zone gets its own rows, preventing misalignment from
different heading rhythms.
- _detect_vertical_dividers(): finds pipe word_boxes at consistent x
positions spanning >50% of zone height
- _split_zone_at_vertical_dividers(): creates left/right PageZone objects
with layout_hint and vsplit_group metadata
- Column union skips vsplit zones (independent column sets)
- Frontend renders vsplit zones side by side via flex layout
- PageZone gets layout_hint + vsplit_group fields
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Post-processing steps like 5h (slash-IPA conversion) modify cell.text
but not individual word_boxes. The colored per-word display showed
stale word_box text instead of the corrected cell text. Now falls
back to the plain input when texts don't match.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents false narrow columns from text overflow at page edges.
Session 355f3c84 had a 3-row/4% tertiary cluster creating a spurious
third column from right-column text overflow.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reject /.../ matches containing spaces, parens, or commas (e.g. sb/sth up)
- Second pass converts trailing /ipa2/ after [ipa1] (double pronunciation)
- Validate standalone /ipa/ at start against same reject pattern
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dictionary-style pages print IPA between slashes (e.g. tiger /'taiga/).
Step 5h detects these patterns, looks up the headword in the IPA dictionary
for proper Unicode IPA, and falls back to OCR text when not found.
Converts /ipa/ to [ipa] bracket notation matching the rest of the pipeline.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 4d removes "|" and "||" word_boxes that OCR produces when reading
physical vertical divider lines between columns. Also strips stray pipe
chars from cell text.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_is_grammar_bracket_content now splits "no pl" into ["no", "pl"]
instead of treating it as single token "no pl".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes:
1. Add pl, sg, no, also, ae, be etc. to _GRAMMAR_BRACKET_WORDS so
annotations like (pl) and (no pl) are not replaced with IPA.
2. Skip articles (the, a, an) in fix_ipa_continuation_cell — they
never get IPA in vocabulary books.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Footer rows (e.g. page numbers) now show "F" in amber below the row
number, mirroring the blue "H" label for headers.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three fixes:
1. fix_ipa_continuation_cell: when headword has inline IPA like
"beat [bˈiːt] , beat, beaten", only generate IPA for uncovered
words (beaten), not words already shown (beat). When bracket is
at end like "the Highlands [ˈhaɪləndz]", return inline IPA directly.
2. Step 5d: recover garbled IPA from word_boxes when Step 5c emptied
the cell text (e.g. "[n, nn]" → "").
3. Added 2 tests for inline IPA behavior (35 total).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Footer rows like "two hundred and twelve" are no longer removed from
the grid. Instead they stay in cells/rows and get tagged so the
frontend can render them differently.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column_1 cells like "to" (infinitive markers) were incorrectly extracted
as page_refs. Now only cells matching p.70, ,.65, or bare digits are
treated as page references.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 5g now extracts column_1 cells individually as page_refs (instead of
requiring the whole row to be column_1-only), and footer detection skips
rows containing real IPA Unicode symbols to avoid false positives on
IPA continuation rows like [sˈiː] – [sˈɔː] – [sˈiːn].
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Step 5f: Remove dictionary IPA from headings detected after IPA
correction (e.g. "Theme [θˈiːm]" → "Theme")
- Step 5g: Extract page_ref rows (column_1 only, e.g. "p.70") and
footer rows (last single-cell row, e.g. page number "212") from
the vocabulary table into zone-level metadata (page_refs, footer)
so the frontend can render them separately
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 5d now also processes IPA continuations without brackets (e.g.
"ska:f – ska:vz", "'sekandarr sku:l") when the row has only 1 content
cell and the text is pure-ASCII garbled IPA (no real IPA Unicode symbols).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
"Theme [θˈiːm]" contains real IPA symbols (θ, ˈ) and should NOT be filtered.
Only filter text that has garbled IPA markers (:, ') but no real Unicode IPA chars.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Unbracketed IPA continuations like "ska:f – ska:vz" were falsely detected
as headings. Now _text_has_garbled_ipa() filters them out.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Page numbers like "two hundred and twelve" in the last row were falsely
detected as headings. Now first and last non-header rows are excluded.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Color headings now preserve actual starting col_index instead of hardcoded 0
- New _detect_heading_rows_by_single_cell: detects rows with only 1 content
cell (excl. page_ref) as headings — catches black headings like "Theme"
that have normal color/height but are alone in their row
- Runs after Step 5d (IPA continuation) to avoid false positives
- 5 new tests (32 total)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous heuristic picked the column with the longest average text as
the English headword column. In layouts with long example sentences, this
picked the wrong column (examples instead of headwords). Now counts cells
with bracket patterns per column — the column with the most brackets is
the headword column where IPA needs fixing.
Fixes garbled OCR-IPA like "change [tfeind3]" → "change [tʃˈeɪndʒ]".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes:
1. Step 5d now only treats cells as continuation when text is entirely
inside brackets (e.g. "[n, nn]"). Cells with headwords outside brackets
(e.g. "employee [im'ploi:]") are no longer overwritten.
2. fix_ipa_continuation_cell no longer skips grammar words like "down" —
they are part of the headword in phrasal verbs like "close sth. down".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The en_col_type heuristic (longest avg text) picks the example column,
missing IPA continuation cells in the actual headword column. Now Step 5d
checks all column_* cells for garbled IPA patterns independently.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Detect bracketed text without real IPA symbols as garbled OCR phonetics
- Allow IPA continuation fix even when other columns have content (for rows
where EN cell is clearly garbled bracketed IPA)
- Strip parenthetical grammar annotations like (no pl) from headword before
IPA lookup in fix_ipa_continuation_cell
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Skip ghost filtering for boxes with border_thickness=0 (images/graphics
have no border lines to produce OCR artifacts like |, I)
2. Remove individual word_boxes with height > 3x zone median (OCR from
graphics like a huge "N" from a map image below text)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Filter words inside image_overlays (removes OCR from images)
2. Ghost filter: only remove single-char border artifacts, not multi-char
like (= which is real content
3. Skip first-row header detection for zones with image_overlays
(merged geometry creates artificial gaps)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zone merging: content zones separated by box zones (images) are merged
into a single zone with image_overlays, so split tables reconnect.
Heading detection: after color annotation, rows where all words are
non-black and taller than 1.2x median are merged into spanning heading cells.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The walk-back was going 4 chars, eating the last letter of the
headword: "schoolbag" → "schoolba". Limiting to 3 gives correct
split: "schoolbag" + "[sku:lbæg]".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For merged tokens like "schoolbagsku:lbæg", split at IPA marker
boundary instead of prefix-matching to a shorter dictionary word.
Result: "schoolbag [sku:lbæg]" instead of "school [skˈuːl]".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents false positives where punctuation (apostrophes) in merged
tokens caused wrong dictionary matches (e.g. "'se" from "'sekandarr"
matching as a word, breaking IPA continuation row fix).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IPA continuation rows (phonetic transcription that wraps below the
headword) now get proper IPA by looking up headwords from the row
above. E.g. "ska:f – ska:vz" → "[skˈɑːf] – [skˈɑːvz]".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Stop removing rows that contain only phonetic transcription below
the headword. These rows are valid content that users need to see.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_insert_missing_ipa was adding dictionary IPA to cells that had NO
phonetic transcription on the original page (e.g. "scissors" heading,
"scarf - scarves" without IPA). Now guarded by _text_has_garbled_ipa()
which checks for OCR-mangled phonetic markers (stress marks, length
marks, IPA special chars) before allowing insertion.
Rule: if a line has no phonetics, don't add any. Where garbled IPA
exists, replace it with correct IPA notation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- _filter_decorative_margin: Phase 2 now also removes short words (<=3
chars) in the same narrow x-range as the detected single-char strip,
catching multi-char OCR artifacts like "Vv" from alphabet graphics.
- _filter_header_junk: New filter detects the content start (first row
with 3+ high-confidence words) and removes low-conf short fragments
above it that are OCR artifacts from header illustrations.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When exclude regions are saved or deleted, the cached grid result is
cleared so the grid rebuilds with updated exclusions on the next step.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Users can now draw rectangles on the document image in the Structure
Detection step to mark areas (e.g. header graphics, alphabet strips)
that should be excluded from OCR results during grid building.
- Backend: PUT/DELETE endpoints for exclude regions stored in structure_result
- Backend: _build_grid_core() filters all words inside user-defined exclude regions
- Frontend: Interactive rectangle drawing with visual overlay and delete buttons
- Preserve exclude regions when re-running structure detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The _detect_spine_shadow function was triggering on normal text content
because shadow_range > 20 was too low and convolution edge artifacts
created artificially low values. Now requires: range > 40, darkest < 180,
narrow valley (not text plateau), and brightness rise toward page content.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Refactor left/right shadow detection into shared _detect_spine_shadow()
that finds the darkest column (= book spine center) via argmin of
smoothed brightness. Both sides now cut at the spine center, ensuring
equal page sizes in double-page scans regardless of shadow position.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mirror the left-edge shadow detection for the right side: analyze
brightness gradient in the right 25% to find scanner gray strips
from book spines. Cuts at the last bright column before the shadow
dip. Fixes cropping of book scans where the next page bleeds in.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Ground Truth button on last step of Pipeline/Kombi modes in ocr-overlay
- Prominent category picker in active session info bar (pulses when unset)
- GT badge shown when session has ground truth reference
- Backend: auto-detect pipeline from ocr_engine, store in GT snapshot
- Pipeline info shown in GT session list and regression reports
- Also pass pipeline param from ocr-pipeline StepGroundTruth
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract _build_grid_core() from build_grid() endpoint for reuse.
New ocr_pipeline_regression.py with endpoints to mark sessions as
ground truth, list them, and run regression comparisons after code
changes. Frontend button in StepGroundTruth.tsx to mark/update GT.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strip IPA brackets that fix_cell_phonetics may have added for short
dictionary words (e.g. "si" → "[si]") before checking if the row is
a garbled phonetic continuation. Detect phonetic text by presence of
':' (length marks), leading apostrophe (stress marks), or absence of
any word with ≥3 letters.
Fixes Row 39 ("si: [si] — So: - si:n") not being removed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- grid_editor_api: After IPA correction, detect rows containing only
garbled phonetics in the English column (no German translation, no
IPA brackets inserted). These are wrap-around lines where printed
IPA extends to the line below the headword. Remove them since the
headword row already has correct IPA.
- cv_ocr_engines: _insert_missing_ipa now tries dehyphenated form
as fallback (e.g. "second-hand" → "secondhand") for dictionary
lookup, fixing IPA insertion for compound words.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Single-column German text pages were getting IPA inserted for words
that happen to exist in the English dictionary ("die" → [dˈaɪ],
"Das" → [dɑs]). Now IPA correction only runs when the grid has ≥3
columns, which is the minimum for a vocabulary table layout
(English | article | German).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_insert_missing_ipa now removes garbled phonetic text (e.g. "skea",
"sku:l", "'sizaz") that follows the inserted IPA bracket. Keeps
delimiters (–, -), uppercase words (German), and known English words.
Fixes: "scare [skˈɛə] skea" → "scare [skˈɛə]"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- _merge_inline_marker_columns: skip merge when ≥50% of words are
alphabetic (preserves "to", "in", "der" columns)
- Rule 2 (oversized stub): widen to ≤3 words / ≤5 chars (catches "SEA &")
- IPA phonetics: map longest-avg-text column to column_en so
fix_cell_phonetics runs in the grid editor
- ocr_pipeline_overlays: add missing split_page_into_zones import
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Rule 3 to junk-row filter: rows where no word is longer than
2 chars are removed as scattered OCR debris from illustrations
- Fully disable spanning-header detection which falsely flagged IPA
transcriptions and vocabulary entries as spanning headers
- First-row heuristic remains for genuine header detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Black text has median_sat ~6-7, green text ~63-65. At threshold 50,
scanner blue tints (median_sat ~50-54) on words like "Wasser" were
falsely classified as blue. Threshold 55 has good margin on both sides.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rows with ≤2 words, total text ≤3 chars, and word height >1.8x median
are removed as non-content elements (e.g. red page number "( 9").
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Apply recovered-artifact filter to ALL zones (was box-zones only)
- Filter any recovered word with text ≤ 2 chars (not just !?•·)
- Add post-grid junk-row removal: rows where all word_boxes have
conf < 50 and text ≤ 3 chars are dropped as OCR noise
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extracted 4 overlay functions (_get_structure_overlay, _get_columns_overlay,
_get_rows_overlay, _get_words_overlay) that were missing from the initial
split. Provides render_overlay() dispatcher used by sessions module.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous commit added `cached["word_result"]` but `cached` was
not defined in these functions. Changed to safely check `_cache` dict
first. Also includes sat_threshold fix (70→50) for green text.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Green text words like "Insel" and "Internet" had median_sat=65, just
below the threshold of 70, causing them to be classified as black.
Black text has median_sat=6-7, so threshold=50 provides clear
separation (6-7 vs 63-65) without false positives.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The frontend was checking for an existing structure_result and reusing
it, which meant the backend fix (passing word_boxes to graphic detection)
never had a chance to run on existing sessions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both kombi OCR functions wrote word_result to DB but not to the
in-memory cache. When detect-structure ran next, it found no words
and passed an empty list to graphic detection, making all word-overlap
heuristics ineffective. This caused green text words to be wrongly
classified as graphic regions.
Also adds a fallback in detect-structure to use raw OCR word lists
if cell word_boxes are empty.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two issues in paddle-kombi word merge:
1. Overlap threshold too strict: PaddleOCR "Stick" and Tesseract
"Stück" overlap at 48.6%, just below the 50% threshold. Both words
ended up in the result, overlapping on the same position.
Fix: lower threshold from 50% to 40%.
2. Text selection blind to confidence: always took PaddleOCR text
even when Tesseract had higher confidence and correct text.
Fix: when texts differ due to spatial-only match, prefer the
engine with higher confidence.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When union columns from multiple content zones are applied, column
boundaries can span wider than any single zone's bbox. Using
zone.bbox_px.w as the scale reference caused the total scaled width
to exceed the container, pushing the table off-screen.
Now uses the actual total column width sum as the scale reference,
guaranteeing columns always fit within the container.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column detection:
- Raise MIN_COVERAGE_PRIMARY 20%→35% (prevents false columns in
flowing text where random gaps < 35% of rows)
- Raise MIN_COVERAGE_SECONDARY 12%→20%, MIN_DISTINCT_ROWS 2→3
- Vocabulary worksheets unaffected (columns appear in >80% of rows)
Graphic word filter:
- Only remove words with OCR confidence < 50 inside graphic regions
- High-confidence words are real text, not image artifacts
- Prevents legitimate colored text from being discarded
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 25x25 dilation kernel merges nearby green words into large regions,
so pixel-overlap with OCR word boxes drops below 50%. Previous density
checks alone weren't sufficient.
New multi-layered approach:
- Count OCR word CENTROIDS inside each colored region
- ≥2 centroids → definitely text (images don't produce multiple words)
- 1 centroid + 10%+ pixel overlap → likely text
- Lower pixel overlap threshold from 50% to 40%
- Raise density+height thresholds for text-line detection
- Use INFO logging to diagnose remaining false positives
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add color pixel density checks to cv_graphic_detect.py Pass 1:
- density < 20% → skip (text strokes are thin, images are filled)
- density < 30% + height < 4% page → skip (colored text line)
This fixes green headings (Insel, Internet, Inuit) being removed
as graphic regions, which also caused word reordering in lines.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous algorithm used binary ink projection and found false
splits at normal text column gaps. The spine of a book on a scanner
has a characteristic DARK gray strip (scanner bed) flanked by bright
white paper on both sides.
New approach: column-mean brightness with heavy smoothing, looking for
a dark valley (< 88% of paper brightness) in the center region that
has bright paper on both sides.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: merge gaps within 5% of image width — the spine area may have
thin ink strips splitting one physical gap into multiple detected gaps.
Only use gaps >= 2% width as split points.
Frontend: StepCrop now handles multi_page crop responses without
crashing on missing original_size/cropped_size fields.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tesseract OSD 'rotate' returns the clockwise correction needed,
but the code was applying counterclockwise for 90° and clockwise
for 270° — exactly reversed. This caused pages scanned sideways
to be flipped upside down instead of corrected.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a book scan (double-page spread) is detected during the crop step,
the system automatically:
1. Detects vertical center gaps (spine area) via ink density projection
2. Splits into N page sub-sessions (reusing existing sub-session mechanism)
3. Individually crops each page (removing its own borders)
4. Returns sub-session IDs for downstream pipeline processing
Detection: landscape images (w > h * 1.15), vertical gap < 15% peak
density in center region (25-75%), gap width >= 0.8% of image width.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: add layout_metrics (avg_row_height_px, font_size_suggestion_px)
to build-grid response for faithful grid reconstruction.
Frontend: rewrite GridTable from HTML <table> to CSS Grid layout.
Column widths are now proportional to the OCR-measured x_min/x_max
positions. Row heights use the average content row height from the
scan. Column and row resize via drag handles (Excel-like).
Font: add Noto Sans (supports IPA characters) via next/font/google.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix_cell_phonetics was only called in the OCR pipeline endpoints
(/words, /cells) but not in the combo mode (build-grid / ocr-overlay).
Garbled IPA like [teist] is now corrected to [teɪst] using the
IPA dictionary, same as in the pipeline.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Filter recovered single-char artifacts (!, ?, •) from box zones
where they are decorative noise, not real text markers
2. Detect spanning header rows (e.g. "Unit4: Bonnie Scotland") that
stretch across multiple columns with colored text. Merge their
cells into a single spanning cell in column 0.
3. Fix missing opening parentheses: when cell text has ")" but no
matching "(", prepend "(" to the text.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Load structure_result from session to get detected graphic bounds
- Exclude OCR words whose center falls inside a graphic region
- Exclude recovered colored text inside graphic regions
- Reject color recovery regions wider than 4x median word height
Fixes garbage characters (!, ?, •) in box zones and false OCR
detections (N, ?) in image areas.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous version only checked X overlap, causing false positives for
short words like "=" and "I" that appear at similar X positions in
different rows. Now requires >=50% overlap in both dimensions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR can return overlapping phrases (e.g. "von jm." and "jm. =")
that produce duplicate words after splitting. Added _deduplicate_words()
post-merge pass that removes words with same text at overlapping positions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Words on the same visual line can have slightly different top values
(1-6px). Sorting by (top, left) produced wrong word order in the
frontend display. Now uses _group_words_into_lines to group by Y
proximity first, then sort by X within each line.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Colored-pixel fragments in narrow inter-word gaps were being recovered
as false characters (e.g., "!" between "lend" and "sb."), disrupting
word order. Use adaptive padding based on median word height instead
of fixed 4px.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Add _merge_inline_marker_columns(): narrow columns (<80px) with
avg word length <=2 chars (bullets, numbering) are merged into
the adjacent text column. Fixes box zones getting 2 columns when
bullet points are just indentation markers.
2. Improve ghost filter: check word edges (left/right/top/bottom)
against border bands instead of center-only. Catches = at x=947
whose left edge touches the box border.
3. Add = and + to _GRID_GHOST_CHARS for border artifact detection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Add _filter_border_ghosts() to grid editor - removes OCR artefacts
like | sitting on box borders before row/column clustering.
The tall | (h=55) was inflating row 0's y_max, causing row overlap.
2. Fix _assign_word_to_row() to prefer closest y_center when rows
overlap, instead of always returning the first matching row.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Logs word positions, median height, Y tolerance, and resulting
rows for zones with <= 30 words to diagnose row merging issues.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a cell has colored words (red !, blue phonetics), render each
word as a separate span with its own color instead of coloring the
entire input text with the first non-black color found.
Switches to editable input on cell selection (click).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zone 4 found 4 columns incl. page_ref, union also yields 4.
The strict > check prevented union from applying to Zone 0.
Changed to >= so all content zones get the merged column set.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of propagating columns from the largest content zone only
(which missed narrow columns like page_ref), collect column split
points from ALL content zones and merge them. This way a column
found in any zone (e.g. page_ref at x=132 in the zone below boxes)
is available everywhere.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reduce gap threshold from max(40, 5%) to max(30, 2%) so page_ref
columns (e.g. p.55/p.57) at ~56px gap are detected as tertiary columns.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Page references (p.55, p.57) and marker columns (!) appear in very few
rows (< 12% coverage) but sit at the far left/right margin with a clear
gap to the main content. Add a third detection tier that catches these
narrow margin columns when they have >= 2 distinct rows and are within
15% of the content edge with >= 40px gap to the nearest main column.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Global column detection diluted narrow sub-columns (page refs, markers)
because they appeared in too few rows relative to the total. Instead,
detect columns per zone independently, then propagate the best columns
(from the content zone with the most words) to smaller content zones.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Content zones (above/between/below boxes) now share the same column
structure: columns are detected once from ALL content-zone words, then
applied to each content zone. Box zones still detect columns independently.
This fixes the issue where narrow columns (page refs like p.55) were not
detected in small content zones above boxes, even though the same column
existed in the larger content zone below the box.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Enrich column geometries with original full-page words (box-filtered)
so _detect_sub_columns() finds narrow sub-columns across box boundaries
- Add inline marker guard: bullet points (1., 2., •) are not split into
sub-columns (minimum gap check: 1.2× word height or 20px)
- Add box_rects parameter to build_grid_from_words() — words inside boxes
are excluded from X-gap column clustering
- Pass box rects from zones to words_first grid builder
- Add 9 tests for box-aware column detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New approach: dilate color mask heavily (25x25) to merge nearby colored
pixels into regions, then check word overlap:
- >50% overlap with OCR word boxes → colored text → skip
- <50% overlap → colored image/graphic → keep
This detects balloon clusters as one "image" region instead of trying
to classify individual shapes. Red words like "borrow/lend" are filtered
because they overlap with their word boxes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 5x5 MORPH_CLOSE was connecting scattered color pixels into one
page-spanning contour that swallowed individual balloons. Fix:
- Remove MORPH_CLOSE, keep only MORPH_OPEN for speckle removal
- Lower sat threshold 50→40 to catch more colored elements
- Filter contours spanning >50% of width OR height (was AND)
- Filter contours >10% of image area
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pass 1 (color): Detect colored graphics on HSV saturation channel.
Black text is invisible on this channel, so no word exclusion needed.
Catches colored balloons, arrows, icons reliably.
Pass 2 (ink): Detect large black illustrations on dark ink mask
minus word exclusion. Only keeps area > 5000 to avoid text fragments.
Fixes: all 5 balloons now detectable (previously word exclusion zones
were eating colored graphics that overlapped with nearby OCR words).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Text fragments after word exclusion are indistinguishable from arrows
and icons via contour metrics. Since the goal is detecting graphics,
images, boxes and colors (not arrows/icons), simplify to only:
- circle/balloon (circularity > 0.55 — very reliable)
- illustration (area > 3000 — clearly non-text)
Boxes and colors are handled by cv_box_detect and cv_color_detect.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Lower min_area from 200 to 80 (small balloons ~100-300px²)
- Lower word_pad from 10 to 5 (10px was eating nearby graphics)
- Relax circle detection: circularity>0.55, min_dim>15 (was 0.70/25)
- Text fragments still filtered by _classify_shape noise threshold
- Add ACCEPT logging for debugging
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Raise min_area from 30 to 200 (text fragments are small)
- Raise word_pad from 3 to 10px (OCR bboxes are tight)
- Reduce morph close kernel from 5x5 to 3x3 (avoid reconnecting text)
- Tighten arrow detection: min 20px, circularity<0.35, >=2 defects
- Add 'noise' category for too-small elements, filter them out
- Raise min dimension from 4 to 8px
- Add debug logging for word count and exclusion coverage
- Raise max_area_ratio to 0.25 (allow larger illustrations)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Graphic detection needs word positions to exclude text from the ink mask.
Previously Struktur ran before OCR, causing every word to be detected as
a graphic element. Now:
- Pipeline: Struktur at index 7 (after Wörter)
- Kombi: Struktur at index 5 (after PP-OCRv5+Tesseract, before Tabelle)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add cv_graphic_detect.py for detecting non-text visual elements (arrows,
circles, lines, exclamation marks, icons, illustrations). Draw detected
graphics on structure overlay image and display them in the frontend
StepStructureDetection component with shape counts and individual listings.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Insert the Struktur detection step between Zuschneiden and
PP-OCRv5+Tesseract in the Kombi pipeline on /ai/ocr-overlay.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New pipeline step between Crop and Columns that visualizes detected
document structure: boxes (line-based + shading), page zones, and
color regions. Shows original image on the left, annotated overlay
on the right.
Backend: POST /detect-structure endpoint + /image/structure-overlay
Frontend: StepStructureDetection component with zone/box/color details
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously color/shading detection only ran as fallback when no line-based
boxes were found. Now both methods run in parallel with result merging,
so smaller shaded boxes (like "German leihen") get detected even when
larger bordered boxes are already found. Uses median-blur background
analysis that works for both colored and grayscale/B&W scans.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Median hue instead of mean (robust to background contamination)
- Otsu threshold instead of fixed 180 (adapts to colored backgrounds)
- Background sampling from border pixels with hue-distance filter
- Higher sat_threshold (70) + min_sat_ratio (25%) to reduce false positives
- Classify using saturated pixels only for cleaner hue signal
Fixes: borrow/lend misdetected as orange (actually red, median_H=5)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add color/color_name/recovered fields to OcrWordBox type
- GridTable: show colored text + left-edge color indicator strip
- GridEditor: show color stats and recovered count in summary bar
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_build_cells() creates new word_box dicts, so color fields set before
grid building were lost. Now detect_word_colors() runs after cells
are built, on the final word_boxes. Recovery still runs before grid
building so recovered words participate in column/row detection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New cv_color_detect.py module:
- detect_word_colors(): annotates existing words with text color (HSV analysis)
- recover_colored_text(): finds colored text regions missed by standard OCR
(e.g. red ! markers) using HSV masks + contour detection
Integrated into build-grid: words get color/color_name fields, recovered
colored regions are merged into the word list before grid building.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only cluster left-edges of words that begin a new group within their row
(first word or preceded by a large gap). This filters out mid-phrase
word positions (IPA transcriptions, second words in multi-word entries)
that were causing too many false columns.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column detection now clusters word left-edges by X-proximity and filters
by row coverage (Y-coverage), matching the proven approach from cv_layout.py
but using precise OCR word positions instead of ink-based estimates.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: new grid_editor_api.py with build-grid endpoint that detects
bordered boxes, splits page into zones, clusters columns/rows per zone
from Kombi word positions. New DB column grid_editor_result JSONB.
Frontend: GridEditor component with editable HTML tables per zone,
column bold toggle, header row toggle, undo/redo, keyboard navigation
(Tab/Enter/Arrow), image overlay verification, and save/load.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Since ocr_region_paddle() now runs RapidOCR locally (same PP-OCRv5 models),
the "PaddleOCR (Hetzner)" labels were misleading. Renamed to "PP-OCRv5 (lokal)".
Removed the Kombi-Vergleich tab since both sides would produce identical results.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RapidOCR uses the same PP-OCRv5 ONNX models locally, avoiding 504 timeouts
from remote PaddleOCR on large images. Set FORCE_REMOTE_PADDLE=1 to bypass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add /rapid-kombi backend endpoint using local RapidOCR + Tesseract merge,
KombiCompareStep component for parallel execution and side-by-side overlay,
and wordResultOverride prop on OverlayReconstruction for direct data injection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: Add spatial overlap check (>=50% horizontal IoU) to Kombi merge
so words at the same position are deduplicated even when OCR text differs.
Frontend: Add yPct/hPct to WordPosition so each word renders at its actual
vertical position instead of all words collapsing to the cell center Y.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR returns entire phrases as single boxes (e.g. "More than 200
singers took part in the"). The merge algorithm compared word-by-word
but Paddle had multi-word boxes vs Tesseract's individual words, so
nothing matched and all Tesseract words were added as "extras" causing
duplicates. Now splits Paddle boxes into individual words before merge.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR 3.4.0 removed 'latin' language support. Use 'en' with
explicit ocr_version='PP-OCRv5' instead, with fallback for older API.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces position-based word matching with row-based sequence alignment
to fix doubled words and cross-line averaging in Kombi-Modus.
New algorithm:
1. Group words into rows by Y-position clustering
2. Match rows between engines by vertical center proximity
3. Within each row: walk both sequences left-to-right, deduplicating
4. Unmatched rows kept as-is
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Even after multi-criteria matching, near-duplicate words can slip through
(same text, centers within 30px horizontal / 15px vertical). The new
_deduplicate_words() removes these, keeping the higher-confidence copy.
Regression test with real session data (row 2 with 145 near-dupes)
confirms no duplicates remain after merge + deduplication.
Tests: 37 → 45 (added TestDeduplicateWords, TestMergeRealWorldRegression).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The merge algorithm now uses 3 criteria instead of just IoU > 0.3:
1. IoU > 0.15 (relaxed threshold)
2. Center proximity < word height AND same row
3. Text similarity > 0.7 AND same row
This prevents doubled overlapping words when both PaddleOCR and
Tesseract find the same word at similar positions. Unique words
from either engine (e.g. bullets from Tesseract) are still added.
Tests expanded: 19 → 37 (added _box_center_dist, _text_similarity,
_words_match tests + deduplication regression test).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs both OCR engines on the preprocessed image and merges results:
word boxes matched by IoU, coordinates averaged by confidence weight.
Unmatched Tesseract words (bullets, symbols) are added for better coverage.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The slide positioning hook was re-matching cell.text tokens against
word_boxes via fuzzy text similarity, which broke positioning for
special characters (!, bullet points, IPA). Now uses word_box
coordinates directly — exact OCR positions without re-interpretation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When PaddleOCR returns "!Betonung" as a single word box, the overlay
positions text starting at the "!" instead of the actual word. Split
such boxes into ["!", "Betonung"] with proportional position splitting,
matching the existing IPA bracket splitting logic.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces custom _paddle_words_to_grid_cells with the proven
build_grid_from_words from cv_words_first.py — same function the
regular pipeline uses with PaddleOCR. Handles phrase splitting,
column clustering, and produces cells with word_boxes that the
slide/cluster positioning hooks expect.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
One cell per row with all words as word_boxes instead of one cell per
word. Gives OverlayReconstruction a row-spanning bbox_pct for correct
font sizing and per-word positions for slide/cluster placement.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Uses the cropped/dewarped image instead of the original so the overlay
shows the correctly oriented page. 5 steps instead of 2.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New 2-step mode (Upload → PaddleOCR+Overlay) alongside the existing
7-step pipeline. Backend endpoint runs PaddleOCR on the original image
and clusters words into rows/cells directly. Frontend adds a mode
toggle and PaddleDirectStep component.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace sequential 1:1 token-to-box mapping with fuzzy text matching.
Each token from cell.text finds its best matching word_box by text
similarity (normalized prefix match + substring bonus). Handles:
- Reordered boxes (different sort between text and boxes)
- IPA corrections changing token boundaries
- Token/box count mismatches
Unmatched tokens get interpolated positions from matched neighbors.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR returns "badge[bxd3]" without space, but the IPA fixer
produces "badge [bˈædʒ]" with space, creating a token count mismatch
between cell.text and word_boxes. Now also split at "[" boundaries
so each IPA bracket gets its own sub-box.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR returns phrase-level bounding boxes (e.g. "competition
[kompa'tifn]" as one box) but the overlay slide mechanism expects
one box per word for accurate positioning. Multi-word boxes are now
split proportionally by character count with small gaps between words.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
words_first was storing word_boxes in percent coordinates while
cv_cell_grid.py uses absolute pixel coordinates. The overlay slide
mechanism divides by imgW to get percentages, so percent-in-percent
caused positions near zero. Now both grid builders use the same format.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When engine=paddle is selected, the backend overrides grid_method to
words_first and returns plain JSON (no SSE streaming). The frontend
was not aware of this override — it sent stream=true and tried to parse
SSE events from a JSON response, resulting in "Keine Daten".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bilder > 1500px werden vor dem Upload verkleinert. Koordinaten
werden zurueckskaliert. JPEG statt PNG fuer schnelleren Upload.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update chunk counts for 8 successfully ingested DE laws (Phase H1)
- Add 6 new BGB-Teile entries (AGB, Fernabsatz, Kaufrecht, Widerruf, Digital)
- Add EGBGB Widerrufsbelehrung entry
- Update COLLECTION_TOTALS: gesetze 58304→63567 (+5263 Phase H chunks)
- Add Verbraucherschutz thematic group to Landkarte
- Extend ecommerce industry map with consumer protection regulations
- Update date to March 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleOCR als neue engine=paddle Option in der OCR-Pipeline.
Microservice auf Hetzner (paddleocr-service/), async HTTP-Client
(paddleocr_remote.py), Frontend-Dropdown, automatisch words_first.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 15 new regulations from Phase H ingestion:
- DE: PAngV, VSBG, ProdHaftG, VerpackG, ElektroG, BattDG, BFSG, UWG, GewO
- EU: Warenkauf-RL, Klausel-RL, UGP-RL, Preisangaben-RL, Omnibus-RL, BattVO
Chunk counts set to 0 (will be updated after successful ingestion).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_has_ipa_gap() prüft ob Tesseract eine IPA-Klammer übersehen hat anhand
des physischen Abstands zwischen Headword und nächstem Wort. Ohne Gap
(z.B. "be good at sth.", "Focus on language") wird kein IPA eingefügt.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_insert_missing_ipa ueberspringe Texte mit >6 Woertern oder Klammern.
Neue _insert_headword_ipa fuer column_text: prueft nur das erste Wort
der Zeile, unabhaengig von Textlaenge oder vorhandenen Klammern.
Ausserdem _sync_word_boxes_after_ipa_insert gefixt: Token-Vergleich
nutzt jetzt paralleles Durchlaufen statt zip (verschobene Positionen).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fuer column_text werden fehlende IPA-Lautschriften (challenge, profit,
film, badge) wieder eingefuegt, aber gleichzeitig eine synthetische
word_box erzeugt, damit die 1:1 Token-zu-Box Zuordnung im Overlay
erhalten bleibt.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_strip_orphan_bracket entfernte deutsche Bedeutungsangaben in Klammern,
weil sie weder als Grammar-Partikel noch als IPA erkannt wurden.
Fix: Klammerinhalte mit echten Wörtern (>=4 Buchstaben) werden behalten.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
word_boxes wurden nur im Cell-Crop-Pfad (narrow columns) gesetzt,
aber nicht im Full-Page Word-Assignment-Pfad (broad columns).
Jetzt werden die Tesseract-Wort-Koordinaten in beiden Pfaden gespeichert.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TEXT kommt aus cell.text (bereinigt, IPA-korrigiert).
POSITIONEN kommen aus word_boxes (exakte OCR-Koordinaten).
Tokens werden 1:1 in Leserichtung zugeordnet.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: _ocr_cell_crop speichert jetzt word_boxes mit exakten
Tesseract/RapidOCR Wort-Koordinaten (left, top, width, height)
im Cell-Ergebnis. Absolute Bildkoordinaten, bereits zurueckgemappt.
Frontend: Slide-Hook nutzt word_boxes direkt wenn vorhanden —
jedes Wort wird exakt an seiner OCR-Position platziert. Kein
Pixel-Scanning noetig. Fallback auf alten Slide wenn keine Boxes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Gruppen-Sliding schob nicht weit genug nach rechts. Zurueck zum
Original-Einzelwort-Slide, aber mit den Fixes:
- fontRatio=1.0 (konsistente Schriftgroesse wie Fallback)
- Token-Breiten aus medianCh * 0.7 / refFontSize (statt totalInk)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Vorher: split(/\s+/) zerlegte alles in Einzelwoerter, verlor die
Spaltenstruktur (3+ Spaces zwischen Gruppen). Woerter stauten sich links.
Jetzt: split(/\s{3,}/) erhält Gruppen wie im Cluster-Modus. Jede Gruppe
wird als Einheit von links nach rechts geschoben bis Tinte gefunden.
Breite = max(gemessene Textbreite, tatsaechliche Tintenbreite).
fontRatio=1.0, kein Wort geht verloren.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fontRatio war 0.65 (35% kleiner als Fallback-Rendering). Jetzt 1.0
wie beim Fallback. Token-Breiten berechnet aus measureText skaliert
auf die tatsaechlich gerenderte Schriftgroesse (medianCh * 0.7).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Schriftgroesse wird jetzt GLOBAL aus der medianen Zellhoehe berechnet
(65% der Zellhoehe als Ziel-Font). Alle Tokens bekommen dieselbe
konsistente Groesse. Die Slide-Logik bestimmt nur noch die x-Position.
Vorher: Scale pro Zelle aus Ink-Span/Textbreite -> inkonsistente Groessen.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
totalInk zaehlte nur dunkle Pixel-Spalten (Striche), ignorierte
Luecken zwischen Buchstaben. Scale war dadurch viel zu klein,
Schrift unlesbar. Jetzt wird der Ink-Span (erstes bis letztes
dunkles Pixel) als Referenz fuer die Textbreite verwendet.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Neuer Hook useSlideWordPositions: Schiebt alle erkannten Woerter von links
nach rechts ueber die Pixel-Projektion bis jedes Wort auf seiner Tinte
einrastet. Kein Wort geht verloren, keine Cluster-Matching-Regeln noetig.
Toggle-Button (Slide/Cluster) in der Overlay-Toolbar zum Umschalten.
Bestehender Cluster-Algorithmus bleibt als Alternative erhalten.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix_cell_phonetics() ersetzt fehlerhafte IPA-Klammern UND fuegt fehlende
Lautschrift fuer englische Woerter ein (z.B. badge, film, challenge, profit).
Wird auf alle Zellen mit col_type column_en/column_text angewandt.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zwei wesentliche Verbesserungen:
1. Multi-group: Gruppen werden per Best-Fit-Breite den Clustern
zugeordnet statt naiv links-nach-rechts. Damit wird z.B.
"Kokosnuss" dem DE-Spalten-Cluster zugeordnet statt dem
breiteren Box-Cluster.
2. Single-group Fallback: verwendet den BREITESTEN Cluster statt
first-to-last Span. Verhindert dass Streupixel von benachbarten
Seitenbereichen den Text nach links ziehen.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Box-Rahmen werden vom OCR als einzelne Symbole wie "|" oder ">"
erkannt und als eigene Text-Gruppen behandelt. Das verfaelscht die
Cluster-Zuordnung weil diese Artefakte entweder keinen eigenen
Cluster erzeugen oder den falschen Cluster zugewiesen bekommen.
Fix: Gruppen mit max 2 Zeichen ohne Buchstaben/Ziffern werden mit
der benachbarten Gruppe zusammengefuehrt bevor die Cluster-Zuordnung
laeuft.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wenn mehr Pixel-Cluster als Text-Gruppen existieren (z.B. wegen
Box-Rahmenlinien), werden jetzt die N breitesten Cluster ausgewaehlt
statt naiv clusters[i]→groups[i] zuzuordnen. Text-Cluster sind
breiter als Rahmenlinien-Cluster.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Vorher: wenn Text mehr Wort-Gruppen hatte als Pixel-Cluster gefunden
wurden (z.B. bei Box-Rahmen die Cluster zusammenmergen), wurde die
Zelle komplett uebersprungen → Fallback bei x=0%.
Jetzt: Fallback auf Single-Span Positionierung (first→last Cluster)
statt Skip. Damit wird der Text immer korrekt horizontal platziert.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Schriftgroesse basiert jetzt auf Median-Zeilenhoehe statt
individueller Zellhoehe — keine Groessensprunge in Box-Bereichen
2. Sehr schmale Pixel-Cluster (< 0.5% Zellbreite) werden gefiltert,
damit Box-Rahmen nicht als Textposition erkannt werden
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NameError behoben: skip_heal_gaps war nicht im Scope der
_word_batch_stream_generator Funktion.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_heal_row_gaps verschiebt Zell-Positionen nach Entfernung von Artefakt-Zeilen,
was im Overlay zu sichtbarem Versatz fuehrt (z.B. 23px bei "badge").
Neuer skip_heal_gaps Parameter in build_cell_grid_v2 und words-Endpoint
behaelt die exakten Zeilen-Positionen bei.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Seiten mit Info-Boxen (andere Zeilenhoehe) fuehren dazu, dass _regularize_row_grid
die Zeilenpositionen verzerrt. Neuer skip_regularize Parameter nutzt stattdessen
die gap-basierten Zeilen, die der tatsaechlichen Seitengeometrie folgen.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Statt Side-by-Side wird der erkannte Text jetzt direkt ueber das
Originalbild gelegt. Textfarbe (rot/blau/schwarz) und Deckkraft
per Slider einstellbar fuer einfache visuelle Fehlersuche.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wenn column_result fehlt (z.B. OCR Overlay Pipeline), wird automatisch
eine einzelne ganzseitige Pseudo-Spalte erzeugt statt einen Fehler zu werfen.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Das cropped image ist bereits orientierungskorrigiert. Die zusaetzliche
180°-Rotation ueber imageRotation drehte das Bild falsch herum.
imageRotation wird weiter fuer Pixel-Matching genutzt, aber nicht mehr
fuer die Bildanzeige.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Euro/Badge-Zeilen hatten ihren Center innerhalb der Box-Zone, weshalb
das Clamping nicht griff. Jetzt wird anhand der Box-Mitte entschieden
ob eine Zelle nach oben (clamp height) oder unten (push y) gehoert.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
boxZonesPct useMemo war nach bedingten Returns platziert, was gegen
Reacts Rules of Hooks verstoesst und einen Client-Side Crash ausloest.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zellen oberhalb der Box werden in der Hoehe begrenzt, Zellen unterhalb
werden nach unten verschoben. Sub-Session-Zellen bleiben unveraendert.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- detect_rows: Content-Strips nutzen jetzt box_ranges_inner (geschrumpft
um border_thickness, min 5px) statt der vollen Box-Range
- detect_words: _row_in_box Filter nutzt ebenfalls inner Range
- Dadurch wird die letzte Zeile oberhalb einer Box nicht mehr
faelschlicherweise der Box zugeordnet und ausgeschlossen
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- usePixelWordPositions: neuer rotation-Parameter (0 | 180)
- Bei 180°: Bild auf Canvas rotiert, Zell-Koordinaten transformiert,
Cluster-Positionen zurueck-gespiegelt
- StepReconstruction: 180°-Toggle-Button in Overlay-Toolbar
- Default 180° bei Parent-Sessions mit Boxen
- Linkes Originalbild wird ebenfalls CSS-rotiert
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- usePixelWordPositions Hook extrahiert (shared zwischen StepLlmReview und StepReconstruction)
- StepReconstruction: neuer Overlay-Modus mit 50/50 Layout (Original + Rekonstruktion)
- Sub-Session-Zellen werden in Parent-Koordinaten konvertiert und zusammengefuehrt
- Spalten-/Zeilenlinien und Box-Zone-Markierung aus column_result/row_result
- Schriftgroesse-Slider und Bold-Toggle fuer Overlay
- StepLlmReview: ~140 Zeilen Pixel-Analyse durch Hook ersetzt
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alle Wortgruppen bekommen die gleiche fontRatio (gerundet auf 0.02),
basierend auf der haeufigsten berechneten Groesse. Ueberschriften
und Fliesstext haben damit einheitliche Schriftgroesse.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Analysiert Schwarzpixel-Verteilung auf dem Originalbild per Canvas.
Findet Wort-Cluster pro Zeile und positioniert erkannte Textgruppen
an den exakten Pixel-Positionen. Monospace-Font zurueck auf Sans-Serif.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
column_text Zellen enthalten proportionale Leerzeichen zur Ausrichtung.
Mit Monospace-Font stehen Waehrungswerte korrekt untereinander.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alle Zellen einer Spalte bekommen die gleiche x-Position (Median)
damit Werte vertikal korrekt untereinander stehen.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Neuer Overlay-Modus zeigt OCR-Text per bbox_pct ueber weissem
Hintergrund neben dem Originalbild. Steuerelemente fuer Schriftgroesse,
Einrueckung und Bold. Inline-Editing per contentEditable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_cells_to_vocab_entries wurde nur bei is_vocab (column_en/column_de)
aufgerufen. Fuer Sub-Sessions mit column_text wurden keine Eintraege
erzeugt, daher blieb die Korrektur-Tabelle leer.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_cells_to_vocab_entries kannte column_text nicht, daher wurden
keine Eintraege erzeugt. Jetzt mappt column_text -> 'text' Feld.
Frontend: column_text in FIELD_LABELS/COL_TYPE_TO_FIELD/COL_TYPE_COLOR.
Label: "Tabelle" statt "Vokabeltabelle" fuer Sub-Sessions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_is_noise_tail_token() stuft rein nicht-alphabetische Tokens wie
€0.50, £1, €2.50 als OCR-Noise ein und entfernt sie. Zusaetzlich
zerstoert ' '.join(tokens) das proportionale Spacing.
Fuer Single-Column Sub-Sessions wird _clean_cell_text uebersprungen.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zeigt low-confidence Woerter (conf<30) und Zellinhalte pro Zeile,
um fehlende Euro/Pfund-Betraege zu diagnostizieren.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Gap-basierte Erkennung findet bei kleinen Box-Bildern zu wenige Gaps
und mergt Zeilen (7 raw gaps -> 4 validated -> nur 3 rows statt 6).
Sub-Sessions nutzen jetzt direkt _build_rows_from_word_grouping(),
das Woerter nach Y-Position clustert — robuster fuer komplexe Box-Layouts.
Zusaetzlich: alle zones=None Crashes gefixt (replace_all .get("zones") or []).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
column_result.get("zones", []) gibt None zurueck wenn der Key mit
Wert None existiert. Geaendert zu .get("zones") or [].
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bisher wurden _word_dicts, _inv und _content_bounds fuer Sub-Sessions
nicht gecacht, sodass detect_rows auf detect_column_geometry() zurueckfiel.
Das konnte bei kleinen Box-Bildern mit <5 Woertern fehlschlagen.
Jetzt laeuft Tesseract + Binarisierung direkt im Pseudo-Spalten-Block,
und die Intermediates werden gecacht. Zusaetzlich ausfuehrliche Kommentare
zur Zeilenerkennung (detect_row_geometry, _regularize_row_grid).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sub-Sessions ueberspringen Spaltenerkennung und nutzen stattdessen eine
Pseudo-Spalte ueber die volle Breite. Text wird mit proportionalem
Spacing aus Wort-Positionen rekonstruiert, um raeumliches Layout zu erhalten.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alle Bild-Endpoints (cropped, columns-overlay, rows-overlay,
words-overlay) suchten nur nach cropped/dewarped. Sub-Sessions haben
nur ein original-Bild. Neue Hilfsfunktion _get_base_image_png() mit
Fallback-Kette: cropped > dewarped > original.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Spalten-/Zeilen-/Woerter-Erkennung suchen nach cropped_bgr oder
dewarped_bgr. Bei Sub-Sessions existiert nur original_bgr (der
Box-Ausschnitt). Jetzt wird original_bgr automatisch als cropped_bgr
gesetzt, sowohl im Cache-Aufbau als auch bei der Erstellung.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Box-Sub-Sessions haben bereits ein zugeschnittenes Bild. Orientierung,
Begradigung, Entzerrung und Crop werden uebersprungen (skipped).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rote semi-transparente Box-Markierung in allen Overlays (Spalten, Zeilen, Woerter)
- Zeilenerkennung: Combined-Image-Ansatz schliesst Box-Bereiche aus
- Woerter-Erkennung: Zeilen innerhalb von Box-Zonen werden gefiltert
- Sub-Sessions: parent_session_id/box_index in DB-Schema
- POST /sessions/{id}/create-box-sessions erstellt Sub-Sessions aus Box-Regionen
- Session-Info zeigt Sub-Sessions bzw. Parent-Verknuepfung
- Sessions-Liste blendet Sub-Sessions per Default aus
- Rekonstruktion: Fabric-JSON merged Sub-Session-Zellen an Box-Positionen
- Save-Reconstruction routet box{N}_* Updates an Sub-Sessions
- GET /sessions/{id}/vocab-entries/merged fuer zusammengefuehrte Eintraege
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Content-Streifen oberhalb/unterhalb von Boxen werden zu einem Bild zusammengefügt,
Spaltenerkennung läuft einmal auf dem kombinierten Bild. Entfernt Step 5c
(suspicion-based gap alignment), da der neue Ansatz das Problem an der Wurzel löst.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Vorher wurden alle internen Gaps geprueft, was echte Spaltentrennungen
(EN→DE) faelschlicherweise entfernte. Jetzt werden nur Gaps geprueft,
die eine unverhaeltnismaessig breite rechte Spalte erzeugen wuerden
(>2x Median-Spaltenbreite). Schwelle auf 15% gesenkt.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Variable wurde vor ihrer Definition in Step 7 referenziert.
Eigene margin_thresh Variable fuer Step 5c eingefuehrt.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Interiore Gaps werden jetzt geprueft: rechts des Gaps muessen
mindestens 25% der Woerter eine gemeinsame linke Kante teilen.
Verhindert falsche Spaltentrennungen innerhalb breiter Spalten
(z.B. Example-Spalte mit kurzen und langen Eintraegen).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Weitere NameError-Probleme vom Modul-Refactoring: beide Symbole
werden in cv_cell_grid.py benutzt, sind aber in cv_ocr_engines.py
definiert und waren nicht importiert.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Funktion war nur in cv_review.py definiert, wurde aber auch in
cv_ocr_engines.py und cv_layout.py benutzt — NameError zur Laufzeit.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NameError in build_cell_grid_v2 weil _MIN_WORD_CONF nur in
_ocr_cell_crop und build_cell_grid lokal definiert war.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Spalten-, Zeilen-, Woerter-Overlay und alle nachfolgenden Steps
(LLM-Review, Rekonstruktion) lesen jetzt image/cropped mit Fallback
auf image/dewarped. Tests fuer page_crop.py hinzugefuegt (25 Tests).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
detect_and_fix_orientation() wird jetzt vor dem Deskew-Schritt in der
OCR-Pipeline ausgefuehrt, sodass 90/180/270°-gedrehte Scans automatisch
korrigiert werden. Frontend zeigt Orientierungskorrektur als Info-Banner.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wenn bereits 2+ breite Content-Spalten existieren, ist das Layout
wahrscheinlich korrekt in EN/DE getrennt. Split wird nur ausgefuehrt
wenn eine einzelne breite Spalte EN+DE kombiniert enthaelt.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Gaps die den Spaltenrand beruehren (Margins) werden jetzt ausgeschlossen,
nur interne Gaps werden als Split-Kandidaten betrachtet. Behebt das
Problem dass trailing whitespace faelschlich als groesster Gap gewaehlt
wurde. Early-return in _run_ocr_pipeline_for_page gibt jetzt korrekt
([], rotation) statt [] zurueck.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Spalten mit <=2 Woertern und <15% Breite werden jetzt als column_marker
statt als content-Spalte klassifiziert. Bei 2 breiten Content-Spalten
wird die rechte als column_example statt column_de gelabelt, da die
linke Spalte EN+DE kombiniert enthaelt.
OSD-Zoom von 1.0 auf 2.0 erhoeht fuer zuverlaessigere Orientierungserkennung.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rotation wird jetzt in upload_pdf_get_info() erkannt, damit Thumbnails
bei der Seitenauswahl bereits richtig herum angezeigt werden.
Debug-Logging fuer _split_broad_columns hinzugefuegt.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_split_broad_columns() erkennt EN/DE-Gemisch in breiten Spalten via
Word-Coverage-Analyse und trennt sie am groessten Luecken-Gap.
Thumbnails und Page-Images werden serverseitig per fitz rotiert,
Frontend laedt Thumbnails nach OCR-Processing neu.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tesseract OSD erkennt 0/90/180/270° Rotation und korrigiert
automatisch vor dem Deskew. Loest das Problem mit Buchscannern,
bei denen jede 2. Seite auf dem Kopf steht.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Shared Funktion positional_column_regions() in cv_vocab_pipeline.py,
wird jetzt von beiden Pfaden (Vocab-Worksheet + OCR Pipeline Admin)
genutzt. classify_column_types() bleibt als Legacy erhalten.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zeigt die ersten 8 Zeichen der Session-ID neben dem Untertitel an,
damit die Session einfach identifiziert und kommuniziert werden kann.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sprachbasiertes Scoring (classify_column_types) verursachte vertauschte
Spalten auf Seite 3 bei Beispielsaetzen mit vielen englischen Funktionswoertern.
Neue _positional_column_regions() ordnet Spalten rein geometrisch (links→rechts)
zu. OCR Pipeline Admin bleibt unveraendert.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Woerter aus Sub-Header-Bereichen ueberlappten korrekte Spaltenluecken
und liessen die Word-Validation faelschlich Gaps verwerfen. Jetzt werden
nur Woerter aus dem gewaehlten Segment fuer die Validation verwendet.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Statt full-width Zeilen zu maskieren wird die Seite jetzt an grossen
horizontalen Luecken (Sub-Header, Kapitelgrenzen) in Segmente unterteilt.
Das groesste Segment wird fuer die vertikale Projektion verwendet.
Dadurch stoeren Illustrationen und Ueberschriften nicht mehr.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wenn pixel-basierte Projektion zu wenige Spaltenluecken findet (z.B.
durch Illustrationen/Grafiken die Luecken fuellen), wird jetzt eine
wort-basierte Gap-Detection als Zwischenschritt vor dem Clustering
ausgefuehrt. Tesseract-Wort-BBs sind immun gegen dekorative Grafiken.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Farbige Sub-Header (z.B. "Unit 4: Bonnie Scotland") mit voller Breite
fuellten die Spaltenluecken im vertikalen Projektionsprofil auf und
fuehrten zu 11 statt 5 erkannten Spalten. Zeilen mit >40% Tintendichte
werden jetzt vor der Projektion maskiert.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After iterative projection (pass 1) and word-alignment (pass 2), a third
pass uses Tesseract word positions + linear regression per text line to
measure and correct residual rotation. This catches cases where passes 1-2
leave significant slope (e.g. 1.7° residual on heavily skewed scans).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Increase iterative deskew coarse_range from ±2° to ±5° to handle
heavily skewed scans
- New deskew_two_pass(): runs iterative projection first, then
word-alignment on the corrected image to detect/fix residual skew
(applied when residual ≥ 0.3°)
- OCR pipeline API auto_deskew now uses deskew_two_pass by default
- Vocab worksheet _run_ocr_pipeline_for_page uses deskew_two_pass
- Deskew result now includes angle_residual and two_pass_debug
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- LLM Compare Seiten, Configs und alle Referenzen geloescht
- Kommunikation-Kategorie in Sidebar mit Video & Chat, Voice Service, Alerts
- Compliance SDK Kategorie aus Sidebar entfernt
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Original pages rendered at full resolution (pdf-page-image endpoint, zoom=2.0)
instead of downscaled thumbnails
- Insert-row triangles on left margin between every row (hover to reveal)
- Dynamic extra columns: "+" button in header adds custom columns
(e.g. Aussprache, Wortart), removable via hover-x on column header
- Extra columns stored per-page (pageExtraColumns state) so different
source pages can have different column structures
- Grid template adjusts dynamically based on number of columns
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove max-w-7xl constraint on content area so panels stretch to edges
- Fall back to direct API thumbnail URLs when blob URLs are empty
- Original pages now reliably show even if preloaded thumbnails failed
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Swap from 3/5-2/5 grid to 1/3-2/3 flexbox (original left, table right)
- Table uses 3 equal 1fr columns for EN/DE/example instead of cramped 13-col grid
- Full viewport height minus header (calc(100vh - 240px)) for more visible rows
- Show only processed pages in original preview (filtered by selectedPages)
- Remove per-row insert buttons to reduce vertical noise
- Compact row spacing (py-1.5) to fit ~15+ rows without scrolling
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
process-single-page now runs the full CV pipeline (deskew → dewarp → columns →
rows → cell-first OCR v2 → LLM review) for much better extraction quality.
Falls back to LLM vision if pipeline imports are unavailable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
handleNext() did nothing on the last step (early return). Now resets
session, steps and navigates back to the session overview.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The side-by-side panels used calc(100vh - 380px) pushing the Speichern/
Abschliessen buttons below the viewport. Reduced to calc(100vh - 580px)
and made the action bar sticky at the bottom.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Horizontal projection of binary image is insensitive at 0.5° because
text rows look nearly identical. The real discriminator is vertical edge
alignment: at the correct angle, word left-edges and column borders
become truly vertical, producing sharp peaks in the vertical projection
of Sobel-X edges. Also: BORDER_REPLICATE + trim to avoid artifacts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Variance is insensitive to 0.5° differences. Gradient score (L2 norm of
first derivative) detects sharp text-line transitions much better.
Also: use horizontal profile in both phases, finer coarse step (0.1°).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds deskew_image_iterative() as 3rd deskew method that directly optimizes
for projection-profile sharpness instead of proxy signals (Hough/word alignment).
Coarse sweep on horizontal profile, fine sweep on vertical profile.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add "example" to spell correction loop — was only correcting
"english" and "german" fields, missing umlauts in example sentences
- Use "german" language for example field (mixed-language, umlauts needed)
- Disable cell-level bold detection — cannot distinguish bold from
non-bold in mixed-format cells (e.g. "cookie ['kuki]")
- Keep _measure_stroke_width and _classify_bold_cells for future
word-level bold detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bold detection:
- Replace absolute threshold with page-level relative comparison
- Measure stroke width for all cells, then mark cells >1.4× median as bold
- Adapts automatically to font, DPI and scan quality
Save buttons:
- Fix status stuck on 'error' preventing re-click
- Better error messages with response body
- Fallback score to 0 when null
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add umlaut confusion rules (i→ü, a→ä, o→ö, u→ü) to _spell_fix_token
for German text — fixes "iberqueren" → "überqueren" etc.
- Add _detect_bold() using OpenCV stroke-width analysis on cell crops
- Integrate bold detection in both narrow (cell-crop) and broad (word-lookup) paths
- Add is_bold field to GridCell TypeScript interface
- Render bold text in StepGroundTruth reconstruction view
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Prepend /klausur-api prefix to original image URL (nginx proxy)
- Remove colored column background stripes, use white background
- Change cell text color to black instead of per-column-type colors
- Calculate font size dynamically from cell bbox height via ResizeObserver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces the stub StepGroundTruth with a full side-by-side Original vs
Reconstruction view. Adds VLM-based image region detection (qwen2.5vl),
mflux image generation proxy, sync scroll/zoom, manual region drawing,
and score/notes persistence.
New backend endpoints: detect-images, generate-image, validate, get validation.
New standalone mflux-service (scripts/mflux-service.py) for Metal GPU generation.
Dockerfile.base: adds fonts-liberation (Apache-2.0).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove _fix_character_confusion() from words endpoint (now only in Phase 0)
- Extend spell checker to find real OCR errors via spell.correction()
- Add field-aware dictionary selection (EN/DE) for spell corrections
- Add _normalize_page_ref() for page_ref column (p-60 → p.60)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Genericity audit findings:
- Remove German prefixes from _GRAMMAR_BRACKET_WORDS (only English field
is processed, German prefixes were unreachable dead code)
- Move _IPA_CHARS and _MIN_WORD_CONF to module-level constants
- Document _NARROW_COL_THRESHOLD_PCT with empirical rationale
- Document _PAD=3 with DPI context
- Document _PHONETIC_BRACKET_RE intentional mixed-bracket matching
- Reduce all diagnostic logger.info() to logger.debug() in:
_ocr_cell_crop, _replace_phonetics_in_text, _fix_phonetic_brackets
- Keep only summary-level info logging
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fundamentally rearchitect build_cell_grid_v2 to combine the best of
both approaches:
- Broad columns (>15% image width): Use full-page Tesseract word
assignment. Handles IPA brackets, punctuation, sentence flow,
and ellipsis correctly. No garbled phonetics.
- Narrow columns (<15% image width): Use isolated cell-crop OCR
to prevent neighbour bleeding from adjacent broad columns.
This eliminates the need for complex phonetic bracket replacement
on broad columns since full-page Tesseract reads them correctly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Only process 'english' field for IPA replacement. German and example
fields contain meaningful parenthetical content like (gefrorenes Wasser),
(sich beschweren) that must never be replaced.
- Simplify _is_grammar_bracket_content: only known grammar particles
(with, about/of, sth, etc.) are preserved. Removes the >= 4 chars
heuristic that incorrectly preserved garbled IPA like [breik], [maus].
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace _is_meaningful_bracket_content with _is_grammar_bracket_content
that uses a whitelist of grammar particles (with, about/of, auf, etc.)
- Check IPA dictionary FIRST: if word has IPA, treat brackets as phonetic
- Strip orphan brackets (no word before them) that are garbled IPA
- Preserve correct IPA (contains Unicode IPA chars) and grammar info
- Fix variable name bug (result → text)
Fixes: break [breik] now correctly replaced, cross (with) preserved,
orphan [mais] and {'mani setva] stripped.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tesseract mangles IPA square brackets into curly braces or parentheses
(e.g. China [ˈtʃaɪnə] → China {'tfatno]). The previous regex only
matched [...], missing all garbled variants.
- Match any bracket type: [...], {...}, (...) including mixed pairs
- Add _is_meaningful_bracket_content() to preserve legitimate German
prefixes like (zer)brechen and Tanz(veranstaltung)
- Trigger IPA replacement on any bracket character, not just [
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RapidOCR (PaddleOCR) is optimized for full-page scene text and produces
artifacts on small isolated cell crops: extra characters ("Tanz z",
"er r wollte"), missing punctuation, garbled phonetic transcriptions.
Tesseract works much better on isolated binarized crops with upscaling,
which is exactly what cell-first OCR provides. RapidOCR remains available
as explicit engine choice via the dropdown.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Batch OCR takes 30-60s with 3x upscaling. Without keepalive events,
proxy servers (Nginx) drop the SSE connection after their read timeout.
Now sends keepalive events every 5s to prevent timeout, with elapsed
time for debugging. Also checks for client disconnect between keepalives.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Frontend: retry /words POST once after 2s delay if it gets 400/404,
which happens when navigating via wizard after container restart
(session cache not yet warm).
Backend: log when session needs DB reload and when dewarped_bgr is missing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Der Compliance Advisor gehoert ins Compliance SDK (macmini:3007/sdk/agents),
nicht ins Lehrer-Admin. Die verbleibenden 5 Agenten (TutorAgent, GraderAgent,
QualityJudge, AlertAgent, Orchestrator) bleiben erhalten.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Short cell crops (<80px height) are always 3x upscaled for RapidOCR
to improve recognition of periods, ellipsis, and phonetic symbols
- Lowered Det.box_thresh from 0.6 to 0.4 to detect small characters
that were being filtered out (dots, brackets, IPA symbols)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cell crops of 35-54px height were too small for RapidOCR to detect
text reliably. Uses _ensure_minimum_crop_size(min_dim=150) for
consistent upscaling across all OCR engines.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add 3px padding around cell crops to avoid clipping edge characters
(parentheses in "Tanz(veranstaltung)", descenders, etc.)
- Upscale small BGR crops for RapidOCR, same as Tesseract path
- Add info-level diagnostic logging to _ocr_cell_crop for debugging
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents first content row from expanding into header area (causing
"ulary" from "VOCABULARY" to appear in DE column) and last content row
from expanding into footer area (causing page numbers to appear as content).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old per-cell streaming timed out because sequential cell OCR was
too slow to send the first event before proxy timeout. Now uses
build_cell_grid_v2 (parallel ThreadPoolExecutor) via run_in_executor,
then streams all cells at once after batch completes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cell-First OCR (v2): Each cell is cropped and OCR'd in isolation,
eliminating neighbour bleeding (e.g. "to", "ps" in marker columns).
Uses ThreadPoolExecutor for parallel Tesseract calls.
Document type detection: Classifies pages as vocab_table, full_text,
or generic_table using projection profiles (<2s, no OCR needed).
Frontend dynamically skips columns/rows steps for full-text pages.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The HSV-based coloured marker detection caused false positives in
nearly every marker cell. Coloured markers like red "!" are an
extreme edge case — better handled manually in reconstruction.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes for marker column content (e.g. red "!" marks):
1. Skip _clean_cell_text() noise filter for column_marker — it
requires 2+ consecutive letters, which drops punctuation-only
markers like "!" or "*".
2. For marker columns, detect coloured pixels via HSV saturation
check (S>80) in addition to grayscale darkness. Create a
binarized image where both dark AND saturated pixels become
black foreground, so Tesseract can see red markers that appear
near-white in standard grayscale conversion.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a narrow column expands into neighbor space, the neighbor's
boundaries must be adjusted to avoid overlap. After expansion, left
neighbor's right edge and right neighbor's left edge are trimmed to
match the expanded column's new boundaries, with words re-assigned.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sub-column splits create adjacent columns with 0px gap between them.
The previous expansion only worked with explicit gaps. Now it looks at
where the neighbor's actual words are and claims unused space up to
MIN_WORD_MARGIN (4px) from the nearest word, even if there's no gap
in the column boundaries.
Also added debug logging for expansion input.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After deskew, horizontal text lines are already straight (~0° slope).
Method D was measuring this (always ~0°) instead of the actual vertical
shear (column edge drift). This caused it to report 0.112° with 0.96
confidence, overwhelming Method A's correct detection of negative shear.
New Method D groups words by X-position into vertical columns, then
measures how left-edge X drifts with Y position via linear regression.
dx/dy = tan(shear_angle), directly measuring column tilt.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The narrow column expansion was running inside detect_column_geometry()
on the 4 main columns, but the narrowest columns (marker ~14px,
page_ref ~93px) are created AFTERWARDS by _detect_sub_columns().
Extracted expand_narrow_columns() as standalone function and call it
after sub-column splitting in the columns API endpoint.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes for edge case where residual shear pushes content out of
narrow columns (marker, page_ref):
1. Column expansion (Step 10): After detection, narrow columns (<10%
content width) expand into adjacent whitespace gaps, claiming up to
40% of the gap but never past the nearest word in the neighbor
column. This gives marker/page_ref columns breathing room.
2. Dewarp sensitivity: Lower minimum angle from 0.15° to 0.08°, lower
ensemble min confidence from 0.5 to 0.35, lower final threshold
from 0.5 to 0.4, and skip quality gate for small corrections
(<0.5°) where projection variance change is negligible.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The detections array was empty when shear was below threshold, hiding
all 4 method results from the frontend Details panel.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add DewarpDetection type with per-method results
- Expand method labels for all 4 detectors (A-D)
- Show green/amber banner: applied vs quality-gate-rejected
- Expandable "Details" panel showing all 4 methods with confidence bars
- Visual confidence bars instead of plain percentage
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace setBackgroundImage() with backgroundImage property (v6 breaking change)
- Replace setWidth/setHeight with Canvas constructor options
- Fix opacity handler to use direct property access
- Update CLAUDE.md: use git -C and docker compose -f instead of cd
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Rename Step 6 label to "Korrektur" (was "OCR-Zeichenkorrektur")
2. Move _fix_character_confusion from pipeline Step 1 into
llm_review_entries_streaming so corrections are visible in the UI:
char changes (| → I, 1 → I, 8 → B) are now emitted as a batch event
right after the meta event, appearing in the corrections list
3. StepReconstruction: all cells (including empty) are now rendered as
editable inputs — removed filter that hid empty cells from the editor
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
content_roi was cropped to [left_x:right_x] — the detected content boundary.
Words at the right edge of the last column (beyond right_x) were never
found in the initial scan, so they remained missing even after the column
geometry was extended to full image width (w).
Fix: crop to [left_x:w] so all words including those near the right margin
are detected and assigned correctly to the last column.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_CHAR_CONFUSION_RULES: standalone "1" → "I" now skips "1." and "1,"
Cross-language fallback rule: same lookahead (?![\d.,]) added
Fixes: "cross = 1. Kreuz" being converted to "cross = I. Kreuz" in Step 1
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
right_x is the detected content boundary, which can still be several
pixels short of actual text near the page margin. Since the page margin
contains only white space, extending the last column's OCR crop to the
full image width (w) is always safe and prevents right-edge text cutoff.
Affects three locations in detect_column_geometry():
- Word count logging loop
- ColumnGeometry boundary building (Step 8)
- Phantom filter boundary adjustment (Step 9)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phantom column fix:
Adjacent tiny gaps (e.g. 11px + 35px) can create very narrow columns
(< 3% of content width) with 0 words. These are scan artefacts, not
real columns. New Step 9 in detect_column_geometry():
- Filter columns where width < max(20px, 3% content_w) AND words < 3
- After filtering, extend each remaining column to close the gap with
its right neighbor, and re-assign words to correct column
Example from logs: 5 columns → 4 columns (phantom at x=710, width=36px
eliminated; neighbors expanded to cover the gap)
UI rename:
- 'Schritt 6: LLM-Korrektur' → 'Schritt 6: OCR-Zeichenkorrektur'
- 'LLM-Korrektur starten' → 'Zeichenkorrektur starten'
- Error message updated accordingly
(No LLM involved anymore — spell-checker is the active engine)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two new functions:
- _is_artifact_row(): marks rows as artifacts if all detected tokens
are single characters (scanner shadows produce dots/dashes, not words).
A real vocabulary row always contains at least one 2+ char word.
- _heal_row_gaps(): after removing empty/artifact rows, expands each
remaining content row to the midpoint of adjacent gaps, so OCR crops
are not artificially narrow. First row extends to content top_bound;
last row to content bottom_bound.
Applied in both build_cell_grid() and build_cell_grid_streaming() after
the word_count>0 filter and before OCR.
Addresses cases like:
- Row 21: scan shadow → single-char artifacts → filtered before OCR
- Row 23: completely empty (word_count=0) → already filtered
- Row 22: real content → now expanded upward/downward to fill the space
that rows 21 and 23 occupied, giving OCR the correct full height
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously detect_column_geometry() ended the last column at the start
of the detected right-margin gap (left_x + right_boundary), which could
cut into actual text near the right edge of the Example column.
Since only the page margin lies to the right of the last column, the
rightmost column now always extends to right_x regardless of whether
a right-margin gap was detected. This prevents OCR crops from missing
words at the right edge of wide columns like column_example.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add pyspellchecker (MIT) to requirements for EN+DE dictionary lookup
- New spell_review_entries_sync() + spell_review_entries_streaming():
- Dictionary-backed substitution: checks if corrected word is known
- Structural rule: digit at pos 0 + lowercase rest → most likely letter
(e.g. "8en"→"Ben", "8uch"→"Buch", "5ee"→"See", "6eld"→"Geld")
- Pattern rule: "|." → "1." for numbered list prefixes
- Standalone "|" → "I" (capital I)
- IPA entries still protected via existing _entry_needs_review filter
- Headings/untranslated words (e.g. "Story") are untouched (no susp. chars)
- llm_review_entries + llm_review_entries_streaming: route via REVIEW_ENGINE
env var ("spell" default, "llm" to restore previous behaviour)
- docker-compose.yml: REVIEW_ENGINE=${REVIEW_ENGINE:-spell}
- LLM code preserved for fallback (set REVIEW_ENGINE=llm in .env)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extend _OCR_CHAR_MAP to treat '|' as a possible misread of digit '1'
in addition to letters l/L/i/I. Fixes cases like 'cross = |. Kreuz'
→ 'cross = 1. Kreuz' (numbered list prefix) being rejected.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Die UI nutzt llm_review_entries_streaming, nicht llm_review_entries.
Die Streaming-Version hatte kein think:false → qwen3:0.6b verbrachte
9 Sekunden im Denkprozess ohne Token-Budget für die eigentliche Antwort.
- think: false in Streaming-Version ergänzt
- num_predict: 4096 → 8192 (konsistent mit nicht-streaming)
- Logging für batch-Fortschritt, Response-Länge, geparste Einträge
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Log zeigte: "Invalid control character at: line 28 column 27"
Das Pipe-Zeichen | in OCR-Texten (z.B. "| want" statt "I want")
bricht den JSON-Parser wenn es als Literal im LLM-Response steht.
Fixes:
- _sanitize_for_json(): entfernt ASCII Control-Chars 0x00-0x1f
(außer Tab/LF/CR die in JSON valid sind)
- | → I als erlaubte OCR-Korrektur in _is_spurious_change und Prompt
- Reverse-Check in _is_spurious_change (l→I etc.)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Der digit-in-word Pre-Filter hat alle 41 Einträge geblockt (skipped=41
im Log). OCR-Fehler können nicht im voraus erkannt werden.
Zurück zum ursprünglichen Ansatz: alle nicht-leeren Einträge ohne
IPA-Klammern werden ans LLM gesendet. Schutz gegen Übersetzungen
erfolgt ausschließlich über den strikten Prompt und _is_spurious_change().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- think: false in Ollama API Request (qwen3 disables CoT nativ)
- <think>...</think> Stripping in _parse_llm_json_array (Fallback falls
think:false nicht greift)
- INFO-Logging: wie viele Einträge gesendet werden, Response-Länge,
Anzahl geparster Einträge
- DEBUG-Logging: erste 3 Eingabe-Einträge, ersten 500 Zeichen der Antwort
- Bessere Fehlermeldung wenn JSON-Parsing fehlschlägt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
## Problem
qwen3:0.6b interpretierte den Prompt zu weit und versuchte:
- Englische Wörter zu übersetzen (EN-Spalte umschreiben)
- Korrekte deutsche Wörter neu zu übersetzen
- IPA-Einträge in Klammern zu 'korrigieren'
## Fixes
### 1. Strengerer Pre-Filter (entry_needs_review)
Sendet jetzt NUR Einträge ans LLM, die tatsächlich ein
Ziffer-in-Wort-Muster haben (0158 zwischen Buchstaben).
→ Korrekte Einträge werden gar nicht erst gesendet.
### 2. Viel restriktiverer Prompt
- Explizites Verbot: "du übersetzt NICHTS, weder EN→DE noch DE→EN"
- Nur die 5 Ziffer→Buchstaben-Fälle sind erlaubt
- Konkrete Beispiele für erlaubte Korrekturen
- Kein vager "Im Zweifel nicht ändern" — sondern explizites VERBOTEN
### 3. Stärkerer Spurious-Change-Filter
Verwirft LLM-Änderungen, die keine Ziffer→Buchstabe-Substitution sind.
Verhindert Übersetzungen und Neuformulierungen auch wenn der Prompt
sie nicht vollständig unterdrückt.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The try/except block for the deskew step had 4 extra spaces of
indentation from a previous edit. Python rejected the file with
IndentationError at startup.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Word-to-column assignment now uses overlap-based matching instead of
center-point matching. This fixes narrow page_ref columns losing
their last digit (e.g. "p.59" → "p.5") when the digit's center
falls slightly past the midpoint boundary into the next column.
2. Post-OCR empty row filter: rows where ALL cells have empty text
are removed after OCR. This catches inter-row gaps that had stray
Tesseract artifacts giving word_count > 0 but no actual content.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add marker/bbox_marker fields to WordEntry type
- Add page_ref/column_marker colors to StepReconstruction
- Make StepLlmReview table dynamic based on columns_used metadata,
showing all detected columns (EN, DE, Example, page_ref, marker)
instead of hardcoded EN/DE/Beispiel only
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Decouple display bbox from OCR crop region. Display bbox now uses exact
col.x/row.y/col.width/row.height (no padding), so adjacent cells touch
without gaps. OCR crop keeps 4px internal padding for edge character
detection.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Frontend: Replace hardcoded EN/DE/Example vocab table with unified dynamic
table driven by columns_used from backend. Labeling, confirmation, counts,
and summary badges are now all cell-based instead of branching on isVocab.
Backend: Change _cells_to_vocab_entries() entry filter from checking only
english/german/example to checking ANY mapped field. This preserves rows
with only marker or source_page content, fixing the issue where marker
sub-columns disappeared at the end of OCR processing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three fixes for sub-columns disappearing at end of streaming:
1. Backend: add column_marker mapping in _cells_to_vocab_entries()
so marker text is included in vocab entries (not silently dropped)
2. Frontend types: add source_page and bbox_ref to WordEntry interface
3. Frontend table: show page_ref column (Seite) in vocab table when
entries have source_page data, instead of only EN/DE/Example
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add is_sub_column flag to ColumnGeometry. Sub-columns created by
_detect_sub_columns() are now exempt from the edge-column word_count<8
rule that converts them to column_ignore.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The streaming word endpoint excluded page_ref from _skip_types,
causing sub-column splits to be lost in the meta event and final
grid_shape. Aligned _skip_types with build_cell_grid_streaming().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Header/footer words (page numbers, chapter titles) could pollute the
left-edge alignment bins and trigger false sub-column splits. Now
_detect_header_footer_gaps() runs early and its boundaries are passed
to _detect_sub_columns() to filter those words from clustering and
the split threshold check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Word 'left' values in ColumnGeometry.words are relative to the content
ROI (left_x), but geo.x is in absolute image coordinates. The split
position was computed from relative word positions and then compared
against absolute geo.x, resulting in negative widths and no splits on
real data. Pass left_x through to _detect_sub_columns to bridge the
two coordinate systems.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace gap-based splitting with alignment-bin approach: cluster word
left-edges within 8px tolerance, find the leftmost bin with >= 10% of
words as the true column start, split off any words to its left as a
sub-column. This correctly handles both page references ("p.59") and
misread exclamation marks ("!" → "I") even when the pixel gap is small.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Detects hidden sub-columns (e.g. page references like "p.59") within
already-recognized columns by clustering word left-edge positions and
splitting when a clear minority cluster exists. The sub-column is then
classified as page_ref and mapped to VocabRow.source_page.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Gaps that extend to the image boundary (top/bottom edge) are not valid
content separators — they typically represent dewarp padding. Only gaps
with content on both sides qualify as header/footer boundaries.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The inverted image can be taller than img_h after dewarp shear
correction, causing footer_y to be detected outside the visible page.
Now clamps the horizontal projection to actual_h = min(inv.shape[0], img_h).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Check for actual ink content in detected top/bottom regions:
- 'header'/'footer' when text is present (e.g. title, page number)
- 'margin_top'/'margin_bottom' when the region is empty page margin
Also update all skip-type sets and color maps for the new types.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the trivial top_y/bottom_y threshold check with horizontal
projection gap analysis that finds large whitespace gaps separating
header/footer content from the main body. This correctly detects
headers (e.g. "VOCABULARY" banners) and footers (page numbers) even
when _find_content_bounds includes them in the content area.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The ocr_pipeline_api.py code path called classify_column_types without
left_x/right_x, so margin regions were never created. Also add logging
to _build_margin_regions for debugging.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Thin black lines (1-5px) at page edges from scanning were incorrectly
detected as content, shifting content bounds and creating spurious
IGNORE columns. This filters narrow projection runs (<1% of image
dimension) and introduces explicit margin_left/margin_right regions
for downstream page reconstruction.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Unit tests: 76 new parametrized tests for noise filter, phonetic detection,
cell text cleaning, and row merging (116 total, all green)
2. Continuation-row merge: detect multi-line vocab entries where text wraps
(lowercase EN + empty DE) and merge into previous entry
3. Empty DE fallback: secondary PSM=7 OCR pass for cells missed by PSM=6
4. Batch-OCR: collect empty cells per column, run single Tesseract call on
column strip instead of per-cell (~66% fewer calls for 3+ empty cells)
5. StepReconstruction UI: font scaling via naturalHeight, empty EN/DE field
highlighting, undo/redo (Ctrl+Z), per-cell reset button
6. Session reprocess: POST /sessions/{id}/reprocess endpoint to re-run from
any step, with reprocess button on completed pipeline steps
Also fixes pre-existing dewarp_image tuple unpacking bug in run_cv_pipeline
and updates dewarp tests to match current (image, info) return signature.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes:
1. Tokens ending with ] (e.g. "serva]") were stripped by the noise
filter because ] was not in the allowed punctuation list.
2. Rows containing only phonetic transcription (e.g. ['mani serva])
are now merged into the previous vocab entry instead of creating
a separate (invalid) entry. This prevents the LLM from trying
to "correct" phonetic fragments.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The noise filter was stripping words containing hyphens, parentheses,
slashes, and dots (e.g. "money-saver", "Schild(chen)", "(Salat-)Gurke",
"Tanz(veranstaltung)"). Now strips all common dictionary punctuation
before checking for internal noise characters.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace containment-with-padding approach with midpoint-based column
ranges. For adjacent columns, the assignment boundary is the midpoint
between them (Voronoi-style). This prevents padding overlap where words
near column borders (e.g. "We" at the start of example sentences) were
assigned to the preceding column. The last column extends generously to
capture all rightmost text.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_is_noise_tail_token() treated words with unbalanced parentheses like
"selbst)" or "(wir" as OCR noise because the parenthesis counted as
"internal noise". Now strips leading/trailing parentheses before the
noise check, so legitimate words in example sentences like
"We baked ... (wir ... selbst)" are preserved.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Word assignment: Replace nearest-center-distance with containment-first
strategy. Words whose center falls within a column's bounds (+ 15% pad)
are assigned to that column before falling back to nearest-center. This
fixes long example sentences losing their rightmost words to adjacent
columns.
LLM review: Strengthen prompt to explicitly forbid changing proper nouns,
place names, and correctly-spelled words. Add _is_spurious_change()
post-filter that rejects case-only changes and hallucinated word
replacements (< 50% character overlap).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
StepLlmReview: Show full vocab table with image overlay, row-level
status tracking (pending/active/reviewed/corrected/skipped), and
auto-scroll during SSE streaming. Load previous results on mount.
StepReconstruction: New step 7 with editable text fields at original
bbox positions over dewarped image. Zoom controls, tab navigation,
color-coded columns, save to backend.
Backend: Add POST /sessions/{id}/reconstruction endpoint.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Stream LLM review results batch-by-batch (8 entries per batch) via SSE
- Frontend shows live progress bar, batch log, and corrections appearing
- Skip entries with IPA phonetic transcriptions (already dictionary-corrected)
- Refactor llm_review_entries into reusable helpers for both streaming and non-streaming paths
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add /no_think tag to prompt (qwen3 thinking mode causes massive slowdown)
- Increase httpx timeout from 120s to 300s for large vocab tables
- Improve error logging with traceback and exception type
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the placeholder "Koordinaten" step with an LLM review step that
sends vocab entries to qwen3:30b-a3b via Ollama for OCR error correction
(e.g. "8en" → "Ben"). Teachers can review, accept/reject individual
corrections in a diff table before applying them.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add _KNOWN_ABBREVIATIONS set with ~150 common EN/DE abbreviations
(sth, sb, etc, eg, ie, usw, bzw, vgl, adj, adv, prep, sg, pl, ...).
Tokens matching known abbreviations are never stripped as noise.
Also handle dotted abbreviations (e.g., z.B., i.e.) that have no
2+ consecutive alpha chars by checking the abbreviation set before
the _RE_REAL_WORD filter.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add _clean_cell_text() with three sub-filters to remove OCR noise:
- _is_garbage_text(): vowel/consonant ratio check for phantom row garbage
- _is_noise_tail_token(): dictionary-based trailing noise detection
- _RE_REAL_WORD check for cells with no real words (just fragments)
Handles balanced parentheses "(auf)" and trailing hyphens "under-"
as legitimate tokens while stripping noise like "Es)", "3", "ee", "B".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two generic noise filters added to _ocr_single_cell():
1. Word confidence filter (conf < 30): removes low-confidence words
before text assembly. Catches trailing artifacts like "Es)" after
real text, and standalone noise from image edges.
2. Cell noise filter: clears cells whose entire text has no real
alphabetic word (>= 2 letters). Catches fragments like "E:", "3",
"u", "D", "2.77", "and )" from image areas, while keeping real
short words like "Ei", "go", "an".
Both filters apply to word-lookup AND cell-OCR fallback results.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The cell grid IS the result. Each cell stays at its detected position.
Removed _split_comma_entries and _attach_example_sentences from the
pipeline — they were shuffling content between rows/columns, causing
"Mäuse" to appear in a separate row, "stand..." to move to Example,
and "Ei" to disappear.
Now: cells → _cells_to_vocab_entries (1:1 row mapping) →
_fix_character_confusion → _fix_phonetic_brackets → done.
Also lowered pixel-density threshold from 2% to 0.5% for the cell-OCR
fallback so small text like "Ei" is not filtered out.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three non-generic solutions replaced with universal heuristics:
1. Cell-OCR fallback: instead of restricting to column_en/column_de,
now checks pixel density (>2% dark pixels) for ANY column type.
Truly empty cells are skipped without running Tesseract.
2. Example-sentence detection: instead of checking for example-column
text (worksheet-specific), now uses sentence heuristics (>=4 words
or ends with sentence punctuation). Short EN text without DE is
kept as a vocab entry (OCR may have missed the translation).
3. Comma-split: re-enabled with singular/plural detection. Pairs like
"mouse, mice" / "Maus, Mäuse" are kept together. Verb forms like
"break, broke, broken" are still split into individual entries.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three bugs in the post-processing pipeline were overwriting correct
streaming results with wrong ones:
1. _split_comma_entries was splitting "Maus, Mäuse" into two separate
entries. Disabled — word forms belong together.
2. _attach_example_sentences treated "Ei" (2 chars) as OCR noise due
to `len(de) > 2` threshold. Lowered to `len(de) > 1`.
3. _attach_example_sentences wrongly classified rows with EN text but
no DE (like "stand ...") as example sentences, merging them into
the previous entry. Now only treats rows as examples if they also
have no text in the example column.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Skip Tesseract fallback for column_example cells which are often
legitimately empty. This reduces ~48 Tesseract calls to ~10,
cutting Step 5 fallback time from ~13s to ~3s.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Next.js was producing the same chunk hash across builds, causing
browsers to serve stale cached JS even after redeployment.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Word-lookup from full-page Tesseract is fast but can miss small or
isolated words (e.g. "Ei"). Now falls back to per-cell Tesseract OCR
for cells that remain empty after word-lookup. The ocr_engine field
reports 'cell_ocr_fallback' for cells that needed the fallback.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Word-lookup is now ~0.03s (vs seconds with per-cell Tesseract), so
always re-run detection when entering Step 5 instead of showing
potentially stale cached word_result from the session DB.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace per-cell word filtering (which allowed the same word to appear in
multiple columns due to padded overlap) with exclusive nearest-center
assignment. Each word is assigned to exactly one column per row.
Also use row height as Y-tolerance for text assembly so words within
the same row (e.g. "Maus, Mäuse") are always grouped on one line.
Fixes: words leaking into wrong columns, missing words, duplicate words.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The row_result stored in DB excludes words to keep payload small.
When Step 5 reconstructs RowGeometry from DB, words were empty,
causing word-lookup to find nothing and return blank cells.
Now re-populates row.words from cached _word_dicts (or re-runs
detect_column_geometry if cache is cold) before cell grid building.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace per-cell Tesseract re-runs with lookup of pre-existing full-page
words from row.words. Words are filtered by X-overlap with column bounds.
This fixes phantom rows with garbage text, missing last words, and
incomplete example text by using the more reliable full-page OCR results.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rows in inter-line whitespace gaps have no Tesseract words assigned but
were still processed by build_cell_grid, producing garbage OCR output.
Filter these phantom rows using the word_count field set during Step 4.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cells now appear one-by-one in the UI as they are OCR'd, with a live
progress bar, instead of waiting for the full result.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract build_cell_grid() as layout-agnostic foundation from
build_word_grid(). Step 5 now produces a generic cell grid (columns x
rows) and auto-detects whether vocab layout is present. Frontend
dynamically switches between vocab table (EN/DE/Example) and generic
cell table based on layout type.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The validation that rejected word-center grid when it produced more rows
than gap-based detection was causing fallback to gap-based rows (large
boxes). The word-center grid regularization works correctly after the
center-based grouping and cluster merging fixes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes:
1. Grid validation: reject word-center grid if it produces MORE rows
than gap-based detection (more rows = lines were split = worse).
Falls back to gap-based rows in that case.
2. Words overlay: draw clean grid cells (column × row intersections)
instead of padded entry bboxes. Eliminates confusing double lines.
OCR text labels are placed inside the grid cells directly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The y_tolerance for word-center clustering was based on median word
height (21px → 12px tolerance), which was too small. Words on the
same line can have centers 15-20px apart due to different heights.
Now uses 40% of the gap-based median row height as tolerance (e.g.
40px row → 16px tolerance), and 30% for merge threshold. This
produces correct cluster counts matching actual text lines.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When columns change (Step 3), invalidate row_result and word_result.
When rows change (Step 4), invalidate word_result.
This ensures Step 5 always uses the latest row boundaries instead of
showing stale cached word_result from a previous run.
Applies to both auto-detection and manual override endpoints.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix half-height rows caused by tall special characters (brackets, IPA
symbols) being split into separate line clusters:
- Group words by vertical CENTER instead of TOP position, so tall
characters on the same line stay in one cluster
- Filter outlier-height words (>2× median) when computing letter_h
so brackets/IPA don't skew the row height
- Merge clusters closer than 0.4× median word height (definitely
same text line despite slight center differences)
- Increased y_tolerance from 0.5× to 0.6× median word height
- Enhanced logging with cluster merge count and row height range
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace rigid uniform grid with bottom-up approach that derives row
boundaries from word vertical centers:
- Group words into line clusters, compute center_y per cluster
- Compute pitch (distance between consecutive centers)
- Detect section breaks where gap > 1.8× median pitch
- Place row boundaries at midpoints between consecutive centers
- Per-section local pitch adapts to heading/paragraph spacing
- Validate ≥85% word placement, fallback to gap-based rows
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace _split_oversized_rows() with _regularize_row_grid(). When ≥60%
of content rows have consistent height (±25% of median), overlay a
uniform grid with the standard row height over the entire content area.
This leverages the fact that books/vocab lists use constant row heights.
Validates grid by checking ≥85% of words land in a grid row. Falls back
to gap-based rows if heights are too irregular or words don't fit.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement _split_oversized_rows() in detect_row_geometry() (Step 7) to
split content rows >1.5× median height using local horizontal projection.
This produces correctly-sized rows before word OCR runs, instead of
working around the issue in Step 5 with sub-cell splitting hacks.
Removed Step 5 workarounds: _split_oversized_entries(), sub-cell
splitting in build_word_grid(), and median_row_h calculation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For cells taller than 1.5× median row height, split vertically into
sub-cells and OCR each separately. This fixes RapidOCR losing text
at the bottom of tall cells (e.g. "floor/Fußboden" below "egg/Ei"
in a merged row). Generic fix — works for any oversized cell.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds entries for all regulation codes in REGULATIONS_IN_RAG that were
missing from RAG_PDF_MAPPING, fixing "Kein PDF-Mapping" messages.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Integrate Britfone dictionary (MIT, 15k British English IPA entries)
- Add pronunciation parameter: 'british' (default) or 'american'
- British uses Britfone (Received Pronunciation), falls back to CMU
- American uses eng_to_ipa/CMU, falls back to Britfone
- Frontend: dropdown to switch pronunciation, default = British
- API: ?pronunciation=british|american query parameter
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Docker container cannot reach Google Fonts, causing build failures.
Switch to bundled local font file using next/font/local.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Semantic example matching: instead of attaching example sentences
to the immediately preceding entry, find the vocab entry whose
English word(s) appear in the example. "a broken arm" → matches
"broken" via word overlap, not "egg/Ei". Uses stem matching for
word form variants (break/broken share stem "bro").
2. Cell padding: add 8px padding to each cell region so words at
column/row edges don't get clipped by OCR (fixes "er wollte"
missing at cell boundaries).
3. Treat very short DE text (≤2 chars) as OCR noise, not real
translation — prevents false positives in example detection.
All fixes are generic and deterministic.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allows viewing chunks side-by-side with original PDF in fullscreen mode
for large screen QA review. Toggle via button or close with Escape key.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Default collection changed from bp_compliance_gesetze (DE/AT/CH laws where
PDFs need manual download) to bp_compliance_ce (EU regulations where PDFs
are auto-downloaded). Added HEAD request check so missing PDFs show a clear
"PDF nicht vorhanden" message instead of a 404 in the iframe.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 4 post-processing steps after OCR (no LLM needed):
1. Character confusion fix: I/1/l/| correction using cross-language
context (if DE has "Ich", EN "1" → "I")
2. IPA dictionary replacement: detect [phonetics] brackets, look up
correct IPA from eng_to_ipa (MIT, 134k words) — replaces OCR'd
phonetic symbols with dictionary-correct transcription
3. Comma-split: "break, broke, broken" / "brechen, brach, gebrochen"
→ 3 individual entries when part counts match
4. Example sentence attachment: rows with EN but no DE translation
get attached as examples to the preceding vocab entry
All fixes are deterministic and generic — no hardcoded word lists.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Root container uses calc(100vh - 220px) for fixed viewport height
- All flex children use min-h-0 to enable proper overflow scrolling
- Removed duplicate bottom nav buttons (Zurueck/Weiter) that appeared
in the middle of the chunk text — navigation is only in the header now
- Chunk text panel scrolls internally with fixed header
- Added prominent article/section badges in header and panel header
- Added chunk length quality indicator (warns on very short/long chunks)
- Structural metadata keys (article, section, pages) sorted first
- Sidebar shows regulation name instead of code for better readability
- PDF viewer uses pages metadata from payload when available
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Preserve \n between visual lines within cells (instead of joining with space)
- Rejoin hyphenated words split across line breaks (e.g. Fuß-\nboden → Fußboden)
- Split oversized rows (>1.5× median height) into sub-entries when EN/DE
line counts match — deterministic fix for missed Step 4 row boundaries
- Frontend: render \n as <br/>, use textarea for multiline editing
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Qdrant collections use regulation_id (e.g. eu_2016_679) as the filter key,
not regulation_code (e.g. GDPR). Updated rag-constants.ts with correct qdrant_id
mappings from actual Qdrant data, fixed API to filter on regulation_id, and updated
ChunkBrowserQA to pass qdrant_id values.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Switch to PP-OCRv5 Latin model (supports ä, ö, ü, ß)
- Use SERVER model for better accuracy
- Lower Det.unclip_ratio 1.6→1.3 to reduce word merging
- Raise Det.box_thresh 0.5→0.6 for stricter detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New ChunkBrowserQA component replaces inline chunk browser with:
- Document sidebar with live chunk counts per regulation (batched Qdrant count API)
- Sequential chunk navigation with arrow keys (1/N through all chunks of a document)
- Overlap display showing previous/next chunk boundaries (amber-highlighted)
- Split-view with original PDF via iframe (estimated page from chunk index)
- Adjustable chunks-per-page ratio for PDF page estimation
Extracts REGULATIONS_IN_RAG and REGULATION_INFO to shared rag-constants.ts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix A: Use _group_words_into_lines() with adaptive Y-tolerance to
correctly order words in multi-line cells (fixes word reordering bug).
RapidOCR: Add as alternative OCR engine (PaddleOCR models on ONNX
Runtime, native ARM64). Engine selectable via dropdown in UI or
?engine= query param. Auto mode prefers RapidOCR when available.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Document Chunk-Browser tab functionality and API
- Cover scroll endpoint, text search, pagination
- Document Originalquelle links and low-chunk warnings
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PSM 7 (single line) missed the second line in cells with two lines.
PSM 6 handles multi-line content. Also fix sort order to Y-then-X
for correct reading order.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- New 'Chunk-Browser' tab for sequential chunk browsing
- Qdrant scroll API proxy (scroll + collection-count actions)
- Pagination with prev/next through all chunks in a collection
- Text search filter with highlighting
- Click to expand chunk and see all metadata
- 'In Chunks suchen' button now navigates to Chunk-Browser with correct collection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: build_word_grid() intersects column regions with content rows,
OCRs each cell with language-specific Tesseract, and returns vocabulary
entries with percent-based bounding boxes. New endpoints: POST /words,
GET /image/words-overlay, ground-truth save/retrieve for words.
Frontend: StepWordRecognition with overview + step-through labeling modes,
goToStep callback for row correction feedback loop.
MkDocs: OCR Pipeline documentation added.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add REGULATION_SOURCES map with 88 original document URLs for all
regulations (EUR-Lex, gesetze-im-internet.de, RIS, Fedlex, etc.)
- Render "Originalquelle →" link in regulation detail panel
- Add amber warning indicator for suspiciously low chunk counts (<10)
- Add EU_IFRS_DE, EU_IFRS_EN, EFRAG_ENDORSEMENT to RAG tracking
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Build a word-coverage mask so only pixels near Tesseract word bounding
boxes contribute to the horizontal projection. Image regions (high ink
but no words) are treated as white, preventing illustrations from
merging multiple vocabulary rows into one.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Insert rows step between columns and words in the pipeline wizard.
Shows overlay image, row list with type badges, and ground truth controls.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Step 4 (row detection) between column detection and word recognition.
Uses horizontal projection profiles + whitespace gaps (same method as columns).
Includes header/footer classification via gap-size heuristics.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New regulations across bp_compliance_ce (11), bp_compliance_gesetze (31),
and bp_compliance_datenschutz (1). Collection totals updated:
gesetze 58304, ce 18183, datenschutz 2448, total 103912.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column detection now uses vertical projection profiles to find whitespace
gaps between columns, then validates gaps against word bounding boxes to
prevent splitting through words. Old clustering algorithm extracted as
fallback (_detect_columns_by_clustering) for pages with < 2 detected gaps.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Local variables named 'isInRag' shadowed the outer function, causing
"isInRag is not a function" error. Renamed to regInRag/codeInRag.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Chunks column now uses getKnownChunks() instead of API-based getRegulationChunks()
- Status column uses isInRag() check (green/red) instead of ratio-based calculation
- Key Intersections chips show green/red with checkmark/cross based on RAG status
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reduce left-side threshold from 35% to 20% of content width
- Strong language signal (eng/deu > 0.3) now prevents page_ref assignment
- Increase column_ignore word threshold from 3 to 8 for edge columns
- Apply language guard to Level 1 and Level 2 classification
Fixes: column with deu=0.921 was misclassified as page_ref because
reference score check ran before language analysis.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add REGULATIONS_IN_RAG Set tracking all 42 regulations currently in Qdrant
- Add 4 new regulation entries: E-Commerce-RL, Verbraucherrechte-RL,
Digitale-Inhalte-RL, DMA (all ingested Feb 2026)
- Add RAG column to regulations table with green check/red x indicators
- Update Landkarte tab: green/x on industry cards, thematic clusters,
and regulation matrix
- Replace old "Integrated Regulations" section with full RAG coverage overview
- Update hardcoded chunk counts (Templates: 7689, NiBiS: 7996)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address 5 weaknesses found via ground-truth comparison on session df3548d1:
- Add column_ignore for edge columns with < 3 words (margin detection)
- Absorb tiny clusters (< 5% width) into neighbors post-merge
- Restrict page_ref to left 35% of content area across all 3 levels
- Loosen marker thresholds (width < 6%, words <= 15) and add strong
marker score for very narrow non-edge columns (< 4%)
- Add EN/DE position tiebreaker when language signals are both weak
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Side-by-side view: auto result (readonly) vs GT editor where teacher
draws correct columns. Diff table shows Auto vs GT with IoU matching.
GT data persisted per session for algorithm tuning.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sub-alignments within a column (indented words, etc.) were 60-90px apart
and not getting merged at 3%. On a typical 5-col page (~1500px), 6% = ~90px
merges sub-alignments while keeping real column boundaries (~300px) separate.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clusters now track Y-positions of their words and filter by vertical
coverage (>=30% primary, >=15%+5words secondary) to reject noise from
indentations or page numbers. Merge distance widened to 3% content width.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ManualColumnEditor now uses grid-cols-2 layout (image left, controls right)
matching the normal view size so the image doesn't zoom in
- StepColumnDetection only runs auto-detection when no cached result exists;
revisiting step 3 loads cached columns without re-running detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ersetzt hardcodierte Positionsregeln durch ein zweistufiges System:
Phase A erkennt Spaltengeometrie (Clustering), Phase B klassifiziert
Typen per Inhalt (Sprache/Rolle) mit 3-stufiger Fallback-Kette.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace projection-profile layout analysis with Tesseract word bounding
box clustering to detect 5-column vocabulary layouts (page_ref, EN, DE,
markers, examples). Falls back to projection profiles when < 3 clusters.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sessions werden jetzt in PostgreSQL gespeichert statt in-memory.
Neue Session-Liste mit Name, Datum, Schritt. Sessions ueberleben
Browser-Refresh und Container-Neustart. Step 3 nutzt analyze_layout()
fuer automatische Spaltenerkennung mit farbigem Overlay.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old displacement-map approach shifted entire rows by a parabolic
profile, creating a circle/barrel distortion. The actual problem is
a linear vertical shear: after deskew aligns horizontal lines, the
vertical column edges are still tilted by ~0.5°.
New approach:
- Detect shear angle from strongest vertical edge slope (not curvature)
- Apply cv2.warpAffine shear to straighten vertical features
- Manual slider: -2.0° to +2.0° in 0.05° steps
- Slider initializes to auto-detected shear angle
- Ground truth question: "Spalten vertikal ausgerichtet?"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old -3.0 to +3.0 scale multiplied the full displacement map (up to ~79px)
directly, causing extreme distortion at values >1. New slider:
- 0% = no correction
- 100% = auto-detected correction (default)
- 200% = double correction
- Step size: 5%
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tesseract OCR + 70 Debian packages + pip dependencies are now in a
separate base image (klausur-base:latest) that is built once and reused.
A --no-cache build now only rebuilds the code layer (~seconds) instead
of re-downloading 33 MB of system packages (~9 minutes).
Rebuild base when requirements.txt or system deps change:
docker build -f klausur-service/Dockerfile.base -t klausur-base:latest klausur-service/
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix dewarp method selection: prefer methods with >5px curvature over
higher confidence (vertical_edge 79px was being ignored for text_baseline 2px)
- Add grid overlay on left image in Dewarp step for side-by-side comparison
- Add GET /sessions/{id} endpoint to reload session data
- StepDeskew accepts sessionId prop to restore state when navigating back
- SessionInfo type extended with optional deskew_result and dewarp_result
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
"Korrekt ausgerichtet?" → "Rotation korrekt?" mit Hinweis,
dass Woelbung/Verzerrung im naechsten Schritt korrigiert wird.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
**/ocr_pipeline_auto_steps.py | owner=klausur | reason=run_auto is a single async generator yielding SSE events across 6 steps (528 LOC) | review=2026-10-01
**/ocr/pipeline/auto_steps.py | owner=klausur | reason=Same file moved to ocr/ package | review=2026-10-01
# Legacy — TEMPORAER bis Refactoring abgeschlossen
# Dateien hier werden Phase fuer Phase abgearbeitet und entfernt.
# KEINE neuen Ausnahmen ohne [guardrail-change] Commit-Marker!
purpose="Verwalten Sie die vast.ai GPU-Instanzen fuer LLM-Verarbeitung und OCR. Starten/Stoppen Sie GPUs bei Bedarf und ueberwachen Sie Kosten in Echtzeit."
purpose="Vergleichen Sie Antworten verschiedener KI-Provider (OpenAI, Claude, Self-hosted) fuer Qualitaetssicherung. Optimieren Sie Parameter und System Prompts fuer beste Ergebnisse. Standalone-Werkzeug ohne direkten Datenfluss zur KI-Pipeline."
description="Das TrOCR-Modell von Microsoft ist speziell fuer Handschrifterkennung trainiert. Es verwendet eine Vision-Transformer (ViT) Architektur fuer Bildverarbeitung und einen Text-Decoder fuer die Textgenerierung."
description="LoRA fuegt kleine, trainierbare Matrizen zu bestimmten Schichten hinzu, ohne das Basismodell zu veraendern. Dies ermoeglicht effizientes Fine-Tuning mit minimaler Speichernutzung."
specs={[
{label:'Methode',value:'Low-Rank Adaptation'},
{label:'Adapter-Groesse',value:'~10 MB'},
{label:'Trainingszeit',value:'5-15 Min (CPU)'},
{label:'Min. Beispiele',value:'10'},
]}
/>
<ComponentCard
icon="🔒"
title="Pseudonymisierung"
description="Schuelernamen werden durch anonyme Tokens ersetzt, bevor Daten die lokale Umgebung verlassen. Das Mapping wird ausschliesslich lokal gespeichert."
specs={[
{label:'Methode',value:'QR-Code Tokens'},
{label:'Token-Format',value:'UUID v4'},
{label:'Mapping',value:'Lokal beim Lehrer'},
{label:'Cloud-Daten',value:'Nur Tokens + Text'},
]}
/>
<ComponentCard
icon="☁️"
title="Cloud LLM"
description="Die KI-Korrektur erfolgt auf deutschen Servern mit strikter Mandantentrennung. Es werden keine Klarnamen oder identifizierenden Informationen uebertragen."
purpose="Uebersicht aller OCR-Sessions und deren Ground-Truth-Status. Zum Pruefen und Korrigieren eine Session oeffnen — sie wird im Kombi-Modus (OCR Overlay) bearbeitet."
audience={['Entwickler','QA']}
defaultCollapsed
architecture={{
services:['klausur-service (FastAPI, Port 8086)'],
databases:['PostgreSQL (ocr_pipeline_sessions)'],
}}
relatedPages={[
{
name:'Kombi Pipeline',
href:'/ai/ocr-overlay',
description:'Sessions bearbeiten und GT markieren',
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.