Two fixes:
1. Grid validation: reject word-center grid if it produces MORE rows
than gap-based detection (more rows = lines were split = worse).
Falls back to gap-based rows in that case.
2. Words overlay: draw clean grid cells (column × row intersections)
instead of padded entry bboxes. Eliminates confusing double lines.
OCR text labels are placed inside the grid cells directly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The y_tolerance for word-center clustering was based on median word
height (21px → 12px tolerance), which was too small. Words on the
same line can have centers 15-20px apart due to different heights.
Now uses 40% of the gap-based median row height as tolerance (e.g.
40px row → 16px tolerance), and 30% for merge threshold. This
produces correct cluster counts matching actual text lines.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When columns change (Step 3), invalidate row_result and word_result.
When rows change (Step 4), invalidate word_result.
This ensures Step 5 always uses the latest row boundaries instead of
showing stale cached word_result from a previous run.
Applies to both auto-detection and manual override endpoints.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix half-height rows caused by tall special characters (brackets, IPA
symbols) being split into separate line clusters:
- Group words by vertical CENTER instead of TOP position, so tall
characters on the same line stay in one cluster
- Filter outlier-height words (>2× median) when computing letter_h
so brackets/IPA don't skew the row height
- Merge clusters closer than 0.4× median word height (definitely
same text line despite slight center differences)
- Increased y_tolerance from 0.5× to 0.6× median word height
- Enhanced logging with cluster merge count and row height range
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace rigid uniform grid with bottom-up approach that derives row
boundaries from word vertical centers:
- Group words into line clusters, compute center_y per cluster
- Compute pitch (distance between consecutive centers)
- Detect section breaks where gap > 1.8× median pitch
- Place row boundaries at midpoints between consecutive centers
- Per-section local pitch adapts to heading/paragraph spacing
- Validate ≥85% word placement, fallback to gap-based rows
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace _split_oversized_rows() with _regularize_row_grid(). When ≥60%
of content rows have consistent height (±25% of median), overlay a
uniform grid with the standard row height over the entire content area.
This leverages the fact that books/vocab lists use constant row heights.
Validates grid by checking ≥85% of words land in a grid row. Falls back
to gap-based rows if heights are too irregular or words don't fit.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement _split_oversized_rows() in detect_row_geometry() (Step 7) to
split content rows >1.5× median height using local horizontal projection.
This produces correctly-sized rows before word OCR runs, instead of
working around the issue in Step 5 with sub-cell splitting hacks.
Removed Step 5 workarounds: _split_oversized_entries(), sub-cell
splitting in build_word_grid(), and median_row_h calculation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For cells taller than 1.5× median row height, split vertically into
sub-cells and OCR each separately. This fixes RapidOCR losing text
at the bottom of tall cells (e.g. "floor/Fußboden" below "egg/Ei"
in a merged row). Generic fix — works for any oversized cell.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Integrate Britfone dictionary (MIT, 15k British English IPA entries)
- Add pronunciation parameter: 'british' (default) or 'american'
- British uses Britfone (Received Pronunciation), falls back to CMU
- American uses eng_to_ipa/CMU, falls back to Britfone
- Frontend: dropdown to switch pronunciation, default = British
- API: ?pronunciation=british|american query parameter
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Semantic example matching: instead of attaching example sentences
to the immediately preceding entry, find the vocab entry whose
English word(s) appear in the example. "a broken arm" → matches
"broken" via word overlap, not "egg/Ei". Uses stem matching for
word form variants (break/broken share stem "bro").
2. Cell padding: add 8px padding to each cell region so words at
column/row edges don't get clipped by OCR (fixes "er wollte"
missing at cell boundaries).
3. Treat very short DE text (≤2 chars) as OCR noise, not real
translation — prevents false positives in example detection.
All fixes are generic and deterministic.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 4 post-processing steps after OCR (no LLM needed):
1. Character confusion fix: I/1/l/| correction using cross-language
context (if DE has "Ich", EN "1" → "I")
2. IPA dictionary replacement: detect [phonetics] brackets, look up
correct IPA from eng_to_ipa (MIT, 134k words) — replaces OCR'd
phonetic symbols with dictionary-correct transcription
3. Comma-split: "break, broke, broken" / "brechen, brach, gebrochen"
→ 3 individual entries when part counts match
4. Example sentence attachment: rows with EN but no DE translation
get attached as examples to the preceding vocab entry
All fixes are deterministic and generic — no hardcoded word lists.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Preserve \n between visual lines within cells (instead of joining with space)
- Rejoin hyphenated words split across line breaks (e.g. Fuß-\nboden → Fußboden)
- Split oversized rows (>1.5× median height) into sub-entries when EN/DE
line counts match — deterministic fix for missed Step 4 row boundaries
- Frontend: render \n as <br/>, use textarea for multiline editing
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Switch to PP-OCRv5 Latin model (supports ä, ö, ü, ß)
- Use SERVER model for better accuracy
- Lower Det.unclip_ratio 1.6→1.3 to reduce word merging
- Raise Det.box_thresh 0.5→0.6 for stricter detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix A: Use _group_words_into_lines() with adaptive Y-tolerance to
correctly order words in multi-line cells (fixes word reordering bug).
RapidOCR: Add as alternative OCR engine (PaddleOCR models on ONNX
Runtime, native ARM64). Engine selectable via dropdown in UI or
?engine= query param. Auto mode prefers RapidOCR when available.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PSM 7 (single line) missed the second line in cells with two lines.
PSM 6 handles multi-line content. Also fix sort order to Y-then-X
for correct reading order.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: build_word_grid() intersects column regions with content rows,
OCRs each cell with language-specific Tesseract, and returns vocabulary
entries with percent-based bounding boxes. New endpoints: POST /words,
GET /image/words-overlay, ground-truth save/retrieve for words.
Frontend: StepWordRecognition with overview + step-through labeling modes,
goToStep callback for row correction feedback loop.
MkDocs: OCR Pipeline documentation added.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Build a word-coverage mask so only pixels near Tesseract word bounding
boxes contribute to the horizontal projection. Image regions (high ink
but no words) are treated as white, preventing illustrations from
merging multiple vocabulary rows into one.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Step 4 (row detection) between column detection and word recognition.
Uses horizontal projection profiles + whitespace gaps (same method as columns).
Includes header/footer classification via gap-size heuristics.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column detection now uses vertical projection profiles to find whitespace
gaps between columns, then validates gaps against word bounding boxes to
prevent splitting through words. Old clustering algorithm extracted as
fallback (_detect_columns_by_clustering) for pages with < 2 detected gaps.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reduce left-side threshold from 35% to 20% of content width
- Strong language signal (eng/deu > 0.3) now prevents page_ref assignment
- Increase column_ignore word threshold from 3 to 8 for edge columns
- Apply language guard to Level 1 and Level 2 classification
Fixes: column with deu=0.921 was misclassified as page_ref because
reference score check ran before language analysis.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address 5 weaknesses found via ground-truth comparison on session df3548d1:
- Add column_ignore for edge columns with < 3 words (margin detection)
- Absorb tiny clusters (< 5% width) into neighbors post-merge
- Restrict page_ref to left 35% of content area across all 3 levels
- Loosen marker thresholds (width < 6%, words <= 15) and add strong
marker score for very narrow non-edge columns (< 4%)
- Add EN/DE position tiebreaker when language signals are both weak
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Side-by-side view: auto result (readonly) vs GT editor where teacher
draws correct columns. Diff table shows Auto vs GT with IoU matching.
GT data persisted per session for algorithm tuning.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sub-alignments within a column (indented words, etc.) were 60-90px apart
and not getting merged at 3%. On a typical 5-col page (~1500px), 6% = ~90px
merges sub-alignments while keeping real column boundaries (~300px) separate.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clusters now track Y-positions of their words and filter by vertical
coverage (>=30% primary, >=15%+5words secondary) to reject noise from
indentations or page numbers. Merge distance widened to 3% content width.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ersetzt hardcodierte Positionsregeln durch ein zweistufiges System:
Phase A erkennt Spaltengeometrie (Clustering), Phase B klassifiziert
Typen per Inhalt (Sprache/Rolle) mit 3-stufiger Fallback-Kette.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace projection-profile layout analysis with Tesseract word bounding
box clustering to detect 5-column vocabulary layouts (page_ref, EN, DE,
markers, examples). Falls back to projection profiles when < 3 clusters.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sessions werden jetzt in PostgreSQL gespeichert statt in-memory.
Neue Session-Liste mit Name, Datum, Schritt. Sessions ueberleben
Browser-Refresh und Container-Neustart. Step 3 nutzt analyze_layout()
fuer automatische Spaltenerkennung mit farbigem Overlay.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old displacement-map approach shifted entire rows by a parabolic
profile, creating a circle/barrel distortion. The actual problem is
a linear vertical shear: after deskew aligns horizontal lines, the
vertical column edges are still tilted by ~0.5°.
New approach:
- Detect shear angle from strongest vertical edge slope (not curvature)
- Apply cv2.warpAffine shear to straighten vertical features
- Manual slider: -2.0° to +2.0° in 0.05° steps
- Slider initializes to auto-detected shear angle
- Ground truth question: "Spalten vertikal ausgerichtet?"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old -3.0 to +3.0 scale multiplied the full displacement map (up to ~79px)
directly, causing extreme distortion at values >1. New slider:
- 0% = no correction
- 100% = auto-detected correction (default)
- 200% = double correction
- Step size: 5%
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tesseract OCR + 70 Debian packages + pip dependencies are now in a
separate base image (klausur-base:latest) that is built once and reused.
A --no-cache build now only rebuilds the code layer (~seconds) instead
of re-downloading 33 MB of system packages (~9 minutes).
Rebuild base when requirements.txt or system deps change:
docker build -f klausur-service/Dockerfile.base -t klausur-base:latest klausur-service/
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix dewarp method selection: prefer methods with >5px curvature over
higher confidence (vertical_edge 79px was being ignored for text_baseline 2px)
- Add grid overlay on left image in Dewarp step for side-by-side comparison
- Add GET /sessions/{id} endpoint to reload session data
- StepDeskew accepts sessionId prop to restore state when navigating back
- SessionInfo type extended with optional deskew_result and dewarp_result
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Neue Route /ai/ocr-pipeline mit schrittweiser Begradigung (Deskew),
Raster-Overlay und Ground Truth. Schritte 2-6 als Platzhalter.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>