Variance is insensitive to 0.5° differences. Gradient score (L2 norm of
first derivative) detects sharp text-line transitions much better.
Also: use horizontal profile in both phases, finer coarse step (0.1°).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds deskew_image_iterative() as 3rd deskew method that directly optimizes
for projection-profile sharpness instead of proxy signals (Hough/word alignment).
Coarse sweep on horizontal profile, fine sweep on vertical profile.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add "example" to spell correction loop — was only correcting
"english" and "german" fields, missing umlauts in example sentences
- Use "german" language for example field (mixed-language, umlauts needed)
- Disable cell-level bold detection — cannot distinguish bold from
non-bold in mixed-format cells (e.g. "cookie ['kuki]")
- Keep _measure_stroke_width and _classify_bold_cells for future
word-level bold detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bold detection:
- Replace absolute threshold with page-level relative comparison
- Measure stroke width for all cells, then mark cells >1.4× median as bold
- Adapts automatically to font, DPI and scan quality
Save buttons:
- Fix status stuck on 'error' preventing re-click
- Better error messages with response body
- Fallback score to 0 when null
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add umlaut confusion rules (i→ü, a→ä, o→ö, u→ü) to _spell_fix_token
for German text — fixes "iberqueren" → "überqueren" etc.
- Add _detect_bold() using OpenCV stroke-width analysis on cell crops
- Integrate bold detection in both narrow (cell-crop) and broad (word-lookup) paths
- Add is_bold field to GridCell TypeScript interface
- Render bold text in StepGroundTruth reconstruction view
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Prepend /klausur-api prefix to original image URL (nginx proxy)
- Remove colored column background stripes, use white background
- Change cell text color to black instead of per-column-type colors
- Calculate font size dynamically from cell bbox height via ResizeObserver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces the stub StepGroundTruth with a full side-by-side Original vs
Reconstruction view. Adds VLM-based image region detection (qwen2.5vl),
mflux image generation proxy, sync scroll/zoom, manual region drawing,
and score/notes persistence.
New backend endpoints: detect-images, generate-image, validate, get validation.
New standalone mflux-service (scripts/mflux-service.py) for Metal GPU generation.
Dockerfile.base: adds fonts-liberation (Apache-2.0).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove _fix_character_confusion() from words endpoint (now only in Phase 0)
- Extend spell checker to find real OCR errors via spell.correction()
- Add field-aware dictionary selection (EN/DE) for spell corrections
- Add _normalize_page_ref() for page_ref column (p-60 → p.60)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Genericity audit findings:
- Remove German prefixes from _GRAMMAR_BRACKET_WORDS (only English field
is processed, German prefixes were unreachable dead code)
- Move _IPA_CHARS and _MIN_WORD_CONF to module-level constants
- Document _NARROW_COL_THRESHOLD_PCT with empirical rationale
- Document _PAD=3 with DPI context
- Document _PHONETIC_BRACKET_RE intentional mixed-bracket matching
- Reduce all diagnostic logger.info() to logger.debug() in:
_ocr_cell_crop, _replace_phonetics_in_text, _fix_phonetic_brackets
- Keep only summary-level info logging
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fundamentally rearchitect build_cell_grid_v2 to combine the best of
both approaches:
- Broad columns (>15% image width): Use full-page Tesseract word
assignment. Handles IPA brackets, punctuation, sentence flow,
and ellipsis correctly. No garbled phonetics.
- Narrow columns (<15% image width): Use isolated cell-crop OCR
to prevent neighbour bleeding from adjacent broad columns.
This eliminates the need for complex phonetic bracket replacement
on broad columns since full-page Tesseract reads them correctly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Only process 'english' field for IPA replacement. German and example
fields contain meaningful parenthetical content like (gefrorenes Wasser),
(sich beschweren) that must never be replaced.
- Simplify _is_grammar_bracket_content: only known grammar particles
(with, about/of, sth, etc.) are preserved. Removes the >= 4 chars
heuristic that incorrectly preserved garbled IPA like [breik], [maus].
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace _is_meaningful_bracket_content with _is_grammar_bracket_content
that uses a whitelist of grammar particles (with, about/of, auf, etc.)
- Check IPA dictionary FIRST: if word has IPA, treat brackets as phonetic
- Strip orphan brackets (no word before them) that are garbled IPA
- Preserve correct IPA (contains Unicode IPA chars) and grammar info
- Fix variable name bug (result → text)
Fixes: break [breik] now correctly replaced, cross (with) preserved,
orphan [mais] and {'mani setva] stripped.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tesseract mangles IPA square brackets into curly braces or parentheses
(e.g. China [ˈtʃaɪnə] → China {'tfatno]). The previous regex only
matched [...], missing all garbled variants.
- Match any bracket type: [...], {...}, (...) including mixed pairs
- Add _is_meaningful_bracket_content() to preserve legitimate German
prefixes like (zer)brechen and Tanz(veranstaltung)
- Trigger IPA replacement on any bracket character, not just [
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RapidOCR (PaddleOCR) is optimized for full-page scene text and produces
artifacts on small isolated cell crops: extra characters ("Tanz z",
"er r wollte"), missing punctuation, garbled phonetic transcriptions.
Tesseract works much better on isolated binarized crops with upscaling,
which is exactly what cell-first OCR provides. RapidOCR remains available
as explicit engine choice via the dropdown.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Batch OCR takes 30-60s with 3x upscaling. Without keepalive events,
proxy servers (Nginx) drop the SSE connection after their read timeout.
Now sends keepalive events every 5s to prevent timeout, with elapsed
time for debugging. Also checks for client disconnect between keepalives.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Frontend: retry /words POST once after 2s delay if it gets 400/404,
which happens when navigating via wizard after container restart
(session cache not yet warm).
Backend: log when session needs DB reload and when dewarped_bgr is missing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Der Compliance Advisor gehoert ins Compliance SDK (macmini:3007/sdk/agents),
nicht ins Lehrer-Admin. Die verbleibenden 5 Agenten (TutorAgent, GraderAgent,
QualityJudge, AlertAgent, Orchestrator) bleiben erhalten.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Short cell crops (<80px height) are always 3x upscaled for RapidOCR
to improve recognition of periods, ellipsis, and phonetic symbols
- Lowered Det.box_thresh from 0.6 to 0.4 to detect small characters
that were being filtered out (dots, brackets, IPA symbols)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cell crops of 35-54px height were too small for RapidOCR to detect
text reliably. Uses _ensure_minimum_crop_size(min_dim=150) for
consistent upscaling across all OCR engines.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add 3px padding around cell crops to avoid clipping edge characters
(parentheses in "Tanz(veranstaltung)", descenders, etc.)
- Upscale small BGR crops for RapidOCR, same as Tesseract path
- Add info-level diagnostic logging to _ocr_cell_crop for debugging
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents first content row from expanding into header area (causing
"ulary" from "VOCABULARY" to appear in DE column) and last content row
from expanding into footer area (causing page numbers to appear as content).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old per-cell streaming timed out because sequential cell OCR was
too slow to send the first event before proxy timeout. Now uses
build_cell_grid_v2 (parallel ThreadPoolExecutor) via run_in_executor,
then streams all cells at once after batch completes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cell-First OCR (v2): Each cell is cropped and OCR'd in isolation,
eliminating neighbour bleeding (e.g. "to", "ps" in marker columns).
Uses ThreadPoolExecutor for parallel Tesseract calls.
Document type detection: Classifies pages as vocab_table, full_text,
or generic_table using projection profiles (<2s, no OCR needed).
Frontend dynamically skips columns/rows steps for full-text pages.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The HSV-based coloured marker detection caused false positives in
nearly every marker cell. Coloured markers like red "!" are an
extreme edge case — better handled manually in reconstruction.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes for marker column content (e.g. red "!" marks):
1. Skip _clean_cell_text() noise filter for column_marker — it
requires 2+ consecutive letters, which drops punctuation-only
markers like "!" or "*".
2. For marker columns, detect coloured pixels via HSV saturation
check (S>80) in addition to grayscale darkness. Create a
binarized image where both dark AND saturated pixels become
black foreground, so Tesseract can see red markers that appear
near-white in standard grayscale conversion.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a narrow column expands into neighbor space, the neighbor's
boundaries must be adjusted to avoid overlap. After expansion, left
neighbor's right edge and right neighbor's left edge are trimmed to
match the expanded column's new boundaries, with words re-assigned.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sub-column splits create adjacent columns with 0px gap between them.
The previous expansion only worked with explicit gaps. Now it looks at
where the neighbor's actual words are and claims unused space up to
MIN_WORD_MARGIN (4px) from the nearest word, even if there's no gap
in the column boundaries.
Also added debug logging for expansion input.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After deskew, horizontal text lines are already straight (~0° slope).
Method D was measuring this (always ~0°) instead of the actual vertical
shear (column edge drift). This caused it to report 0.112° with 0.96
confidence, overwhelming Method A's correct detection of negative shear.
New Method D groups words by X-position into vertical columns, then
measures how left-edge X drifts with Y position via linear regression.
dx/dy = tan(shear_angle), directly measuring column tilt.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The narrow column expansion was running inside detect_column_geometry()
on the 4 main columns, but the narrowest columns (marker ~14px,
page_ref ~93px) are created AFTERWARDS by _detect_sub_columns().
Extracted expand_narrow_columns() as standalone function and call it
after sub-column splitting in the columns API endpoint.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes for edge case where residual shear pushes content out of
narrow columns (marker, page_ref):
1. Column expansion (Step 10): After detection, narrow columns (<10%
content width) expand into adjacent whitespace gaps, claiming up to
40% of the gap but never past the nearest word in the neighbor
column. This gives marker/page_ref columns breathing room.
2. Dewarp sensitivity: Lower minimum angle from 0.15° to 0.08°, lower
ensemble min confidence from 0.5 to 0.35, lower final threshold
from 0.5 to 0.4, and skip quality gate for small corrections
(<0.5°) where projection variance change is negligible.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The detections array was empty when shear was below threshold, hiding
all 4 method results from the frontend Details panel.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add DewarpDetection type with per-method results
- Expand method labels for all 4 detectors (A-D)
- Show green/amber banner: applied vs quality-gate-rejected
- Expandable "Details" panel showing all 4 methods with confidence bars
- Visual confidence bars instead of plain percentage
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace setBackgroundImage() with backgroundImage property (v6 breaking change)
- Replace setWidth/setHeight with Canvas constructor options
- Fix opacity handler to use direct property access
- Update CLAUDE.md: use git -C and docker compose -f instead of cd
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Rename Step 6 label to "Korrektur" (was "OCR-Zeichenkorrektur")
2. Move _fix_character_confusion from pipeline Step 1 into
llm_review_entries_streaming so corrections are visible in the UI:
char changes (| → I, 1 → I, 8 → B) are now emitted as a batch event
right after the meta event, appearing in the corrections list
3. StepReconstruction: all cells (including empty) are now rendered as
editable inputs — removed filter that hid empty cells from the editor
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
content_roi was cropped to [left_x:right_x] — the detected content boundary.
Words at the right edge of the last column (beyond right_x) were never
found in the initial scan, so they remained missing even after the column
geometry was extended to full image width (w).
Fix: crop to [left_x:w] so all words including those near the right margin
are detected and assigned correctly to the last column.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_CHAR_CONFUSION_RULES: standalone "1" → "I" now skips "1." and "1,"
Cross-language fallback rule: same lookahead (?![\d.,]) added
Fixes: "cross = 1. Kreuz" being converted to "cross = I. Kreuz" in Step 1
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
right_x is the detected content boundary, which can still be several
pixels short of actual text near the page margin. Since the page margin
contains only white space, extending the last column's OCR crop to the
full image width (w) is always safe and prevents right-edge text cutoff.
Affects three locations in detect_column_geometry():
- Word count logging loop
- ColumnGeometry boundary building (Step 8)
- Phantom filter boundary adjustment (Step 9)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phantom column fix:
Adjacent tiny gaps (e.g. 11px + 35px) can create very narrow columns
(< 3% of content width) with 0 words. These are scan artefacts, not
real columns. New Step 9 in detect_column_geometry():
- Filter columns where width < max(20px, 3% content_w) AND words < 3
- After filtering, extend each remaining column to close the gap with
its right neighbor, and re-assign words to correct column
Example from logs: 5 columns → 4 columns (phantom at x=710, width=36px
eliminated; neighbors expanded to cover the gap)
UI rename:
- 'Schritt 6: LLM-Korrektur' → 'Schritt 6: OCR-Zeichenkorrektur'
- 'LLM-Korrektur starten' → 'Zeichenkorrektur starten'
- Error message updated accordingly
(No LLM involved anymore — spell-checker is the active engine)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two new functions:
- _is_artifact_row(): marks rows as artifacts if all detected tokens
are single characters (scanner shadows produce dots/dashes, not words).
A real vocabulary row always contains at least one 2+ char word.
- _heal_row_gaps(): after removing empty/artifact rows, expands each
remaining content row to the midpoint of adjacent gaps, so OCR crops
are not artificially narrow. First row extends to content top_bound;
last row to content bottom_bound.
Applied in both build_cell_grid() and build_cell_grid_streaming() after
the word_count>0 filter and before OCR.
Addresses cases like:
- Row 21: scan shadow → single-char artifacts → filtered before OCR
- Row 23: completely empty (word_count=0) → already filtered
- Row 22: real content → now expanded upward/downward to fill the space
that rows 21 and 23 occupied, giving OCR the correct full height
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>