Woerter aus Sub-Header-Bereichen ueberlappten korrekte Spaltenluecken
und liessen die Word-Validation faelschlich Gaps verwerfen. Jetzt werden
nur Woerter aus dem gewaehlten Segment fuer die Validation verwendet.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Statt full-width Zeilen zu maskieren wird die Seite jetzt an grossen
horizontalen Luecken (Sub-Header, Kapitelgrenzen) in Segmente unterteilt.
Das groesste Segment wird fuer die vertikale Projektion verwendet.
Dadurch stoeren Illustrationen und Ueberschriften nicht mehr.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wenn pixel-basierte Projektion zu wenige Spaltenluecken findet (z.B.
durch Illustrationen/Grafiken die Luecken fuellen), wird jetzt eine
wort-basierte Gap-Detection als Zwischenschritt vor dem Clustering
ausgefuehrt. Tesseract-Wort-BBs sind immun gegen dekorative Grafiken.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Farbige Sub-Header (z.B. "Unit 4: Bonnie Scotland") mit voller Breite
fuellten die Spaltenluecken im vertikalen Projektionsprofil auf und
fuehrten zu 11 statt 5 erkannten Spalten. Zeilen mit >40% Tintendichte
werden jetzt vor der Projektion maskiert.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After iterative projection (pass 1) and word-alignment (pass 2), a third
pass uses Tesseract word positions + linear regression per text line to
measure and correct residual rotation. This catches cases where passes 1-2
leave significant slope (e.g. 1.7° residual on heavily skewed scans).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Increase iterative deskew coarse_range from ±2° to ±5° to handle
heavily skewed scans
- New deskew_two_pass(): runs iterative projection first, then
word-alignment on the corrected image to detect/fix residual skew
(applied when residual ≥ 0.3°)
- OCR pipeline API auto_deskew now uses deskew_two_pass by default
- Vocab worksheet _run_ocr_pipeline_for_page uses deskew_two_pass
- Deskew result now includes angle_residual and two_pass_debug
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- LLM Compare Seiten, Configs und alle Referenzen geloescht
- Kommunikation-Kategorie in Sidebar mit Video & Chat, Voice Service, Alerts
- Compliance SDK Kategorie aus Sidebar entfernt
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Original pages rendered at full resolution (pdf-page-image endpoint, zoom=2.0)
instead of downscaled thumbnails
- Insert-row triangles on left margin between every row (hover to reveal)
- Dynamic extra columns: "+" button in header adds custom columns
(e.g. Aussprache, Wortart), removable via hover-x on column header
- Extra columns stored per-page (pageExtraColumns state) so different
source pages can have different column structures
- Grid template adjusts dynamically based on number of columns
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove max-w-7xl constraint on content area so panels stretch to edges
- Fall back to direct API thumbnail URLs when blob URLs are empty
- Original pages now reliably show even if preloaded thumbnails failed
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Swap from 3/5-2/5 grid to 1/3-2/3 flexbox (original left, table right)
- Table uses 3 equal 1fr columns for EN/DE/example instead of cramped 13-col grid
- Full viewport height minus header (calc(100vh - 240px)) for more visible rows
- Show only processed pages in original preview (filtered by selectedPages)
- Remove per-row insert buttons to reduce vertical noise
- Compact row spacing (py-1.5) to fit ~15+ rows without scrolling
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
process-single-page now runs the full CV pipeline (deskew → dewarp → columns →
rows → cell-first OCR v2 → LLM review) for much better extraction quality.
Falls back to LLM vision if pipeline imports are unavailable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
handleNext() did nothing on the last step (early return). Now resets
session, steps and navigates back to the session overview.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The side-by-side panels used calc(100vh - 380px) pushing the Speichern/
Abschliessen buttons below the viewport. Reduced to calc(100vh - 580px)
and made the action bar sticky at the bottom.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Horizontal projection of binary image is insensitive at 0.5° because
text rows look nearly identical. The real discriminator is vertical edge
alignment: at the correct angle, word left-edges and column borders
become truly vertical, producing sharp peaks in the vertical projection
of Sobel-X edges. Also: BORDER_REPLICATE + trim to avoid artifacts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Variance is insensitive to 0.5° differences. Gradient score (L2 norm of
first derivative) detects sharp text-line transitions much better.
Also: use horizontal profile in both phases, finer coarse step (0.1°).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds deskew_image_iterative() as 3rd deskew method that directly optimizes
for projection-profile sharpness instead of proxy signals (Hough/word alignment).
Coarse sweep on horizontal profile, fine sweep on vertical profile.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add "example" to spell correction loop — was only correcting
"english" and "german" fields, missing umlauts in example sentences
- Use "german" language for example field (mixed-language, umlauts needed)
- Disable cell-level bold detection — cannot distinguish bold from
non-bold in mixed-format cells (e.g. "cookie ['kuki]")
- Keep _measure_stroke_width and _classify_bold_cells for future
word-level bold detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bold detection:
- Replace absolute threshold with page-level relative comparison
- Measure stroke width for all cells, then mark cells >1.4× median as bold
- Adapts automatically to font, DPI and scan quality
Save buttons:
- Fix status stuck on 'error' preventing re-click
- Better error messages with response body
- Fallback score to 0 when null
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add umlaut confusion rules (i→ü, a→ä, o→ö, u→ü) to _spell_fix_token
for German text — fixes "iberqueren" → "überqueren" etc.
- Add _detect_bold() using OpenCV stroke-width analysis on cell crops
- Integrate bold detection in both narrow (cell-crop) and broad (word-lookup) paths
- Add is_bold field to GridCell TypeScript interface
- Render bold text in StepGroundTruth reconstruction view
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Prepend /klausur-api prefix to original image URL (nginx proxy)
- Remove colored column background stripes, use white background
- Change cell text color to black instead of per-column-type colors
- Calculate font size dynamically from cell bbox height via ResizeObserver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces the stub StepGroundTruth with a full side-by-side Original vs
Reconstruction view. Adds VLM-based image region detection (qwen2.5vl),
mflux image generation proxy, sync scroll/zoom, manual region drawing,
and score/notes persistence.
New backend endpoints: detect-images, generate-image, validate, get validation.
New standalone mflux-service (scripts/mflux-service.py) for Metal GPU generation.
Dockerfile.base: adds fonts-liberation (Apache-2.0).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove _fix_character_confusion() from words endpoint (now only in Phase 0)
- Extend spell checker to find real OCR errors via spell.correction()
- Add field-aware dictionary selection (EN/DE) for spell corrections
- Add _normalize_page_ref() for page_ref column (p-60 → p.60)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Genericity audit findings:
- Remove German prefixes from _GRAMMAR_BRACKET_WORDS (only English field
is processed, German prefixes were unreachable dead code)
- Move _IPA_CHARS and _MIN_WORD_CONF to module-level constants
- Document _NARROW_COL_THRESHOLD_PCT with empirical rationale
- Document _PAD=3 with DPI context
- Document _PHONETIC_BRACKET_RE intentional mixed-bracket matching
- Reduce all diagnostic logger.info() to logger.debug() in:
_ocr_cell_crop, _replace_phonetics_in_text, _fix_phonetic_brackets
- Keep only summary-level info logging
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fundamentally rearchitect build_cell_grid_v2 to combine the best of
both approaches:
- Broad columns (>15% image width): Use full-page Tesseract word
assignment. Handles IPA brackets, punctuation, sentence flow,
and ellipsis correctly. No garbled phonetics.
- Narrow columns (<15% image width): Use isolated cell-crop OCR
to prevent neighbour bleeding from adjacent broad columns.
This eliminates the need for complex phonetic bracket replacement
on broad columns since full-page Tesseract reads them correctly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Only process 'english' field for IPA replacement. German and example
fields contain meaningful parenthetical content like (gefrorenes Wasser),
(sich beschweren) that must never be replaced.
- Simplify _is_grammar_bracket_content: only known grammar particles
(with, about/of, sth, etc.) are preserved. Removes the >= 4 chars
heuristic that incorrectly preserved garbled IPA like [breik], [maus].
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace _is_meaningful_bracket_content with _is_grammar_bracket_content
that uses a whitelist of grammar particles (with, about/of, auf, etc.)
- Check IPA dictionary FIRST: if word has IPA, treat brackets as phonetic
- Strip orphan brackets (no word before them) that are garbled IPA
- Preserve correct IPA (contains Unicode IPA chars) and grammar info
- Fix variable name bug (result → text)
Fixes: break [breik] now correctly replaced, cross (with) preserved,
orphan [mais] and {'mani setva] stripped.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tesseract mangles IPA square brackets into curly braces or parentheses
(e.g. China [ˈtʃaɪnə] → China {'tfatno]). The previous regex only
matched [...], missing all garbled variants.
- Match any bracket type: [...], {...}, (...) including mixed pairs
- Add _is_meaningful_bracket_content() to preserve legitimate German
prefixes like (zer)brechen and Tanz(veranstaltung)
- Trigger IPA replacement on any bracket character, not just [
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RapidOCR (PaddleOCR) is optimized for full-page scene text and produces
artifacts on small isolated cell crops: extra characters ("Tanz z",
"er r wollte"), missing punctuation, garbled phonetic transcriptions.
Tesseract works much better on isolated binarized crops with upscaling,
which is exactly what cell-first OCR provides. RapidOCR remains available
as explicit engine choice via the dropdown.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Batch OCR takes 30-60s with 3x upscaling. Without keepalive events,
proxy servers (Nginx) drop the SSE connection after their read timeout.
Now sends keepalive events every 5s to prevent timeout, with elapsed
time for debugging. Also checks for client disconnect between keepalives.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Frontend: retry /words POST once after 2s delay if it gets 400/404,
which happens when navigating via wizard after container restart
(session cache not yet warm).
Backend: log when session needs DB reload and when dewarped_bgr is missing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Der Compliance Advisor gehoert ins Compliance SDK (macmini:3007/sdk/agents),
nicht ins Lehrer-Admin. Die verbleibenden 5 Agenten (TutorAgent, GraderAgent,
QualityJudge, AlertAgent, Orchestrator) bleiben erhalten.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Short cell crops (<80px height) are always 3x upscaled for RapidOCR
to improve recognition of periods, ellipsis, and phonetic symbols
- Lowered Det.box_thresh from 0.6 to 0.4 to detect small characters
that were being filtered out (dots, brackets, IPA symbols)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cell crops of 35-54px height were too small for RapidOCR to detect
text reliably. Uses _ensure_minimum_crop_size(min_dim=150) for
consistent upscaling across all OCR engines.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add 3px padding around cell crops to avoid clipping edge characters
(parentheses in "Tanz(veranstaltung)", descenders, etc.)
- Upscale small BGR crops for RapidOCR, same as Tesseract path
- Add info-level diagnostic logging to _ocr_cell_crop for debugging
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents first content row from expanding into header area (causing
"ulary" from "VOCABULARY" to appear in DE column) and last content row
from expanding into footer area (causing page numbers to appear as content).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old per-cell streaming timed out because sequential cell OCR was
too slow to send the first event before proxy timeout. Now uses
build_cell_grid_v2 (parallel ThreadPoolExecutor) via run_in_executor,
then streams all cells at once after batch completes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cell-First OCR (v2): Each cell is cropped and OCR'd in isolation,
eliminating neighbour bleeding (e.g. "to", "ps" in marker columns).
Uses ThreadPoolExecutor for parallel Tesseract calls.
Document type detection: Classifies pages as vocab_table, full_text,
or generic_table using projection profiles (<2s, no OCR needed).
Frontend dynamically skips columns/rows steps for full-text pages.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>