detect_and_fix_orientation() wird jetzt vor dem Deskew-Schritt in der
OCR-Pipeline ausgefuehrt, sodass 90/180/270°-gedrehte Scans automatisch
korrigiert werden. Frontend zeigt Orientierungskorrektur als Info-Banner.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wenn bereits 2+ breite Content-Spalten existieren, ist das Layout
wahrscheinlich korrekt in EN/DE getrennt. Split wird nur ausgefuehrt
wenn eine einzelne breite Spalte EN+DE kombiniert enthaelt.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Gaps die den Spaltenrand beruehren (Margins) werden jetzt ausgeschlossen,
nur interne Gaps werden als Split-Kandidaten betrachtet. Behebt das
Problem dass trailing whitespace faelschlich als groesster Gap gewaehlt
wurde. Early-return in _run_ocr_pipeline_for_page gibt jetzt korrekt
([], rotation) statt [] zurueck.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Spalten mit <=2 Woertern und <15% Breite werden jetzt als column_marker
statt als content-Spalte klassifiziert. Bei 2 breiten Content-Spalten
wird die rechte als column_example statt column_de gelabelt, da die
linke Spalte EN+DE kombiniert enthaelt.
OSD-Zoom von 1.0 auf 2.0 erhoeht fuer zuverlaessigere Orientierungserkennung.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rotation wird jetzt in upload_pdf_get_info() erkannt, damit Thumbnails
bei der Seitenauswahl bereits richtig herum angezeigt werden.
Debug-Logging fuer _split_broad_columns hinzugefuegt.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_split_broad_columns() erkennt EN/DE-Gemisch in breiten Spalten via
Word-Coverage-Analyse und trennt sie am groessten Luecken-Gap.
Thumbnails und Page-Images werden serverseitig per fitz rotiert,
Frontend laedt Thumbnails nach OCR-Processing neu.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tesseract OSD erkennt 0/90/180/270° Rotation und korrigiert
automatisch vor dem Deskew. Loest das Problem mit Buchscannern,
bei denen jede 2. Seite auf dem Kopf steht.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Shared Funktion positional_column_regions() in cv_vocab_pipeline.py,
wird jetzt von beiden Pfaden (Vocab-Worksheet + OCR Pipeline Admin)
genutzt. classify_column_types() bleibt als Legacy erhalten.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zeigt die ersten 8 Zeichen der Session-ID neben dem Untertitel an,
damit die Session einfach identifiziert und kommuniziert werden kann.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sprachbasiertes Scoring (classify_column_types) verursachte vertauschte
Spalten auf Seite 3 bei Beispielsaetzen mit vielen englischen Funktionswoertern.
Neue _positional_column_regions() ordnet Spalten rein geometrisch (links→rechts)
zu. OCR Pipeline Admin bleibt unveraendert.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Woerter aus Sub-Header-Bereichen ueberlappten korrekte Spaltenluecken
und liessen die Word-Validation faelschlich Gaps verwerfen. Jetzt werden
nur Woerter aus dem gewaehlten Segment fuer die Validation verwendet.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Statt full-width Zeilen zu maskieren wird die Seite jetzt an grossen
horizontalen Luecken (Sub-Header, Kapitelgrenzen) in Segmente unterteilt.
Das groesste Segment wird fuer die vertikale Projektion verwendet.
Dadurch stoeren Illustrationen und Ueberschriften nicht mehr.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wenn pixel-basierte Projektion zu wenige Spaltenluecken findet (z.B.
durch Illustrationen/Grafiken die Luecken fuellen), wird jetzt eine
wort-basierte Gap-Detection als Zwischenschritt vor dem Clustering
ausgefuehrt. Tesseract-Wort-BBs sind immun gegen dekorative Grafiken.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Farbige Sub-Header (z.B. "Unit 4: Bonnie Scotland") mit voller Breite
fuellten die Spaltenluecken im vertikalen Projektionsprofil auf und
fuehrten zu 11 statt 5 erkannten Spalten. Zeilen mit >40% Tintendichte
werden jetzt vor der Projektion maskiert.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After iterative projection (pass 1) and word-alignment (pass 2), a third
pass uses Tesseract word positions + linear regression per text line to
measure and correct residual rotation. This catches cases where passes 1-2
leave significant slope (e.g. 1.7° residual on heavily skewed scans).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Increase iterative deskew coarse_range from ±2° to ±5° to handle
heavily skewed scans
- New deskew_two_pass(): runs iterative projection first, then
word-alignment on the corrected image to detect/fix residual skew
(applied when residual ≥ 0.3°)
- OCR pipeline API auto_deskew now uses deskew_two_pass by default
- Vocab worksheet _run_ocr_pipeline_for_page uses deskew_two_pass
- Deskew result now includes angle_residual and two_pass_debug
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- LLM Compare Seiten, Configs und alle Referenzen geloescht
- Kommunikation-Kategorie in Sidebar mit Video & Chat, Voice Service, Alerts
- Compliance SDK Kategorie aus Sidebar entfernt
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Original pages rendered at full resolution (pdf-page-image endpoint, zoom=2.0)
instead of downscaled thumbnails
- Insert-row triangles on left margin between every row (hover to reveal)
- Dynamic extra columns: "+" button in header adds custom columns
(e.g. Aussprache, Wortart), removable via hover-x on column header
- Extra columns stored per-page (pageExtraColumns state) so different
source pages can have different column structures
- Grid template adjusts dynamically based on number of columns
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove max-w-7xl constraint on content area so panels stretch to edges
- Fall back to direct API thumbnail URLs when blob URLs are empty
- Original pages now reliably show even if preloaded thumbnails failed
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Swap from 3/5-2/5 grid to 1/3-2/3 flexbox (original left, table right)
- Table uses 3 equal 1fr columns for EN/DE/example instead of cramped 13-col grid
- Full viewport height minus header (calc(100vh - 240px)) for more visible rows
- Show only processed pages in original preview (filtered by selectedPages)
- Remove per-row insert buttons to reduce vertical noise
- Compact row spacing (py-1.5) to fit ~15+ rows without scrolling
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
process-single-page now runs the full CV pipeline (deskew → dewarp → columns →
rows → cell-first OCR v2 → LLM review) for much better extraction quality.
Falls back to LLM vision if pipeline imports are unavailable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
handleNext() did nothing on the last step (early return). Now resets
session, steps and navigates back to the session overview.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The side-by-side panels used calc(100vh - 380px) pushing the Speichern/
Abschliessen buttons below the viewport. Reduced to calc(100vh - 580px)
and made the action bar sticky at the bottom.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Horizontal projection of binary image is insensitive at 0.5° because
text rows look nearly identical. The real discriminator is vertical edge
alignment: at the correct angle, word left-edges and column borders
become truly vertical, producing sharp peaks in the vertical projection
of Sobel-X edges. Also: BORDER_REPLICATE + trim to avoid artifacts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Variance is insensitive to 0.5° differences. Gradient score (L2 norm of
first derivative) detects sharp text-line transitions much better.
Also: use horizontal profile in both phases, finer coarse step (0.1°).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds deskew_image_iterative() as 3rd deskew method that directly optimizes
for projection-profile sharpness instead of proxy signals (Hough/word alignment).
Coarse sweep on horizontal profile, fine sweep on vertical profile.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add "example" to spell correction loop — was only correcting
"english" and "german" fields, missing umlauts in example sentences
- Use "german" language for example field (mixed-language, umlauts needed)
- Disable cell-level bold detection — cannot distinguish bold from
non-bold in mixed-format cells (e.g. "cookie ['kuki]")
- Keep _measure_stroke_width and _classify_bold_cells for future
word-level bold detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bold detection:
- Replace absolute threshold with page-level relative comparison
- Measure stroke width for all cells, then mark cells >1.4× median as bold
- Adapts automatically to font, DPI and scan quality
Save buttons:
- Fix status stuck on 'error' preventing re-click
- Better error messages with response body
- Fallback score to 0 when null
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add umlaut confusion rules (i→ü, a→ä, o→ö, u→ü) to _spell_fix_token
for German text — fixes "iberqueren" → "überqueren" etc.
- Add _detect_bold() using OpenCV stroke-width analysis on cell crops
- Integrate bold detection in both narrow (cell-crop) and broad (word-lookup) paths
- Add is_bold field to GridCell TypeScript interface
- Render bold text in StepGroundTruth reconstruction view
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Prepend /klausur-api prefix to original image URL (nginx proxy)
- Remove colored column background stripes, use white background
- Change cell text color to black instead of per-column-type colors
- Calculate font size dynamically from cell bbox height via ResizeObserver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces the stub StepGroundTruth with a full side-by-side Original vs
Reconstruction view. Adds VLM-based image region detection (qwen2.5vl),
mflux image generation proxy, sync scroll/zoom, manual region drawing,
and score/notes persistence.
New backend endpoints: detect-images, generate-image, validate, get validation.
New standalone mflux-service (scripts/mflux-service.py) for Metal GPU generation.
Dockerfile.base: adds fonts-liberation (Apache-2.0).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove _fix_character_confusion() from words endpoint (now only in Phase 0)
- Extend spell checker to find real OCR errors via spell.correction()
- Add field-aware dictionary selection (EN/DE) for spell corrections
- Add _normalize_page_ref() for page_ref column (p-60 → p.60)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Genericity audit findings:
- Remove German prefixes from _GRAMMAR_BRACKET_WORDS (only English field
is processed, German prefixes were unreachable dead code)
- Move _IPA_CHARS and _MIN_WORD_CONF to module-level constants
- Document _NARROW_COL_THRESHOLD_PCT with empirical rationale
- Document _PAD=3 with DPI context
- Document _PHONETIC_BRACKET_RE intentional mixed-bracket matching
- Reduce all diagnostic logger.info() to logger.debug() in:
_ocr_cell_crop, _replace_phonetics_in_text, _fix_phonetic_brackets
- Keep only summary-level info logging
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fundamentally rearchitect build_cell_grid_v2 to combine the best of
both approaches:
- Broad columns (>15% image width): Use full-page Tesseract word
assignment. Handles IPA brackets, punctuation, sentence flow,
and ellipsis correctly. No garbled phonetics.
- Narrow columns (<15% image width): Use isolated cell-crop OCR
to prevent neighbour bleeding from adjacent broad columns.
This eliminates the need for complex phonetic bracket replacement
on broad columns since full-page Tesseract reads them correctly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Only process 'english' field for IPA replacement. German and example
fields contain meaningful parenthetical content like (gefrorenes Wasser),
(sich beschweren) that must never be replaced.
- Simplify _is_grammar_bracket_content: only known grammar particles
(with, about/of, sth, etc.) are preserved. Removes the >= 4 chars
heuristic that incorrectly preserved garbled IPA like [breik], [maus].
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace _is_meaningful_bracket_content with _is_grammar_bracket_content
that uses a whitelist of grammar particles (with, about/of, auf, etc.)
- Check IPA dictionary FIRST: if word has IPA, treat brackets as phonetic
- Strip orphan brackets (no word before them) that are garbled IPA
- Preserve correct IPA (contains Unicode IPA chars) and grammar info
- Fix variable name bug (result → text)
Fixes: break [breik] now correctly replaced, cross (with) preserved,
orphan [mais] and {'mani setva] stripped.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>