Commit Graph

123 Commits

Author SHA1 Message Date
Benjamin Admin
02631dc4e0 feat: breite Spalten per Word-Gap splitten + gedrehte Scans im Frontend anzeigen
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 15s
_split_broad_columns() erkennt EN/DE-Gemisch in breiten Spalten via
Word-Coverage-Analyse und trennt sie am groessten Luecken-Gap.
Thumbnails und Page-Images werden serverseitig per fitz rotiert,
Frontend laedt Thumbnails nach OCR-Processing neu.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 18:16:32 +01:00
Benjamin Admin
a5635e0c43 feat: automatische Orientierungserkennung fuer umgedrehte Scans
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 23s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m50s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 15s
Tesseract OSD erkennt 0/90/180/270° Rotation und korrigiert
automatisch vor dem Deskew. Loest das Problem mit Buchscannern,
bei denen jede 2. Seite auf dem Kopf steht.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 17:26:21 +01:00
Benjamin Admin
7a1bd5e82d refactor: positional_column_regions auch in OCR Pipeline verwenden
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 24s
CI / test-python-klausur (push) Failing after 1m48s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 16s
Shared Funktion positional_column_regions() in cv_vocab_pipeline.py,
wird jetzt von beiden Pfaden (Vocab-Worksheet + OCR Pipeline Admin)
genutzt. classify_column_types() bleibt als Legacy erhalten.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 17:20:51 +01:00
Benjamin Admin
4532f68173 fix: Word-Validation auf Segment-Woerter beschraenken
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 1m55s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 17s
Woerter aus Sub-Header-Bereichen ueberlappten korrekte Spaltenluecken
und liessen die Word-Validation faelschlich Gaps verwerfen. Jetzt werden
nur Woerter aus dem gewaehlten Segment fuer die Validation verwendet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 23:13:19 +01:00
Benjamin Admin
391449fedf fix: Seite an Sub-Headern segmentieren, groesstes Segment fuer Projektion
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 17s
Statt full-width Zeilen zu maskieren wird die Seite jetzt an grossen
horizontalen Luecken (Sub-Header, Kapitelgrenzen) in Segmente unterteilt.
Das groesste Segment wird fuer die vertikale Projektion verwendet.
Dadurch stoeren Illustrationen und Ueberschriften nicht mehr.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 23:07:23 +01:00
Benjamin Admin
cb2b924a7b fix: word-coverage gap detection als Fallback bei Illustrationen
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m53s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 18s
Wenn pixel-basierte Projektion zu wenige Spaltenluecken findet (z.B.
durch Illustrationen/Grafiken die Luecken fuellen), wird jetzt eine
wort-basierte Gap-Detection als Zwischenschritt vor dem Clustering
ausgefuehrt. Tesseract-Wort-BBs sind immun gegen dekorative Grafiken.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 22:58:27 +01:00
Benjamin Admin
8f3a50b981 fix: full-width Zeilen vor Spaltenerkennung maskieren
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m56s
CI / test-python-agent-core (push) Successful in 14s
CI / test-nodejs-website (push) Successful in 17s
Farbige Sub-Header (z.B. "Unit 4: Bonnie Scotland") mit voller Breite
fuellten die Spaltenluecken im vertikalen Projektionsprofil auf und
fuehrten zu 11 statt 5 erkannten Spalten. Zeilen mit >40% Tintendichte
werden jetzt vor der Projektion maskiert.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 22:50:27 +01:00
Benjamin Admin
d39d249daa feat: add pass 3 text-line regression to deskew pipeline
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m53s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 15s
After iterative projection (pass 1) and word-alignment (pass 2), a third
pass uses Tesseract word positions + linear regression per text line to
measure and correct residual rotation. This catches cases where passes 1-2
leave significant slope (e.g. 1.7° residual on heavily skewed scans).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 17:53:11 +01:00
Benjamin Admin
538d5c732e feat: two-pass deskew with wider angle range and residual correction
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
- Increase iterative deskew coarse_range from ±2° to ±5° to handle
  heavily skewed scans
- New deskew_two_pass(): runs iterative projection first, then
  word-alignment on the corrected image to detect/fix residual skew
  (applied when residual ≥ 0.3°)
- OCR pipeline API auto_deskew now uses deskew_two_pass by default
- Vocab worksheet _run_ocr_pipeline_for_page uses deskew_two_pass
- Deskew result now includes angle_residual and two_pass_debug

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 17:34:57 +01:00
Benjamin Admin
b8a9493310 fix: deskew iterative — use vertical Sobel edges + vertical projection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
Horizontal projection of binary image is insensitive at 0.5° because
text rows look nearly identical. The real discriminator is vertical edge
alignment: at the correct angle, word left-edges and column borders
become truly vertical, producing sharp peaks in the vertical projection
of Sobel-X edges. Also: BORDER_REPLICATE + trim to avoid artifacts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 14:23:43 +01:00
Benjamin Admin
68a6b97654 fix: use gradient score instead of variance for iterative deskew
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m46s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 17s
Variance is insensitive to 0.5° differences. Gradient score (L2 norm of
first derivative) detects sharp text-line transitions much better.
Also: use horizontal profile in both phases, finer coarse step (0.1°).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 14:11:19 +01:00
Benjamin Admin
af1b12c97d feat: iterative projection-profile deskew (2-phase variance optimization)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m53s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 17s
Adds deskew_image_iterative() as 3rd deskew method that directly optimizes
for projection-profile sharpness instead of proxy signals (Hough/word alignment).
Coarse sweep on horizontal profile, fine sweep on vertical profile.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 13:46:44 +01:00
Benjamin Admin
770aea611f fix: correct example field (fixes iberqueren), disable cell-level bold
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m50s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 17s
- Add "example" to spell correction loop — was only correcting
  "english" and "german" fields, missing umlauts in example sentences
- Use "german" language for example field (mixed-language, umlauts needed)
- Disable cell-level bold detection — cannot distinguish bold from
  non-bold in mixed-format cells (e.g. "cookie ['kuki]")
- Keep _measure_stroke_width and _classify_bold_cells for future
  word-level bold detection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 13:15:59 +01:00
Benjamin Admin
1a2efbf075 fix: relative bold detection (page median), fix save/finish buttons
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m3s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 21s
Bold detection:
- Replace absolute threshold with page-level relative comparison
- Measure stroke width for all cells, then mark cells >1.4× median as bold
- Adapts automatically to font, DPI and scan quality

Save buttons:
- Fix status stuck on 'error' preventing re-click
- Better error messages with response body
- Fallback score to 0 when null

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 13:02:16 +01:00
Benjamin Admin
cd12755da6 feat: OCR umlaut confusion correction + bold detection via stroke-width
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 2m39s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 18s
- Add umlaut confusion rules (i→ü, a→ä, o→ö, u→ü) to _spell_fix_token
  for German text — fixes "iberqueren" → "überqueren" etc.
- Add _detect_bold() using OpenCV stroke-width analysis on cell crops
- Integrate bold detection in both narrow (cell-crop) and broad (word-lookup) paths
- Add is_bold field to GridCell TypeScript interface
- Render bold text in StepGroundTruth reconstruction view

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 12:06:57 +01:00
Benjamin Admin
a58dfca1d8 fix: move char-confusion fix to correction step, add spell + page-ref corrections
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m55s
CI / test-python-agent-core (push) Successful in 30s
CI / test-nodejs-website (push) Successful in 20s
CI / nodejs-lint (push) Failing after 10m5s
- Remove _fix_character_confusion() from words endpoint (now only in Phase 0)
- Extend spell checker to find real OCR errors via spell.correction()
- Add field-aware dictionary selection (EN/DE) for spell corrections
- Add _normalize_page_ref() for page_ref column (p-60 → p.60)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 00:26:13 +01:00
Benjamin Admin
fd99d4f875 cleanup: remove sheet-specific code, reduce logging, document constants
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m59s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 17s
Genericity audit findings:
- Remove German prefixes from _GRAMMAR_BRACKET_WORDS (only English field
  is processed, German prefixes were unreachable dead code)
- Move _IPA_CHARS and _MIN_WORD_CONF to module-level constants
- Document _NARROW_COL_THRESHOLD_PCT with empirical rationale
- Document _PAD=3 with DPI context
- Document _PHONETIC_BRACKET_RE intentional mixed-bracket matching
- Reduce all diagnostic logger.info() to logger.debug() in:
  _ocr_cell_crop, _replace_phonetics_in_text, _fix_phonetic_brackets
- Keep only summary-level info logging

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 00:04:02 +01:00
Benjamin Admin
1e0c6bb4b5 feat: hybrid OCR — full-page for broad columns, cell-crop for narrow
Fundamentally rearchitect build_cell_grid_v2 to combine the best of
both approaches:

- Broad columns (>15% image width): Use full-page Tesseract word
  assignment. Handles IPA brackets, punctuation, sentence flow,
  and ellipsis correctly. No garbled phonetics.
- Narrow columns (<15% image width): Use isolated cell-crop OCR
  to prevent neighbour bleeding from adjacent broad columns.

This eliminates the need for complex phonetic bracket replacement
on broad columns since full-page Tesseract reads them correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 23:38:44 +01:00
Benjamin Admin
e6dc3fcdd7 fix: only replace phonetics in english field, fix grammar detection
- Only process 'english' field for IPA replacement. German and example
  fields contain meaningful parenthetical content like (gefrorenes Wasser),
  (sich beschweren) that must never be replaced.
- Simplify _is_grammar_bracket_content: only known grammar particles
  (with, about/of, sth, etc.) are preserved. Removes the >= 4 chars
  heuristic that incorrectly preserved garbled IPA like [breik], [maus].

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 23:19:03 +01:00
Benjamin Admin
edbdac3203 fix: improve phonetic bracket replacement logic
- Replace _is_meaningful_bracket_content with _is_grammar_bracket_content
  that uses a whitelist of grammar particles (with, about/of, auf, etc.)
- Check IPA dictionary FIRST: if word has IPA, treat brackets as phonetic
- Strip orphan brackets (no word before them) that are garbled IPA
- Preserve correct IPA (contains Unicode IPA chars) and grammar info
- Fix variable name bug (result → text)

Fixes: break [breik] now correctly replaced, cross (with) preserved,
orphan [mais] and {'mani setva] stripped.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 23:13:34 +01:00
Benjamin Admin
99573a46ef debug: add phonetic bracket replacement logging 2026-03-04 23:01:01 +01:00
Benjamin Admin
6ad4b84584 fix: broaden phonetic bracket regex to catch Tesseract-garbled IPA
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 16s
Tesseract mangles IPA square brackets into curly braces or parentheses
(e.g. China [ˈtʃaɪnə] → China {'tfatno]). The previous regex only
matched [...], missing all garbled variants.

- Match any bracket type: [...], {...}, (...) including mixed pairs
- Add _is_meaningful_bracket_content() to preserve legitimate German
  prefixes like (zer)brechen and Tanz(veranstaltung)
- Trigger IPA replacement on any bracket character, not just [

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 22:53:50 +01:00
Benjamin Admin
f94a3836f8 fix: use Tesseract as default engine for cell-first OCR instead of RapidOCR
RapidOCR (PaddleOCR) is optimized for full-page scene text and produces
artifacts on small isolated cell crops: extra characters ("Tanz z",
"er r wollte"), missing punctuation, garbled phonetic transcriptions.

Tesseract works much better on isolated binarized crops with upscaling,
which is exactly what cell-first OCR provides. RapidOCR remains available
as explicit engine choice via the dropdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 22:30:34 +01:00
Benjamin Admin
90ecb46bed fix: force 3x upscale for short RapidOCR crops + lower box_thresh
- Short cell crops (<80px height) are always 3x upscaled for RapidOCR
  to improve recognition of periods, ellipsis, and phonetic symbols
- Lowered Det.box_thresh from 0.6 to 0.4 to detect small characters
  that were being filtered out (dots, brackets, IPA symbols)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 19:47:36 +01:00
Benjamin Admin
bb0e23303c debug: log RapidOCR upscale dimensions to verify scaling
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 18:18:03 +01:00
Benjamin Admin
604da26b24 fix: upscale RapidOCR crops to min 150px (was 64px), matching Tesseract
Cell crops of 35-54px height were too small for RapidOCR to detect
text reliably. Uses _ensure_minimum_crop_size(min_dim=150) for
consistent upscaling across all OCR engines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 17:38:06 +01:00
Benjamin Admin
113a1c10e5 fix: add 3px cell padding + upscale small RapidOCR crops + diagnostic logging
- Add 3px padding around cell crops to avoid clipping edge characters
  (parentheses in "Tanz(veranstaltung)", descenders, etc.)
- Upscale small BGR crops for RapidOCR, same as Tesseract path
- Add info-level diagnostic logging to _ocr_cell_crop for debugging

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 16:45:59 +01:00
Benjamin Admin
e4bdb3cc24 debug: add diagnostic logging to _ocr_cell_crop for empty cell investigation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 16:35:33 +01:00
Benjamin Admin
d0e7966925 fix: use header/footer row boundaries for _heal_row_gaps in cell-first OCR
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 33s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 1m53s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 20s
Prevents first content row from expanding into header area (causing
"ulary" from "VOCABULARY" to appear in DE column) and last content row
from expanding into footer area (causing page numbers to appear as content).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 15:44:13 +01:00
Benjamin Admin
29c74a9962 feat: cell-first OCR + document type detection + dynamic pipeline steps
Cell-First OCR (v2): Each cell is cropped and OCR'd in isolation,
eliminating neighbour bleeding (e.g. "to", "ps" in marker columns).
Uses ThreadPoolExecutor for parallel Tesseract calls.

Document type detection: Classifies pages as vocab_table, full_text,
or generic_table using projection profiles (<2s, no OCR needed).
Frontend dynamically skips columns/rows steps for full-text pages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 13:52:38 +01:00
Benjamin Admin
00a74b3144 revert: remove marker column OCR special handling
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m48s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 18s
The HSV-based coloured marker detection caused false positives in
nearly every marker cell. Coloured markers like red "!" are an
extreme edge case — better handled manually in reconstruction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 11:52:59 +01:00
Benjamin Admin
489835a279 fix: detect red/coloured markers in OCR pipeline
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m55s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 17s
Two fixes for marker column content (e.g. red "!" marks):

1. Skip _clean_cell_text() noise filter for column_marker — it
   requires 2+ consecutive letters, which drops punctuation-only
   markers like "!" or "*".

2. For marker columns, detect coloured pixels via HSV saturation
   check (S>80) in addition to grayscale darkness. Create a
   binarized image where both dark AND saturated pixels become
   black foreground, so Tesseract can see red markers that appear
   near-white in standard grayscale conversion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 11:38:12 +01:00
Benjamin Admin
f0726d9a2b fix: shrink overlapping neighbors after narrow column expansion
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 16s
When a narrow column expands into neighbor space, the neighbor's
boundaries must be adjusted to avoid overlap. After expansion, left
neighbor's right edge and right neighbor's left edge are trimmed to
match the expanded column's new boundaries, with words re-assigned.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 11:12:13 +01:00
Benjamin Admin
ae1f9f7494 fix: expand narrow columns into neighbor space, not just gaps
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m48s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Sub-column splits create adjacent columns with 0px gap between them.
The previous expansion only worked with explicit gaps. Now it looks at
where the neighbor's actual words are and claims unused space up to
MIN_WORD_MARGIN (4px) from the nearest word, even if there's no gap
in the column boundaries.

Also added debug logging for expansion input.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 10:49:10 +01:00
Benjamin Admin
e4aff2b27e fix: rewrite Method D to measure vertical column drift instead of text-line slope
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m56s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 17s
After deskew, horizontal text lines are already straight (~0° slope).
Method D was measuring this (always ~0°) instead of the actual vertical
shear (column edge drift). This caused it to report 0.112° with 0.96
confidence, overwhelming Method A's correct detection of negative shear.

New Method D groups words by X-position into vertical columns, then
measures how left-edge X drifts with Y position via linear regression.
dx/dy = tan(shear_angle), directly measuring column tilt.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 10:31:19 +01:00
Benjamin Admin
9dd77ab54a fix: move column expansion AFTER sub-column split
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 18s
The narrow column expansion was running inside detect_column_geometry()
on the 4 main columns, but the narrowest columns (marker ~14px,
page_ref ~93px) are created AFTERWARDS by _detect_sub_columns().

Extracted expand_narrow_columns() as standalone function and call it
after sub-column splitting in the columns API endpoint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 10:07:40 +01:00
Benjamin Admin
e426de937c fix: expand narrow columns + lower dewarp thresholds for small angles
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 19s
Two fixes for edge case where residual shear pushes content out of
narrow columns (marker, page_ref):

1. Column expansion (Step 10): After detection, narrow columns (<10%
   content width) expand into adjacent whitespace gaps, claiming up to
   40% of the gap but never past the nearest word in the neighbor
   column. This gives marker/page_ref columns breathing room.

2. Dewarp sensitivity: Lower minimum angle from 0.15° to 0.08°, lower
   ensemble min confidence from 0.5 to 0.35, lower final threshold
   from 0.5 to 0.4, and skip quality gate for small corrections
   (<0.5°) where projection variance change is negligible.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 09:32:47 +01:00
Benjamin Admin
0d3f001acb fix: always include detections in dewarp response, even when no correction applied
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m49s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 19s
The detections array was empty when shear was below threshold, hiding
all 4 method results from the frontend Details panel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 09:05:43 +01:00
Benjamin Admin
ab3ecc7c08 feat: OCR pipeline v2.1 – narrow column OCR, dewarp automation, Fabric.js editor
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m50s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 15s
Proposal B: Adaptive padding, crop upscaling, PSM selection, row-strip re-OCR
for narrow columns (<15% width) – expected accuracy boost 60-70% → 85-90%.

Proposal A: New text-line straightness detector (Method D), quality gate
(rejects counterproductive corrections), 2-pass projection refinement,
higher confidence thresholds – expected manual dewarp reduction to <10%.

Proposal C: Fabric.js canvas editor with drag/drop, inline editing, undo/redo,
opacity slider, zoom, PDF/DOCX export endpoints.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 22:44:14 +01:00
Benjamin Admin
d1c8075da2 fix: three OCR pipeline UX improvements
1. Rename Step 6 label to "Korrektur" (was "OCR-Zeichenkorrektur")
2. Move _fix_character_confusion from pipeline Step 1 into
   llm_review_entries_streaming so corrections are visible in the UI:
   char changes (| → I, 1 → I, 8 → B) are now emitted as a batch event
   right after the meta event, appearing in the corrections list
3. StepReconstruction: all cells (including empty) are now rendered as
   editable inputs — removed filter that hid empty cells from the editor

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 17:31:55 +01:00
Benjamin Admin
f3d61a9394 fix: extend initial Tesseract scan to full image width for word detection
content_roi was cropped to [left_x:right_x] — the detected content boundary.
Words at the right edge of the last column (beyond right_x) were never
found in the initial scan, so they remained missing even after the column
geometry was extended to full image width (w).

Fix: crop to [left_x:w] so all words including those near the right margin
are detected and assigned correctly to the last column.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 17:08:03 +01:00
Benjamin Admin
ab2423bd10 fix: protect numbered list prefixes from 1→I confusion in char fix step
_CHAR_CONFUSION_RULES: standalone "1" → "I" now skips "1." and "1,"
Cross-language fallback rule: same lookahead (?![\d.,]) added
Fixes: "cross = 1. Kreuz" being converted to "cross = I. Kreuz" in Step 1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 16:46:45 +01:00
Benjamin Admin
b914b6f49d fix(columns): extend rightmost column to full image width (w) not content right_x
right_x is the detected content boundary, which can still be several
pixels short of actual text near the page margin. Since the page margin
contains only white space, extending the last column's OCR crop to the
full image width (w) is always safe and prevents right-edge text cutoff.

Affects three locations in detect_column_geometry():
- Word count logging loop
- ColumnGeometry boundary building (Step 8)
- Phantom filter boundary adjustment (Step 9)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 16:25:07 +01:00
Benjamin Admin
123b7ada0b fix(columns): filter phantom narrow columns + rename step to OCR-Zeichenkorrektur
Phantom column fix:
Adjacent tiny gaps (e.g. 11px + 35px) can create very narrow columns
(< 3% of content width) with 0 words. These are scan artefacts, not
real columns. New Step 9 in detect_column_geometry():
- Filter columns where width < max(20px, 3% content_w) AND words < 3
- After filtering, extend each remaining column to close the gap with
  its right neighbor, and re-assign words to correct column

Example from logs: 5 columns → 4 columns (phantom at x=710, width=36px
eliminated; neighbors expanded to cover the gap)

UI rename:
- 'Schritt 6: LLM-Korrektur' → 'Schritt 6: OCR-Zeichenkorrektur'
- 'LLM-Korrektur starten' → 'Zeichenkorrektur starten'
- Error message updated accordingly
(No LLM involved anymore — spell-checker is the active engine)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 16:06:59 +01:00
Benjamin Admin
cb61fab77b fix(rows): filter artifact rows and heal gaps for full OCR height
Two new functions:
- _is_artifact_row(): marks rows as artifacts if all detected tokens
  are single characters (scanner shadows produce dots/dashes, not words).
  A real vocabulary row always contains at least one 2+ char word.
- _heal_row_gaps(): after removing empty/artifact rows, expands each
  remaining content row to the midpoint of adjacent gaps, so OCR crops
  are not artificially narrow. First row extends to content top_bound;
  last row to content bottom_bound.

Applied in both build_cell_grid() and build_cell_grid_streaming() after
the word_count>0 filter and before OCR.

Addresses cases like:
- Row 21: scan shadow → single-char artifacts → filtered before OCR
- Row 23: completely empty (word_count=0) → already filtered
- Row 22: real content → now expanded upward/downward to fill the space
  that rows 21 and 23 occupied, giving OCR the correct full height

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:38:58 +01:00
Benjamin Admin
6623a5d10e fix(columns): extend rightmost column to content right edge (right_x)
Previously detect_column_geometry() ended the last column at the start
of the detected right-margin gap (left_x + right_boundary), which could
cut into actual text near the right edge of the Example column.

Since only the page margin lies to the right of the last column, the
rightmost column now always extends to right_x regardless of whether
a right-margin gap was detected. This prevents OCR crops from missing
words at the right edge of wide columns like column_example.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:26:38 +01:00
Benjamin Admin
21ea458fcf feat(ocr-review): replace LLM with rule-based spell-checker (REVIEW_ENGINE=spell)
- Add pyspellchecker (MIT) to requirements for EN+DE dictionary lookup
- New spell_review_entries_sync() + spell_review_entries_streaming():
  - Dictionary-backed substitution: checks if corrected word is known
  - Structural rule: digit at pos 0 + lowercase rest → most likely letter
    (e.g. "8en"→"Ben", "8uch"→"Buch", "5ee"→"See", "6eld"→"Geld")
  - Pattern rule: "|." → "1." for numbered list prefixes
  - Standalone "|" → "I" (capital I)
  - IPA entries still protected via existing _entry_needs_review filter
  - Headings/untranslated words (e.g. "Story") are untouched (no susp. chars)
- llm_review_entries + llm_review_entries_streaming: route via REVIEW_ENGINE
  env var ("spell" default, "llm" to restore previous behaviour)
- docker-compose.yml: REVIEW_ENGINE=${REVIEW_ENGINE:-spell}
- LLM code preserved for fallback (set REVIEW_ENGINE=llm in .env)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:04:27 +01:00
Benjamin Admin
b1f7fee284 fix(ocr-review): add pipe→1 as valid OCR correction in _is_spurious_change
Extend _OCR_CHAR_MAP to treat '|' as a possible misread of digit '1'
in addition to letters l/L/i/I. Fixes cases like 'cross = |. Kreuz'
→ 'cross = 1. Kreuz' (numbered list prefix) being rejected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 14:50:16 +01:00
Benjamin Admin
dc5d76ecf5 fix(llm-review): think=false und Logging in Streaming-Version fehlten
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 18s
Die UI nutzt llm_review_entries_streaming, nicht llm_review_entries.
Die Streaming-Version hatte kein think:false → qwen3:0.6b verbrachte
9 Sekunden im Denkprozess ohne Token-Budget für die eigentliche Antwort.

- think: false in Streaming-Version ergänzt
- num_predict: 4096 → 8192 (konsistent mit nicht-streaming)
- Logging für batch-Fortschritt, Response-Länge, geparste Einträge

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 14:43:42 +01:00
Benjamin Admin
1ac47cd9b7 fix(llm-review): JSON-Parse-Fehler durch Control-Zeichen beheben
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m48s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Log zeigte: "Invalid control character at: line 28 column 27"
Das Pipe-Zeichen | in OCR-Texten (z.B. "| want" statt "I want")
bricht den JSON-Parser wenn es als Literal im LLM-Response steht.

Fixes:
- _sanitize_for_json(): entfernt ASCII Control-Chars 0x00-0x1f
  (außer Tab/LF/CR die in JSON valid sind)
- | → I als erlaubte OCR-Korrektur in _is_spurious_change und Prompt
- Reverse-Check in _is_spurious_change (l→I etc.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 14:37:16 +01:00