Commit Graph

141 Commits

Author SHA1 Message Date
Benjamin Admin
b914b6f49d fix(columns): extend rightmost column to full image width (w) not content right_x
right_x is the detected content boundary, which can still be several
pixels short of actual text near the page margin. Since the page margin
contains only white space, extending the last column's OCR crop to the
full image width (w) is always safe and prevents right-edge text cutoff.

Affects three locations in detect_column_geometry():
- Word count logging loop
- ColumnGeometry boundary building (Step 8)
- Phantom filter boundary adjustment (Step 9)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 16:25:07 +01:00
Benjamin Admin
123b7ada0b fix(columns): filter phantom narrow columns + rename step to OCR-Zeichenkorrektur
Phantom column fix:
Adjacent tiny gaps (e.g. 11px + 35px) can create very narrow columns
(< 3% of content width) with 0 words. These are scan artefacts, not
real columns. New Step 9 in detect_column_geometry():
- Filter columns where width < max(20px, 3% content_w) AND words < 3
- After filtering, extend each remaining column to close the gap with
  its right neighbor, and re-assign words to correct column

Example from logs: 5 columns → 4 columns (phantom at x=710, width=36px
eliminated; neighbors expanded to cover the gap)

UI rename:
- 'Schritt 6: LLM-Korrektur' → 'Schritt 6: OCR-Zeichenkorrektur'
- 'LLM-Korrektur starten' → 'Zeichenkorrektur starten'
- Error message updated accordingly
(No LLM involved anymore — spell-checker is the active engine)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 16:06:59 +01:00
Benjamin Admin
cb61fab77b fix(rows): filter artifact rows and heal gaps for full OCR height
Two new functions:
- _is_artifact_row(): marks rows as artifacts if all detected tokens
  are single characters (scanner shadows produce dots/dashes, not words).
  A real vocabulary row always contains at least one 2+ char word.
- _heal_row_gaps(): after removing empty/artifact rows, expands each
  remaining content row to the midpoint of adjacent gaps, so OCR crops
  are not artificially narrow. First row extends to content top_bound;
  last row to content bottom_bound.

Applied in both build_cell_grid() and build_cell_grid_streaming() after
the word_count>0 filter and before OCR.

Addresses cases like:
- Row 21: scan shadow → single-char artifacts → filtered before OCR
- Row 23: completely empty (word_count=0) → already filtered
- Row 22: real content → now expanded upward/downward to fill the space
  that rows 21 and 23 occupied, giving OCR the correct full height

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:38:58 +01:00
Benjamin Admin
6623a5d10e fix(columns): extend rightmost column to content right edge (right_x)
Previously detect_column_geometry() ended the last column at the start
of the detected right-margin gap (left_x + right_boundary), which could
cut into actual text near the right edge of the Example column.

Since only the page margin lies to the right of the last column, the
rightmost column now always extends to right_x regardless of whether
a right-margin gap was detected. This prevents OCR crops from missing
words at the right edge of wide columns like column_example.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:26:38 +01:00
Benjamin Admin
21ea458fcf feat(ocr-review): replace LLM with rule-based spell-checker (REVIEW_ENGINE=spell)
- Add pyspellchecker (MIT) to requirements for EN+DE dictionary lookup
- New spell_review_entries_sync() + spell_review_entries_streaming():
  - Dictionary-backed substitution: checks if corrected word is known
  - Structural rule: digit at pos 0 + lowercase rest → most likely letter
    (e.g. "8en"→"Ben", "8uch"→"Buch", "5ee"→"See", "6eld"→"Geld")
  - Pattern rule: "|." → "1." for numbered list prefixes
  - Standalone "|" → "I" (capital I)
  - IPA entries still protected via existing _entry_needs_review filter
  - Headings/untranslated words (e.g. "Story") are untouched (no susp. chars)
- llm_review_entries + llm_review_entries_streaming: route via REVIEW_ENGINE
  env var ("spell" default, "llm" to restore previous behaviour)
- docker-compose.yml: REVIEW_ENGINE=${REVIEW_ENGINE:-spell}
- LLM code preserved for fallback (set REVIEW_ENGINE=llm in .env)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:04:27 +01:00
Benjamin Admin
b1f7fee284 fix(ocr-review): add pipe→1 as valid OCR correction in _is_spurious_change
Extend _OCR_CHAR_MAP to treat '|' as a possible misread of digit '1'
in addition to letters l/L/i/I. Fixes cases like 'cross = |. Kreuz'
→ 'cross = 1. Kreuz' (numbered list prefix) being rejected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 14:50:16 +01:00
Benjamin Admin
dc5d76ecf5 fix(llm-review): think=false und Logging in Streaming-Version fehlten
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 18s
Die UI nutzt llm_review_entries_streaming, nicht llm_review_entries.
Die Streaming-Version hatte kein think:false → qwen3:0.6b verbrachte
9 Sekunden im Denkprozess ohne Token-Budget für die eigentliche Antwort.

- think: false in Streaming-Version ergänzt
- num_predict: 4096 → 8192 (konsistent mit nicht-streaming)
- Logging für batch-Fortschritt, Response-Länge, geparste Einträge

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 14:43:42 +01:00
Benjamin Admin
1ac47cd9b7 fix(llm-review): JSON-Parse-Fehler durch Control-Zeichen beheben
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m48s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Log zeigte: "Invalid control character at: line 28 column 27"
Das Pipe-Zeichen | in OCR-Texten (z.B. "| want" statt "I want")
bricht den JSON-Parser wenn es als Literal im LLM-Response steht.

Fixes:
- _sanitize_for_json(): entfernt ASCII Control-Chars 0x00-0x1f
  (außer Tab/LF/CR die in JSON valid sind)
- | → I als erlaubte OCR-Korrektur in _is_spurious_change und Prompt
- Reverse-Check in _is_spurious_change (l→I etc.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 14:37:16 +01:00
Benjamin Admin
fa8e38db2d fix(llm-review): Pre-Filter entfernt — alle Einträge ans LLM senden
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 20s
Der digit-in-word Pre-Filter hat alle 41 Einträge geblockt (skipped=41
im Log). OCR-Fehler können nicht im voraus erkannt werden.

Zurück zum ursprünglichen Ansatz: alle nicht-leeren Einträge ohne
IPA-Klammern werden ans LLM gesendet. Schutz gegen Übersetzungen
erfolgt ausschließlich über den strikten Prompt und _is_spurious_change().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 14:29:46 +01:00
Benjamin Admin
f1b6246838 fix(llm-review): Diagnose-Logging + think=false + <think>-Tag-Stripping
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m49s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 17s
- think: false in Ollama API Request (qwen3 disables CoT nativ)
- <think>...</think> Stripping in _parse_llm_json_array (Fallback falls
  think:false nicht greift)
- INFO-Logging: wie viele Einträge gesendet werden, Response-Länge,
  Anzahl geparster Einträge
- DEBUG-Logging: erste 3 Eingabe-Einträge, ersten 500 Zeichen der Antwort
- Bessere Fehlermeldung wenn JSON-Parsing fehlschlägt

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 14:13:08 +01:00
Benjamin Admin
2fce92d7b1 fix(llm-review): LLM übersetzt nicht mehr — nur noch OCR-Ziffernfehler
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
## Problem
qwen3:0.6b interpretierte den Prompt zu weit und versuchte:
- Englische Wörter zu übersetzen (EN-Spalte umschreiben)
- Korrekte deutsche Wörter neu zu übersetzen
- IPA-Einträge in Klammern zu 'korrigieren'

## Fixes

### 1. Strengerer Pre-Filter (entry_needs_review)
Sendet jetzt NUR Einträge ans LLM, die tatsächlich ein
Ziffer-in-Wort-Muster haben (0158 zwischen Buchstaben).
→ Korrekte Einträge werden gar nicht erst gesendet.

### 2. Viel restriktiverer Prompt
- Explizites Verbot: "du übersetzt NICHTS, weder EN→DE noch DE→EN"
- Nur die 5 Ziffer→Buchstaben-Fälle sind erlaubt
- Konkrete Beispiele für erlaubte Korrekturen
- Kein vager "Im Zweifel nicht ändern" — sondern explizites VERBOTEN

### 3. Stärkerer Spurious-Change-Filter
Verwirft LLM-Änderungen, die keine Ziffer→Buchstabe-Substitution sind.
Verhindert Übersetzungen und Neuformulierungen auch wenn der Prompt
sie nicht vollständig unterdrückt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 13:48:54 +01:00
Benjamin Admin
7eb03ca8d1 fix(ocr-pipeline): IndentationError in auto-mode deskew block
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m49s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 17s
The try/except block for the deskew step had 4 extra spaces of
indentation from a previous edit. Python rejected the file with
IndentationError at startup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 13:21:49 +01:00
Benjamin Admin
50e1c964ee feat(klausur-service): OCR-Pipeline Optimierungen (Improvements 2-4)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m46s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 16s
## Improvement 2: VLM-basierter Dewarp
- Neuer Query-Parameter `method` für POST /sessions/{id}/dewarp
  Optionen: ensemble (default) | vlm | cv
- `_detect_shear_with_vlm()`: fragt qwen2.5vl:32b per Ollama nach
  dem Scherwinkel — gibt Zahlenwert + Konfidenz zurück
- `os`, `Query` zu ocr_pipeline_api.py Imports hinzugefügt
- `_apply_shear` aus cv_vocab_pipeline importiert

## Improvement 4: 3-Methoden Ensemble-Dewarp
- `_detect_shear_by_projection()`: Varianz-Sweep ±3° / 0.25°-Schritte
  auf horizontalen Text-Zeilen-Projektionen (~30ms)
- `_detect_shear_by_hough()`: Gewichteter Median über HoughLinesP
  auf Tabellen-Linien, Vorzeichen-Inversion (~20ms)
- `_ensemble_shear()`: Kombiniert alle 3 Methoden (conf >= 0.3),
  Ausreißer-Filter bei >1° Abweichung, Bonus bei Agreement <0.5°
- `dewarp_image()` nutzt jetzt alle 3 Methoden parallel,
  `use_ensemble: bool = True` für Rückwärtskompatibilität
- auto_dewarp Response enthält jetzt `detections`-Array

## Improvement 3: Vollautomatik-Endpoint
- POST /sessions/{id}/run-auto mit RunAutoRequest:
  from_step (1-6), ocr_engine, pronunciation,
  skip_llm_review, dewarp_method
- SSE-Streaming für alle 5+1 Schritte (deskew→dewarp→columns→rows→words→llm-review)
- Jeder Schritt: start / done / skipped / error Events
- Abschluss-Event: {steps_run, steps_skipped}
- LLM-Review-Fehler sind nicht-fatal (Pipeline läuft weiter)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 13:13:20 +01:00
Benjamin Admin
2e0f8632f8 feat(klausur): Handschrift entfernen + Klausur-HTR implementiert
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m49s
CI / test-python-agent-core (push) Successful in 14s
CI / test-nodejs-website (push) Successful in 15s
Feature 1: Handschrift entfernen via OCR-Pipeline Session
- services/handwriting_detection.py: _detect_pencil() + target_ink Parameter
  ("all" | "colored" | "pencil") für gezielte Tinten-Erkennung
- ocr_pipeline_session_store.py: clean_png + handwriting_removal_meta Spalten
  (idempotentes ALTER TABLE in init_ocr_pipeline_tables)
- ocr_pipeline_api.py: POST /sessions/{id}/remove-handwriting Endpoint
  + "clean" zu valid_types für Image-Serving hinzugefügt

Feature 2: Klausur-HTR (Hochwertige Handschriftenerkennung)
- handwriting_htr_api.py: Neuer Router /api/v1/htr/recognize + /recognize-session
  Primary: qwen2.5vl:32b via Ollama, Fallback: trocr-large-handwritten
- services/trocr_service.py: size Parameter (base | large) für get_trocr_model()
  + run_trocr_ocr() - unterstützt jetzt trocr-large-handwritten
- main.py: HTR Router registriert

Config:
- docker-compose.yml: OLLAMA_HTR_MODEL, HTR_FALLBACK_MODEL
- .env.example: HTR Env-Vars dokumentiert

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 12:04:26 +01:00
Benjamin Admin
606bef0591 fix(ocr-pipeline): overlap-based word assignment and empty row filtering
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 1m14s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m55s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 17s
1. Word-to-column assignment now uses overlap-based matching instead of
   center-point matching. This fixes narrow page_ref columns losing
   their last digit (e.g. "p.59" → "p.5") when the digit's center
   falls slightly past the midpoint boundary into the next column.

2. Post-OCR empty row filter: rows where ALL cells have empty text
   are removed after OCR. This catches inter-row gaps that had stray
   Tesseract artifacts giving word_count > 0 but no actual content.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 11:00:29 +01:00
Benjamin Admin
ccba2bb887 fix(ocr-pipeline): show sub-columns in reconstruction and LLM review steps
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 21s
- Add marker/bbox_marker fields to WordEntry type
- Add page_ref/column_marker colors to StepReconstruction
- Make StepLlmReview table dynamic based on columns_used metadata,
  showing all detected columns (EN, DE, Example, page_ref, marker)
  instead of hardcoded EN/DE/Beispiel only

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 10:36:27 +01:00
Benjamin Admin
75bca1f02d fix(ocr-cells): align cell bboxes exactly to column/row coordinates
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m49s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 17s
Decouple display bbox from OCR crop region. Display bbox now uses exact
col.x/row.y/col.width/row.height (no padding), so adjacent cells touch
without gaps. OCR crop keeps 4px internal padding for edge character
detection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 09:21:56 +01:00
Benjamin Admin
4d428980c1 refactor(word-step): make table fully generic and fix marker-only row filter
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 24s
CI / test-python-klausur (push) Failing after 1m43s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 17s
Frontend: Replace hardcoded EN/DE/Example vocab table with unified dynamic
table driven by columns_used from backend. Labeling, confirmation, counts,
and summary badges are now all cell-based instead of branching on isVocab.

Backend: Change _cells_to_vocab_entries() entry filter from checking only
english/german/example to checking ANY mapped field. This preserves rows
with only marker or source_page content, fixing the issue where marker
sub-columns disappeared at the end of OCR processing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 08:45:24 +01:00
Benjamin Admin
dea3349b23 fix(ocr-pipeline): preserve sub-column data in vocab table display
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m51s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 16s
Three fixes for sub-columns disappearing at end of streaming:

1. Backend: add column_marker mapping in _cells_to_vocab_entries()
   so marker text is included in vocab entries (not silently dropped)

2. Frontend types: add source_page and bbox_ref to WordEntry interface

3. Frontend table: show page_ref column (Seite) in vocab table when
   entries have source_page data, instead of only EN/DE/Example

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 08:06:15 +01:00
Benjamin Admin
0d72f2c836 fix(sub-columns): protect sub-columns from column_ignore pre-filter
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 23s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 15s
Add is_sub_column flag to ColumnGeometry. Sub-columns created by
_detect_sub_columns() are now exempt from the edge-column word_count<8
rule that converts them to column_ignore.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 07:55:53 +01:00
Benjamin Admin
d6a8c1d821 fix(streaming): include page_ref columns in SSE metadata
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
The streaming word endpoint excluded page_ref from _skip_types,
causing sub-column splits to be lost in the meta event and final
grid_shape. Aligned _skip_types with build_cell_grid_streaming().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 07:48:07 +01:00
Benjamin Admin
6527beae03 fix(sub-columns): exclude header/footer words from alignment clustering
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 24s
CI / test-python-klausur (push) Failing after 1m51s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 18s
Header/footer words (page numbers, chapter titles) could pollute the
left-edge alignment bins and trigger false sub-column splits. Now
_detect_header_footer_gaps() runs early and its boundaries are passed
to _detect_sub_columns() to filter those words from clustering and
the split threshold check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 07:33:54 +01:00
Benjamin Admin
3904ddb493 fix(sub-columns): convert relative word positions to absolute coords for split
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m51s
CI / test-python-agent-core (push) Successful in 14s
CI / test-nodejs-website (push) Successful in 17s
Word 'left' values in ColumnGeometry.words are relative to the content
ROI (left_x), but geo.x is in absolute image coordinates. The split
position was computed from relative word positions and then compared
against absolute geo.x, resulting in negative widths and no splits on
real data. Pass left_x through to _detect_sub_columns to bridge the
two coordinate systems.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 19:16:13 +01:00
Benjamin Admin
6e1a349eed fix(tests): adjust word counts so 10% threshold works correctly
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 19:00:14 +01:00
Benjamin Admin
7252f9a956 refactor(ocr-pipeline): use left-edge alignment approach for sub-column detection
Replace gap-based splitting with alignment-bin approach: cluster word
left-edges within 8px tolerance, find the leftmost bin with >= 10% of
words as the true column start, split off any words to its left as a
sub-column. This correctly handles both page references ("p.59") and
misread exclamation marks ("!" → "I") even when the pixel gap is small.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 18:56:38 +01:00
Benjamin Admin
f13116345b fix(tests): use correct bbox_pct dict format in _cells_to_vocab_entries tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 18:26:24 +01:00
Benjamin Admin
991984d9c3 fix(tests): pass columns_meta arg to _cells_to_vocab_entries tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 18:23:55 +01:00
Benjamin Admin
1a246eb059 feat(ocr-pipeline): generic sub-column detection via left-edge clustering
Detects hidden sub-columns (e.g. page references like "p.59") within
already-recognized columns by clustering word left-edge positions and
splitting when a clear minority cluster exists. The sub-column is then
classified as page_ref and mapped to VocabRow.source_page.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 18:18:02 +01:00
Benjamin Admin
0532b2a797 fix(ocr-pipeline): skip edge-touching gaps in header/footer detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m50s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
Gaps that extend to the image boundary (top/bottom edge) are not valid
content separators — they typically represent dewarp padding. Only gaps
with content on both sides qualify as header/footer boundaries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 17:54:49 +01:00
Benjamin Admin
f1fcc67357 fix(ocr-pipeline): clamp gap detection to img_h to avoid dewarp padding
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m46s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 16s
The inverted image can be taller than img_h after dewarp shear
correction, causing footer_y to be detected outside the visible page.
Now clamps the horizontal projection to actual_h = min(inv.shape[0], img_h).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 17:06:58 +01:00
Benjamin Admin
c8981423d4 feat(ocr-pipeline): distinguish header/footer vs margin_top/margin_bottom
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 2m0s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 19s
Check for actual ink content in detected top/bottom regions:
- 'header'/'footer' when text is present (e.g. title, page number)
- 'margin_top'/'margin_bottom' when the region is empty page margin

Also update all skip-type sets and color maps for the new types.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 16:55:41 +01:00
Benjamin Admin
f615c5f66d feat(ocr-pipeline): generic header/footer detection via projection gap analysis
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m48s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 16s
Replace the trivial top_y/bottom_y threshold check with horizontal
projection gap analysis that finds large whitespace gaps separating
header/footer content from the main body. This correctly detects
headers (e.g. "VOCABULARY" banners) and footers (page numbers) even
when _find_content_bounds includes them in the content area.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 16:13:48 +01:00
Benjamin Admin
a052f73de3 fix(ocr-pipeline): pass left_x/right_x to classify_column_types in API path
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m45s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 18s
The ocr_pipeline_api.py code path called classify_column_types without
left_x/right_x, so margin regions were never created. Also add logging
to _build_margin_regions for debugging.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 15:42:39 +01:00
Benjamin Admin
34ccdd5fd1 feat(ocr-pipeline): filter scan artifacts in content bounds and add margin regions
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m50s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Thin black lines (1-5px) at page edges from scanning were incorrectly
detected as content, shifting content bounds and creating spurious
IGNORE columns. This filters narrow projection runs (<1% of image
dimension) and introduces explicit margin_left/margin_right regions
for downstream page reconstruction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 15:29:18 +01:00
Benjamin Admin
e718353d9f feat(ocr-pipeline): 6 systematic improvements for robustness, performance & UX
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 37s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 21s
1. Unit tests: 76 new parametrized tests for noise filter, phonetic detection,
   cell text cleaning, and row merging (116 total, all green)
2. Continuation-row merge: detect multi-line vocab entries where text wraps
   (lowercase EN + empty DE) and merge into previous entry
3. Empty DE fallback: secondary PSM=7 OCR pass for cells missed by PSM=6
4. Batch-OCR: collect empty cells per column, run single Tesseract call on
   column strip instead of per-cell (~66% fewer calls for 3+ empty cells)
5. StepReconstruction UI: font scaling via naturalHeight, empty EN/DE field
   highlighting, undo/redo (Ctrl+Z), per-cell reset button
6. Session reprocess: POST /sessions/{id}/reprocess endpoint to re-run from
   any step, with reprocess button on completed pipeline steps

Also fixes pre-existing dewarp_image tuple unpacking bug in run_cv_pipeline
and updates dewarp tests to match current (image, info) return signature.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 14:46:38 +01:00
Benjamin Admin
c3a924a620 fix(ocr-pipeline): merge phonetic-only rows and fix bracket noise filter
Two fixes:
1. Tokens ending with ] (e.g. "serva]") were stripped by the noise
   filter because ] was not in the allowed punctuation list.
2. Rows containing only phonetic transcription (e.g. ['mani serva])
   are now merged into the previous vocab entry instead of creating
   a separate (invalid) entry. This prevents the LLM from trying
   to "correct" phonetic fragments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 14:14:20 +01:00
Benjamin Admin
650f15bc1b fix(ocr-pipeline): tolerate dictionary punctuation in noise filter
The noise filter was stripping words containing hyphens, parentheses,
slashes, and dots (e.g. "money-saver", "Schild(chen)", "(Salat-)Gurke",
"Tanz(veranstaltung)"). Now strips all common dictionary punctuation
before checking for internal noise characters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 13:12:40 +01:00
Benjamin Admin
40a77a82f6 fix(ocr-pipeline): use midpoint boundaries for column word assignment
Replace containment-with-padding approach with midpoint-based column
ranges. For adjacent columns, the assignment boundary is the midpoint
between them (Voronoi-style). This prevents padding overlap where words
near column borders (e.g. "We" at the start of example sentences) were
assigned to the preceding column. The last column extends generously to
capture all rightmost text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 12:53:56 +01:00
Benjamin Admin
87931c35e4 fix(ocr-pipeline): stop noise filter from stripping parenthesized words
_is_noise_tail_token() treated words with unbalanced parentheses like
"selbst)" or "(wir" as OCR noise because the parenthesis counted as
"internal noise". Now strips leading/trailing parentheses before the
noise check, so legitimate words in example sentences like
"We baked ... (wir ... selbst)" are preserved.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 12:51:28 +01:00
Benjamin Admin
29b1d95acc fix(ocr-pipeline): improve word-column assignment and LLM review accuracy
Word assignment: Replace nearest-center-distance with containment-first
strategy. Words whose center falls within a column's bounds (+ 15% pad)
are assigned to that column before falling back to nearest-center. This
fixes long example sentences losing their rightmost words to adjacent
columns.

LLM review: Strengthen prompt to explicitly forbid changing proper nouns,
place names, and correctly-spelled words. Add _is_spurious_change()
post-filter that rejects case-only changes and hallucinated word
replacements (< 50% character overlap).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 12:40:26 +01:00
Benjamin Admin
dbf0db0c13 feat(ocr-pipeline): improve LLM review UI + add reconstruction step
StepLlmReview: Show full vocab table with image overlay, row-level
status tracking (pending/active/reviewed/corrected/skipped), and
auto-scroll during SSE streaming. Load previous results on mount.

StepReconstruction: New step 7 with editable text fields at original
bbox positions over dewarped image. Zoom controls, tab navigation,
color-coded columns, save to backend.

Backend: Add POST /sessions/{id}/reconstruction endpoint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 12:19:21 +01:00
Benjamin Admin
2a493890b6 feat(ocr-pipeline): add SSE streaming and phonetic filter to LLM review
- Stream LLM review results batch-by-batch (8 entries per batch) via SSE
- Frontend shows live progress bar, batch log, and corrections appearing
- Skip entries with IPA phonetic transcriptions (already dictionary-corrected)
- Refactor llm_review_entries into reusable helpers for both streaming and non-streaming paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 11:46:06 +01:00
Benjamin Admin
e171a736e7 fix(ocr-pipeline): increase LLM timeout to 300s and disable qwen3 thinking
- Add /no_think tag to prompt (qwen3 thinking mode causes massive slowdown)
- Increase httpx timeout from 120s to 300s for large vocab tables
- Improve error logging with traceback and exception type

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 11:31:03 +01:00
Benjamin Admin
938d1d69cf feat(ocr-pipeline): add LLM-based OCR correction step (Step 6)
Replace the placeholder "Koordinaten" step with an LLM review step that
sends vocab entries to qwen3:30b-a3b via Ollama for OCR error correction
(e.g. "8en" → "Ben"). Teachers can review, accept/reject individual
corrections in a diff table before applying them.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 11:13:17 +01:00
Benjamin Admin
e9f368d3ec feat(ocr-pipeline): add abbreviation allowlist to noise filter
Add _KNOWN_ABBREVIATIONS set with ~150 common EN/DE abbreviations
(sth, sb, etc, eg, ie, usw, bzw, vgl, adj, adv, prep, sg, pl, ...).
Tokens matching known abbreviations are never stripped as noise.

Also handle dotted abbreviations (e.g., z.B., i.e.) that have no
2+ consecutive alpha chars by checking the abbreviation set before
the _RE_REAL_WORD filter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 10:46:54 +01:00
Benjamin Admin
3028f421b4 feat(ocr-pipeline): add cell text noise filter for OCR artifacts
Add _clean_cell_text() with three sub-filters to remove OCR noise:
- _is_garbage_text(): vowel/consonant ratio check for phantom row garbage
- _is_noise_tail_token(): dictionary-based trailing noise detection
- _RE_REAL_WORD check for cells with no real words (just fragments)

Handles balanced parentheses "(auf)" and trailing hyphens "under-"
as legitimate tokens while stripping noise like "Es)", "3", "ee", "B".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 10:19:31 +01:00
Benjamin Admin
2b1c499d54 fix(ocr-pipeline): filter OCR noise from image areas and artifacts
Two generic noise filters added to _ocr_single_cell():

1. Word confidence filter (conf < 30): removes low-confidence words
   before text assembly.  Catches trailing artifacts like "Es)" after
   real text, and standalone noise from image edges.

2. Cell noise filter: clears cells whose entire text has no real
   alphabetic word (>= 2 letters).  Catches fragments like "E:", "3",
   "u", "D", "2.77", "and )" from image areas, while keeping real
   short words like "Ei", "go", "an".

Both filters apply to word-lookup AND cell-OCR fallback results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:56:54 +01:00
Benjamin Admin
72cc77dcf4 fix(ocr-pipeline): cells = result, no post-processing content shuffling
The cell grid IS the result. Each cell stays at its detected position.

Removed _split_comma_entries and _attach_example_sentences from the
pipeline — they were shuffling content between rows/columns, causing
"Mäuse" to appear in a separate row, "stand..." to move to Example,
and "Ei" to disappear.

Now: cells → _cells_to_vocab_entries (1:1 row mapping) →
_fix_character_confusion → _fix_phonetic_brackets → done.

Also lowered pixel-density threshold from 2% to 0.5% for the cell-OCR
fallback so small text like "Ei" is not filtered out.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:41:30 +01:00
Benjamin Admin
e3f939a628 refactor(ocr-pipeline): make post-processing fully generic
Three non-generic solutions replaced with universal heuristics:

1. Cell-OCR fallback: instead of restricting to column_en/column_de,
   now checks pixel density (>2% dark pixels) for ANY column type.
   Truly empty cells are skipped without running Tesseract.

2. Example-sentence detection: instead of checking for example-column
   text (worksheet-specific), now uses sentence heuristics (>=4 words
   or ends with sentence punctuation). Short EN text without DE is
   kept as a vocab entry (OCR may have missed the translation).

3. Comma-split: re-enabled with singular/plural detection. Pairs like
   "mouse, mice" / "Maus, Mäuse" are kept together. Verb forms like
   "break, broke, broken" are still split into individual entries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:27:30 +01:00
Benjamin Admin
6bca3370e0 fix(ocr-pipeline): fix vocab post-processing destroying correct cell results
Three bugs in the post-processing pipeline were overwriting correct
streaming results with wrong ones:

1. _split_comma_entries was splitting "Maus, Mäuse" into two separate
   entries. Disabled — word forms belong together.

2. _attach_example_sentences treated "Ei" (2 chars) as OCR noise due
   to `len(de) > 2` threshold. Lowered to `len(de) > 1`.

3. _attach_example_sentences wrongly classified rows with EN text but
   no DE (like "stand ...") as example sentences, merging them into
   the previous entry. Now only treats rows as examples if they also
   have no text in the example column.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:16:50 +01:00