breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	e4bdb3cc24	debug: add diagnostic logging to _ocr_cell_crop for empty cell investigation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 16:35:33 +01:00
Benjamin Admin	d0e7966925	fix: use header/footer row boundaries for _heal_row_gaps in cell-first OCR CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 33s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 20s Details Prevents first content row from expanding into header area (causing "ulary" from "VOCABULARY" to appear in DE column) and last content row from expanding into footer area (causing page numbers to appear as content). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 15:44:13 +01:00
Benjamin Admin	68d230c297	fix: use batch-then-stream SSE for cell-first OCR CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m49s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details The old per-cell streaming timed out because sequential cell OCR was too slow to send the first event before proxy timeout. Now uses build_cell_grid_v2 (parallel ThreadPoolExecutor) via run_in_executor, then streams all cells at once after batch completes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 14:51:55 +01:00
Benjamin Admin	16dc77e5c2	chore: add migration 005_add_doc_type.sql CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 13:54:56 +01:00
Benjamin Admin	29c74a9962	feat: cell-first OCR + document type detection + dynamic pipeline steps Cell-First OCR (v2): Each cell is cropped and OCR'd in isolation, eliminating neighbour bleeding (e.g. "to", "ps" in marker columns). Uses ThreadPoolExecutor for parallel Tesseract calls. Document type detection: Classifies pages as vocab_table, full_text, or generic_table using projection profiles (<2s, no OCR needed). Frontend dynamically skips columns/rows steps for full-text pages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 13:52:38 +01:00
Benjamin Admin	00a74b3144	revert: remove marker column OCR special handling CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m48s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details The HSV-based coloured marker detection caused false positives in nearly every marker cell. Coloured markers like red "!" are an extreme edge case — better handled manually in reconstruction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 11:52:59 +01:00
Benjamin Admin	489835a279	fix: detect red/coloured markers in OCR pipeline CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details Two fixes for marker column content (e.g. red "!" marks): 1. Skip _clean_cell_text() noise filter for column_marker — it requires 2+ consecutive letters, which drops punctuation-only markers like "!" or "*". 2. For marker columns, detect coloured pixels via HSV saturation check (S>80) in addition to grayscale darkness. Create a binarized image where both dark AND saturated pixels become black foreground, so Tesseract can see red markers that appear near-white in standard grayscale conversion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 11:38:12 +01:00
Benjamin Admin	f0726d9a2b	fix: shrink overlapping neighbors after narrow column expansion CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 16s Details When a narrow column expands into neighbor space, the neighbor's boundaries must be adjusted to avoid overlap. After expansion, left neighbor's right edge and right neighbor's left edge are trimmed to match the expanded column's new boundaries, with words re-assigned. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 11:12:13 +01:00
Benjamin Admin	ae1f9f7494	fix: expand narrow columns into neighbor space, not just gaps CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m48s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Sub-column splits create adjacent columns with 0px gap between them. The previous expansion only worked with explicit gaps. Now it looks at where the neighbor's actual words are and claims unused space up to MIN_WORD_MARGIN (4px) from the nearest word, even if there's no gap in the column boundaries. Also added debug logging for expansion input. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 10:49:10 +01:00
Benjamin Admin	e4aff2b27e	fix: rewrite Method D to measure vertical column drift instead of text-line slope CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details After deskew, horizontal text lines are already straight (~0° slope). Method D was measuring this (always ~0°) instead of the actual vertical shear (column edge drift). This caused it to report 0.112° with 0.96 confidence, overwhelming Method A's correct detection of negative shear. New Method D groups words by X-position into vertical columns, then measures how left-edge X drifts with Y position via linear regression. dx/dy = tan(shear_angle), directly measuring column tilt. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 10:31:19 +01:00
Benjamin Admin	9dd77ab54a	fix: move column expansion AFTER sub-column split CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 18s Details The narrow column expansion was running inside detect_column_geometry() on the 4 main columns, but the narrowest columns (marker ~14px, page_ref ~93px) are created AFTERWARDS by _detect_sub_columns(). Extracted expand_narrow_columns() as standalone function and call it after sub-column splitting in the columns API endpoint. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 10:07:40 +01:00
Benjamin Admin	e426de937c	fix: expand narrow columns + lower dewarp thresholds for small angles CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 19s Details Two fixes for edge case where residual shear pushes content out of narrow columns (marker, page_ref): 1. Column expansion (Step 10): After detection, narrow columns (<10% content width) expand into adjacent whitespace gaps, claiming up to 40% of the gap but never past the nearest word in the neighbor column. This gives marker/page_ref columns breathing room. 2. Dewarp sensitivity: Lower minimum angle from 0.15° to 0.08°, lower ensemble min confidence from 0.5 to 0.35, lower final threshold from 0.5 to 0.4, and skip quality gate for small corrections (<0.5°) where projection variance change is negligible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 09:32:47 +01:00
Benjamin Admin	0d3f001acb	fix: always include detections in dewarp response, even when no correction applied CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m49s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 19s Details The detections array was empty when shear was below threshold, hiding all 4 method results from the frontend Details panel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 09:05:43 +01:00
Benjamin Admin	c484a89b78	fix: dewarp UI shows detection details, quality gate status, confidence bars CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 35s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 19s Details - Add DewarpDetection type with per-method results - Expand method labels for all 4 detectors (A-D) - Show green/amber banner: applied vs quality-gate-rejected - Expandable "Details" panel showing all 4 methods with confidence bars - Visual confidence bars instead of plain percentage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 08:39:55 +01:00
Benjamin Admin	d5f2ce4659	fix: Fabric.js v6 API compatibility + CLAUDE.md SSH commands CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m46s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 15s Details - Replace setBackgroundImage() with backgroundImage property (v6 breaking change) - Replace setWidth/setHeight with Canvas constructor options - Fix opacity handler to use direct property access - Update CLAUDE.md: use git -C and docker compose -f instead of cd Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 23:01:19 +01:00
Benjamin Admin	ab3ecc7c08	feat: OCR pipeline v2.1 – narrow column OCR, dewarp automation, Fabric.js editor CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m50s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 15s Details Proposal B: Adaptive padding, crop upscaling, PSM selection, row-strip re-OCR for narrow columns (<15% width) – expected accuracy boost 60-70% → 85-90%. Proposal A: New text-line straightness detector (Method D), quality gate (rejects counterproductive corrections), 2-pass projection refinement, higher confidence thresholds – expected manual dewarp reduction to <10%. Proposal C: Fabric.js canvas editor with drag/drop, inline editing, undo/redo, opacity slider, zoom, PDF/DOCX export endpoints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 22:44:14 +01:00
Benjamin Admin	970ec1f548	docs: OCR-Pipeline v2.0.0 – alle Optimierungen 2026-03-03 dokumentiert - Schritte 6–8 jetzt vollständig dokumentiert (nicht mehr "Geplant") - Step 3: Full-Width-Scan, Phantom-Filter-Detail - Step 4: Artefakt-Zeilen, Gap-Healing - Step 6: Spell Checker, Char Confusion (_fix_character_confusion), SSE-Protokoll, Env-Vars (REVIEW_ENGINE, OLLAMA_REVIEW_*) - Step 7: Rekonstruktions-Canvas, leere Zellen editierbar - Dependencies-Tabelle mit pyspellchecker als neue Dependency - Änderungshistorie mit allen 2026-03-03 Commits Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 18:42:25 +01:00
Benjamin Admin	a610bc75ba	fix: rename LLM-Korrektur to Korrektur in wizard stepper and types	2026-03-03 17:56:46 +01:00
Benjamin Admin	153f41358b	fix: remove stale allCells dependency in emptyCellIds memo	2026-03-03 17:39:14 +01:00
Benjamin Admin	d1c8075da2	fix: three OCR pipeline UX improvements 1. Rename Step 6 label to "Korrektur" (was "OCR-Zeichenkorrektur") 2. Move _fix_character_confusion from pipeline Step 1 into llm_review_entries_streaming so corrections are visible in the UI: char changes (\| → I, 1 → I, 8 → B) are now emitted as a batch event right after the meta event, appearing in the corrections list 3. StepReconstruction: all cells (including empty) are now rendered as editable inputs — removed filter that hid empty cells from the editor Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 17:31:55 +01:00
Benjamin Admin	f3d61a9394	fix: extend initial Tesseract scan to full image width for word detection content_roi was cropped to [left_x:right_x] — the detected content boundary. Words at the right edge of the last column (beyond right_x) were never found in the initial scan, so they remained missing even after the column geometry was extended to full image width (w). Fix: crop to [left_x:w] so all words including those near the right margin are detected and assigned correctly to the last column. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 17:08:03 +01:00
Benjamin Admin	ab2423bd10	fix: protect numbered list prefixes from 1→I confusion in char fix step _CHAR_CONFUSION_RULES: standalone "1" → "I" now skips "1." and "1," Cross-language fallback rule: same lookahead (?![\d.,]) added Fixes: "cross = 1. Kreuz" being converted to "cross = I. Kreuz" in Step 1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 16:46:45 +01:00
Benjamin Admin	b914b6f49d	fix(columns): extend rightmost column to full image width (w) not content right_x right_x is the detected content boundary, which can still be several pixels short of actual text near the page margin. Since the page margin contains only white space, extending the last column's OCR crop to the full image width (w) is always safe and prevents right-edge text cutoff. Affects three locations in detect_column_geometry(): - Word count logging loop - ColumnGeometry boundary building (Step 8) - Phantom filter boundary adjustment (Step 9) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 16:25:07 +01:00
Benjamin Admin	123b7ada0b	fix(columns): filter phantom narrow columns + rename step to OCR-Zeichenkorrektur Phantom column fix: Adjacent tiny gaps (e.g. 11px + 35px) can create very narrow columns (< 3% of content width) with 0 words. These are scan artefacts, not real columns. New Step 9 in detect_column_geometry(): - Filter columns where width < max(20px, 3% content_w) AND words < 3 - After filtering, extend each remaining column to close the gap with its right neighbor, and re-assign words to correct column Example from logs: 5 columns → 4 columns (phantom at x=710, width=36px eliminated; neighbors expanded to cover the gap) UI rename: - 'Schritt 6: LLM-Korrektur' → 'Schritt 6: OCR-Zeichenkorrektur' - 'LLM-Korrektur starten' → 'Zeichenkorrektur starten' - Error message updated accordingly (No LLM involved anymore — spell-checker is the active engine) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 16:06:59 +01:00
Benjamin Admin	cb61fab77b	fix(rows): filter artifact rows and heal gaps for full OCR height Two new functions: - _is_artifact_row(): marks rows as artifacts if all detected tokens are single characters (scanner shadows produce dots/dashes, not words). A real vocabulary row always contains at least one 2+ char word. - _heal_row_gaps(): after removing empty/artifact rows, expands each remaining content row to the midpoint of adjacent gaps, so OCR crops are not artificially narrow. First row extends to content top_bound; last row to content bottom_bound. Applied in both build_cell_grid() and build_cell_grid_streaming() after the word_count>0 filter and before OCR. Addresses cases like: - Row 21: scan shadow → single-char artifacts → filtered before OCR - Row 23: completely empty (word_count=0) → already filtered - Row 22: real content → now expanded upward/downward to fill the space that rows 21 and 23 occupied, giving OCR the correct full height Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 15:38:58 +01:00
Benjamin Admin	6623a5d10e	fix(columns): extend rightmost column to content right edge (right_x) Previously detect_column_geometry() ended the last column at the start of the detected right-margin gap (left_x + right_boundary), which could cut into actual text near the right edge of the Example column. Since only the page margin lies to the right of the last column, the rightmost column now always extends to right_x regardless of whether a right-margin gap was detected. This prevents OCR crops from missing words at the right edge of wide columns like column_example. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 15:26:38 +01:00
Benjamin Admin	21ea458fcf	feat(ocr-review): replace LLM with rule-based spell-checker (REVIEW_ENGINE=spell) - Add pyspellchecker (MIT) to requirements for EN+DE dictionary lookup - New spell_review_entries_sync() + spell_review_entries_streaming(): - Dictionary-backed substitution: checks if corrected word is known - Structural rule: digit at pos 0 + lowercase rest → most likely letter (e.g. "8en"→"Ben", "8uch"→"Buch", "5ee"→"See", "6eld"→"Geld") - Pattern rule: "\|." → "1." for numbered list prefixes - Standalone "\|" → "I" (capital I) - IPA entries still protected via existing _entry_needs_review filter - Headings/untranslated words (e.g. "Story") are untouched (no susp. chars) - llm_review_entries + llm_review_entries_streaming: route via REVIEW_ENGINE env var ("spell" default, "llm" to restore previous behaviour) - docker-compose.yml: REVIEW_ENGINE=${REVIEW_ENGINE:-spell} - LLM code preserved for fallback (set REVIEW_ENGINE=llm in .env) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 15:04:27 +01:00
Benjamin Admin	b1f7fee284	fix(ocr-review): add pipe→1 as valid OCR correction in _is_spurious_change Extend _OCR_CHAR_MAP to treat '\|' as a possible misread of digit '1' in addition to letters l/L/i/I. Fixes cases like 'cross = \|. Kreuz' → 'cross = 1. Kreuz' (numbered list prefix) being rejected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 14:50:16 +01:00
Benjamin Admin	dc5d76ecf5	fix(llm-review): think=false und Logging in Streaming-Version fehlten CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 18s Details Die UI nutzt llm_review_entries_streaming, nicht llm_review_entries. Die Streaming-Version hatte kein think:false → qwen3:0.6b verbrachte 9 Sekunden im Denkprozess ohne Token-Budget für die eigentliche Antwort. - think: false in Streaming-Version ergänzt - num_predict: 4096 → 8192 (konsistent mit nicht-streaming) - Logging für batch-Fortschritt, Response-Länge, geparste Einträge Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 14:43:42 +01:00
Benjamin Admin	1ac47cd9b7	fix(llm-review): JSON-Parse-Fehler durch Control-Zeichen beheben CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m48s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Log zeigte: "Invalid control character at: line 28 column 27" Das Pipe-Zeichen \| in OCR-Texten (z.B. "\| want" statt "I want") bricht den JSON-Parser wenn es als Literal im LLM-Response steht. Fixes: - _sanitize_for_json(): entfernt ASCII Control-Chars 0x00-0x1f (außer Tab/LF/CR die in JSON valid sind) - \| → I als erlaubte OCR-Korrektur in _is_spurious_change und Prompt - Reverse-Check in _is_spurious_change (l→I etc.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 14:37:16 +01:00
Benjamin Admin	fa8e38db2d	fix(llm-review): Pre-Filter entfernt — alle Einträge ans LLM senden CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 31s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m58s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 20s Details Der digit-in-word Pre-Filter hat alle 41 Einträge geblockt (skipped=41 im Log). OCR-Fehler können nicht im voraus erkannt werden. Zurück zum ursprünglichen Ansatz: alle nicht-leeren Einträge ohne IPA-Klammern werden ans LLM gesendet. Schutz gegen Übersetzungen erfolgt ausschließlich über den strikten Prompt und _is_spurious_change(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 14:29:46 +01:00
Benjamin Admin	f1b6246838	fix(llm-review): Diagnose-Logging + think=false + <think>-Tag-Stripping CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m49s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details - think: false in Ollama API Request (qwen3 disables CoT nativ) - <think>...</think> Stripping in _parse_llm_json_array (Fallback falls think:false nicht greift) - INFO-Logging: wie viele Einträge gesendet werden, Response-Länge, Anzahl geparster Einträge - DEBUG-Logging: erste 3 Eingabe-Einträge, ersten 500 Zeichen der Antwort - Bessere Fehlermeldung wenn JSON-Parsing fehlschlägt Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 14:13:08 +01:00
Benjamin Admin	2fce92d7b1	fix(llm-review): LLM übersetzt nicht mehr — nur noch OCR-Ziffernfehler CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details ## Problem qwen3:0.6b interpretierte den Prompt zu weit und versuchte: - Englische Wörter zu übersetzen (EN-Spalte umschreiben) - Korrekte deutsche Wörter neu zu übersetzen - IPA-Einträge in Klammern zu 'korrigieren' ## Fixes ### 1. Strengerer Pre-Filter (entry_needs_review) Sendet jetzt NUR Einträge ans LLM, die tatsächlich ein Ziffer-in-Wort-Muster haben (0158 zwischen Buchstaben). → Korrekte Einträge werden gar nicht erst gesendet. ### 2. Viel restriktiverer Prompt - Explizites Verbot: "du übersetzt NICHTS, weder EN→DE noch DE→EN" - Nur die 5 Ziffer→Buchstaben-Fälle sind erlaubt - Konkrete Beispiele für erlaubte Korrekturen - Kein vager "Im Zweifel nicht ändern" — sondern explizites VERBOTEN ### 3. Stärkerer Spurious-Change-Filter Verwirft LLM-Änderungen, die keine Ziffer→Buchstabe-Substitution sind. Verhindert Übersetzungen und Neuformulierungen auch wenn der Prompt sie nicht vollständig unterdrückt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 13:48:54 +01:00
Benjamin Admin	7eb03ca8d1	fix(ocr-pipeline): IndentationError in auto-mode deskew block CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m49s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 17s Details The try/except block for the deskew step had 4 extra spaces of indentation from a previous edit. Python rejected the file with IndentationError at startup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 13:21:49 +01:00
Benjamin Admin	50e1c964ee	feat(klausur-service): OCR-Pipeline Optimierungen (Improvements 2-4) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m46s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 16s Details ## Improvement 2: VLM-basierter Dewarp - Neuer Query-Parameter `method` für POST /sessions/{id}/dewarp Optionen: ensemble (default) \| vlm \| cv - `_detect_shear_with_vlm()`: fragt qwen2.5vl:32b per Ollama nach dem Scherwinkel — gibt Zahlenwert + Konfidenz zurück - `os`, `Query` zu ocr_pipeline_api.py Imports hinzugefügt - `_apply_shear` aus cv_vocab_pipeline importiert ## Improvement 4: 3-Methoden Ensemble-Dewarp - `_detect_shear_by_projection()`: Varianz-Sweep ±3° / 0.25°-Schritte auf horizontalen Text-Zeilen-Projektionen (~30ms) - `_detect_shear_by_hough()`: Gewichteter Median über HoughLinesP auf Tabellen-Linien, Vorzeichen-Inversion (~20ms) - `_ensemble_shear()`: Kombiniert alle 3 Methoden (conf >= 0.3), Ausreißer-Filter bei >1° Abweichung, Bonus bei Agreement <0.5° - `dewarp_image()` nutzt jetzt alle 3 Methoden parallel, `use_ensemble: bool = True` für Rückwärtskompatibilität - auto_dewarp Response enthält jetzt `detections`-Array ## Improvement 3: Vollautomatik-Endpoint - POST /sessions/{id}/run-auto mit RunAutoRequest: from_step (1-6), ocr_engine, pronunciation, skip_llm_review, dewarp_method - SSE-Streaming für alle 5+1 Schritte (deskew→dewarp→columns→rows→words→llm-review) - Jeder Schritt: start / done / skipped / error Events - Abschluss-Event: {steps_run, steps_skipped} - LLM-Review-Fehler sind nicht-fatal (Pipeline läuft weiter) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 13:13:20 +01:00
Benjamin Admin	2e0f8632f8	feat(klausur): Handschrift entfernen + Klausur-HTR implementiert CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m49s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 15s Details Feature 1: Handschrift entfernen via OCR-Pipeline Session - services/handwriting_detection.py: _detect_pencil() + target_ink Parameter ("all" \| "colored" \| "pencil") für gezielte Tinten-Erkennung - ocr_pipeline_session_store.py: clean_png + handwriting_removal_meta Spalten (idempotentes ALTER TABLE in init_ocr_pipeline_tables) - ocr_pipeline_api.py: POST /sessions/{id}/remove-handwriting Endpoint + "clean" zu valid_types für Image-Serving hinzugefügt Feature 2: Klausur-HTR (Hochwertige Handschriftenerkennung) - handwriting_htr_api.py: Neuer Router /api/v1/htr/recognize + /recognize-session Primary: qwen2.5vl:32b via Ollama, Fallback: trocr-large-handwritten - services/trocr_service.py: size Parameter (base \| large) für get_trocr_model() + run_trocr_ocr() - unterstützt jetzt trocr-large-handwritten - main.py: HTR Router registriert Config: - docker-compose.yml: OLLAMA_HTR_MODEL, HTR_FALLBACK_MODEL - .env.example: HTR Env-Vars dokumentiert Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 12:04:26 +01:00
Benjamin Admin	606bef0591	fix(ocr-pipeline): overlap-based word assignment and empty row filtering CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 1m14s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 17s Details 1. Word-to-column assignment now uses overlap-based matching instead of center-point matching. This fixes narrow page_ref columns losing their last digit (e.g. "p.59" → "p.5") when the digit's center falls slightly past the midpoint boundary into the next column. 2. Post-OCR empty row filter: rows where ALL cells have empty text are removed after OCR. This catches inter-row gaps that had stray Tesseract artifacts giving word_count > 0 but no actual content. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 11:00:29 +01:00
Benjamin Admin	ccba2bb887	fix(ocr-pipeline): show sub-columns in reconstruction and LLM review steps CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 21s Details - Add marker/bbox_marker fields to WordEntry type - Add page_ref/column_marker colors to StepReconstruction - Make StepLlmReview table dynamic based on columns_used metadata, showing all detected columns (EN, DE, Example, page_ref, marker) instead of hardcoded EN/DE/Beispiel only Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 10:36:27 +01:00
Benjamin Admin	75bca1f02d	fix(ocr-cells): align cell bboxes exactly to column/row coordinates CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m49s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 17s Details Decouple display bbox from OCR crop region. Display bbox now uses exact col.x/row.y/col.width/row.height (no padding), so adjacent cells touch without gaps. OCR crop keeps 4px internal padding for edge character detection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 09:21:56 +01:00
Benjamin Admin	4d428980c1	refactor(word-step): make table fully generic and fix marker-only row filter CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 24s Details CI / test-python-klausur (push) Failing after 1m43s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 17s Details Frontend: Replace hardcoded EN/DE/Example vocab table with unified dynamic table driven by columns_used from backend. Labeling, confirmation, counts, and summary badges are now all cell-based instead of branching on isVocab. Backend: Change _cells_to_vocab_entries() entry filter from checking only english/german/example to checking ANY mapped field. This preserves rows with only marker or source_page content, fixing the issue where marker sub-columns disappeared at the end of OCR processing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 08:45:24 +01:00
Benjamin Admin	dea3349b23	fix(ocr-pipeline): preserve sub-column data in vocab table display CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m51s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 16s Details Three fixes for sub-columns disappearing at end of streaming: 1. Backend: add column_marker mapping in _cells_to_vocab_entries() so marker text is included in vocab entries (not silently dropped) 2. Frontend types: add source_page and bbox_ref to WordEntry interface 3. Frontend table: show page_ref column (Seite) in vocab table when entries have source_page data, instead of only EN/DE/Example Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 08:06:15 +01:00
Benjamin Admin	0d72f2c836	fix(sub-columns): protect sub-columns from column_ignore pre-filter CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 23s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 15s Details Add is_sub_column flag to ColumnGeometry. Sub-columns created by _detect_sub_columns() are now exempt from the edge-column word_count<8 rule that converts them to column_ignore. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 07:55:53 +01:00
Benjamin Admin	d6a8c1d821	fix(streaming): include page_ref columns in SSE metadata CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details The streaming word endpoint excluded page_ref from _skip_types, causing sub-column splits to be lost in the meta event and final grid_shape. Aligned _skip_types with build_cell_grid_streaming(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 07:48:07 +01:00
Benjamin Admin	6527beae03	fix(sub-columns): exclude header/footer words from alignment clustering CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 31s Details CI / test-go-edu-search (push) Successful in 24s Details CI / test-python-klausur (push) Failing after 1m51s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 18s Details Header/footer words (page numbers, chapter titles) could pollute the left-edge alignment bins and trigger false sub-column splits. Now _detect_header_footer_gaps() runs early and its boundaries are passed to _detect_sub_columns() to filter those words from clustering and the split threshold check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 07:33:54 +01:00
Benjamin Admin	3904ddb493	fix(sub-columns): convert relative word positions to absolute coords for split CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m51s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 17s Details Word 'left' values in ColumnGeometry.words are relative to the content ROI (left_x), but geo.x is in absolute image coordinates. The split position was computed from relative word positions and then compared against absolute geo.x, resulting in negative widths and no splits on real data. Pass left_x through to _detect_sub_columns to bridge the two coordinate systems. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-02 19:16:13 +01:00
Benjamin Admin	6e1a349eed	fix(tests): adjust word counts so 10% threshold works correctly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-02 19:00:14 +01:00
Benjamin Admin	7252f9a956	refactor(ocr-pipeline): use left-edge alignment approach for sub-column detection Replace gap-based splitting with alignment-bin approach: cluster word left-edges within 8px tolerance, find the leftmost bin with >= 10% of words as the true column start, split off any words to its left as a sub-column. This correctly handles both page references ("p.59") and misread exclamation marks ("!" → "I") even when the pixel gap is small. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-02 18:56:38 +01:00
Benjamin Admin	f13116345b	fix(tests): use correct bbox_pct dict format in _cells_to_vocab_entries tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-02 18:26:24 +01:00
Benjamin Admin	991984d9c3	fix(tests): pass columns_meta arg to _cells_to_vocab_entries tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-02 18:23:55 +01:00
Benjamin Admin	1a246eb059	feat(ocr-pipeline): generic sub-column detection via left-edge clustering Detects hidden sub-columns (e.g. page references like "p.59") within already-recognized columns by clustering word left-edge positions and splitting when a clear minority cluster exists. The sub-column is then classified as page_ref and mapped to VocabRow.source_page. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-02 18:18:02 +01:00

1 2 3 4

163 Commits