breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	4e8ea77140	fix: leere Spalten als strukturell behandeln + 2-Spalten-Layout korrekt labeln CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m50s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details Spalten mit <=2 Woertern und <15% Breite werden jetzt als column_marker statt als content-Spalte klassifiziert. Bei 2 breiten Content-Spalten wird die rechte als column_example statt column_de gelabelt, da die linke Spalte EN+DE kombiniert enthaelt. OSD-Zoom von 1.0 auf 2.0 erhoeht fuer zuverlaessigere Orientierungserkennung. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 19:35:21 +01:00
Benjamin Admin	e8ba5ec073	fix: Orientierungserkennung beim PDF-Upload statt erst bei OCR CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 23s Details CI / test-go-edu-search (push) Successful in 23s Details CI / test-python-klausur (push) Failing after 1m47s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 17s Details Rotation wird jetzt in upload_pdf_get_info() erkannt, damit Thumbnails bei der Seitenauswahl bereits richtig herum angezeigt werden. Debug-Logging fuer _split_broad_columns hinzugefuegt. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 19:11:45 +01:00
Benjamin Admin	02631dc4e0	feat: breite Spalten per Word-Gap splitten + gedrehte Scans im Frontend anzeigen CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 15s Details _split_broad_columns() erkennt EN/DE-Gemisch in breiten Spalten via Word-Coverage-Analyse und trennt sie am groessten Luecken-Gap. Thumbnails und Page-Images werden serverseitig per fitz rotiert, Frontend laedt Thumbnails nach OCR-Processing neu. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 18:16:32 +01:00
Benjamin Admin	a5635e0c43	feat: automatische Orientierungserkennung fuer umgedrehte Scans CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 23s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m50s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 15s Details Tesseract OSD erkennt 0/90/180/270° Rotation und korrigiert automatisch vor dem Deskew. Loest das Problem mit Buchscannern, bei denen jede 2. Seite auf dem Kopf steht. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 17:26:21 +01:00
Benjamin Admin	7a1bd5e82d	refactor: positional_column_regions auch in OCR Pipeline verwenden CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 24s Details CI / test-python-klausur (push) Failing after 1m48s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 16s Details Shared Funktion positional_column_regions() in cv_vocab_pipeline.py, wird jetzt von beiden Pfaden (Vocab-Worksheet + OCR Pipeline Admin) genutzt. classify_column_types() bleibt als Legacy erhalten. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 17:20:51 +01:00
Benjamin Admin	a5df2b6e15	fix: Spaltenklassifikation im Vocab-Worksheet durch positionsbasierte Zuordnung ersetzen CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 33s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m47s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 20s Details Sprachbasiertes Scoring (classify_column_types) verursachte vertauschte Spalten auf Seite 3 bei Beispielsaetzen mit vielen englischen Funktionswoertern. Neue _positional_column_regions() ordnet Spalten rein geometrisch (links→rechts) zu. OCR Pipeline Admin bleibt unveraendert. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 17:07:11 +01:00
Benjamin Admin	4532f68173	fix: Word-Validation auf Segment-Woerter beschraenken CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 17s Details Woerter aus Sub-Header-Bereichen ueberlappten korrekte Spaltenluecken und liessen die Word-Validation faelschlich Gaps verwerfen. Jetzt werden nur Woerter aus dem gewaehlten Segment fuer die Validation verwendet. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 23:13:19 +01:00
Benjamin Admin	391449fedf	fix: Seite an Sub-Headern segmentieren, groesstes Segment fuer Projektion CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m58s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details Statt full-width Zeilen zu maskieren wird die Seite jetzt an grossen horizontalen Luecken (Sub-Header, Kapitelgrenzen) in Segmente unterteilt. Das groesste Segment wird fuer die vertikale Projektion verwendet. Dadurch stoeren Illustrationen und Ueberschriften nicht mehr. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 23:07:23 +01:00
Benjamin Admin	cb2b924a7b	fix: word-coverage gap detection als Fallback bei Illustrationen CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details Wenn pixel-basierte Projektion zu wenige Spaltenluecken findet (z.B. durch Illustrationen/Grafiken die Luecken fuellen), wird jetzt eine wort-basierte Gap-Detection als Zwischenschritt vor dem Clustering ausgefuehrt. Tesseract-Wort-BBs sind immun gegen dekorative Grafiken. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 22:58:27 +01:00
Benjamin Admin	8f3a50b981	fix: full-width Zeilen vor Spaltenerkennung maskieren CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 17s Details Farbige Sub-Header (z.B. "Unit 4: Bonnie Scotland") mit voller Breite fuellten die Spaltenluecken im vertikalen Projektionsprofil auf und fuehrten zu 11 statt 5 erkannten Spalten. Zeilen mit >40% Tintendichte werden jetzt vor der Projektion maskiert. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 22:50:27 +01:00
Benjamin Admin	2ad391e4e4	feat: Feinabstimmung mit 7 Schiebereglern fuer Deskew/Dewarp CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Neues aufklappbares Panel unter Entzerrung mit individuellen Reglern: - 3 Rotations-Regler (P1 Iterative, P2 Word-Alignment, P3 Textline) - 4 Scherungs-Regler (A-D Methoden) mit Radio-Auswahl - Kombinierte Vorschau und Ground-Truth-Speicherung - Backend: POST /sessions/{id}/adjust-combined Endpoint Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 18:22:33 +01:00
Benjamin Admin	d39d249daa	feat: add pass 3 text-line regression to deskew pipeline CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 15s Details After iterative projection (pass 1) and word-alignment (pass 2), a third pass uses Tesseract word positions + linear regression per text line to measure and correct residual rotation. This catches cases where passes 1-2 leave significant slope (e.g. 1.7° residual on heavily skewed scans). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 17:53:11 +01:00
Benjamin Admin	538d5c732e	feat: two-pass deskew with wider angle range and residual correction CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details - Increase iterative deskew coarse_range from ±2° to ±5° to handle heavily skewed scans - New deskew_two_pass(): runs iterative projection first, then word-alignment on the corrected image to detect/fix residual skew (applied when residual ≥ 0.3°) - OCR pipeline API auto_deskew now uses deskew_two_pass by default - Vocab worksheet _run_ocr_pipeline_for_page uses deskew_two_pass - Deskew result now includes angle_residual and two_pass_debug Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 17:34:57 +01:00
Benjamin Admin	b7ae36e92b	feat: use OCR pipeline instead of LLM vision for vocab worksheet extraction CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 17s Details process-single-page now runs the full CV pipeline (deskew → dewarp → columns → rows → cell-first OCR v2 → LLM review) for much better extraction quality. Falls back to LLM vision if pipeline imports are unavailable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 15:35:44 +01:00
Benjamin Admin	b8a9493310	fix: deskew iterative — use vertical Sobel edges + vertical projection CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details Horizontal projection of binary image is insensitive at 0.5° because text rows look nearly identical. The real discriminator is vertical edge alignment: at the correct angle, word left-edges and column borders become truly vertical, producing sharp peaks in the vertical projection of Sobel-X edges. Also: BORDER_REPLICATE + trim to avoid artifacts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 14:23:43 +01:00
Benjamin Admin	68a6b97654	fix: use gradient score instead of variance for iterative deskew CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m46s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details Variance is insensitive to 0.5° differences. Gradient score (L2 norm of first derivative) detects sharp text-line transitions much better. Also: use horizontal profile in both phases, finer coarse step (0.1°). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 14:11:19 +01:00
Benjamin Admin	af1b12c97d	feat: iterative projection-profile deskew (2-phase variance optimization) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 17s Details Adds deskew_image_iterative() as 3rd deskew method that directly optimizes for projection-profile sharpness instead of proxy signals (Hough/word alignment). Coarse sweep on horizontal profile, fine sweep on vertical profile. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 13:46:44 +01:00
Benjamin Admin	770aea611f	fix: correct example field (fixes iberqueren), disable cell-level bold CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m50s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details - Add "example" to spell correction loop — was only correcting "english" and "german" fields, missing umlauts in example sentences - Use "german" language for example field (mixed-language, umlauts needed) - Disable cell-level bold detection — cannot distinguish bold from non-bold in mixed-format cells (e.g. "cookie ['kuki]") - Keep _measure_stroke_width and _classify_bold_cells for future word-level bold detection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 13:15:59 +01:00
Benjamin Admin	1a2efbf075	fix: relative bold detection (page median), fix save/finish buttons CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m3s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 21s Details Bold detection: - Replace absolute threshold with page-level relative comparison - Measure stroke width for all cells, then mark cells >1.4× median as bold - Adapts automatically to font, DPI and scan quality Save buttons: - Fix status stuck on 'error' preventing re-click - Better error messages with response body - Fallback score to 0 when null Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 13:02:16 +01:00
Benjamin Admin	cd12755da6	feat: OCR umlaut confusion correction + bold detection via stroke-width CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m39s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details - Add umlaut confusion rules (i→ü, a→ä, o→ö, u→ü) to _spell_fix_token for German text — fixes "iberqueren" → "überqueren" etc. - Add _detect_bold() using OpenCV stroke-width analysis on cell crops - Integrate bold detection in both narrow (cell-crop) and broad (word-lookup) paths - Add is_bold field to GridCell TypeScript interface - Render bold text in StepGroundTruth reconstruction view Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 12:06:57 +01:00
Benjamin Admin	1cc69d6b5e	feat: OCR pipeline step 8 — validation view with image detection & generation CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m4s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 19s Details Replaces the stub StepGroundTruth with a full side-by-side Original vs Reconstruction view. Adds VLM-based image region detection (qwen2.5vl), mflux image generation proxy, sync scroll/zoom, manual region drawing, and score/notes persistence. New backend endpoints: detect-images, generate-image, validate, get validation. New standalone mflux-service (scripts/mflux-service.py) for Metal GPU generation. Dockerfile.base: adds fonts-liberation (Apache-2.0). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 10:40:37 +01:00
Benjamin Admin	293e7914d8	feat: improved OCR pipeline session manager with categories, thumbnails, pipeline logging CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 39s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m48s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 20s Details - Add document_category (10 types) and pipeline_log JSONB columns - Session list: thumbnails, copyable IDs, category/doc_type badges - Inline category dropdown, bulk delete, pipeline step logging - New endpoints: thumbnail, delete-all, pipeline-log, categories - Cleared all 22 old test sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 09:44:38 +01:00
Benjamin Admin	a58dfca1d8	fix: move char-confusion fix to correction step, add spell + page-ref corrections CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 30s Details CI / test-nodejs-website (push) Successful in 20s Details CI / nodejs-lint (push) Failing after 10m5s Details - Remove _fix_character_confusion() from words endpoint (now only in Phase 0) - Extend spell checker to find real OCR errors via spell.correction() - Add field-aware dictionary selection (EN/DE) for spell corrections - Add _normalize_page_ref() for page_ref column (p-60 → p.60) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 00:26:13 +01:00
Benjamin Admin	fd99d4f875	cleanup: remove sheet-specific code, reduce logging, document constants CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 17s Details Genericity audit findings: - Remove German prefixes from _GRAMMAR_BRACKET_WORDS (only English field is processed, German prefixes were unreachable dead code) - Move _IPA_CHARS and _MIN_WORD_CONF to module-level constants - Document _NARROW_COL_THRESHOLD_PCT with empirical rationale - Document _PAD=3 with DPI context - Document _PHONETIC_BRACKET_RE intentional mixed-bracket matching - Reduce all diagnostic logger.info() to logger.debug() in: _ocr_cell_crop, _replace_phonetics_in_text, _fix_phonetic_brackets - Keep only summary-level info logging Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 00:04:02 +01:00
Benjamin Admin	1e0c6bb4b5	feat: hybrid OCR — full-page for broad columns, cell-crop for narrow Fundamentally rearchitect build_cell_grid_v2 to combine the best of both approaches: - Broad columns (>15% image width): Use full-page Tesseract word assignment. Handles IPA brackets, punctuation, sentence flow, and ellipsis correctly. No garbled phonetics. - Narrow columns (<15% image width): Use isolated cell-crop OCR to prevent neighbour bleeding from adjacent broad columns. This eliminates the need for complex phonetic bracket replacement on broad columns since full-page Tesseract reads them correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 23:38:44 +01:00
Benjamin Admin	e6dc3fcdd7	fix: only replace phonetics in english field, fix grammar detection - Only process 'english' field for IPA replacement. German and example fields contain meaningful parenthetical content like (gefrorenes Wasser), (sich beschweren) that must never be replaced. - Simplify _is_grammar_bracket_content: only known grammar particles (with, about/of, sth, etc.) are preserved. Removes the >= 4 chars heuristic that incorrectly preserved garbled IPA like [breik], [maus]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 23:19:03 +01:00
Benjamin Admin	edbdac3203	fix: improve phonetic bracket replacement logic - Replace _is_meaningful_bracket_content with _is_grammar_bracket_content that uses a whitelist of grammar particles (with, about/of, auf, etc.) - Check IPA dictionary FIRST: if word has IPA, treat brackets as phonetic - Strip orphan brackets (no word before them) that are garbled IPA - Preserve correct IPA (contains Unicode IPA chars) and grammar info - Fix variable name bug (result → text) Fixes: break [breik] now correctly replaced, cross (with) preserved, orphan [mais] and {'mani setva] stripped. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 23:13:34 +01:00
Benjamin Admin	99573a46ef	debug: add phonetic bracket replacement logging	2026-03-04 23:01:01 +01:00
Benjamin Admin	6ad4b84584	fix: broaden phonetic bracket regex to catch Tesseract-garbled IPA CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 16s Details Tesseract mangles IPA square brackets into curly braces or parentheses (e.g. China [ˈtʃaɪnə] → China {'tfatno]). The previous regex only matched [...], missing all garbled variants. - Match any bracket type: [...], {...}, (...) including mixed pairs - Add _is_meaningful_bracket_content() to preserve legitimate German prefixes like (zer)brechen and Tanz(veranstaltung) - Trigger IPA replacement on any bracket character, not just [ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 22:53:50 +01:00
Benjamin Admin	f94a3836f8	fix: use Tesseract as default engine for cell-first OCR instead of RapidOCR RapidOCR (PaddleOCR) is optimized for full-page scene text and produces artifacts on small isolated cell crops: extra characters ("Tanz z", "er r wollte"), missing punctuation, garbled phonetic transcriptions. Tesseract works much better on isolated binarized crops with upscaling, which is exactly what cell-first OCR provides. RapidOCR remains available as explicit engine choice via the dropdown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 22:30:34 +01:00
Benjamin Admin	34c649c8be	fix: send SSE keepalive events every 5s during batch OCR Batch OCR takes 30-60s with 3x upscaling. Without keepalive events, proxy servers (Nginx) drop the SSE connection after their read timeout. Now sends keepalive events every 5s to prevent timeout, with elapsed time for debugging. Also checks for client disconnect between keepalives. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 22:21:14 +01:00
Benjamin Admin	dd16c88007	fix: retry words request on 400/404 + add backend diagnostic logging CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details Frontend: retry /words POST once after 2s delay if it gets 400/404, which happens when navigating via wizard after container restart (session cache not yet warm). Backend: log when session needs DB reload and when dewarped_bgr is missing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 20:15:54 +01:00
Benjamin Admin	90ecb46bed	fix: force 3x upscale for short RapidOCR crops + lower box_thresh - Short cell crops (<80px height) are always 3x upscaled for RapidOCR to improve recognition of periods, ellipsis, and phonetic symbols - Lowered Det.box_thresh from 0.6 to 0.4 to detect small characters that were being filtered out (dots, brackets, IPA symbols) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 19:47:36 +01:00
Benjamin Admin	bb0e23303c	debug: log RapidOCR upscale dimensions to verify scaling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 18:18:03 +01:00
Benjamin Admin	604da26b24	fix: upscale RapidOCR crops to min 150px (was 64px), matching Tesseract Cell crops of 35-54px height were too small for RapidOCR to detect text reliably. Uses _ensure_minimum_crop_size(min_dim=150) for consistent upscaling across all OCR engines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 17:38:06 +01:00
Benjamin Admin	113a1c10e5	fix: add 3px cell padding + upscale small RapidOCR crops + diagnostic logging - Add 3px padding around cell crops to avoid clipping edge characters (parentheses in "Tanz(veranstaltung)", descenders, etc.) - Upscale small BGR crops for RapidOCR, same as Tesseract path - Add info-level diagnostic logging to _ocr_cell_crop for debugging Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 16:45:59 +01:00
Benjamin Admin	e4bdb3cc24	debug: add diagnostic logging to _ocr_cell_crop for empty cell investigation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 16:35:33 +01:00
Benjamin Admin	d0e7966925	fix: use header/footer row boundaries for _heal_row_gaps in cell-first OCR CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 33s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 20s Details Prevents first content row from expanding into header area (causing "ulary" from "VOCABULARY" to appear in DE column) and last content row from expanding into footer area (causing page numbers to appear as content). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 15:44:13 +01:00
Benjamin Admin	68d230c297	fix: use batch-then-stream SSE for cell-first OCR CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m49s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details The old per-cell streaming timed out because sequential cell OCR was too slow to send the first event before proxy timeout. Now uses build_cell_grid_v2 (parallel ThreadPoolExecutor) via run_in_executor, then streams all cells at once after batch completes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 14:51:55 +01:00
Benjamin Admin	16dc77e5c2	chore: add migration 005_add_doc_type.sql CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 13:54:56 +01:00
Benjamin Admin	29c74a9962	feat: cell-first OCR + document type detection + dynamic pipeline steps Cell-First OCR (v2): Each cell is cropped and OCR'd in isolation, eliminating neighbour bleeding (e.g. "to", "ps" in marker columns). Uses ThreadPoolExecutor for parallel Tesseract calls. Document type detection: Classifies pages as vocab_table, full_text, or generic_table using projection profiles (<2s, no OCR needed). Frontend dynamically skips columns/rows steps for full-text pages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 13:52:38 +01:00
Benjamin Admin	00a74b3144	revert: remove marker column OCR special handling CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m48s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details The HSV-based coloured marker detection caused false positives in nearly every marker cell. Coloured markers like red "!" are an extreme edge case — better handled manually in reconstruction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 11:52:59 +01:00
Benjamin Admin	489835a279	fix: detect red/coloured markers in OCR pipeline CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details Two fixes for marker column content (e.g. red "!" marks): 1. Skip _clean_cell_text() noise filter for column_marker — it requires 2+ consecutive letters, which drops punctuation-only markers like "!" or "*". 2. For marker columns, detect coloured pixels via HSV saturation check (S>80) in addition to grayscale darkness. Create a binarized image where both dark AND saturated pixels become black foreground, so Tesseract can see red markers that appear near-white in standard grayscale conversion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 11:38:12 +01:00
Benjamin Admin	f0726d9a2b	fix: shrink overlapping neighbors after narrow column expansion CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 16s Details When a narrow column expands into neighbor space, the neighbor's boundaries must be adjusted to avoid overlap. After expansion, left neighbor's right edge and right neighbor's left edge are trimmed to match the expanded column's new boundaries, with words re-assigned. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 11:12:13 +01:00
Benjamin Admin	ae1f9f7494	fix: expand narrow columns into neighbor space, not just gaps CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m48s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Sub-column splits create adjacent columns with 0px gap between them. The previous expansion only worked with explicit gaps. Now it looks at where the neighbor's actual words are and claims unused space up to MIN_WORD_MARGIN (4px) from the nearest word, even if there's no gap in the column boundaries. Also added debug logging for expansion input. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 10:49:10 +01:00
Benjamin Admin	e4aff2b27e	fix: rewrite Method D to measure vertical column drift instead of text-line slope CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details After deskew, horizontal text lines are already straight (~0° slope). Method D was measuring this (always ~0°) instead of the actual vertical shear (column edge drift). This caused it to report 0.112° with 0.96 confidence, overwhelming Method A's correct detection of negative shear. New Method D groups words by X-position into vertical columns, then measures how left-edge X drifts with Y position via linear regression. dx/dy = tan(shear_angle), directly measuring column tilt. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 10:31:19 +01:00
Benjamin Admin	9dd77ab54a	fix: move column expansion AFTER sub-column split CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 18s Details The narrow column expansion was running inside detect_column_geometry() on the 4 main columns, but the narrowest columns (marker ~14px, page_ref ~93px) are created AFTERWARDS by _detect_sub_columns(). Extracted expand_narrow_columns() as standalone function and call it after sub-column splitting in the columns API endpoint. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 10:07:40 +01:00
Benjamin Admin	e426de937c	fix: expand narrow columns + lower dewarp thresholds for small angles CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 19s Details Two fixes for edge case where residual shear pushes content out of narrow columns (marker, page_ref): 1. Column expansion (Step 10): After detection, narrow columns (<10% content width) expand into adjacent whitespace gaps, claiming up to 40% of the gap but never past the nearest word in the neighbor column. This gives marker/page_ref columns breathing room. 2. Dewarp sensitivity: Lower minimum angle from 0.15° to 0.08°, lower ensemble min confidence from 0.5 to 0.35, lower final threshold from 0.5 to 0.4, and skip quality gate for small corrections (<0.5°) where projection variance change is negligible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 09:32:47 +01:00
Benjamin Admin	0d3f001acb	fix: always include detections in dewarp response, even when no correction applied CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m49s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 19s Details The detections array was empty when shear was below threshold, hiding all 4 method results from the frontend Details panel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 09:05:43 +01:00
Benjamin Admin	ab3ecc7c08	feat: OCR pipeline v2.1 – narrow column OCR, dewarp automation, Fabric.js editor CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m50s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 15s Details Proposal B: Adaptive padding, crop upscaling, PSM selection, row-strip re-OCR for narrow columns (<15% width) – expected accuracy boost 60-70% → 85-90%. Proposal A: New text-line straightness detector (Method D), quality gate (rejects counterproductive corrections), 2-pass projection refinement, higher confidence thresholds – expected manual dewarp reduction to <10%. Proposal C: Fabric.js canvas editor with drag/drop, inline editing, undo/redo, opacity slider, zoom, PDF/DOCX export endpoints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 22:44:14 +01:00

1 2 3 4

151 Commits