breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	be86a7d14d	fix: preserve pipe syllable dividers + detect alphabet sidebar columns 1. Pipe divider fix: Changed OCR char-confusion regex so \| between letters (Ka\|me\|rad) is NOT converted to I. Only standalone/ word-boundary pipes are converted (\|ch → Ich, \| want → I want). 2. Alphabet sidebar detection improvements: - _filter_decorative_margin() now considers 2-char words (OCR reads "Aa", "Bb" from sidebars), lowered min strip from 8→6 - _filter_border_strip_words() lowered decorative threshold from 50%→45% - New step 4f: grid-level thin-edge-column filter as safety net — removes edge columns with <35% fill rate and >60% short text Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 13:52:11 +01:00
Benjamin Admin	a1e079b911	feat: Sprint 1 — IPA hardening, regression framework, ground-truth review Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 19s Details Track A (Backend): - Compound word IPA decomposition (schoolbag→school+bag) - Trailing garbled IPA fragment removal after brackets (R21 fix) - Regression runner with DB persistence, history endpoints - Page crop determinism verified with tests Track B (Frontend): - OCR Regression dashboard (/ai/ocr-regression) - Ground Truth Review workflow (/ai/ocr-ground-truth) with split-view, confidence highlighting, inline edit, batch mark, progress tracking Track C (Docs): - OCR-Pipeline.md v5.0 (Steps 5e-5h) - Regression testing guide - mkdocs.yml nav update Track D (Infra): - TrOCR baseline benchmark script - run-regression.sh shell script - Migration 008: regression_runs table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 09:21:27 +01:00
Benjamin Admin	1f7989cfc2	Fix grammar bracket detection: split on spaces too, not just slashes Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m48s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 15s Details _is_grammar_bracket_content now splits "no pl" into ["no", "pl"] instead of treating it as single token "no pl". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 11:45:35 +01:00
Benjamin Admin	ef5aed6a98	Preserve grammar annotations (pl), (no pl) and skip articles in IPA Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details Two fixes: 1. Add pl, sg, no, also, ae, be etc. to _GRAMMAR_BRACKET_WORDS so annotations like (pl) and (no pl) are not replaced with IPA. 2. Skip articles (the, a, an) in fix_ipa_continuation_cell — they never get IPA in vocabulary books. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 11:42:44 +01:00
Benjamin Admin	a579c31ddb	Fix IPA continuation: skip words with inline IPA, recover emptied cells Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m46s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 15s Details Three fixes: 1. fix_ipa_continuation_cell: when headword has inline IPA like "beat [bˈiːt] , beat, beaten", only generate IPA for uncovered words (beaten), not words already shown (beat). When bracket is at end like "the Highlands [ˈhaɪləndz]", return inline IPA directly. 2. Step 5d: recover garbled IPA from word_boxes when Step 5c emptied the cell text (e.g. "[n, nn]" → ""). 3. Added 2 tests for inline IPA behavior (35 total). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 09:31:54 +01:00
Benjamin Admin	92a7b85c2d	Fix IPA continuation: only process fully-bracketed cells, keep phrasal verb particles Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details Two fixes: 1. Step 5d now only treats cells as continuation when text is entirely inside brackets (e.g. "[n, nn]"). Cells with headwords outside brackets (e.g. "employee [im'ploi:]") are no longer overwritten. 2. fix_ipa_continuation_cell no longer skips grammar words like "down" — they are part of the headword in phrasal verbs like "close sth. down". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 00:43:51 +01:00
Benjamin Admin	6bfa9eed86	Fix garbled IPA detection for bracket-notation like [n, nn] and [1uedtX,1] Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details - Detect bracketed text without real IPA symbols as garbled OCR phonetics - Allow IPA continuation fix even when other columns have content (for rows where EN cell is clearly garbled bracketed IPA) - Strip parenthetical grammar annotations like (no pl) from headword before IPA lookup in fix_ipa_continuation_cell Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 23:28:00 +01:00
Benjamin Admin	2e6ab3a646	Fix IPA marker split: walk back max 3 chars for onset cluster The walk-back was going 4 chars, eating the last letter of the headword: "schoolbag" → "schoolba". Limiting to 3 gives correct split: "schoolbag" + "[sku:lbæg]". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 10:57:15 +01:00
Benjamin Admin	cc5ee74921	Use OCR-recognized IPA when word not in dictionary For merged tokens like "schoolbagsku:lbæg", split at IPA marker boundary instead of prefix-matching to a shorter dictionary word. Result: "schoolbag [sku:lbæg]" instead of "school [skˈuːl]". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 10:55:36 +01:00
Benjamin Admin	21d37b5da1	Fix prefix matching: use alpha-only chars, min 4-char prefix Prevents false positives where punctuation (apostrophes) in merged tokens caused wrong dictionary matches (e.g. "'se" from "'sekandarr" matching as a word, breaking IPA continuation row fix). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 10:40:37 +01:00
Benjamin Admin	19cbbf310a	Improve garbled IPA cleanup: trailing strip, prefix match, broader guard 1. Strip trailing garbled IPA after proper [IPA] brackets (e.g. "sea [sˈiː] si:" → "sea [sˈiː]") 2. Add prefix matching for merged tokens where OCR joined headword with garbled IPA (e.g. "schoolbagsku:lbæg" → "schoolbag [skˈuːlbæɡ]") 3. Broaden guard to also trigger on trailing non-dictionary words (e.g. "scare skea" → "scare [skˈɛə]") Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 10:36:25 +01:00
Benjamin Admin	fc0ab84e40	Fix garbled IPA in continuation rows using headword lookup IPA continuation rows (phonetic transcription that wraps below the headword) now get proper IPA by looking up headwords from the row above. E.g. "ska:f – ska:vz" → "[skˈɑːf] – [skˈɑːvz]". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 10:28:14 +01:00
Benjamin Admin	038eaf783c	Only insert IPA when garbled phonetics exist in OCR text Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m49s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 17s Details _insert_missing_ipa was adding dictionary IPA to cells that had NO phonetic transcription on the original page (e.g. "scissors" heading, "scarf - scarves" without IPA). Now guarded by _text_has_garbled_ipa() which checks for OCR-mangled phonetic markers (stress marks, length marks, IPA special chars) before allowing insertion. Rule: if a line has no phonetics, don't add any. Where garbled IPA exists, replace it with correct IPA notation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 09:59:21 +01:00
Benjamin Admin	8ef4c089cf	Remove IPA continuation rows and support hyphenated word lookup - grid_editor_api: After IPA correction, detect rows containing only garbled phonetics in the English column (no German translation, no IPA brackets inserted). These are wrap-around lines where printed IPA extends to the line below the headword. Remove them since the headword row already has correct IPA. - cv_ocr_engines: _insert_missing_ipa now tries dehyphenated form as fallback (e.g. "second-hand" → "secondhand") for dictionary lookup, fixing IPA insertion for compound words. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 12:05:38 +01:00
Benjamin Admin	b98ea33a3a	Strip garbled OCR phonetics after IPA insertion _insert_missing_ipa now removes garbled phonetic text (e.g. "skea", "sku:l", "'sizaz") that follows the inserted IPA bracket. Keeps delimiters (–, -), uppercase words (German), and known English words. Fixes: "scare [skˈɛə] skea" → "scare [skˈɛə]" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 11:15:14 +01:00
Benjamin Admin	b83b38e7f2	feat: use local RapidOCR as default in ocr_region_paddle(), remote as fallback Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 17s Details RapidOCR uses the same PP-OCRv5 ONNX models locally, avoiding 504 timeouts from remote PaddleOCR on large images. Set FORCE_REMOTE_PADDLE=1 to bypass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 08:26:04 +01:00
Benjamin Admin	685d135be5	fix: downscale large images before PaddleOCR (Traefik 60s limit) Some checks failed CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details Bilder > 1500px werden vor dem Upload verkleinert. Koordinaten werden zurueckskaliert. JPEG statt PNG fuer schnelleren Upload. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 14:28:58 +01:00
Benjamin Admin	a6069631cc	feat: PaddleOCR Remote-Engine (PP-OCRv5 Latin auf Hetzner x86_64) Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 31s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m7s Details CI / test-python-agent-core (push) Successful in 21s Details CI / test-nodejs-website (push) Successful in 21s Details PaddleOCR als neue engine=paddle Option in der OCR-Pipeline. Microservice auf Hetzner (paddleocr-service/), async HTTP-Client (paddleocr_remote.py), Frontend-Dropdown, automatisch words_first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 09:31:22 +01:00
Benjamin Admin	2e21a4b6d0	fix: IPA nur einfügen wenn word_boxes Gap >80px zeigen (kein falsches IPA) Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 55s Details CI / test-go-edu-search (push) Successful in 48s Details CI / test-python-klausur (push) Failing after 2m11s Details CI / test-python-agent-core (push) Successful in 23s Details CI / test-nodejs-website (push) Successful in 26s Details _has_ipa_gap() prüft ob Tesseract eine IPA-Klammer übersehen hat anhand des physischen Abstands zwischen Headword und nächstem Wort. Ohne Gap (z.B. "be good at sth.", "Focus on language") wird kein IPA eingefügt. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 23:40:18 +01:00
Benjamin Admin	d98dba9098	fix: Headword-IPA auch in langen column_text Zeilen einfuegen Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 53s Details CI / test-go-edu-search (push) Successful in 49s Details CI / test-python-klausur (push) Failing after 2m14s Details CI / test-python-agent-core (push) Successful in 22s Details CI / test-nodejs-website (push) Successful in 23s Details _insert_missing_ipa ueberspringe Texte mit >6 Woertern oder Klammern. Neue _insert_headword_ipa fuer column_text: prueft nur das erste Wort der Zeile, unabhaengig von Textlaenge oder vorhandenen Klammern. Ausserdem _sync_word_boxes_after_ipa_insert gefixt: Token-Vergleich nutzt jetzt paralleles Durchlaufen statt zip (verschobene Positionen). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 23:25:38 +01:00
Benjamin Admin	cd13eca290	fix: IPA-Einfuegung fuer column_text mit word_boxes Synchronisation Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 31s Details CI / test-go-edu-search (push) Successful in 32s Details CI / test-python-klausur (push) Failing after 2m9s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 20s Details Fuer column_text werden fehlende IPA-Lautschriften (challenge, profit, film, badge) wieder eingefuegt, aber gleichzeitig eine synthetische word_box erzeugt, damit die 1:1 Token-zu-Box Zuordnung im Overlay erhalten bleibt. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 23:15:26 +01:00
Benjamin Admin	aa7db43f02	fix: column_text nur garbled IPA ersetzen, keine Einfuegung/Entfernung Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 34s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 2m8s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 21s Details Fuer column_text (Full-Page Overlay mit gemischtem EN+DE Text): - Kein IPA einfuegen (wuerde Token-Count aendern, Overlay-Positionen brechen) - Keine orphan brackets entfernen (sind oft deutsche Bedeutungen wie (probieren)) - Nur garbled IPA ersetzen (z.B. [teıst] -> [tˈeɪst]) column_en behaelt volle Verarbeitung (replace + strip + insert). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 23:05:37 +01:00
Benjamin Admin	4afd5bd8e8	fix: Klammerwörter wie (probieren), (Profit) nicht mehr als garbled IPA entfernen Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 50s Details CI / test-go-edu-search (push) Successful in 45s Details CI / test-python-klausur (push) Failing after 2m12s Details CI / test-python-agent-core (push) Successful in 23s Details CI / test-nodejs-website (push) Successful in 27s Details _strip_orphan_bracket entfernte deutsche Bedeutungsangaben in Klammern, weil sie weder als Grammar-Partikel noch als IPA erkannt wurden. Fix: Klammerinhalte mit echten Wörtern (>=4 Buchstaben) werden behalten. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 22:47:01 +01:00
Benjamin Admin	2f51ac617f	feat: IPA-Lautschrift in Cell-Texte einfuegen (fuer Overlay-Modus) Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 34s Details CI / test-go-edu-search (push) Successful in 31s Details CI / test-python-klausur (push) Failing after 2m5s Details CI / test-python-agent-core (push) Successful in 23s Details CI / test-nodejs-website (push) Successful in 22s Details fix_cell_phonetics() ersetzt fehlerhafte IPA-Klammern UND fuegt fehlende Lautschrift fuer englische Woerter ein (z.B. badge, film, challenge, profit). Wird auf alle Zellen mit col_type column_en/column_text angewandt. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 15:47:26 +01:00
Benjamin Admin	23b7840ea7	feat: Full-Row OCR mit Spacing fuer Box-Sub-Sessions Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 40s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m16s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 22s Details Sub-Sessions ueberspringen Spaltenerkennung und nutzen stattdessen eine Pseudo-Spalte ueber die volle Breite. Text wird mit proportionalem Spacing aus Wort-Positionen rekonstruiert, um raeumliches Layout zu erhalten. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 08:28:29 +01:00
Benjamin Admin	cf9dde9876	fix: _group_words_into_lines nach cv_ocr_engines.py verschieben Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 2m4s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 21s Details Funktion war nur in cv_review.py definiert, wurde aber auch in cv_ocr_engines.py und cv_layout.py benutzt — NameError zur Laufzeit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 15:24:56 +01:00
Benjamin Admin	9a5a35bff1	refactor: cv_vocab_pipeline.py in 6 Module aufteilen (8163 → 6 + Fassade) Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Monolithische 8163-Zeilen-Datei aufgeteilt in fokussierte Module: - cv_vocab_types.py (156 Z.): Dataklassen, Konstanten, IPA, Feature-Flags - cv_preprocessing.py (1166 Z.): Bild-I/O, Orientierung, Deskew, Dewarp - cv_layout.py (3036 Z.): Dokumenttyp, Spalten, Zeilen, Klassifikation - cv_ocr_engines.py (1282 Z.): OCR-Engines, Vocab-Postprocessing, Text-Cleaning - cv_cell_grid.py (1510 Z.): Cell-Grid v2+Legacy, Vocab-Konvertierung - cv_review.py (1184 Z.): LLM/Spell Review, Pipeline-Orchestrierung cv_vocab_pipeline.py ist jetzt eine Re-Export-Fassade (35 Z.) — alle bestehenden Imports bleiben unveraendert. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 23:46:47 +01:00

27 Commits