Files
breakpilot-lehrer/klausur-service
Benjamin Admin ab294d5a6f feat(ocr-pipeline): deterministic post-processing pipeline
Add 4 post-processing steps after OCR (no LLM needed):

1. Character confusion fix: I/1/l/| correction using cross-language
   context (if DE has "Ich", EN "1" → "I")
2. IPA dictionary replacement: detect [phonetics] brackets, look up
   correct IPA from eng_to_ipa (MIT, 134k words) — replaces OCR'd
   phonetic symbols with dictionary-correct transcription
3. Comma-split: "break, broke, broken" / "brechen, brach, gebrochen"
   → 3 individual entries when part counts match
4. Example sentence attachment: rows with EN but no DE translation
   get attached as examples to the preceding vocab entry

All fixes are deterministic and generic — no hardcoded word lists.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 21:00:09 +01:00
..