breakpilot-lehrer/klausur-service/backend/cv_vocab_pipeline.py at e3aa8e899ee912243a6aa90d8a2b40301f8e9f75

Files

Benjamin Admin ab294d5a6f feat(ocr-pipeline): deterministic post-processing pipeline

Add 4 post-processing steps after OCR (no LLM needed):

1. Character confusion fix: I/1/l/| correction using cross-language
   context (if DE has "Ich", EN "1" → "I")
2. IPA dictionary replacement: detect [phonetics] brackets, look up
   correct IPA from eng_to_ipa (MIT, 134k words) — replaces OCR'd
   phonetic symbols with dictionary-correct transcription
3. Comma-split: "break, broke, broken" / "brechen, brach, gebrochen"
   → 3 individual entries when part counts match
4. Example sentence attachment: rows with EN but no DE translation
   get attached as examples to the preceding vocab entry

All fixes are deterministic and generic — no hardcoded word lists.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-28 21:00:09 +01:00

120 KiB

Raw Blame History

View Raw

120 KiB Raw Blame History

120 KiB

Raw Blame History