refactor(ocr-pipeline): make post-processing fully generic

Three non-generic solutions replaced with universal heuristics: 1. Cell-OCR fallback: instead of restricting to column_en/column_de, now checks pixel density (>2% dark pixels) for ANY column type. Truly empty cells are skipped without running Tesseract. 2. Example-sentence detection: instead of checking for example-column text (worksheet-specific), now uses sentence heuristics (>=4 words or ends with sentence punctuation). Short EN text without DE is kept as a vocab entry (OCR may have missed the translation). 3. Comma-split: re-enabled with singular/plural detection. Pairs like "mouse, mice" / "Maus, Mäuse" are kept together. Verb forms like "break, broke, broken" are still split into individual entries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:27:30 +01:00
parent 6bca3370e0
commit e3f939a628
2 changed files with 76 additions and 23 deletions
--- a/klausur-service/backend/ocr_pipeline_api.py
+++ b/klausur-service/backend/ocr_pipeline_api.py
@@ -1179,9 +1179,7 @@ async def detect_words(
        entries = _cells_to_vocab_entries(cells, columns_meta)
        entries = _fix_character_confusion(entries)
        entries = _fix_phonetic_brackets(entries, pronunciation=pronunciation)
-        # NOTE: _split_comma_entries disabled — word forms like "mouse, mice"
-        # / "Maus, Mäuse" belong together in one entry.
-        # entries = _split_comma_entries(entries)
+        entries = _split_comma_entries(entries)
        entries = _attach_example_sentences(entries)
        word_result["vocab_entries"] = entries
        # Also keep "entries" key for backwards compatibility
@@ -1310,9 +1308,7 @@ async def _word_stream_generator(
        entries = _cells_to_vocab_entries(all_cells, columns_meta)
        entries = _fix_character_confusion(entries)
        entries = _fix_phonetic_brackets(entries, pronunciation=pronunciation)
-        # NOTE: _split_comma_entries disabled — word forms like "mouse, mice"
-        # / "Maus, Mäuse" belong together in one entry.
-        # entries = _split_comma_entries(entries)
+        entries = _split_comma_entries(entries)
        entries = _attach_example_sentences(entries)
        word_result["vocab_entries"] = entries
        word_result["entries"] = entries