fix(ocr-pipeline): fix vocab post-processing destroying correct cell results

Three bugs in the post-processing pipeline were overwriting correct
streaming results with wrong ones:

1. _split_comma_entries was splitting "Maus, Mäuse" into two separate
   entries. Disabled — word forms belong together.

2. _attach_example_sentences treated "Ei" (2 chars) as OCR noise due
   to `len(de) > 2` threshold. Lowered to `len(de) > 1`.

3. _attach_example_sentences wrongly classified rows with EN text but
   no DE (like "stand ...") as example sentences, merging them into
   the previous entry. Now only treats rows as examples if they also
   have no text in the example column.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-02 09:16:50 +01:00
parent befc44d2dd
commit 6bca3370e0
2 changed files with 17 additions and 7 deletions

View File

@@ -1179,7 +1179,9 @@ async def detect_words(
entries = _cells_to_vocab_entries(cells, columns_meta)
entries = _fix_character_confusion(entries)
entries = _fix_phonetic_brackets(entries, pronunciation=pronunciation)
entries = _split_comma_entries(entries)
# NOTE: _split_comma_entries disabled — word forms like "mouse, mice"
# / "Maus, Mäuse" belong together in one entry.
# entries = _split_comma_entries(entries)
entries = _attach_example_sentences(entries)
word_result["vocab_entries"] = entries
# Also keep "entries" key for backwards compatibility
@@ -1308,7 +1310,9 @@ async def _word_stream_generator(
entries = _cells_to_vocab_entries(all_cells, columns_meta)
entries = _fix_character_confusion(entries)
entries = _fix_phonetic_brackets(entries, pronunciation=pronunciation)
entries = _split_comma_entries(entries)
# NOTE: _split_comma_entries disabled — word forms like "mouse, mice"
# / "Maus, Mäuse" belong together in one entry.
# entries = _split_comma_entries(entries)
entries = _attach_example_sentences(entries)
word_result["vocab_entries"] = entries
word_result["entries"] = entries