fix(ocr-pipeline): cells = result, no post-processing content shuffling
The cell grid IS the result. Each cell stays at its detected position. Removed _split_comma_entries and _attach_example_sentences from the pipeline — they were shuffling content between rows/columns, causing "Mäuse" to appear in a separate row, "stand..." to move to Example, and "Ei" to disappear. Now: cells → _cells_to_vocab_entries (1:1 row mapping) → _fix_character_confusion → _fix_phonetic_brackets → done. Also lowered pixel-density threshold from 2% to 0.5% for the cell-OCR fallback so small text like "Ei" is not filtered out. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -35,8 +35,6 @@ from cv_vocab_pipeline import (
|
||||
_cells_to_vocab_entries,
|
||||
_fix_character_confusion,
|
||||
_fix_phonetic_brackets,
|
||||
_split_comma_entries,
|
||||
_attach_example_sentences,
|
||||
analyze_layout,
|
||||
analyze_layout_by_words,
|
||||
build_cell_grid,
|
||||
@@ -1174,15 +1172,13 @@ async def detect_words(
|
||||
},
|
||||
}
|
||||
|
||||
# For vocab layout: add post-processed vocab_entries (backwards compat)
|
||||
# For vocab layout: map cells 1:1 to vocab entries (row→entry).
|
||||
# No content shuffling — each cell stays at its detected position.
|
||||
if is_vocab:
|
||||
entries = _cells_to_vocab_entries(cells, columns_meta)
|
||||
entries = _fix_character_confusion(entries)
|
||||
entries = _fix_phonetic_brackets(entries, pronunciation=pronunciation)
|
||||
entries = _split_comma_entries(entries)
|
||||
entries = _attach_example_sentences(entries)
|
||||
word_result["vocab_entries"] = entries
|
||||
# Also keep "entries" key for backwards compatibility
|
||||
word_result["entries"] = entries
|
||||
word_result["entry_count"] = len(entries)
|
||||
word_result["summary"]["total_entries"] = len(entries)
|
||||
@@ -1302,14 +1298,13 @@ async def _word_stream_generator(
|
||||
},
|
||||
}
|
||||
|
||||
# Vocab post-processing
|
||||
# For vocab layout: map cells 1:1 to vocab entries (row→entry).
|
||||
# No content shuffling — each cell stays at its detected position.
|
||||
vocab_entries = None
|
||||
if is_vocab:
|
||||
entries = _cells_to_vocab_entries(all_cells, columns_meta)
|
||||
entries = _fix_character_confusion(entries)
|
||||
entries = _fix_phonetic_brackets(entries, pronunciation=pronunciation)
|
||||
entries = _split_comma_entries(entries)
|
||||
entries = _attach_example_sentences(entries)
|
||||
word_result["vocab_entries"] = entries
|
||||
word_result["entries"] = entries
|
||||
word_result["entry_count"] = len(entries)
|
||||
|
||||
Reference in New Issue
Block a user