Improve syllable divider insertion for dictionary pages

Rewrite cv_syllable_detect.py with pyphen-first approach:
- Remove unreliable CV gate (morphological pipe detection)
- Strip existing pipes and re-syllabify via pyphen (DE then EN)
- Merge pipe-gap spaces where OCR split words at divider positions
- Guard merges with function word blacklist and punctuation checks

Add false-positive prevention:
- Pre-check: skip if <5% of cells have existing | from OCR
- Call-site check: require article_col_index (der/die/das column)
- Prevents syllabification of synonym dictionaries and word lists

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-24 19:44:29 +01:00
parent 7fbcae954b
commit ed7fc99fc4
2 changed files with 221 additions and 112 deletions

View File

@@ -1456,10 +1456,15 @@ async def _build_grid_core(session_id: str, session: dict) -> dict:
logger.warning("Dictionary detection failed: %s", e)
# --- Syllable divider insertion for dictionary pages ---
# CV-validated: only inserts "|" where image shows thin vertical lines.
# See cv_syllable_detect.py for the detection + insertion logic.
# Only on confirmed dictionary pages with article columns (der/die/das).
# The article_col_index check avoids false positives on synonym lists,
# word frequency tables, and other alphabetically sorted non-dictionary pages.
# Additionally, insert_syllable_dividers has its own pre-check for existing
# pipe characters in cells (OCR must have already found some).
syllable_insertions = 0
if dict_detection.get("is_dictionary") and img_bgr is not None:
if (dict_detection.get("is_dictionary")
and dict_detection.get("article_col_index") is not None
and img_bgr is not None):
try:
from cv_syllable_detect import insert_syllable_dividers
syllable_insertions = insert_syllable_dividers(