Improve syllable divider insertion for dictionary pages
Rewrite cv_syllable_detect.py with pyphen-first approach: - Remove unreliable CV gate (morphological pipe detection) - Strip existing pipes and re-syllabify via pyphen (DE then EN) - Merge pipe-gap spaces where OCR split words at divider positions - Guard merges with function word blacklist and punctuation checks Add false-positive prevention: - Pre-check: skip if <5% of cells have existing | from OCR - Call-site check: require article_col_index (der/die/das column) - Prevents syllabification of synonym dictionaries and word lists Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1456,10 +1456,15 @@ async def _build_grid_core(session_id: str, session: dict) -> dict:
|
||||
logger.warning("Dictionary detection failed: %s", e)
|
||||
|
||||
# --- Syllable divider insertion for dictionary pages ---
|
||||
# CV-validated: only inserts "|" where image shows thin vertical lines.
|
||||
# See cv_syllable_detect.py for the detection + insertion logic.
|
||||
# Only on confirmed dictionary pages with article columns (der/die/das).
|
||||
# The article_col_index check avoids false positives on synonym lists,
|
||||
# word frequency tables, and other alphabetically sorted non-dictionary pages.
|
||||
# Additionally, insert_syllable_dividers has its own pre-check for existing
|
||||
# pipe characters in cells (OCR must have already found some).
|
||||
syllable_insertions = 0
|
||||
if dict_detection.get("is_dictionary") and img_bgr is not None:
|
||||
if (dict_detection.get("is_dictionary")
|
||||
and dict_detection.get("article_col_index") is not None
|
||||
and img_bgr is not None):
|
||||
try:
|
||||
from cv_syllable_detect import insert_syllable_dividers
|
||||
syllable_insertions = insert_syllable_dividers(
|
||||
|
||||
Reference in New Issue
Block a user