Add language-specific IPA and syllable modes (de/en)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m50s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 15s

Extend ipa_mode and syllable_mode toggles with language options:
- auto: smart detection (default)
- en: only English headword column
- de: only German definition columns
- all: all content columns
- none: skip entirely

Also improve English column auto-detection: use garbled IPA patterns
(apostrophes, colons) in addition to bracket patterns. This correctly
identifies English dictionary pages where OCR produces garbled ASCII
instead of bracket IPA.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-25 08:16:29 +01:00
parent 34680732f8
commit 83c058e400
4 changed files with 68 additions and 31 deletions

View File

@@ -196,6 +196,7 @@ def insert_syllable_dividers(
session_id: str,
*,
force: bool = False,
col_filter: Optional[set] = None,
) -> int:
"""Insert pipe syllable dividers into dictionary cells.
@@ -209,6 +210,8 @@ def insert_syllable_dividers(
Args:
force: If True, skip the pipe-ratio pre-check and syllabify all
content words regardless of whether the original has pipe dividers.
col_filter: If set, only process cells whose col_type is in this set.
None means process all content columns.
Returns the number of cells modified.
"""
@@ -247,6 +250,8 @@ def insert_syllable_dividers(
ct = cell.get("col_type", "")
if not ct.startswith("column_"):
continue
if col_filter is not None and ct not in col_filter:
continue
text = cell.get("text", "")
if not text:
continue