Rule (a2) in Step 5i removed word_boxes with no letters/digits as
"graphic OCR artifacts". This incorrectly removed = signs used as
definition markers in textbooks ("film = 1. Film; 2. filmen").
Added exception list _KEEP_SYMBOLS for meaningful punctuation:
= (= =) ; : - – — / + • · ( ) & * → ← ↔
The root cause: PaddleOCR returns "film = 1. Film; 2. filmen" as one
block, which gets split into word_boxes ["film", "=", "1.", ...].
The "=" word_box had no alphanumeric chars and was removed as artifact.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
= signs are used as definition markers in textbooks ("film = 1. Film").
They were incorrectly removed by two filters:
1. grid_build_core.py Step 5j-pre: _PURE_JUNK_RE matched "=" as
artifact noise. Now exempts =, (=, ;, :, - and similar meaningful
punctuation tokens.
2. cv_ocr_engines.py _is_noise_tail_token: "pure non-alpha" check
removed trailing = tokens. Now exempts meaningful punctuation.
Fixes: "film = 1. Film; 2. filmen" losing the = sign,
"(= I won and he lost.)" losing the (=.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Text like "Betonung auf der 1. Silbe: profit ['profit]" was
incorrectly detected as garbled IPA and replaced with generated
IPA transcription of the previous row's example sentence.
Added guard: if the cell text contains >=3 recognizable words
(3+ letter alpha tokens), it's normal text, not garbled IPA.
Garbled IPA is typically short and has no real dictionary words.
Fixes: Row 13 C3 showing IPA instead of pronunciation hint text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>