Only insert IPA when garbled phonetics exist in OCR text
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m49s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 17s

_insert_missing_ipa was adding dictionary IPA to cells that had NO
phonetic transcription on the original page (e.g. "scissors" heading,
"scarf - scarves" without IPA). Now guarded by _text_has_garbled_ipa()
which checks for OCR-mangled phonetic markers (stress marks, length
marks, IPA special chars) before allowing insertion.

Rule: if a line has no phonetics, don't add any. Where garbled IPA
exists, replace it with correct IPA notation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-19 09:59:21 +01:00
parent 432eee3694
commit 038eaf783c

View File

@@ -984,12 +984,48 @@ def _replace_phonetics_in_text(
return text return text
def _text_has_garbled_ipa(text: str) -> bool:
"""Check if text contains garbled IPA-like fragments from OCR.
Returns True if there is evidence of OCR-mangled phonetic
transcription, e.g. stress marks, length marks, or IPA special chars.
This is used to decide whether ``_insert_missing_ipa`` should run:
it must only insert IPA to *replace* garbled phonetics that are already
in the text — never to ADD phonetics where none existed on the page.
"""
for w in text.strip().split():
# Skip delimiters and very short tokens
if len(w) <= 1 or w in ('', '', '-', '/', '|', ',', ';'):
continue
# Starts with stress mark (OCR read IPA stress ' as apostrophe)
if w.startswith("'") and len(w) > 1 and not w[1:].istitle():
return True
if w.startswith("\u02c8") or w.startswith("\u02cc"): # ˈ ˌ
return True
# Contains IPA length mark ':' in a short non-word fragment
if ':' in w and len(w) < 12:
# But not things like "3:00" (time) or common words
stripped = re.sub(r'[^a-zA-Z:]', '', w)
if ':' in stripped and not stripped.replace(':', '').isalpha():
continue
return True
# Contains IPA special characters
if any(c in w for c in 'əɪɛɒʊʌæɑɔʃʒθðŋ'):
return True
return False
def _insert_missing_ipa(text: str, pronunciation: str = 'british') -> str: def _insert_missing_ipa(text: str, pronunciation: str = 'british') -> str:
"""Insert IPA pronunciation for English words that have no brackets at all. """Insert IPA pronunciation for English words that have no brackets at all.
OCR sometimes drops the phonetic transcription entirely (e.g. "challenge" OCR sometimes garbles the phonetic transcription into plain-text fragments
instead of "challenge [ˈtʃælɪndʒ]"). This scans the text for lone English (e.g. "scare skea" where "skea" is garbled /skɛə/). This scans the text
words that have a dictionary IPA entry and appends [ipa] after them. for the headword, inserts correct [IPA], and strips the garbled fragments.
IMPORTANT: This function must only be called when ``_text_has_garbled_ipa``
confirms that the text actually contains garbled phonetics. If the text
is clean (e.g. just "scissors"), IPA must NOT be inserted — the original
page had no phonetics on that line.
Only inserts for words that: Only inserts for words that:
- are standalone (not already followed by a bracket) - are standalone (not already followed by a bracket)
@@ -1136,10 +1172,12 @@ def fix_cell_phonetics(
continue continue
if col_type == 'column_en': if col_type == 'column_en':
# Full processing: replace garbled IPA, strip orphan brackets, # Full processing: replace garbled IPA, strip orphan brackets.
# insert missing IPA
new_text = _replace_phonetics_in_text(text, pronunciation, strip_orphans=True) new_text = _replace_phonetics_in_text(text, pronunciation, strip_orphans=True)
if new_text == text: if new_text == text and _text_has_garbled_ipa(text):
# Only insert IPA when there IS garbled phonetics in the
# text — never add IPA to clean text that had none on the
# original page.
new_text = _insert_missing_ipa(text, pronunciation) new_text = _insert_missing_ipa(text, pronunciation)
else: else:
# column_text: replace garbled IPA, no orphan stripping # column_text: replace garbled IPA, no orphan stripping