Fix IPA correction persistence and false-positive prefix matching
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 24s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 21s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 24s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 21s
Step 5i was overwriting IPA-corrected text from Step 5c when reconstructing cells from word_boxes. Added _ipa_corrected flag to preserve corrections. Also tightened merged-token prefix matching (min prefix 4 chars, min suffix 3 chars) to prevent false positives like "sis" being extracted from "si:said". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1194,9 +1194,11 @@ def _insert_missing_ipa(text: str, pronunciation: str = 'british') -> str:
|
||||
break
|
||||
# Merged token: dictionary word + garbled IPA stuck together.
|
||||
# E.g. "fictionsalans'fIkfn" starts with "fiction".
|
||||
# Extract the dictionary prefix and add it with IPA.
|
||||
if clean_j and len(clean_j) >= 5:
|
||||
for pend in range(min(len(clean_j), 15), 2, -1):
|
||||
# Extract the dictionary prefix (≥4 chars) and add it with
|
||||
# IPA, but only if enough chars remain after the prefix (≥3)
|
||||
# to look like garbled IPA, not just a plural 's'.
|
||||
if clean_j and len(clean_j) >= 7:
|
||||
for pend in range(min(len(clean_j) - 3, 15), 3, -1):
|
||||
prefix_j = clean_j[:pend]
|
||||
prefix_ipa = _lookup_ipa(prefix_j, pronunciation)
|
||||
if prefix_ipa:
|
||||
|
||||
Reference in New Issue
Block a user