Refine garbled IPA filter: skip only pure-ASCII garbled text, not text with real IPA
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m55s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m55s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
"Theme [θˈiːm]" contains real IPA symbols (θ, ˈ) and should NOT be filtered. Only filter text that has garbled IPA markers (:, ') but no real Unicode IPA chars. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -737,7 +737,9 @@ def _detect_heading_rows_by_single_cell(
|
||||
if not text or text.startswith("["):
|
||||
continue
|
||||
# Skip garbled IPA without brackets (e.g. "ska:f – ska:vz")
|
||||
if _text_has_garbled_ipa(text):
|
||||
# but NOT text with real IPA symbols (e.g. "Theme [θˈiːm]")
|
||||
_REAL_IPA_CHARS = set("ˈˌəɪɛɒʊʌæɑɔʃʒθðŋ")
|
||||
if _text_has_garbled_ipa(text) and not any(c in _REAL_IPA_CHARS for c in text):
|
||||
continue
|
||||
heading_row_indices.append(ri)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user