Fix false header detection: skip continuation lines and mid-column cells
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 54s
CI / test-go-edu-search (push) Successful in 57s
CI / test-python-klausur (push) Failing after 2m57s
CI / test-python-agent-core (push) Successful in 28s
CI / test-nodejs-website (push) Successful in 34s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 54s
CI / test-go-edu-search (push) Successful in 57s
CI / test-python-klausur (push) Failing after 2m57s
CI / test-python-agent-core (push) Successful in 28s
CI / test-nodejs-website (push) Successful in 34s
Single-cell rows were incorrectly detected as headings when they were
actually continuation lines. Two new guards:
1. Text starting with "(" is a continuation (e.g. "(usw.)", "(TV-Serie)")
2. Single cells beyond the first two content columns are overflow lines,
not headings. Real headings appear in the first columns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1053,6 +1053,16 @@ def _detect_heading_rows_by_single_cell(
|
||||
text = (cell.get("text") or "").strip()
|
||||
if not text or text.startswith("["):
|
||||
continue
|
||||
# Continuation lines start with "(" — e.g. "(usw.)", "(TV-Serie)"
|
||||
if text.startswith("("):
|
||||
continue
|
||||
# Single cell NOT in the first content column is likely a
|
||||
# continuation/overflow line, not a heading. Real headings
|
||||
# ("Theme 1", "Unit 3: ...") appear in the first or second
|
||||
# content column.
|
||||
first_content_col = col_indices[0] if col_indices else 0
|
||||
if cell.get("col_index", 0) > first_content_col + 1:
|
||||
continue
|
||||
# Skip garbled IPA without brackets (e.g. "ska:f – ska:vz")
|
||||
# but NOT text with real IPA symbols (e.g. "Theme [θˈiːm]")
|
||||
_REAL_IPA_CHARS = set("ˈˌəɪɛɒʊʌæɑɔʃʒθðŋ")
|
||||
|
||||
Reference in New Issue
Block a user