fix: lower secondary column threshold + strip pipe chars from word_boxes
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 35s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 21s
CI / test-nodejs-website (push) Successful in 18s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 35s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 21s
CI / test-nodejs-website (push) Successful in 18s
Dictionary pages have 2 dictionary columns, each with article + headword sub-columns. The right article column (die/der at x≈626) had only 14.3% row coverage — below the 20% secondary threshold. Lowered to 12% so dictionary article columns qualify. Also strip pipe characters from individual word_box text (not just cell text) to remove OCR syllable separation marks (e.g. "zu|trau|en" → "zutrauen"). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -207,7 +207,7 @@ def _cluster_columns_by_alignment(
|
||||
# text (random inter-word gaps) while still detecting real columns in
|
||||
# vocabulary worksheets (which typically have >80% row coverage).
|
||||
MIN_COVERAGE_PRIMARY = 0.35
|
||||
MIN_COVERAGE_SECONDARY = 0.20
|
||||
MIN_COVERAGE_SECONDARY = 0.12
|
||||
MIN_WORDS_SECONDARY = 4
|
||||
MIN_DISTINCT_ROWS = 3
|
||||
|
||||
@@ -1956,10 +1956,14 @@ async def _build_grid_core(session_id: str, session: dict) -> dict:
|
||||
removed_pipes, z.get("zone_index", 0),
|
||||
)
|
||||
|
||||
# Also strip leading/trailing pipe chars from cell text that may remain
|
||||
# from word_boxes that contained mixed text like "word|" or "|word".
|
||||
# Also strip pipe chars from word_box text and cell text that may remain
|
||||
# from OCR reading syllable-separation marks (e.g. "zu|trau|en" → "zutrauen").
|
||||
for z in zones_data:
|
||||
for cell in z.get("cells", []):
|
||||
for wb in cell.get("word_boxes", []):
|
||||
wbt = wb.get("text", "")
|
||||
if "|" in wbt:
|
||||
wb["text"] = wbt.replace("|", "")
|
||||
text = cell.get("text", "")
|
||||
if "|" in text:
|
||||
cleaned = text.replace("|", "").strip()
|
||||
|
||||
Reference in New Issue
Block a user