Protect page references from junk-row removal
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Failing after 11s
CI / test-go-edu-search (push) Successful in 57s
CI / test-python-klausur (push) Failing after 2m49s
CI / test-nodejs-website (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled

Rows containing only a page reference (p.55, S.12) were removed as
"oversized stubs" (Rule 2) when their word-box height exceeded the
median. Now skips Rule 2 if any word matches the page-ref pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-04-11 22:40:37 +02:00
parent f23aaaea51
commit 9ceee4e07c

View File

@@ -615,10 +615,15 @@ async def _build_grid_core(
# Rule 2: oversized stub — ≤3 words, short total text,
# and word height > 1.8× median (page numbers, stray marks,
# OCR from illustration labels like "SEA &")
# Skip if any word looks like a page reference (p.55, S.12).
if len(row_wbs) <= 3:
total_text = "".join((wb.get("text") or "").strip() for wb in row_wbs)
max_h = max((wb.get("height", 0) for wb in row_wbs), default=0)
if len(total_text) <= 5 and max_h > median_wb_h * 1.8:
has_page_ref = any(
re.match(r'^[pPsS]\.?\s*\d+$', (wb.get("text") or "").strip())
for wb in row_wbs
)
if len(total_text) <= 5 and max_h > median_wb_h * 1.8 and not has_page_ref:
junk_row_indices.add(ri)
continue