Protect page references from junk-row removal
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Failing after 11s
CI / test-go-edu-search (push) Successful in 57s
CI / test-python-klausur (push) Failing after 2m49s
CI / test-nodejs-website (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Failing after 11s
CI / test-go-edu-search (push) Successful in 57s
CI / test-python-klausur (push) Failing after 2m49s
CI / test-nodejs-website (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
Rows containing only a page reference (p.55, S.12) were removed as "oversized stubs" (Rule 2) when their word-box height exceeded the median. Now skips Rule 2 if any word matches the page-ref pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -615,10 +615,15 @@ async def _build_grid_core(
|
||||
# Rule 2: oversized stub — ≤3 words, short total text,
|
||||
# and word height > 1.8× median (page numbers, stray marks,
|
||||
# OCR from illustration labels like "SEA &")
|
||||
# Skip if any word looks like a page reference (p.55, S.12).
|
||||
if len(row_wbs) <= 3:
|
||||
total_text = "".join((wb.get("text") or "").strip() for wb in row_wbs)
|
||||
max_h = max((wb.get("height", 0) for wb in row_wbs), default=0)
|
||||
if len(total_text) <= 5 and max_h > median_wb_h * 1.8:
|
||||
has_page_ref = any(
|
||||
re.match(r'^[pPsS]\.?\s*\d+$', (wb.get("text") or "").strip())
|
||||
for wb in row_wbs
|
||||
)
|
||||
if len(total_text) <= 5 and max_h > median_wb_h * 1.8 and not has_page_ref:
|
||||
junk_row_indices.add(ri)
|
||||
continue
|
||||
|
||||
|
||||
Reference in New Issue
Block a user