Fix marker column detection: remove min-rows requirement
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m55s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 22s

Words to the left of the first detected column boundary must always
form their own column, regardless of how few rows they appear in.
Previously required 4+ distinct rows for tertiary (margin) columns,
which missed page references like p.62, p.63, p.64 (only 3 rows).

Now any cluster at the left/right margin with a clear gap to the
nearest significant column qualifies as its own column.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-04-11 21:24:25 +02:00
parent 8c482ce8dd
commit 7263328edb

View File

@@ -375,13 +375,17 @@ def _cluster_columns_by_alignment(
used_ids = {id(c) for c in primary} | {id(c) for c in secondary} used_ids = {id(c) for c in primary} | {id(c) for c in secondary}
sig_xs = [c["mean_x"] for c in primary + secondary] sig_xs = [c["mean_x"] for c in primary + secondary]
MIN_DISTINCT_ROWS_TERTIARY = max(MIN_DISTINCT_ROWS + 1, 4) # Tertiary: clusters that are clearly to the LEFT of the first
MIN_COVERAGE_TERTIARY = 0.05 # at least 5% of rows # significant column (or RIGHT of the last). If words consistently
# start at a position left of the established first column boundary,
# they MUST be a separate column — regardless of how few rows they
# cover. The only requirement is a clear spatial gap.
MIN_COVERAGE_TERTIARY = 0.02 # at least 1 row effectively
tertiary = [] tertiary = []
for c in clusters: for c in clusters:
if id(c) in used_ids: if id(c) in used_ids:
continue continue
if c["distinct_rows"] < MIN_DISTINCT_ROWS_TERTIARY: if c["distinct_rows"] < 1:
continue continue
if c["row_coverage"] < MIN_COVERAGE_TERTIARY: if c["row_coverage"] < MIN_COVERAGE_TERTIARY:
continue continue