Cap gap_threshold at 25% of zone_w for column detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 46s
CI / test-go-edu-search (push) Successful in 52s
CI / test-python-klausur (push) Failing after 2m51s
CI / test-python-agent-core (push) Successful in 40s
CI / test-nodejs-website (push) Successful in 34s

In small zones (boxes), intra-phrase gaps inflate the median gap,
causing gap_threshold to become too large to detect real column
boundaries. Cap at 25% of zone width to prevent this.

Example: Box "Pounds and euros" has 4 columns at x≈148,534,751,1137
but gap_threshold was 531 (larger than the column gaps themselves).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-04-12 23:58:15 +02:00
parent 8b29d20940
commit 7b294f9150

View File

@@ -282,6 +282,12 @@ def _cluster_columns_by_alignment(
median_h = sorted(heights)[len(heights) // 2] if heights else 25
# Column boundary: gap > 3× median gap or > 1.5× median word height
gap_threshold = max(median_gap * 3, median_h * 1.5, 30)
# Cap at 25% of zone width — prevents over-merging in small zones (boxes)
# where intra-phrase gaps can inflate the median
max_gap = zone_w * 0.25
if gap_threshold > max_gap > 30:
logger.info("alignment columns: capping gap_threshold %.0f%.0f (25%% of zone_w=%d)", gap_threshold, max_gap, zone_w)
gap_threshold = max_gap
else:
gap_threshold = 50