Fix box column detection: use low gap_threshold for small zones
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 42s
CI / test-go-edu-search (push) Successful in 39s
CI / test-python-klausur (push) Failing after 2m48s
CI / test-python-agent-core (push) Successful in 38s
CI / test-nodejs-website (push) Successful in 30s

PaddleOCR returns multi-word blocks (whole phrases), so ALL inter-word
gaps in small zones (boxes, ≤60 words) are column boundaries. Previous
3x-median approach produced thresholds too high to detect real columns.

New approach for small zones: gap_threshold = max(median_h * 1.0, 25).
This correctly detects 4 columns in "Pounds and euros" box where gaps
range from 50-297px and word height is ~31px.

Also includes SmartSpellChecker fixes from previous commits:
- Frequency-based scoring, IPA protection, slash→l, rare-word threshold

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-04-13 07:55:29 +02:00
parent 693803fb7c
commit 5fa5767c9a

View File

@@ -280,14 +280,27 @@ def _cluster_columns_by_alignment(
median_gap = sorted_gaps[len(sorted_gaps) // 2]
heights = [w["height"] for w in words if w.get("height", 0) > 0]
median_h = sorted(heights)[len(heights) // 2] if heights else 25
# Column boundary: gap > 3× median gap or > 1.5× median word height
gap_threshold = max(median_gap * 3, median_h * 1.5, 30)
# Cap at 25% of zone width — prevents over-merging in small zones (boxes)
# where intra-phrase gaps can inflate the median
max_gap = zone_w * 0.25
if gap_threshold > max_gap > 30:
logger.info("alignment columns: capping gap_threshold %.0f%.0f (25%% of zone_w=%d)", gap_threshold, max_gap, zone_w)
gap_threshold = max_gap
# For small word counts (boxes, sub-zones): PaddleOCR returns
# multi-word blocks, so ALL inter-word gaps are potential column
# boundaries. Use a low threshold based on word height — any gap
# wider than ~1x median word height is a column separator.
if len(words) <= 60:
gap_threshold = max(median_h * 1.0, 25)
logger.info(
"alignment columns (small zone): gap_threshold=%.0f "
"(median_h=%.0f, %d words, %d gaps: %s)",
gap_threshold, median_h, len(words), len(sorted_gaps),
[int(g) for g in sorted_gaps[:10]],
)
else:
# Standard approach for large zones (full pages)
gap_threshold = max(median_gap * 3, median_h * 1.5, 30)
# Cap at 25% of zone width
max_gap = zone_w * 0.25
if gap_threshold > max_gap > 30:
logger.info("alignment columns: capping gap_threshold %.0f%.0f (25%% of zone_w=%d)", gap_threshold, max_gap, zone_w)
gap_threshold = max_gap
else:
gap_threshold = 50