Fix box column detection: use low gap_threshold for small zones
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 42s
CI / test-go-edu-search (push) Successful in 39s
CI / test-python-klausur (push) Failing after 2m48s
CI / test-python-agent-core (push) Successful in 38s
CI / test-nodejs-website (push) Successful in 30s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 42s
CI / test-go-edu-search (push) Successful in 39s
CI / test-python-klausur (push) Failing after 2m48s
CI / test-python-agent-core (push) Successful in 38s
CI / test-nodejs-website (push) Successful in 30s
PaddleOCR returns multi-word blocks (whole phrases), so ALL inter-word gaps in small zones (boxes, ≤60 words) are column boundaries. Previous 3x-median approach produced thresholds too high to detect real columns. New approach for small zones: gap_threshold = max(median_h * 1.0, 25). This correctly detects 4 columns in "Pounds and euros" box where gaps range from 50-297px and word height is ~31px. Also includes SmartSpellChecker fixes from previous commits: - Frequency-based scoring, IPA protection, slash→l, rare-word threshold Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -280,14 +280,27 @@ def _cluster_columns_by_alignment(
|
||||
median_gap = sorted_gaps[len(sorted_gaps) // 2]
|
||||
heights = [w["height"] for w in words if w.get("height", 0) > 0]
|
||||
median_h = sorted(heights)[len(heights) // 2] if heights else 25
|
||||
# Column boundary: gap > 3× median gap or > 1.5× median word height
|
||||
gap_threshold = max(median_gap * 3, median_h * 1.5, 30)
|
||||
# Cap at 25% of zone width — prevents over-merging in small zones (boxes)
|
||||
# where intra-phrase gaps can inflate the median
|
||||
max_gap = zone_w * 0.25
|
||||
if gap_threshold > max_gap > 30:
|
||||
logger.info("alignment columns: capping gap_threshold %.0f → %.0f (25%% of zone_w=%d)", gap_threshold, max_gap, zone_w)
|
||||
gap_threshold = max_gap
|
||||
|
||||
# For small word counts (boxes, sub-zones): PaddleOCR returns
|
||||
# multi-word blocks, so ALL inter-word gaps are potential column
|
||||
# boundaries. Use a low threshold based on word height — any gap
|
||||
# wider than ~1x median word height is a column separator.
|
||||
if len(words) <= 60:
|
||||
gap_threshold = max(median_h * 1.0, 25)
|
||||
logger.info(
|
||||
"alignment columns (small zone): gap_threshold=%.0f "
|
||||
"(median_h=%.0f, %d words, %d gaps: %s)",
|
||||
gap_threshold, median_h, len(words), len(sorted_gaps),
|
||||
[int(g) for g in sorted_gaps[:10]],
|
||||
)
|
||||
else:
|
||||
# Standard approach for large zones (full pages)
|
||||
gap_threshold = max(median_gap * 3, median_h * 1.5, 30)
|
||||
# Cap at 25% of zone width
|
||||
max_gap = zone_w * 0.25
|
||||
if gap_threshold > max_gap > 30:
|
||||
logger.info("alignment columns: capping gap_threshold %.0f → %.0f (25%% of zone_w=%d)", gap_threshold, max_gap, zone_w)
|
||||
gap_threshold = max_gap
|
||||
else:
|
||||
gap_threshold = 50
|
||||
|
||||
|
||||
Reference in New Issue
Block a user