Fix colspan: use original words before split_cross_column_words
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 42s
CI / test-go-edu-search (push) Successful in 47s
CI / test-python-klausur (push) Failing after 2m33s
CI / test-python-agent-core (push) Successful in 31s
CI / test-nodejs-website (push) Successful in 35s

_split_cross_column_words was destroying the colspan information by
cutting word-blocks at column boundaries BEFORE _detect_colspan_cells
could analyze them. Now passes original (pre-split) words to colspan
detection while using split words for cell building.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-04-13 11:58:32 +02:00
parent c62ff7cd31
commit dc25f243a4

View File

@@ -1424,6 +1424,8 @@ def _build_zone_grid(
# Split word boxes that straddle column boundaries (e.g. "sichzie" # Split word boxes that straddle column boundaries (e.g. "sichzie"
# spanning Col 1 + Col 2). Must happen after column detection and # spanning Col 1 + Col 2). Must happen after column detection and
# before cell assignment. # before cell assignment.
# Keep original words for colspan detection (split destroys span info).
original_zone_words = zone_words
if len(columns) >= 2: if len(columns) >= 2:
zone_words = _split_cross_column_words(zone_words, columns) zone_words = _split_cross_column_words(zone_words, columns)
@@ -1431,11 +1433,11 @@ def _build_zone_grid(
cells = _build_cells(zone_words, columns, rows, img_w, img_h) cells = _build_cells(zone_words, columns, rows, img_w, img_h)
# --- Detect colspan (merged cells spanning multiple columns) --- # --- Detect colspan (merged cells spanning multiple columns) ---
# A word-block that extends across column boundaries indicates a merged # Uses the ORIGINAL (pre-split) words to detect word-blocks that span
# cell (like Excel cell-merge). Detect these and replace the split # multiple columns. _split_cross_column_words would have destroyed
# cells with a single spanning cell. # this information by cutting words at column boundaries.
if len(columns) >= 2: if len(columns) >= 2:
cells = _detect_colspan_cells(zone_words, columns, rows, cells, img_w, img_h) cells = _detect_colspan_cells(original_zone_words, columns, rows, cells, img_w, img_h)
# Prefix cell IDs with zone index # Prefix cell IDs with zone index
for cell in cells: for cell in cells: