Fix colspan: use original words before split_cross_column_words

_split_cross_column_words was destroying the colspan information by cutting word-blocks at column boundaries BEFORE _detect_colspan_cells could analyze them. Now passes original (pre-split) words to colspan detection while using split words for cell building. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 11:58:32 +02:00
parent c62ff7cd31
commit dc25f243a4
1 changed files with 6 additions and 4 deletions
@@ -1424,6 +1424,8 @@ def _build_zone_grid(
    # Split word boxes that straddle column boundaries (e.g. "sichzie"
    # spanning Col 1 + Col 2).  Must happen after column detection and
    # before cell assignment.
    # Keep original words for colspan detection (split destroys span info).
    original_zone_words = zone_words
    if len(columns) >= 2:
        zone_words = _split_cross_column_words(zone_words, columns)
@@ -1431,11 +1433,11 @@ def _build_zone_grid(
    cells = _build_cells(zone_words, columns, rows, img_w, img_h)
    # --- Detect colspan (merged cells spanning multiple columns) ---
-    # A word-block that extends across column boundaries indicates a merged
+    # Uses the ORIGINAL (pre-split) words to detect word-blocks that span
-    # cell (like Excel cell-merge).  Detect these and replace the split
+    # multiple columns.  _split_cross_column_words would have destroyed
-    # cells with a single spanning cell.
+    # this information by cutting words at column boundaries.
    if len(columns) >= 2:
-        cells = _detect_colspan_cells(zone_words, columns, rows, cells, img_w, img_h)
+        cells = _detect_colspan_cells(original_zone_words, columns, rows, cells, img_w, img_h)
    # Prefix cell IDs with zone index
    for cell in cells: