fix(sub-columns): convert relative word positions to absolute coords for split

Word 'left' values in ColumnGeometry.words are relative to the content ROI (left_x), but geo.x is in absolute image coordinates. The split position was computed from relative word positions and then compared against absolute geo.x, resulting in negative widths and no splits on real data. Pass left_x through to _detect_sub_columns to bridge the two coordinate systems. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(tests): adjust word counts so 10% threshold works correctly
2026-03-02 19:16:13 +01:00 · 2026-03-02 19:00:14 +01:00 · 2026-03-02 18:56:38 +01:00 · 2026-03-02 18:26:24 +01:00 · 2026-03-02 18:23:55 +01:00 · 2026-03-02 18:18:02 +01:00
3 changed files with 369 additions and 2 deletions
--- a/klausur-service/backend/cv_vocab_pipeline.py
+++ b/klausur-service/backend/cv_vocab_pipeline.py
@@ -140,6 +140,7 @@ class VocabRow:
    english: str = ""
    german: str = ""
    example: str = ""
+    source_page: str = ""
    confidence: float = 0.0
    y_position: int = 0

@@ -1033,6 +1034,132 @@ def _detect_columns_by_clustering(
    )


+def _detect_sub_columns(
+    geometries: List[ColumnGeometry],
+    content_w: int,
+    left_x: int = 0,
+    _edge_tolerance: int = 8,
+    _min_col_start_ratio: float = 0.10,
+) -> List[ColumnGeometry]:
+    """Split columns that contain internal sub-columns based on left-edge alignment.
+
+    For each column, clusters word left-edges into alignment bins (within
+    ``_edge_tolerance`` px).  The leftmost bin whose word count reaches
+    ``_min_col_start_ratio`` of the column total is treated as the true column
+    start.  Any words to the left of that bin form a sub-column, provided they
+    number >= 2 and < 35 % of total.
+
+    Word ``left`` values are relative to the content ROI (offset by *left_x*),
+    while ``ColumnGeometry.x`` is in absolute image coordinates.  *left_x*
+    bridges the two coordinate systems.
+
+    Returns a new list of ColumnGeometry — potentially longer than the input.
+    """
+    if content_w <= 0:
+        return geometries
+
+    result: List[ColumnGeometry] = []
+    for geo in geometries:
+        # Only consider wide-enough columns with enough words
+        if geo.width_ratio < 0.15 or geo.word_count < 5:
+            result.append(geo)
+            continue
+
+        # Collect left-edges of confident words
+        confident = [w for w in geo.words if w.get('conf', 0) >= 30]
+        if len(confident) < 3:
+            result.append(geo)
+            continue
+
+        # --- Cluster left-edges into alignment bins ---
+        sorted_edges = sorted(w['left'] for w in confident)
+        bins: List[Tuple[int, int, int, int]] = []  # (center, count, min_edge, max_edge)
+        cur = [sorted_edges[0]]
+        for i in range(1, len(sorted_edges)):
+            if sorted_edges[i] - cur[-1] <= _edge_tolerance:
+                cur.append(sorted_edges[i])
+            else:
+                bins.append((sum(cur) // len(cur), len(cur), min(cur), max(cur)))
+                cur = [sorted_edges[i]]
+        bins.append((sum(cur) // len(cur), len(cur), min(cur), max(cur)))
+
+        # --- Find the leftmost bin qualifying as a real column start ---
+        total = len(confident)
+        min_count = max(3, int(total * _min_col_start_ratio))
+        col_start_bin = None
+        for b in bins:
+            if b[1] >= min_count:
+                col_start_bin = b
+                break
+
+        if col_start_bin is None:
+            result.append(geo)
+            continue
+
+        # Words to the left of the column-start bin are sub-column candidates
+        split_threshold = col_start_bin[2] - _edge_tolerance
+        sub_words = [w for w in geo.words if w['left'] < split_threshold]
+        main_words = [w for w in geo.words if w['left'] >= split_threshold]
+
+        if len(sub_words) < 2 or len(sub_words) / len(geo.words) >= 0.35:
+            result.append(geo)
+            continue
+
+        # --- Build two sub-column geometries ---
+        # Word 'left' values are relative to left_x; geo.x is absolute.
+        # Convert the split position from relative to absolute coordinates.
+        max_sub_left = max(w['left'] for w in sub_words)
+        split_rel = (max_sub_left + col_start_bin[2]) // 2
+        split_abs = split_rel + left_x
+
+        sub_x = geo.x
+        sub_width = split_abs - geo.x
+        main_x = split_abs
+        main_width = (geo.x + geo.width) - split_abs
+
+        if sub_width <= 0 or main_width <= 0:
+            result.append(geo)
+            continue
+
+        sub_geo = ColumnGeometry(
+            index=0,
+            x=sub_x,
+            y=geo.y,
+            width=sub_width,
+            height=geo.height,
+            word_count=len(sub_words),
+            words=sub_words,
+            width_ratio=sub_width / content_w if content_w > 0 else 0.0,
+        )
+        main_geo = ColumnGeometry(
+            index=0,
+            x=main_x,
+            y=geo.y,
+            width=main_width,
+            height=geo.height,
+            word_count=len(main_words),
+            words=main_words,
+            width_ratio=main_width / content_w if content_w > 0 else 0.0,
+        )
+
+        result.append(sub_geo)
+        result.append(main_geo)
+
+        logger.info(
+            f"SubColumnSplit: column idx={geo.index} split at abs_x={split_abs} "
+            f"(rel={split_rel}), sub={len(sub_words)} words, "
+            f"main={len(main_words)} words, "
+            f"col_start_bin=({col_start_bin[0]}, n={col_start_bin[1]})"
+        )
+
+    # Re-index by left-to-right order
+    result.sort(key=lambda g: g.x)
+    for i, g in enumerate(result):
+        g.index = i
+
+    return result
+
+
 def _build_geometries_from_starts(
    col_starts: List[Tuple[int, int]],
    word_dicts: List[Dict],
@@ -2727,6 +2854,9 @@ def analyze_layout_by_words(ocr_img: np.ndarray, dewarped_bgr: np.ndarray) -> Li
    geometries, left_x, right_x, top_y, bottom_y, _word_dicts, _inv = result
    content_w = right_x - left_x

+    # Split sub-columns (e.g. page references) before classification
+    geometries = _detect_sub_columns(geometries, content_w, left_x=left_x)
+
    # Phase B: Content-based classification
    regions = classify_column_types(geometries, content_w, top_y, w, h, bottom_y,
                                    left_x=left_x, right_x=right_x, inv=_inv)
@@ -3841,7 +3971,7 @@ def build_cell_grid(
        return [], []

    # Use columns only — skip ignore, header, footer, page_ref
-    _skip_types = {'column_ignore', 'header', 'footer', 'margin_top', 'margin_bottom', 'page_ref', 'margin_left', 'margin_right'}
+    _skip_types = {'column_ignore', 'header', 'footer', 'margin_top', 'margin_bottom', 'margin_left', 'margin_right'}
    relevant_cols = [c for c in column_regions if c.type not in _skip_types]
    if not relevant_cols:
        logger.warning("build_cell_grid: no usable columns found")
@@ -4003,7 +4133,7 @@ def build_cell_grid_streaming(
    if not content_rows:
        return

-    _skip_types = {'column_ignore', 'header', 'footer', 'margin_top', 'margin_bottom', 'page_ref', 'margin_left', 'margin_right'}
+    _skip_types = {'column_ignore', 'header', 'footer', 'margin_top', 'margin_bottom', 'margin_left', 'margin_right'}
    relevant_cols = [c for c in column_regions if c.type not in _skip_types]
    if not relevant_cols:
        return
@@ -4055,11 +4185,13 @@ def _cells_to_vocab_entries(
        'column_en': 'english',
        'column_de': 'german',
        'column_example': 'example',
+        'page_ref': 'source_page',
    }
    bbox_key_map = {
        'column_en': 'bbox_en',
        'column_de': 'bbox_de',
        'column_example': 'bbox_ex',
+        'page_ref': 'bbox_ref',
    }

    # Group cells by row_index
@@ -4076,11 +4208,13 @@ def _cells_to_vocab_entries(
            'english': '',
            'german': '',
            'example': '',
+            'source_page': '',
            'confidence': 0.0,
            'bbox': None,
            'bbox_en': None,
            'bbox_de': None,
            'bbox_ex': None,
+            'bbox_ref': None,
            'ocr_engine': row_cells[0].get('ocr_engine', '') if row_cells else '',
        }

--- a/klausur-service/backend/ocr_pipeline_api.py
+++ b/klausur-service/backend/ocr_pipeline_api.py
@@ -34,6 +34,7 @@ from cv_vocab_pipeline import (
    PageRegion,
    RowGeometry,
    _cells_to_vocab_entries,
+    _detect_sub_columns,
    _fix_character_confusion,
    _fix_phonetic_brackets,
    analyze_layout,
@@ -698,6 +699,9 @@ async def detect_columns(session_id: str):
        cached["_inv"] = inv
        cached["_content_bounds"] = (left_x, right_x, top_y, bottom_y)

+        # Split sub-columns (e.g. page references) before classification
+        geometries = _detect_sub_columns(geometries, content_w, left_x=left_x)
+
        # Phase B: Content-based classification
        regions = classify_column_types(geometries, content_w, top_y, w, h, bottom_y,
                                        left_x=left_x, right_x=right_x, inv=inv)
--- a/klausur-service/backend/tests/test_cv_vocab_pipeline.py
+++ b/klausur-service/backend/tests/test_cv_vocab_pipeline.py
@@ -24,6 +24,7 @@ from dataclasses import asdict

 # Import module under test
 from cv_vocab_pipeline import (
+    ColumnGeometry,
    PageRegion,
    VocabRow,
    PipelineResult,
@@ -35,6 +36,7 @@ from cv_vocab_pipeline import (
    _filter_narrow_runs,
    _build_margin_regions,
    _detect_header_footer_gaps,
+    _detect_sub_columns,
    _region_has_content,
    _add_header_footer,
    analyze_layout,
@@ -1170,6 +1172,233 @@ class TestRegionContentCheck:
        assert bottom_regions[0].type == 'footer'


+# =============================================
+# Sub-Column Detection Tests
+# =============================================
+
+class TestSubColumnDetection:
+    """Tests for _detect_sub_columns() left-edge alignment detection."""
+
+    def _make_word(self, left: int, text: str = "word", conf: int = 90) -> dict:
+        return {'left': left, 'top': 100, 'width': 50, 'height': 20,
+                'text': text, 'conf': conf}
+
+    def _make_geo(self, x: int, width: int, words: list, content_w: int = 1000) -> ColumnGeometry:
+        return ColumnGeometry(
+            index=0, x=x, y=50, width=width, height=500,
+            word_count=len(words), words=words,
+            width_ratio=width / content_w,
+        )
+
+    def test_sub_column_split_page_refs(self):
+        """3 page-refs left + 40 vocab words right → split into 2.
+
+        The leftmost bin with >= 10% of words (>= 5) is the vocab bin
+        at left=250, so the 3 page-refs are outliers.
+        """
+        content_w = 1000
+        page_words = [self._make_word(100, f"p.{59+i}") for i in range(3)]
+        vocab_words = [self._make_word(250, f"word{i}") for i in range(40)]
+        all_words = page_words + vocab_words
+        geo = self._make_geo(x=80, width=300, words=all_words, content_w=content_w)
+
+        result = _detect_sub_columns([geo], content_w)
+
+        assert len(result) == 2, f"Expected 2 columns, got {len(result)}"
+        left_col = result[0]
+        right_col = result[1]
+        assert left_col.x < right_col.x
+        assert left_col.word_count == 3
+        assert right_col.word_count == 40
+        assert left_col.index == 0
+        assert right_col.index == 1
+
+    def test_sub_column_split_exclamation_marks(self):
+        """5 '!' (misread as I/|) left + 80 example words → split into 2.
+
+        Mirrors the real-world case where red ! marks are OCR'd as I, |, B, 1
+        at a position slightly left of the example sentence start.
+        """
+        content_w = 1500
+        bang_words = [self._make_word(950 + i, chr(ord('I')), conf=60) for i in range(5)]
+        example_words = [self._make_word(975 + (i * 3), f"word{i}") for i in range(80)]
+        all_words = bang_words + example_words
+        geo = self._make_geo(x=940, width=530, words=all_words, content_w=content_w)
+
+        result = _detect_sub_columns([geo], content_w)
+
+        assert len(result) == 2
+        assert result[0].word_count == 5
+        assert result[1].word_count == 80
+
+    def test_no_split_uniform_alignment(self):
+        """All words aligned at same position → no change."""
+        content_w = 1000
+        words = [self._make_word(200, f"word{i}") for i in range(15)]
+        geo = self._make_geo(x=180, width=300, words=words, content_w=content_w)
+
+        result = _detect_sub_columns([geo], content_w)
+
+        assert len(result) == 1
+        assert result[0].word_count == 15
+
+    def test_no_split_narrow_column(self):
+        """Narrow column (width_ratio < 0.15) → no split attempted."""
+        content_w = 1000
+        words = [self._make_word(50, "a")] * 3 + [self._make_word(120, "b")] * 10
+        geo = self._make_geo(x=40, width=140, words=words, content_w=content_w)
+
+        result = _detect_sub_columns([geo], content_w)
+
+        assert len(result) == 1
+
+    def test_no_split_balanced_clusters(self):
+        """Both clusters similarly sized (ratio >= 0.35) → no split."""
+        content_w = 1000
+        left_words = [self._make_word(100, f"a{i}") for i in range(8)]
+        right_words = [self._make_word(300, f"b{i}") for i in range(12)]
+        all_words = left_words + right_words
+        geo = self._make_geo(x=80, width=400, words=all_words, content_w=content_w)
+
+        result = _detect_sub_columns([geo], content_w)
+
+        assert len(result) == 1
+
+    def test_sub_column_reindexing(self):
+        """After split, indices are correctly 0, 1, 2 across all columns."""
+        content_w = 1000
+        # First column: no split (all words at same alignment)
+        words1 = [self._make_word(50, f"de{i}") for i in range(10)]
+        geo1 = ColumnGeometry(index=0, x=30, y=50, width=200, height=500,
+                              word_count=10, words=words1, width_ratio=0.2)
+        # Second column: will split (3 outliers + 40 main)
+        page_words = [self._make_word(400, f"p.{i}") for i in range(3)]
+        en_words = [self._make_word(550, f"en{i}") for i in range(40)]
+        geo2 = ColumnGeometry(index=1, x=380, y=50, width=300, height=500,
+                              word_count=43, words=page_words + en_words, width_ratio=0.3)
+
+        result = _detect_sub_columns([geo1, geo2], content_w)
+
+        assert len(result) == 3
+        assert [g.index for g in result] == [0, 1, 2]
+        assert result[0].word_count == 10
+        assert result[1].word_count == 3
+        assert result[2].word_count == 40
+
+    def test_no_split_too_few_words(self):
+        """Column with fewer than 5 words → no split attempted."""
+        content_w = 1000
+        words = [self._make_word(100, "a"), self._make_word(300, "b"),
+                 self._make_word(300, "c"), self._make_word(300, "d")]
+        geo = self._make_geo(x=80, width=300, words=words, content_w=content_w)
+
+        result = _detect_sub_columns([geo], content_w)
+
+        assert len(result) == 1
+
+    def test_no_split_single_minority_word(self):
+        """Only 1 word left of column start → no split (need >= 2)."""
+        content_w = 1000
+        minority = [self._make_word(100, "p.59")]
+        majority = [self._make_word(300, f"w{i}") for i in range(30)]
+        geo = self._make_geo(x=80, width=350, words=minority + majority, content_w=content_w)
+
+        result = _detect_sub_columns([geo], content_w)
+
+        assert len(result) == 1
+
+    def test_sub_column_split_with_left_x_offset(self):
+        """Word 'left' values are relative to left_x; geo.x is absolute.
+
+        Real-world scenario: left_x=195, EN column at geo.x=310.
+        Page refs at relative left=115-157, vocab words at relative left=216.
+        Without left_x, split_x would be ~202 (< geo.x=310) → negative width → no split.
+        With left_x=195, split_abs = 202 + 195 = 397, which is between geo.x(310)
+        and geo.x+geo.width(748) → valid split.
+        """
+        content_w = 1469
+        left_x = 195
+        page_refs = [self._make_word(115, "p.59"), self._make_word(157, "p.60"),
+                     self._make_word(157, "p.61")]
+        vocab = [self._make_word(216, f"word{i}") for i in range(40)]
+        all_words = page_refs + vocab
+        geo = self._make_geo(x=310, width=438, words=all_words, content_w=content_w)
+
+        result = _detect_sub_columns([geo], content_w, left_x=left_x)
+
+        assert len(result) == 2, f"Expected 2 columns, got {len(result)}"
+        assert result[0].word_count == 3
+        assert result[1].word_count == 40
+
+
+class TestCellsToVocabEntriesPageRef:
+    """Test that page_ref cells are mapped to source_page field."""
+
+    def test_page_ref_mapped_to_source_page(self):
+        """Cell with col_type='page_ref' → source_page field populated."""
+        from cv_vocab_pipeline import _cells_to_vocab_entries
+
+        cells = [
+            {
+                'row_index': 0,
+                'col_type': 'column_en',
+                'text': 'hello',
+                'bbox_pct': {'x': 10, 'y': 10, 'w': 30, 'h': 5},
+                'confidence': 95.0,
+                'ocr_engine': 'tesseract',
+            },
+            {
+                'row_index': 0,
+                'col_type': 'column_de',
+                'text': 'hallo',
+                'bbox_pct': {'x': 40, 'y': 10, 'w': 30, 'h': 5},
+                'confidence': 90.0,
+                'ocr_engine': 'tesseract',
+            },
+            {
+                'row_index': 0,
+                'col_type': 'page_ref',
+                'text': 'p.59',
+                'bbox_pct': {'x': 5, 'y': 10, 'w': 5, 'h': 5},
+                'confidence': 80.0,
+                'ocr_engine': 'tesseract',
+            },
+        ]
+        columns_meta = [
+            {'type': 'column_en'}, {'type': 'column_de'}, {'type': 'page_ref'},
+        ]
+
+        entries = _cells_to_vocab_entries(cells, columns_meta)
+
+        assert len(entries) == 1
+        assert entries[0]['english'] == 'hello'
+        assert entries[0]['german'] == 'hallo'
+        assert entries[0]['source_page'] == 'p.59'
+        assert entries[0]['bbox_ref'] == {'x': 5, 'y': 10, 'w': 5, 'h': 5}
+
+    def test_no_page_ref_defaults_empty(self):
+        """Without page_ref cell, source_page defaults to empty string."""
+        from cv_vocab_pipeline import _cells_to_vocab_entries
+
+        cells = [
+            {
+                'row_index': 0,
+                'col_type': 'column_en',
+                'text': 'world',
+                'bbox_pct': {'x': 10, 'y': 10, 'w': 30, 'h': 5},
+                'confidence': 95.0,
+                'ocr_engine': 'tesseract',
+            },
+        ]
+        columns_meta = [{'type': 'column_en'}]
+
+        entries = _cells_to_vocab_entries(cells, columns_meta)
+
+        assert len(entries) == 1
+        assert entries[0]['source_page'] == ''
+        assert entries[0]['bbox_ref'] is None
+
+
 # =============================================
 # RUN TESTS
 # =============================================
Author	SHA1	Message	Date
Benjamin Admin	3904ddb493	fix(sub-columns): convert relative word positions to absolute coords for split Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m51s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 17s Details Word 'left' values in ColumnGeometry.words are relative to the content ROI (left_x), but geo.x is in absolute image coordinates. The split position was computed from relative word positions and then compared against absolute geo.x, resulting in negative widths and no splits on real data. Pass left_x through to _detect_sub_columns to bridge the two coordinate systems. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-02 19:16:13 +01:00
Benjamin Admin	6e1a349eed	fix(tests): adjust word counts so 10% threshold works correctly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-02 19:00:14 +01:00
Benjamin Admin	7252f9a956	refactor(ocr-pipeline): use left-edge alignment approach for sub-column detection Replace gap-based splitting with alignment-bin approach: cluster word left-edges within 8px tolerance, find the leftmost bin with >= 10% of words as the true column start, split off any words to its left as a sub-column. This correctly handles both page references ("p.59") and misread exclamation marks ("!" → "I") even when the pixel gap is small. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-02 18:56:38 +01:00
Benjamin Admin	f13116345b	fix(tests): use correct bbox_pct dict format in _cells_to_vocab_entries tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-02 18:26:24 +01:00
Benjamin Admin	991984d9c3	fix(tests): pass columns_meta arg to _cells_to_vocab_entries tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-02 18:23:55 +01:00
Benjamin Admin	1a246eb059	feat(ocr-pipeline): generic sub-column detection via left-edge clustering Detects hidden sub-columns (e.g. page references like "p.59") within already-recognized columns by clustering word left-edge positions and splitting when a clear minority cluster exists. The sub-column is then classified as page_ref and mapped to VocabRow.source_page. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-02 18:18:02 +01:00