fix(ocr-pipeline): use gap-based row height for cluster tolerance

The y_tolerance for word-center clustering was based on median word height (21px → 12px tolerance), which was too small. Words on the same line can have centers 15-20px apart due to different heights. Now uses 40% of the gap-based median row height as tolerance (e.g. 40px row → 16px tolerance), and 30% for merge threshold. This produces correct cluster counts matching actual text lines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 12:34:15 +01:00
parent 4970ca903e
commit 8e861e5a4d
1 changed files with 15 additions and 8 deletions
@@ -1602,11 +1602,18 @@ def _regularize_row_grid(
    word_heights = sorted(w['height'] for w in content_words)
    median_wh = word_heights[len(word_heights) // 2]

-    # Group by VERTICAL CENTER, not by top.  Tall characters (brackets,
-    # phonetic symbols) have a much lower top but the same center_y as
-    # normal text on the same line.  Grouping by top would split them
-    # into separate clusters → halved pitch → halved row heights.
-    y_tol = max(10, int(median_wh * 0.6))
+    # Compute median gap-based row height — this is the actual line height
+    # as detected by the horizontal projection.  We use 40% of this as
+    # grouping tolerance.  This is much more reliable than using word height
+    # alone, because words on the same line can have very different heights
+    # (e.g. lowercase vs uppercase, brackets, phonetic symbols).
+    gap_row_heights = sorted(r.height for r in content_rows)
+    median_row_h = gap_row_heights[len(gap_row_heights) // 2]
+
+    # Tolerance: 40% of row height.  Words on the same line should have
+    # centers within this range.  Even if a word's bbox is taller/shorter,
+    # its center should stay within half a row height of the line center.
+    y_tol = max(10, int(median_row_h * 0.4))

    # Sort by center_y, then group by proximity
    words_by_center = sorted(content_words,
@@ -1658,8 +1665,8 @@ def _regularize_row_grid(
    # --- Step B2: Merge clusters that are too close together ---
    # Even with center-based grouping, some edge cases can produce
    # spurious clusters.  Merge any pair whose centers are closer
-    # than 0.4× median_wh (they're definitely the same text line).
-    merge_threshold = max(5, median_wh * 0.4)
+    # than 30% of the row height (they're definitely the same text line).
+    merge_threshold = max(8, median_row_h * 0.3)
    merged: List[Dict] = [cluster_info[0]]
    for cl in cluster_info[1:]:
        prev = merged[-1]
@@ -1832,7 +1839,7 @@ def _regularize_row_grid(
    min_h = min(row_heights) if row_heights else 0
    max_h = max(row_heights) if row_heights else 0
    logger.info(f"RowGrid: word-center grid applied "
-                f"(median_pitch={median_pitch:.0f}px, median_wh={median_wh}px, "
+                f"(median_pitch={median_pitch:.0f}px, median_row_h={median_row_h}px, median_wh={median_wh}px, "
                f"y_tol={y_tol}px, {len(line_clusters)} clusters→{len(cluster_info)} merged, "
                f"{len(sections)} sections, "
                f"{len(grid_rows)} grid rows [h={min_h}-{max_h}px], "