fix(ocr-pipeline): prevent grid from producing more rows than gap-based

Two fixes: 1. Grid validation: reject word-center grid if it produces MORE rows than gap-based detection (more rows = lines were split = worse). Falls back to gap-based rows in that case. 2. Words overlay: draw clean grid cells (column × row intersections) instead of padded entry bboxes. Eliminates confusing double lines. OCR text labels are placed inside the grid cells directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 12:52:41 +01:00
parent 8e861e5a4d
commit c4f2e6554e
2 changed files with 57 additions and 43 deletions
--- a/klausur-service/backend/cv_vocab_pipeline.py
+++ b/klausur-service/backend/cv_vocab_pipeline.py
@@ -1829,6 +1829,13 @@ def _regularize_row_grid(
    # Remove empty grid rows (no words assigned)
    grid_rows = [gr for gr in grid_rows if gr.word_count > 0]

+    # The grid must not produce MORE rows than gap-based detection.
+    # More rows means the clustering split actual lines — that's worse.
+    if len(grid_rows) > len(content_rows):
+        logger.info(f"RowGrid: grid produced {len(grid_rows)} rows > "
+                    f"{len(content_rows)} gap-based → keeping gap-based rows")
+        return rows
+
    # --- Step H: Merge header/footer + re-index ---
    result = list(non_content) + grid_rows
    result.sort(key=lambda r: r.y)