fix(ocr-pipeline): overlap-based word assignment and empty row filtering

1. Word-to-column assignment now uses overlap-based matching instead of center-point matching. This fixes narrow page_ref columns losing their last digit (e.g. "p.59" → "p.5") when the digit's center falls slightly past the midpoint boundary into the next column. 2. Post-OCR empty row filter: rows where ALL cells have empty text are removed after OCR. This catches inter-row gaps that had stray Tesseract artifacts giving word_count > 0 but no actual content. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 11:00:29 +01:00
parent ccba2bb887
commit 606bef0591
2 changed files with 61 additions and 19 deletions
--- a/klausur-service/backend/ocr_pipeline_api.py
+++ b/klausur-service/backend/ocr_pipeline_api.py
@@ -1291,6 +1291,18 @@ async def _word_stream_generator(
    if columns_meta is None:
        columns_meta = []

+    # Post-OCR: remove rows where ALL cells are empty (inter-row gaps
+    # that had stray Tesseract artifacts giving word_count > 0).
+    rows_with_text: set = set()
+    for c in all_cells:
+        if c.get("text", "").strip():
+            rows_with_text.add(c["row_index"])
+    before_filter = len(all_cells)
+    all_cells = [c for c in all_cells if c["row_index"] in rows_with_text]
+    empty_rows_removed = (before_filter - len(all_cells)) // max(n_cols, 1)
+    if empty_rows_removed > 0:
+        logger.info(f"SSE: removed {empty_rows_removed} all-empty rows after OCR")
+
    used_engine = all_cells[0].get("ocr_engine", "tesseract") if all_cells else engine

    word_result = {