feat(ocr-pipeline): generic header/footer detection via projection gap analysis

Replace the trivial top_y/bottom_y threshold check with horizontal projection gap analysis that finds large whitespace gaps separating header/footer content from the main body. This correctly detects headers (e.g. "VOCABULARY" banners) and footers (page numbers) even when _find_content_bounds includes them in the content area. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 16:13:48 +01:00
parent a052f73de3
commit f615c5f66d
3 changed files with 233 additions and 23 deletions
--- a/klausur-service/backend/ocr_pipeline_api.py
+++ b/klausur-service/backend/ocr_pipeline_api.py
@@ -700,7 +700,7 @@ async def detect_columns(session_id: str):

        # Phase B: Content-based classification
        regions = classify_column_types(geometries, content_w, top_y, w, h, bottom_y,
-                                        left_x=left_x, right_x=right_x)
+                                        left_x=left_x, right_x=right_x, inv=inv)

    duration = time.time() - t0