fix(ocr-pipeline): skip edge-touching gaps in header/footer detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m50s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s

Gaps that extend to the image boundary (top/bottom edge) are not valid
content separators — they typically represent dewarp padding. Only gaps
with content on both sides qualify as header/footer boundaries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-02 17:54:49 +01:00
parent f1fcc67357
commit 0532b2a797
2 changed files with 18 additions and 0 deletions

View File

@@ -2593,6 +2593,9 @@ def _detect_header_footer_gaps(
large_gap_threshold = median_gap * GAP_MULTIPLIER
# Step 6: Find largest qualifying gap in header / footer zones
# A separator gap must have content on BOTH sides — edge-touching gaps
# (e.g. dewarp padding at bottom) are not valid separators.
EDGE_MARGIN = max(5, actual_h // 400)
header_zone_limit = int(actual_h * HEADER_FOOTER_ZONE)
footer_zone_start = int(actual_h * (1.0 - HEADER_FOOTER_ZONE))
@@ -2601,6 +2604,8 @@ def _detect_header_footer_gaps(
best_header_size = 0
for gs, ge in raw_gaps:
if gs <= EDGE_MARGIN:
continue # skip gaps touching the top edge
gap_mid = (gs + ge) / 2
gap_size = ge - gs
if gap_mid < header_zone_limit and gap_size > large_gap_threshold:
@@ -2610,6 +2615,8 @@ def _detect_header_footer_gaps(
best_footer_size = 0
for gs, ge in raw_gaps:
if ge >= actual_h - EDGE_MARGIN:
continue # skip gaps touching the bottom edge
gap_mid = (gs + ge) / 2
gap_size = ge - gs
if gap_mid > footer_zone_start and gap_size > large_gap_threshold: