fix: border ghost filter + row overlap fix for box zones
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m53s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 17s

1. Add _filter_border_ghosts() to grid editor - removes OCR artefacts
   like | sitting on box borders before row/column clustering.
   The tall | (h=55) was inflating row 0's y_max, causing row overlap.

2. Fix _assign_word_to_row() to prefer closest y_center when rows
   overlap, instead of always returning the first matching row.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-17 09:54:50 +01:00
parent 43b1f8be58
commit febd0a2f84
2 changed files with 71 additions and 5 deletions

View File

@@ -134,12 +134,16 @@ def _assign_word_to_column(word: Dict, columns: List[Dict]) -> int:
def _assign_word_to_row(word: Dict, rows: List[Dict]) -> int:
"""Return row index for a word based on its Y-center."""
"""Return row index for a word based on its Y-center.
When rows overlap (e.g. due to tall border-ghost characters inflating
a row's y_max), prefer the row whose y_center is closest.
"""
y_center = word['top'] + word['height'] / 2
# Find the row whose y_range contains this word's center
for row in rows:
if row['y_min'] <= y_center <= row['y_max']:
return row['index']
# Find all rows whose y_range contains this word's center
matching = [r for r in rows if r['y_min'] <= y_center <= r['y_max']]
if matching:
return min(matching, key=lambda r: abs(r['y_center'] - y_center))['index']
# Fallback: nearest row by Y-center
return min(rows, key=lambda r: abs(r['y_center'] - y_center))['index']