fix(ocr-pipeline): improve page crop spine detection and cell assignment
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 17s

1. page_crop: Score all dark runs by center-proximity × darkness ×
   narrowness instead of picking the widest. Fixes ad810209 where a
   wide dark area at 35% was chosen over the actual spine at 50%.

2. cv_words_first: Replace x-center-only word→column assignment with
   overlap-based three-pass strategy (overlap → midpoint-range → nearest).
   Fixes truncated German translations like "Schal" instead of
   "Schal - die Schals" in session 079cd0d9.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-24 09:23:30 +01:00
parent 9d34c5201e
commit 2a21127f01
3 changed files with 193 additions and 15 deletions

View File

@@ -124,13 +124,43 @@ def _cluster_rows(
# ---------------------------------------------------------------------------
def _assign_word_to_column(word: Dict, columns: List[Dict]) -> int:
"""Return column index for a word based on its X-center."""
x_center = word['left'] + word['width'] / 2
"""Return column index for a word based on overlap, then center, then nearest.
Three-pass strategy (consistent with _assign_row_words_to_columns):
1. Overlap-based: assign to column with maximum horizontal overlap.
2. Midpoint-range: if no overlap, use midpoints between adjacent columns.
3. Nearest center: last resort fallback.
"""
w_left = word['left']
w_right = w_left + word['width']
w_center = w_left + word['width'] / 2
# Pass 1: overlap-based
best_col = -1
best_overlap = 0
for col in columns:
if col['x_min'] <= x_center < col['x_max']:
overlap = max(0, min(w_right, col['x_max']) - max(w_left, col['x_min']))
if overlap > best_overlap:
best_overlap = overlap
best_col = col['index']
if best_col >= 0 and best_overlap > 0:
return best_col
# Pass 2: midpoint-range (non-overlapping assignment zones)
for ci, col in enumerate(columns):
if ci == 0:
assign_left = 0
else:
assign_left = (columns[ci - 1]['x_max'] + col['x_min']) / 2
if ci == len(columns) - 1:
assign_right = float('inf')
else:
assign_right = (col['x_max'] + columns[ci + 1]['x_min']) / 2
if assign_left <= w_center < assign_right:
return col['index']
# Fallback: nearest column
return min(columns, key=lambda c: abs((c['x_min'] + c['x_max']) / 2 - x_center))['index']
# Pass 3: nearest column center
return min(columns, key=lambda c: abs((c['x_min'] + c['x_max']) / 2 - w_center))['index']
def _assign_word_to_row(word: Dict, rows: List[Dict]) -> int: