fix(ocr-pipeline): improve page crop spine detection and cell assignment
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 17s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 17s
1. page_crop: Score all dark runs by center-proximity × darkness × narrowness instead of picking the widest. Fixes ad810209 where a wide dark area at 35% was chosen over the actual spine at 50%. 2. cv_words_first: Replace x-center-only word→column assignment with overlap-based three-pass strategy (overlap → midpoint-range → nearest). Fixes truncated German translations like "Schal" instead of "Schal - die Schals" in session 079cd0d9. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -124,13 +124,43 @@ def _cluster_rows(
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _assign_word_to_column(word: Dict, columns: List[Dict]) -> int:
|
||||
"""Return column index for a word based on its X-center."""
|
||||
x_center = word['left'] + word['width'] / 2
|
||||
"""Return column index for a word based on overlap, then center, then nearest.
|
||||
|
||||
Three-pass strategy (consistent with _assign_row_words_to_columns):
|
||||
1. Overlap-based: assign to column with maximum horizontal overlap.
|
||||
2. Midpoint-range: if no overlap, use midpoints between adjacent columns.
|
||||
3. Nearest center: last resort fallback.
|
||||
"""
|
||||
w_left = word['left']
|
||||
w_right = w_left + word['width']
|
||||
w_center = w_left + word['width'] / 2
|
||||
|
||||
# Pass 1: overlap-based
|
||||
best_col = -1
|
||||
best_overlap = 0
|
||||
for col in columns:
|
||||
if col['x_min'] <= x_center < col['x_max']:
|
||||
overlap = max(0, min(w_right, col['x_max']) - max(w_left, col['x_min']))
|
||||
if overlap > best_overlap:
|
||||
best_overlap = overlap
|
||||
best_col = col['index']
|
||||
if best_col >= 0 and best_overlap > 0:
|
||||
return best_col
|
||||
|
||||
# Pass 2: midpoint-range (non-overlapping assignment zones)
|
||||
for ci, col in enumerate(columns):
|
||||
if ci == 0:
|
||||
assign_left = 0
|
||||
else:
|
||||
assign_left = (columns[ci - 1]['x_max'] + col['x_min']) / 2
|
||||
if ci == len(columns) - 1:
|
||||
assign_right = float('inf')
|
||||
else:
|
||||
assign_right = (col['x_max'] + columns[ci + 1]['x_min']) / 2
|
||||
if assign_left <= w_center < assign_right:
|
||||
return col['index']
|
||||
# Fallback: nearest column
|
||||
return min(columns, key=lambda c: abs((c['x_min'] + c['x_max']) / 2 - x_center))['index']
|
||||
|
||||
# Pass 3: nearest column center
|
||||
return min(columns, key=lambda c: abs((c['x_min'] + c['x_max']) / 2 - w_center))['index']
|
||||
|
||||
|
||||
def _assign_word_to_row(word: Dict, rows: List[Dict]) -> int:
|
||||
|
||||
Reference in New Issue
Block a user