fix(ocr-pipeline): use PSM 6 (block) for multi-line cell OCR in word grid

PSM 7 (single line) missed the second line in cells with two lines.
PSM 6 handles multi-line content. Also fix sort order to Y-then-X
for correct reading order.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-02-28 09:40:04 +01:00
parent 491df4e1b0
commit 356d39d6ee

View File

@@ -2264,10 +2264,10 @@ def build_word_grid(
) )
cell_lang = lang_map.get(col.type, lang) cell_lang = lang_map.get(col.type, lang)
words = ocr_region(ocr_img, cell_region, lang=cell_lang, psm=7) words = ocr_region(ocr_img, cell_region, lang=cell_lang, psm=6)
# Sort words by x position, join to text # Sort words by Y then X (reading order for multi-line cells)
words.sort(key=lambda w: w['left']) words.sort(key=lambda w: (w['top'], w['left']))
text = ' '.join(w['text'] for w in words) text = ' '.join(w['text'] for w in words)
if words: if words:
avg_conf = sum(w['conf'] for w in words) / len(words) avg_conf = sum(w['conf'] for w in words) / len(words)