fix: cache word_result in paddle_kombi/rapid_kombi for detect-structure

Both kombi OCR functions wrote word_result to DB but not to the
in-memory cache. When detect-structure ran next, it found no words
and passed an empty list to graphic detection, making all word-overlap
heuristics ineffective. This caused green text words to be wrongly
classified as graphic regions.

Also adds a fallback in detect-structure to use raw OCR word lists
if cell word_boxes are empty.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-18 07:29:02 +01:00
parent a25214126d
commit 5359a4cc2b

View File

@@ -1377,6 +1377,14 @@ async def detect_structure(session_id: str):
for cell in word_result["cells"]:
for wb in (cell.get("word_boxes") or []):
words.append(wb)
# Fallback: use raw OCR words if cell word_boxes are empty
if not words and word_result:
for key in ("raw_paddle_words_split", "raw_tesseract_words", "raw_paddle_words"):
raw = word_result.get(key, [])
if raw:
words = raw
logger.info("detect-structure: using %d words from %s (no cell word_boxes)", len(words), key)
break
# If no words yet, use image dimensions with small margin
if words:
content_x = max(0, min(int(wb["left"]) for wb in words))
@@ -3529,6 +3537,7 @@ async def paddle_kombi(session_id: str):
cropped_png=img_png,
current_step=8,
)
cached["word_result"] = word_result
logger.info(
"paddle_kombi session %s: %d cells (%d rows, %d cols) in %.2fs "
@@ -3665,6 +3674,7 @@ async def rapid_kombi(session_id: str):
cropped_png=img_png,
current_step=8,
)
cached["word_result"] = word_result
logger.info(
"rapid_kombi session %s: %d cells (%d rows, %d cols) in %.2fs "