debug: Logging fuer Sub-Session Woertererkennung
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled

Zeigt low-confidence Woerter (conf<30) und Zellinhalte pro Zeile,
um fehlende Euro/Pfund-Betraege zu diagnostizieren.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-10 09:31:34 +01:00
parent f65bd11919
commit 3a791179af
2 changed files with 23 additions and 7 deletions

View File

@@ -1248,7 +1248,17 @@ async def detect_columns(session_id: str):
'width': int(data['width'][i]),
'height': int(data['height'][i]),
})
logger.info(f"OCR Pipeline: sub-session {session_id}: Tesseract found {len(word_dicts)} words")
# Log all words including low-confidence ones for debugging
all_count = sum(1 for i in range(len(data['text']))
if str(data['text'][i]).strip())
low_conf = [(str(data['text'][i]).strip(), int(data['conf'][i]) if str(data['conf'][i]).lstrip('-').isdigit() else -1)
for i in range(len(data['text']))
if str(data['text'][i]).strip()
and (int(data['conf'][i]) if str(data['conf'][i]).lstrip('-').isdigit() else -1) < 30
and (int(data['conf'][i]) if str(data['conf'][i]).lstrip('-').isdigit() else -1) >= 0]
if low_conf:
logger.info(f"OCR Pipeline: sub-session {session_id}: {len(low_conf)} words below conf 30: {low_conf[:20]}")
logger.info(f"OCR Pipeline: sub-session {session_id}: Tesseract found {len(word_dicts)}/{all_count} words (conf>=30)")
except Exception as e:
logger.warning(f"OCR Pipeline: sub-session {session_id}: Tesseract failed: {e}")
word_dicts = []