fix: Sub-Session Zeilenerkennung — Tesseract+inv im Spalten-Schritt cachen
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m9s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 20s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m9s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 20s
Bisher wurden _word_dicts, _inv und _content_bounds fuer Sub-Sessions nicht gecacht, sodass detect_rows auf detect_column_geometry() zurueckfiel. Das konnte bei kleinen Box-Bildern mit <5 Woertern fehlschlagen. Jetzt laeuft Tesseract + Binarisierung direkt im Pseudo-Spalten-Block, und die Intermediates werden gecacht. Zusaetzlich ausfuehrliche Kommentare zur Zeilenerkennung (detect_row_geometry, _regularize_row_grid). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1209,10 +1209,55 @@ async def detect_columns(session_id: str):
|
||||
if img_bgr is None:
|
||||
raise HTTPException(status_code=400, detail="Crop or dewarp must be completed before column detection")
|
||||
|
||||
# Sub-sessions: skip column detection, create single pseudo-column
|
||||
# -----------------------------------------------------------------------
|
||||
# Sub-sessions (box crops): skip column detection entirely.
|
||||
# Instead, create a single pseudo-column spanning the full image width.
|
||||
# Also run Tesseract + binarization here so that the row detection step
|
||||
# can reuse the cached intermediates (_word_dicts, _inv, _content_bounds)
|
||||
# instead of falling back to detect_column_geometry() which may fail
|
||||
# on small box images with < 5 words.
|
||||
# -----------------------------------------------------------------------
|
||||
session = await get_session_db(session_id)
|
||||
if session and session.get("parent_session_id"):
|
||||
h, w = img_bgr.shape[:2]
|
||||
|
||||
# Binarize + invert for row detection (horizontal projection profile)
|
||||
ocr_img = create_ocr_image(img_bgr)
|
||||
inv = cv2.bitwise_not(ocr_img)
|
||||
|
||||
# Run Tesseract to get word bounding boxes.
|
||||
# Word positions are relative to the full image (no ROI crop needed
|
||||
# because the sub-session image IS the cropped box already).
|
||||
# detect_row_geometry expects word positions relative to content ROI,
|
||||
# so with content_bounds = (0, w, 0, h) the coordinates are correct.
|
||||
try:
|
||||
from PIL import Image as PILImage
|
||||
pil_img = PILImage.fromarray(cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB))
|
||||
import pytesseract
|
||||
data = pytesseract.image_to_data(pil_img, lang='eng+deu', output_type=pytesseract.Output.DICT)
|
||||
word_dicts = []
|
||||
for i in range(len(data['text'])):
|
||||
conf = int(data['conf'][i]) if str(data['conf'][i]).lstrip('-').isdigit() else -1
|
||||
text = str(data['text'][i]).strip()
|
||||
if conf < 30 or not text:
|
||||
continue
|
||||
word_dicts.append({
|
||||
'text': text, 'conf': conf,
|
||||
'left': int(data['left'][i]),
|
||||
'top': int(data['top'][i]),
|
||||
'width': int(data['width'][i]),
|
||||
'height': int(data['height'][i]),
|
||||
})
|
||||
logger.info(f"OCR Pipeline: sub-session {session_id}: Tesseract found {len(word_dicts)} words")
|
||||
except Exception as e:
|
||||
logger.warning(f"OCR Pipeline: sub-session {session_id}: Tesseract failed: {e}")
|
||||
word_dicts = []
|
||||
|
||||
# Cache intermediates for row detection (detect_rows reuses these)
|
||||
cached["_word_dicts"] = word_dicts
|
||||
cached["_inv"] = inv
|
||||
cached["_content_bounds"] = (0, w, 0, h)
|
||||
|
||||
column_result = {
|
||||
"columns": [{
|
||||
"type": "column_text",
|
||||
|
||||
Reference in New Issue
Block a user