fix(ocr-pipeline): cells = result, no post-processing content shuffling

The cell grid IS the result. Each cell stays at its detected position. Removed _split_comma_entries and _attach_example_sentences from the pipeline — they were shuffling content between rows/columns, causing "Mäuse" to appear in a separate row, "stand..." to move to Example, and "Ei" to disappear. Now: cells → _cells_to_vocab_entries (1:1 row mapping) → _fix_character_confusion → _fix_phonetic_brackets → done. Also lowered pixel-density threshold from 2% to 0.5% for the cell-OCR fallback so small text like "Ei" is not filtered out. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:41:30 +01:00
parent e3f939a628
commit 72cc77dcf4
2 changed files with 8 additions and 11 deletions
--- a/klausur-service/backend/cv_vocab_pipeline.py
+++ b/klausur-service/backend/cv_vocab_pipeline.py
@@ -3186,9 +3186,11 @@ def _ocr_single_cell(
        if ocr_img is not None:
            crop = ocr_img[cell_y:cell_y + cell_h, cell_x:cell_x + cell_w]
            if crop.size > 0:
-                # Threshold: pixels darker than 180 (on 0-255 grayscale)
+                # Threshold: pixels darker than 180 (on 0-255 grayscale).
+                # Use 0.5% to catch even small text like "Ei" (2 chars)
+                # in an otherwise empty cell.
                dark_ratio = float(np.count_nonzero(crop < 180)) / crop.size
-                _run_fallback = dark_ratio > 0.02
+                _run_fallback = dark_ratio > 0.005
    if _run_fallback:
        cell_region = PageRegion(
            type=col.type,