Fix IPA correction persistence and false-positive prefix matching

Step 5i was overwriting IPA-corrected text from Step 5c when reconstructing cells from word_boxes. Added _ipa_corrected flag to preserve corrections. Also tightened merged-token prefix matching (min prefix 4 chars, min suffix 3 chars) to prevent false positives like "sis" being extracted from "si:said". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-25 07:26:32 +01:00
parent 9ea217bdfc
commit c42924a94a
2 changed files with 20 additions and 4 deletions
--- a/klausur-service/backend/cv_ocr_engines.py
+++ b/klausur-service/backend/cv_ocr_engines.py
@@ -1194,9 +1194,11 @@ def _insert_missing_ipa(text: str, pronunciation: str = 'british') -> str:
                        break
                # Merged token: dictionary word + garbled IPA stuck together.
                # E.g. "fictionsalans'fIkfn" starts with "fiction".
-                # Extract the dictionary prefix and add it with IPA.
-                if clean_j and len(clean_j) >= 5:
-                    for pend in range(min(len(clean_j), 15), 2, -1):
+                # Extract the dictionary prefix (≥4 chars) and add it with
+                # IPA, but only if enough chars remain after the prefix (≥3)
+                # to look like garbled IPA, not just a plural 's'.
+                if clean_j and len(clean_j) >= 7:
+                    for pend in range(min(len(clean_j) - 3, 15), 3, -1):
                        prefix_j = clean_j[:pend]
                        prefix_ipa = _lookup_ipa(prefix_j, pronunciation)
                        if prefix_ipa: