Fix IPA continuation: only process fully-bracketed cells, keep phrasal verb particles

Two fixes: 1. Step 5d now only treats cells as continuation when text is entirely inside brackets (e.g. "[n, nn]"). Cells with headwords outside brackets (e.g. "employee [im'ploi:]") are no longer overwritten. 2. fix_ipa_continuation_cell no longer skips grammar words like "down" — they are part of the headword in phrasal verbs like "close sth. down". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-20 00:43:51 +01:00
parent 5f89913a9a
commit 92a7b85c2d
3 changed files with 32 additions and 5 deletions
--- a/klausur-service/backend/cv_ocr_engines.py
+++ b/klausur-service/backend/cv_ocr_engines.py
@@ -1266,7 +1266,10 @@ def fix_ipa_continuation_cell(
    if not parts:
        return garbled_text

-    # Look up IPA for each headword part
+    # Look up IPA for each headword part.
+    # Do NOT skip grammar words here — they are integral parts of the
+    # headword (e.g. "close down", "the United Kingdom").  Grammar
+    # annotations like "(sth)", "(no pl)" are already stripped above.
    ipa_parts: List[str] = []
    for part in parts:
        # A part may be multi-word like "secondary school"
@@ -1276,9 +1279,6 @@ def fix_ipa_continuation_cell(
            clean_w = re.sub(r'[^a-zA-Z\'-]', '', w)
            if not clean_w or len(clean_w) < 2:
                continue
-            # Skip grammar words like "to" at the start
-            if clean_w.lower() in _GRAMMAR_BRACKET_WORDS:
-                continue
            ipa = _lookup_ipa(clean_w, pronunciation)
            if ipa:
                word_ipas.append(ipa)