Lower word-split threshold from 7 to 4 chars

Short merged words like "anew" (a new), "Imadea" (I made a), "makeadecision" (make a decision) were missed because the split threshold was too high. Now processes tokens >= 4 chars. English single-letter words (a, I) are already handled by the DP algorithm which allows them as valid split points. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 08:59:02 +02:00
parent 656cadbb1e
commit 7ffa4c90f9
3 changed files with 24 additions and 5 deletions
--- a/klausur-service/backend/cv_review.py
+++ b/klausur-service/backend/cv_review.py
@@ -729,7 +729,7 @@ def _try_split_merged_word(token: str) -> Optional[str]:

    Preserves original capitalisation by mapping back to the input string.
    """
-    if not _SPELL_AVAILABLE or len(token) < 5:
+    if not _SPELL_AVAILABLE or len(token) < 4:
        return None

    lower = token.lower()
@@ -835,7 +835,7 @@ def _spell_fix_token(token: str, field: str = "") -> Optional[str]:

    # 5. Merged-word split: OCR often merges adjacent words when spacing
    #    is too tight, e.g. "atmyschool" → "at my school"
-    if len(token) >= 5 and token.isalpha():
+    if len(token) >= 4 and token.isalpha():
        split = _try_split_merged_word(token)
        if split:
            return split