fix: split PaddleOCR boxes at leading ! for overlay word positioning

When PaddleOCR returns "!Betonung" as a single word box, the overlay positions text starting at the "!" instead of the actual word. Split such boxes into ["!", "Betonung"] with proportional position splitting, matching the existing IPA bracket splitting logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 17:46:17 +01:00
parent 8349c28f54
commit 1f527fcd49
1 changed files with 3 additions and 2 deletions
@@ -190,10 +190,11 @@ def _build_cells(
        word_boxes = []
        for w in sorted(cell_words, key=lambda ww: (ww['top'], ww['left'])):
            raw_text = w.get('text', '').strip()
-            # Split by whitespace AND at "[" boundaries (IPA without space)
+            # Split by whitespace, at "[" boundaries (IPA), and after leading "!"
            # e.g. "badge[bxd3]" → ["badge", "[bxd3]"]
            # e.g. "profit['proft]" → ["profit", "['proft]"]
-            tokens = re.split(r'\s+|(?=\[)', raw_text)
+            # e.g. "!Betonung" → ["!", "Betonung"]
+            tokens = re.split(r'\s+|(?=\[)|(?<=!)(?=[A-Za-z\u00c0-\u024f])', raw_text)
            tokens = [t for t in tokens if t]  # remove empty strings
            if len(tokens) <= 1:
                # Single word — keep as-is