fix: split PaddleOCR boxes at IPA brackets for overlay positioning
Some checks failed
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
Some checks failed
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
PaddleOCR returns "badge[bxd3]" without space, but the IPA fixer produces "badge [bˈædʒ]" with space, creating a token count mismatch between cell.text and word_boxes. Now also split at "[" boundaries so each IPA bracket gets its own sub-box. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -15,6 +15,7 @@ DATENSCHUTZ: Alle Verarbeitung erfolgt lokal.
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
import logging
|
import logging
|
||||||
|
import re
|
||||||
import statistics
|
import statistics
|
||||||
from typing import Any, Dict, List, Tuple
|
from typing import Any, Dict, List, Tuple
|
||||||
|
|
||||||
@@ -185,10 +186,15 @@ def _build_cells(
|
|||||||
# PaddleOCR returns phrase-level boxes (e.g. "competition [kompa'tifn]"),
|
# PaddleOCR returns phrase-level boxes (e.g. "competition [kompa'tifn]"),
|
||||||
# but the overlay slide mechanism expects one box per word. Split multi-word
|
# but the overlay slide mechanism expects one box per word. Split multi-word
|
||||||
# boxes into individual word positions proportional to character length.
|
# boxes into individual word positions proportional to character length.
|
||||||
|
# Also split at "[" boundaries (IPA patterns like "badge[bxd3]").
|
||||||
word_boxes = []
|
word_boxes = []
|
||||||
for w in sorted(cell_words, key=lambda ww: (ww['top'], ww['left'])):
|
for w in sorted(cell_words, key=lambda ww: (ww['top'], ww['left'])):
|
||||||
raw_text = w.get('text', '').strip()
|
raw_text = w.get('text', '').strip()
|
||||||
tokens = raw_text.split()
|
# Split by whitespace AND at "[" boundaries (IPA without space)
|
||||||
|
# e.g. "badge[bxd3]" → ["badge", "[bxd3]"]
|
||||||
|
# e.g. "profit['proft]" → ["profit", "['proft]"]
|
||||||
|
tokens = re.split(r'\s+|(?=\[)', raw_text)
|
||||||
|
tokens = [t for t in tokens if t] # remove empty strings
|
||||||
if len(tokens) <= 1:
|
if len(tokens) <= 1:
|
||||||
# Single word — keep as-is
|
# Single word — keep as-is
|
||||||
word_boxes.append({
|
word_boxes.append({
|
||||||
|
|||||||
Reference in New Issue
Block a user