fix: split PaddleOCR boxes at leading ! for overlay word positioning
Some checks failed
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
Some checks failed
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
When PaddleOCR returns "!Betonung" as a single word box, the overlay positions text starting at the "!" instead of the actual word. Split such boxes into ["!", "Betonung"] with proportional position splitting, matching the existing IPA bracket splitting logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -190,10 +190,11 @@ def _build_cells(
|
||||
word_boxes = []
|
||||
for w in sorted(cell_words, key=lambda ww: (ww['top'], ww['left'])):
|
||||
raw_text = w.get('text', '').strip()
|
||||
# Split by whitespace AND at "[" boundaries (IPA without space)
|
||||
# Split by whitespace, at "[" boundaries (IPA), and after leading "!"
|
||||
# e.g. "badge[bxd3]" → ["badge", "[bxd3]"]
|
||||
# e.g. "profit['proft]" → ["profit", "['proft]"]
|
||||
tokens = re.split(r'\s+|(?=\[)', raw_text)
|
||||
# e.g. "!Betonung" → ["!", "Betonung"]
|
||||
tokens = re.split(r'\s+|(?=\[)|(?<=!)(?=[A-Za-z\u00c0-\u024f])', raw_text)
|
||||
tokens = [t for t in tokens if t] # remove empty strings
|
||||
if len(tokens) <= 1:
|
||||
# Single word — keep as-is
|
||||
|
||||
Reference in New Issue
Block a user