Fix word-gap merge: add missing pronouns to stop words, reduce threshold
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 38s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 2m13s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 22s

- Add du/dich/dir/mich/mir/uns/euch/ihm/ihn to _STOP_WORDS to prevent
  false merges like "du" + "zerlegst" → "duzerlegst"
- Reduce max_short threshold from 6 to 5 to prevent merging multi-word
  phrases like "ziehen lassen" → "ziehenlassen"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-27 15:35:12 +01:00
parent a8773d5b00
commit 96ea23164d

View File

@@ -34,7 +34,8 @@ _STOP_WORDS = frozenset([
'der', 'die', 'das', 'dem', 'den', 'des', 'der', 'die', 'das', 'dem', 'den', 'des',
'ein', 'eine', 'einem', 'einen', 'einer', 'ein', 'eine', 'einem', 'einen', 'einer',
# Pronouns # Pronouns
'er', 'es', 'sie', 'wir', 'ihr', 'ich', 'man', 'sich', 'du', 'er', 'es', 'sie', 'wir', 'ihr', 'ich', 'man', 'sich',
'dich', 'dir', 'mich', 'mir', 'uns', 'euch', 'ihm', 'ihn',
# Prepositions # Prepositions
'mit', 'von', 'zu', 'für', 'auf', 'in', 'an', 'um', 'am', 'im', 'mit', 'von', 'zu', 'für', 'auf', 'in', 'an', 'um', 'am', 'im',
'aus', 'bei', 'nach', 'vor', 'bis', 'durch', 'über', 'unter', 'aus', 'bei', 'nach', 'vor', 'bis', 'durch', 'über', 'unter',
@@ -146,7 +147,7 @@ def merge_word_gaps_in_zones(zones_data: List[Dict], session_id: str) -> int:
producing text like "zerknit tert" instead of "zerknittert". This producing text like "zerknit tert" instead of "zerknittert". This
function tries to merge adjacent fragments in every content cell. function tries to merge adjacent fragments in every content cell.
More permissive than ``_try_merge_pipe_gaps`` (threshold 6 instead of 3) More permissive than ``_try_merge_pipe_gaps`` (threshold 5 instead of 3)
but still guarded by pyphen dictionary lookup and stop-word exclusion. but still guarded by pyphen dictionary lookup and stop-word exclusion.
Returns the number of cells modified. Returns the number of cells modified.
@@ -186,8 +187,9 @@ def merge_word_gaps_in_zones(zones_data: List[Dict], session_id: str) -> int:
def _try_merge_word_gaps(text: str, hyph_de) -> str: def _try_merge_word_gaps(text: str, hyph_de) -> str:
"""Merge OCR word fragments with relaxed threshold (max_short=6). """Merge OCR word fragments with relaxed threshold (max_short=6).
Similar to ``_try_merge_pipe_gaps`` but allows longer fragments to be Similar to ``_try_merge_pipe_gaps`` but allows slightly longer fragments
merged. Still requires pyphen to recognize the merged word. (max_short=5 instead of 3). Still requires pyphen to recognize the
merged word.
""" """
parts = text.split(' ') parts = text.split(' ')
if len(parts) < 2: if len(parts) < 2:
@@ -207,7 +209,7 @@ def _try_merge_word_gaps(text: str, hyph_de) -> str:
and prev_alpha and curr_alpha and prev_alpha and curr_alpha
and prev_alpha.lower() not in _STOP_WORDS and prev_alpha.lower() not in _STOP_WORDS
and curr_alpha.lower() not in _STOP_WORDS and curr_alpha.lower() not in _STOP_WORDS
and min(len(prev_alpha), len(curr_alpha)) <= 6 and min(len(prev_alpha), len(curr_alpha)) <= 5
and len(prev_alpha) + len(curr_alpha) >= 4 and len(prev_alpha) + len(curr_alpha) >= 4
) )