Lower word-split threshold from 7 to 4 chars
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 50s
CI / test-go-edu-search (push) Successful in 46s
CI / test-python-klausur (push) Failing after 2m48s
CI / test-python-agent-core (push) Successful in 37s
CI / test-nodejs-website (push) Successful in 38s

Short merged words like "anew" (a new), "Imadea" (I made a),
"makeadecision" (make a decision) were missed because the split
threshold was too high. Now processes tokens >= 4 chars.

English single-letter words (a, I) are already handled by the DP
algorithm which allows them as valid split points.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-04-12 08:59:02 +02:00
parent 656cadbb1e
commit 7ffa4c90f9
3 changed files with 24 additions and 5 deletions

View File

@@ -56,8 +56,27 @@ class TestTrySplitMergedWord:
assert _try_split_merged_word("beautiful") is None
assert _try_split_merged_word("together") is None
def test_anew(self):
result = _try_split_merged_word("anew")
# "anew" is itself a known word, so should NOT be split
# But "a new" is also valid. Dictionary decides.
# If "anew" is known → None. If not → "a new".
# Either way, both are acceptable.
pass # depends on dictionary
def test_imadea(self):
result = _try_split_merged_word("Imadea")
assert result is not None
assert "made" in result.lower() or "I" in result
def test_makeadecision(self):
result = _try_split_merged_word("makeadecision")
assert result is not None
assert "make" in result.lower()
assert "decision" in result.lower()
def test_short_word(self):
"""Words < 5 chars should not be attempted."""
"""Words < 4 chars should not be attempted."""
assert _try_split_merged_word("the") is None
assert _try_split_merged_word("at") is None