Commit Graph

2 Commits

Author SHA1 Message Date
Benjamin Admin
7ffa4c90f9 Lower word-split threshold from 7 to 4 chars
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 50s
CI / test-go-edu-search (push) Successful in 46s
CI / test-python-klausur (push) Failing after 2m48s
CI / test-python-agent-core (push) Successful in 37s
CI / test-nodejs-website (push) Successful in 38s
Short merged words like "anew" (a new), "Imadea" (I made a),
"makeadecision" (make a decision) were missed because the split
threshold was too high. Now processes tokens >= 4 chars.

English single-letter words (a, I) are already handled by the DP
algorithm which allows them as valid split points.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 08:59:02 +02:00
Benjamin Admin
9e2c301723 Add merged-word splitting to OCR spell review
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 43s
CI / test-go-edu-search (push) Successful in 38s
CI / test-nodejs-website (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
OCR often merges adjacent words when spacing is tight, e.g.
"atmyschool" → "at my school", "goodidea" → "good idea".

New _try_split_merged_word() uses dynamic programming to find the
shortest sequence of dictionary words covering the token. Integrated
as step 5 in _spell_fix_token() after general spell correction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 14:11:16 +02:00