Previously, boundary repair was skipped when both words were valid
dictionary words (e.g., "Pound sand", "wit hit", "done euro").
Now uses word-frequency scoring (product of bigram frequencies) to
decide if the repair produces a more common word pair.
Threshold: repair accepted when new pair is >5x more frequent, or
when repair produces a known abbreviation.
New fixes: Pound sand→Pounds and (2000x), wit hit→with it (100000x),
done euro→one euro (7x).
43 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New features:
- Boundary repair: "ats th." → "at sth." (shifted OCR word boundaries)
Tries shifting 1-2 chars between adjacent words, accepts if result
includes a known abbreviation or produces better dictionary matches
- Context split: "anew book" → "a new book" (ambiguous word merges)
Explicit allow/deny list for article+word patterns (alive, alone, etc.)
- Abbreviation awareness: 120+ known abbreviations (sth, sb, adj, etc.)
are now recognized as valid words, preventing false corrections
- Quality gate: boundary repairs only accepted when result scores
higher than original (known words + abbreviations)
40 tests passing, all edge cases covered.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>