feat(ocr-review): replace LLM with rule-based spell-checker (REVIEW_ENGINE=spell)
- Add pyspellchecker (MIT) to requirements for EN+DE dictionary lookup
- New spell_review_entries_sync() + spell_review_entries_streaming():
- Dictionary-backed substitution: checks if corrected word is known
- Structural rule: digit at pos 0 + lowercase rest → most likely letter
(e.g. "8en"→"Ben", "8uch"→"Buch", "5ee"→"See", "6eld"→"Geld")
- Pattern rule: "|." → "1." for numbered list prefixes
- Standalone "|" → "I" (capital I)
- IPA entries still protected via existing _entry_needs_review filter
- Headings/untranslated words (e.g. "Story") are untouched (no susp. chars)
- llm_review_entries + llm_review_entries_streaming: route via REVIEW_ENGINE
env var ("spell" default, "llm" to restore previous behaviour)
- docker-compose.yml: REVIEW_ENGINE=${REVIEW_ENGINE:-spell}
- LLM code preserved for fallback (set REVIEW_ENGINE=llm in .env)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -35,6 +35,9 @@ onnxruntime
|
||||
# IPA pronunciation dictionary lookup (MIT license, bundled CMU dict ~134k words)
|
||||
eng-to-ipa
|
||||
|
||||
# Spell-checker for rule-based OCR correction (MIT license)
|
||||
pyspellchecker>=0.8.1
|
||||
|
||||
# PostgreSQL (for metrics storage)
|
||||
psycopg2-binary>=2.9.0
|
||||
asyncpg>=0.29.0
|
||||
|
||||
Reference in New Issue
Block a user