feat(ocr-review): replace LLM with rule-based spell-checker (REVIEW_ENGINE=spell)
- Add pyspellchecker (MIT) to requirements for EN+DE dictionary lookup
- New spell_review_entries_sync() + spell_review_entries_streaming():
- Dictionary-backed substitution: checks if corrected word is known
- Structural rule: digit at pos 0 + lowercase rest → most likely letter
(e.g. "8en"→"Ben", "8uch"→"Buch", "5ee"→"See", "6eld"→"Geld")
- Pattern rule: "|." → "1." for numbered list prefixes
- Standalone "|" → "I" (capital I)
- IPA entries still protected via existing _entry_needs_review filter
- Headings/untranslated words (e.g. "Story") are untouched (no susp. chars)
- llm_review_entries + llm_review_entries_streaming: route via REVIEW_ENGINE
env var ("spell" default, "llm" to restore previous behaviour)
- docker-compose.yml: REVIEW_ENGINE=${REVIEW_ENGINE:-spell}
- LLM code preserved for fallback (set REVIEW_ENGINE=llm in .env)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -235,6 +235,7 @@ services:
|
||||
OLLAMA_CORRECTION_MODEL: ${OLLAMA_CORRECTION_MODEL:-llama3.2}
|
||||
OLLAMA_REVIEW_MODEL: ${OLLAMA_REVIEW_MODEL:-qwen3:0.6b}
|
||||
OLLAMA_REVIEW_BATCH_SIZE: ${OLLAMA_REVIEW_BATCH_SIZE:-20}
|
||||
REVIEW_ENGINE: ${REVIEW_ENGINE:-spell}
|
||||
OCR_ENGINE: ${OCR_ENGINE:-auto}
|
||||
OLLAMA_HTR_MODEL: ${OLLAMA_HTR_MODEL:-qwen2.5vl:32b}
|
||||
HTR_FALLBACK_MODEL: ${HTR_FALLBACK_MODEL:-trocr-large}
|
||||
|
||||
Reference in New Issue
Block a user