fix(llm-review): LLM übersetzt nicht mehr — nur noch OCR-Ziffernfehler
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
## Problem qwen3:0.6b interpretierte den Prompt zu weit und versuchte: - Englische Wörter zu übersetzen (EN-Spalte umschreiben) - Korrekte deutsche Wörter neu zu übersetzen - IPA-Einträge in Klammern zu 'korrigieren' ## Fixes ### 1. Strengerer Pre-Filter (entry_needs_review) Sendet jetzt NUR Einträge ans LLM, die tatsächlich ein Ziffer-in-Wort-Muster haben (0158 zwischen Buchstaben). → Korrekte Einträge werden gar nicht erst gesendet. ### 2. Viel restriktiverer Prompt - Explizites Verbot: "du übersetzt NICHTS, weder EN→DE noch DE→EN" - Nur die 5 Ziffer→Buchstaben-Fälle sind erlaubt - Konkrete Beispiele für erlaubte Korrekturen - Kein vager "Im Zweifel nicht ändern" — sondern explizites VERBOTEN ### 3. Stärkerer Spurious-Change-Filter Verwirft LLM-Änderungen, die keine Ziffer→Buchstabe-Substitution sind. Verhindert Übersetzungen und Neuformulierungen auch wenn der Prompt sie nicht vollständig unterdrückt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -5390,46 +5390,61 @@ logger.info("LLM review model: %s (batch=%d)", OLLAMA_REVIEW_MODEL, _REVIEW_BATC
|
||||
# Regex: entry contains IPA phonetic brackets like "dance [dɑːns]"
|
||||
_HAS_PHONETIC_RE = _re.compile(r'\[.*?[ˈˌːʃʒθðŋɑɒɔəɜɪʊʌæ].*?\]')
|
||||
|
||||
# Regex: digit adjacent to a letter — the hallmark of OCR digit↔letter confusion.
|
||||
# Matches digits 0,1,5,6,8 (common OCR confusions: 0→O, 1→l/I, 5→S, 6→G, 8→B)
|
||||
# when they appear inside or next to a word character.
|
||||
_OCR_DIGIT_IN_WORD_RE = _re.compile(r'(?<=[A-Za-zÄÖÜäöüß])[01568]|[01568](?=[A-Za-zÄÖÜäöüß])')
|
||||
|
||||
|
||||
def _entry_needs_review(entry: Dict) -> bool:
|
||||
"""Check if an entry should be sent to the LLM for review.
|
||||
|
||||
Skip entries that are empty or contain IPA phonetic transcriptions
|
||||
(those were already corrected by the word dictionary lookup).
|
||||
Only sends entries that actually contain OCR digit↔letter confusion
|
||||
patterns (e.g. "8en" instead of "Ben", "L0ndon" instead of "London").
|
||||
This prevents the LLM from touching correct entries.
|
||||
"""
|
||||
en = entry.get("english", "") or ""
|
||||
de = entry.get("german", "") or ""
|
||||
ex = entry.get("example", "") or ""
|
||||
|
||||
# Skip completely empty entries
|
||||
if not en.strip() and not de.strip():
|
||||
return False
|
||||
# Skip entries with phonetic/IPA brackets — these are dictionary-corrected
|
||||
if _HAS_PHONETIC_RE.search(en):
|
||||
# Skip entries with IPA/phonetic brackets — dictionary-corrected, no OCR digits expected
|
||||
if _HAS_PHONETIC_RE.search(en) or _HAS_PHONETIC_RE.search(de):
|
||||
return False
|
||||
return True
|
||||
# Only review if at least one field has a digit-in-word pattern
|
||||
combined = f"{en} {de} {ex}"
|
||||
return bool(_OCR_DIGIT_IN_WORD_RE.search(combined))
|
||||
|
||||
|
||||
def _build_llm_prompt(table_lines: List[Dict]) -> str:
|
||||
"""Build the LLM correction prompt for a batch of entries."""
|
||||
return f"""Du bist ein Korrekturleser fuer OCR-erkannte Vokabeltabellen (Englisch-Deutsch).
|
||||
Die Tabelle wurde per OCR aus einem Schulbuch-Scan extrahiert. Korrigiere NUR offensichtliche OCR-Fehler.
|
||||
return f"""Du bist ein OCR-Zeichenkorrektur-Werkzeug fuer Vokabeltabellen (Englisch-Deutsch).
|
||||
|
||||
Haeufige OCR-Fehler die du korrigieren sollst:
|
||||
- Ziffern statt Buchstaben: 8→B, 0→O, 1→l/I, 5→S, 6→G
|
||||
- Fehlende oder falsche Satzzeichen
|
||||
- Offensichtliche Tippfehler die durch OCR entstanden sind
|
||||
DEINE EINZIGE AUFGABE: Einzelne Zeichen korrigieren, die vom OCR-Scanner als Ziffer statt als Buchstabe erkannt wurden.
|
||||
|
||||
WICHTIG — Aendere NICHTS in diesen Faellen:
|
||||
- Woerter die korrekt geschrieben sind (auch wenn sie ungewoehnlich aussehen)
|
||||
- Eigennamen, Laendernamen, Staedtenamen (z.B. China, Japan, London, Africa)
|
||||
- Abkuerzungen wie sth., sb., etc., e.g., i.e.
|
||||
- Lautschrift und phonetische Zeichen in eckigen Klammern [...]
|
||||
- Fachbegriffe und Fremdwoerter die korrekt sind
|
||||
- Im Zweifel: NICHT aendern!
|
||||
NUR diese Korrekturen sind erlaubt:
|
||||
- Ziffer 8 statt B: "8en" → "Ben", "8uch" → "Buch", "8all" → "Ball"
|
||||
- Ziffer 0 statt O oder o: "L0ndon" → "London", "0ld" → "Old"
|
||||
- Ziffer 1 statt l oder I: "1ong" → "long", "Ber1in" → "Berlin"
|
||||
- Ziffer 5 statt S oder s: "5tadt" → "Stadt", "5ee" → "See"
|
||||
- Ziffer 6 statt G oder g: "6eld" → "Geld"
|
||||
|
||||
Antworte NUR mit dem korrigierten JSON-Array. Kein erklaerener Text.
|
||||
Fuer jeden Eintrag den du aenderst, setze "corrected": true.
|
||||
Fuer unveraenderte Eintraege setze "corrected": false.
|
||||
Behalte die exakte Struktur (gleiche Anzahl Eintraege).
|
||||
ABSOLUT VERBOTEN — aendere NIEMALS:
|
||||
- Woerter die korrekt geschrieben sind — auch wenn du eine andere Schreibweise kennst
|
||||
- Uebersetzungen — du uebersetzt NICHTS, weder EN→DE noch DE→EN
|
||||
- Korrekte englische Woerter (en-Spalte) — auch wenn du eine Bedeutung kennst
|
||||
- Korrekte deutsche Woerter (de-Spalte) — auch wenn du sie anders sagen wuerdest
|
||||
- Eigennamen: Ben, London, China, Africa, Shakespeare usw.
|
||||
- Abkuerzungen: sth., sb., etc., e.g., i.e., v.t., smb. usw.
|
||||
- Lautschrift in eckigen Klammern [...] — diese NIEMALS beruehren
|
||||
- Beispielsaetze in der ex-Spalte — NIEMALS aendern
|
||||
|
||||
Wenn ein Wort keinen Ziffer-Buchstaben-Fehler enthaelt: gib es UNVERAENDERT zurueck und setze "corrected": false.
|
||||
|
||||
Antworte NUR mit dem JSON-Array. Kein Text davor oder danach.
|
||||
Behalte die exakte Struktur (gleiche Anzahl Eintraege, gleiche Reihenfolge).
|
||||
|
||||
/no_think
|
||||
|
||||
@@ -5440,31 +5455,54 @@ Eingabe:
|
||||
def _is_spurious_change(old_val: str, new_val: str) -> bool:
|
||||
"""Detect LLM changes that are likely wrong and should be discarded.
|
||||
|
||||
Only digit↔letter substitutions (0→O, 1→l, 5→S, 6→G, 8→B) are
|
||||
legitimate OCR corrections. Everything else is rejected.
|
||||
|
||||
Filters out:
|
||||
- Case-only changes (OCR doesn't typically swap case)
|
||||
- Completely different words (LLM hallucinating a replacement)
|
||||
- Changes where the old value is a valid proper noun / place name
|
||||
- Case-only changes
|
||||
- Changes that don't contain any digit→letter fix
|
||||
- Completely different words (LLM translating or hallucinating)
|
||||
- Additions or removals of whole words (count changed)
|
||||
"""
|
||||
if not old_val or not new_val:
|
||||
return False
|
||||
|
||||
# Case-only change — almost never a real OCR error
|
||||
# Case-only change — never a real OCR error
|
||||
if old_val.lower() == new_val.lower():
|
||||
return True
|
||||
|
||||
# If old value starts with uppercase and new is totally different word,
|
||||
# it's likely a proper noun the LLM "corrected"
|
||||
# If the word count changed significantly, the LLM rewrote rather than fixed
|
||||
old_words = old_val.split()
|
||||
new_words = new_val.split()
|
||||
if len(old_words) == 1 and len(new_words) == 1:
|
||||
ow, nw = old_words[0], new_words[0]
|
||||
# Both are single words but share very few characters — likely hallucination
|
||||
if len(ow) > 2 and len(nw) > 2:
|
||||
# Levenshtein-like quick check: if < 50% chars overlap, reject
|
||||
common = sum(1 for c in ow.lower() if c in nw.lower())
|
||||
max_len = max(len(ow), len(nw))
|
||||
if common / max_len < 0.5:
|
||||
return True
|
||||
if abs(len(old_words) - len(new_words)) > 1:
|
||||
return True
|
||||
|
||||
# Core rule: a legitimate correction replaces a digit with the corresponding
|
||||
# letter. If the change doesn't include such a substitution, reject it.
|
||||
# Build a set of (old_char, new_char) pairs that differ between old and new.
|
||||
# Use character-level diff heuristic: if lengths are close, zip and compare.
|
||||
_DIGIT_TO_LETTER = {
|
||||
'0': set('oOgG'),
|
||||
'1': set('lLiI'),
|
||||
'5': set('sS'),
|
||||
'6': set('gG'),
|
||||
'8': set('bB'),
|
||||
}
|
||||
has_valid_digit_fix = False
|
||||
if len(old_val) == len(new_val):
|
||||
for oc, nc in zip(old_val, new_val):
|
||||
if oc != nc:
|
||||
if oc in _DIGIT_TO_LETTER and nc in _DIGIT_TO_LETTER[oc]:
|
||||
has_valid_digit_fix = True
|
||||
# Any other single-char change is suspicious (could be translation)
|
||||
else:
|
||||
# Length changed: only accept if the difference is one char and
|
||||
# the old contained a digit where new has a letter
|
||||
if abs(len(old_val) - len(new_val)) <= 1 and _OCR_DIGIT_IN_WORD_RE.search(old_val):
|
||||
has_valid_digit_fix = True
|
||||
|
||||
if not has_valid_digit_fix:
|
||||
return True # Reject — looks like translation or hallucination
|
||||
|
||||
return False
|
||||
|
||||
|
||||
Reference in New Issue
Block a user