fix(llm-review): Pre-Filter entfernt — alle Einträge ans LLM senden
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 20s

Der digit-in-word Pre-Filter hat alle 41 Einträge geblockt (skipped=41
im Log). OCR-Fehler können nicht im voraus erkannt werden.

Zurück zum ursprünglichen Ansatz: alle nicht-leeren Einträge ohne
IPA-Klammern werden ans LLM gesendet. Schutz gegen Übersetzungen
erfolgt ausschließlich über den strikten Prompt und _is_spurious_change().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-03 14:29:46 +01:00
parent f1b6246838
commit fa8e38db2d

View File

@@ -5399,23 +5399,19 @@ _OCR_DIGIT_IN_WORD_RE = _re.compile(r'(?<=[A-Za-zÄÖÜäöüß])[01568]|[01568]
def _entry_needs_review(entry: Dict) -> bool: def _entry_needs_review(entry: Dict) -> bool:
"""Check if an entry should be sent to the LLM for review. """Check if an entry should be sent to the LLM for review.
Only sends entries that actually contain OCR digit↔letter confusion Sends all non-empty entries that don't have IPA phonetic transcriptions.
patterns (e.g. "8en" instead of "Ben", "L0ndon" instead of "London"). The LLM prompt and _is_spurious_change() guard against unwanted changes.
This prevents the LLM from touching correct entries.
""" """
en = entry.get("english", "") or "" en = entry.get("english", "") or ""
de = entry.get("german", "") or "" de = entry.get("german", "") or ""
ex = entry.get("example", "") or ""
# Skip completely empty entries # Skip completely empty entries
if not en.strip() and not de.strip(): if not en.strip() and not de.strip():
return False return False
# Skip entries with IPA/phonetic brackets — dictionary-corrected, no OCR digits expected # Skip entries with IPA/phonetic brackets — dictionary-corrected, LLM must not touch them
if _HAS_PHONETIC_RE.search(en) or _HAS_PHONETIC_RE.search(de): if _HAS_PHONETIC_RE.search(en) or _HAS_PHONETIC_RE.search(de):
return False return False
# Only review if at least one field has a digit-in-word pattern return True
combined = f"{en} {de} {ex}"
return bool(_OCR_DIGIT_IN_WORD_RE.search(combined))
def _build_llm_prompt(table_lines: List[Dict]) -> str: def _build_llm_prompt(table_lines: List[Dict]) -> str: