fix(llm-review): Pre-Filter entfernt — alle Einträge ans LLM senden
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 20s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 20s
Der digit-in-word Pre-Filter hat alle 41 Einträge geblockt (skipped=41 im Log). OCR-Fehler können nicht im voraus erkannt werden. Zurück zum ursprünglichen Ansatz: alle nicht-leeren Einträge ohne IPA-Klammern werden ans LLM gesendet. Schutz gegen Übersetzungen erfolgt ausschließlich über den strikten Prompt und _is_spurious_change(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -5399,23 +5399,19 @@ _OCR_DIGIT_IN_WORD_RE = _re.compile(r'(?<=[A-Za-zÄÖÜäöüß])[01568]|[01568]
|
|||||||
def _entry_needs_review(entry: Dict) -> bool:
|
def _entry_needs_review(entry: Dict) -> bool:
|
||||||
"""Check if an entry should be sent to the LLM for review.
|
"""Check if an entry should be sent to the LLM for review.
|
||||||
|
|
||||||
Only sends entries that actually contain OCR digit↔letter confusion
|
Sends all non-empty entries that don't have IPA phonetic transcriptions.
|
||||||
patterns (e.g. "8en" instead of "Ben", "L0ndon" instead of "London").
|
The LLM prompt and _is_spurious_change() guard against unwanted changes.
|
||||||
This prevents the LLM from touching correct entries.
|
|
||||||
"""
|
"""
|
||||||
en = entry.get("english", "") or ""
|
en = entry.get("english", "") or ""
|
||||||
de = entry.get("german", "") or ""
|
de = entry.get("german", "") or ""
|
||||||
ex = entry.get("example", "") or ""
|
|
||||||
|
|
||||||
# Skip completely empty entries
|
# Skip completely empty entries
|
||||||
if not en.strip() and not de.strip():
|
if not en.strip() and not de.strip():
|
||||||
return False
|
return False
|
||||||
# Skip entries with IPA/phonetic brackets — dictionary-corrected, no OCR digits expected
|
# Skip entries with IPA/phonetic brackets — dictionary-corrected, LLM must not touch them
|
||||||
if _HAS_PHONETIC_RE.search(en) or _HAS_PHONETIC_RE.search(de):
|
if _HAS_PHONETIC_RE.search(en) or _HAS_PHONETIC_RE.search(de):
|
||||||
return False
|
return False
|
||||||
# Only review if at least one field has a digit-in-word pattern
|
return True
|
||||||
combined = f"{en} {de} {ex}"
|
|
||||||
return bool(_OCR_DIGIT_IN_WORD_RE.search(combined))
|
|
||||||
|
|
||||||
|
|
||||||
def _build_llm_prompt(table_lines: List[Dict]) -> str:
|
def _build_llm_prompt(table_lines: List[Dict]) -> str:
|
||||||
|
|||||||
Reference in New Issue
Block a user