Add pipe auto-correction and graphic artifact filter for grid builder
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 2m10s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 19s

- autocorrect_pipe_artifacts(): strips OCR pipe artifacts from printed
  syllable dividers, validates with pyphen, tries char-deletion near
  pipe positions for garbled words (e.g. "Ze|plpe|lin" → "Zeppelin")
- Rule (a2): filters isolated non-alphanumeric word boxes (≤2 chars,
  no letters/digits) — catches small icons OCR'd as ">", "<" etc.
- Both fixes are generic: pyphen-validated, no session-specific logic

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-27 16:33:38 +01:00
parent 0685fb12da
commit cc4cb3bc2f
2 changed files with 159 additions and 1 deletions

View File

@@ -1323,6 +1323,14 @@ async def _build_grid_core(
and wb.get("conf", 100) < 85):
to_remove.add(i)
# Rule (a2): isolated non-alphanumeric symbols (graphic OCR artifacts)
# Small images/icons next to words get OCR'd as ">", "<", "~", etc.
# Remove word boxes that contain NO letters or digits.
for i, wb in enumerate(wbs):
t = (wb.get("text") or "").strip()
if t and not re.search(r'[a-zA-Z0-9äöüÄÖÜß]', t) and len(t) <= 2:
to_remove.add(i)
# Rule (b) + (c): overlap and duplicate detection
# Sort by x for pairwise comparison
_ALPHA_WORD_RE = re.compile(r'^[A-Za-z\u00c0-\u024f\-]+[.,;:!?]*$')
@@ -1619,6 +1627,15 @@ async def _build_grid_core(
except Exception as e:
logger.warning("Word-gap merge failed: %s", e)
# --- Pipe auto-correction: fix OCR artifacts from printed syllable dividers ---
# Strips | from words, validates with pyphen, tries char-deletion for garbled
# words like "Ze|plpe|lin" → "Zeppelin".
try:
from cv_syllable_detect import autocorrect_pipe_artifacts
autocorrect_pipe_artifacts(zones_data, session_id)
except Exception as e:
logger.warning("Pipe autocorrect failed: %s", e)
# --- Syllable divider insertion for dictionary pages ---
# syllable_mode: "auto" = only when original has pipe dividers (1% threshold),
# "all" = force on all content words, "en" = English column only,