Fix: preserve = ; : - and other meaningful symbols in word_boxes
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 40s
CI / test-go-edu-search (push) Successful in 43s
CI / test-python-klausur (push) Failing after 2m38s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 18s

Rule (a2) in Step 5i removed word_boxes with no letters/digits as
"graphic OCR artifacts". This incorrectly removed = signs used as
definition markers in textbooks ("film = 1. Film; 2. filmen").

Added exception list _KEEP_SYMBOLS for meaningful punctuation:
= (= =) ; : - – — / + • · ( ) & * → ← ↔

The root cause: PaddleOCR returns "film = 1. Film; 2. filmen" as one
block, which gets split into word_boxes ["film", "=", "1.", ...].
The "=" word_box had no alphanumeric chars and was removed as artifact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-04-15 23:18:35 +02:00
parent ba0f659d1e
commit c8027eb7f9

View File

@@ -1407,10 +1407,14 @@ async def _build_grid_core(
# Rule (a2): isolated non-alphanumeric symbols (graphic OCR artifacts) # Rule (a2): isolated non-alphanumeric symbols (graphic OCR artifacts)
# Small images/icons next to words get OCR'd as ">", "<", "~", etc. # Small images/icons next to words get OCR'd as ">", "<", "~", etc.
# Remove word boxes that contain NO letters or digits. # Remove word boxes that contain NO letters or digits.
# Exception: meaningful punctuation used in textbooks (=, ;, :, -, etc.)
_KEEP_SYMBOLS = {'=', '(=', '=)', ';', ':', '-', '', '', '/', '+',
'', '·', '(', ')', '&', '*', '', '', ''}
for i, wb in enumerate(wbs): for i, wb in enumerate(wbs):
t = (wb.get("text") or "").strip() t = (wb.get("text") or "").strip()
if t and not re.search(r'[a-zA-Z0-9äöüÄÖÜß]', t) and len(t) <= 2: if t and not re.search(r'[a-zA-Z0-9äöüÄÖÜß]', t) and len(t) <= 2:
to_remove.add(i) if t not in _KEEP_SYMBOLS:
to_remove.add(i)
# Rule (b) + (c): overlap and duplicate detection # Rule (b) + (c): overlap and duplicate detection
# Sort by x for pairwise comparison # Sort by x for pairwise comparison