c8027eb7f94b1ec47628c94e07d9ca7c2ffe2a4e
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 40s
CI / test-go-edu-search (push) Successful in 43s
CI / test-python-klausur (push) Failing after 2m38s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 18s
Rule (a2) in Step 5i removed word_boxes with no letters/digits as
"graphic OCR artifacts". This incorrectly removed = signs used as
definition markers in textbooks ("film = 1. Film; 2. filmen").
Added exception list _KEEP_SYMBOLS for meaningful punctuation:
= (= =) ; : - – — / + • · ( ) & * → ← ↔
The root cause: PaddleOCR returns "film = 1. Film; 2. filmen" as one
block, which gets split into word_boxes ["film", "=", "1.", ...].
The "=" word_box had no alphanumeric chars and was removed as artifact.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Description
No description provided
Languages
TypeScript
59.7%
Python
33.6%
Go
5.3%
C#
0.8%
CSS
0.2%
Other
0.3%