Rule (a2): switch from allow-list to block-list for symbol removal
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 47s
CI / test-go-edu-search (push) Successful in 47s
CI / test-python-klausur (push) Failing after 2m42s
CI / test-python-agent-core (push) Successful in 34s
CI / test-nodejs-website (push) Successful in 36s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 47s
CI / test-go-edu-search (push) Successful in 47s
CI / test-python-klausur (push) Failing after 2m42s
CI / test-python-agent-core (push) Successful in 34s
CI / test-nodejs-website (push) Successful in 36s
Instead of keeping only specific symbols (_KEEP_SYMBOLS), now only removes explicitly decorative symbols (_REMOVE_SYMBOLS: > < ~ \ ^ etc). All other punctuation (= ( ) ; : - etc.) is preserved by default. This is more robust: any new symbol used in textbooks will be kept unless it's in the small block-list of known decorative artifacts. Fixes: (= token still being removed on page 5 despite being in the allow-list (possibly due to Unicode variants or whitespace). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1407,13 +1407,14 @@ async def _build_grid_core(
|
||||
# Rule (a2): isolated non-alphanumeric symbols (graphic OCR artifacts)
|
||||
# Small images/icons next to words get OCR'd as ">", "<", "~", etc.
|
||||
# Remove word boxes that contain NO letters or digits.
|
||||
# Exception: meaningful punctuation used in textbooks (=, ;, :, -, etc.)
|
||||
_KEEP_SYMBOLS = {'=', '(=', '=)', ';', ':', '-', '–', '—', '/', '+',
|
||||
'•', '·', '(', ')', '&', '*', '→', '←', '↔'}
|
||||
# Exception: keep standard punctuation and symbols commonly used
|
||||
# in textbooks (=, ;, :, -, (, ), etc.). Only remove truly
|
||||
# decorative symbols like >, <, ~, \, ^, `, #.
|
||||
_REMOVE_SYMBOLS = {'>', '<', '~', '\\', '^', '`', '#', '|', '¬', '¦'}
|
||||
for i, wb in enumerate(wbs):
|
||||
t = (wb.get("text") or "").strip()
|
||||
if t and not re.search(r'[a-zA-Z0-9äöüÄÖÜß]', t) and len(t) <= 2:
|
||||
if t not in _KEEP_SYMBOLS:
|
||||
if t in _REMOVE_SYMBOLS:
|
||||
to_remove.add(i)
|
||||
|
||||
# Rule (b) + (c): overlap and duplicate detection
|
||||
|
||||
Reference in New Issue
Block a user