Benjamin Admin e9f368d3ec feat(ocr-pipeline): add abbreviation allowlist to noise filter
Add _KNOWN_ABBREVIATIONS set with ~150 common EN/DE abbreviations
(sth, sb, etc, eg, ie, usw, bzw, vgl, adj, adv, prep, sg, pl, ...).
Tokens matching known abbreviations are never stripped as noise.

Also handle dotted abbreviations (e.g., z.B., i.e.) that have no
2+ consecutive alpha chars by checking the abbreviation set before
the _RE_REAL_WORD filter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 10:46:54 +01:00
Description
No description provided
42 MiB
Languages
TypeScript 60.2%
Python 32.9%
Go 5.5%
C# 0.8%
CSS 0.2%
Other 0.3%