Add _KNOWN_ABBREVIATIONS set with ~150 common EN/DE abbreviations (sth, sb, etc, eg, ie, usw, bzw, vgl, adj, adv, prep, sg, pl, ...). Tokens matching known abbreviations are never stripped as noise. Also handle dotted abbreviations (e.g., z.B., i.e.) that have no 2+ consecutive alpha chars by checking the abbreviation set before the _RE_REAL_WORD filter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>