Benjamin Admin c3a924a620 fix(ocr-pipeline): merge phonetic-only rows and fix bracket noise filter
Two fixes:
1. Tokens ending with ] (e.g. "serva]") were stripped by the noise
   filter because ] was not in the allowed punctuation list.
2. Rows containing only phonetic transcription (e.g. ['mani serva])
   are now merged into the previous vocab entry instead of creating
   a separate (invalid) entry. This prevents the LLM from trying
   to "correct" phonetic fragments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 14:14:20 +01:00
Description
No description provided
42 MiB
Languages
TypeScript 60.1%
Python 33.1%
Go 5.4%
C# 0.8%
CSS 0.2%
Other 0.3%