Two fixes:
1. Tokens ending with ] (e.g. "serva]") were stripped by the noise
filter because ] was not in the allowed punctuation list.
2. Rows containing only phonetic transcription (e.g. ['mani serva])
are now merged into the previous vocab entry instead of creating
a separate (invalid) entry. This prevents the LLM from trying
to "correct" phonetic fragments.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>