Benjamin Admin 97d4355aa9 fix(ocr-pipeline): group words by vertical center, merge close clusters
Fix half-height rows caused by tall special characters (brackets, IPA
symbols) being split into separate line clusters:

- Group words by vertical CENTER instead of TOP position, so tall
  characters on the same line stay in one cluster
- Filter outlier-height words (>2× median) when computing letter_h
  so brackets/IPA don't skew the row height
- Merge clusters closer than 0.4× median word height (definitely
  same text line despite slight center differences)
- Increased y_tolerance from 0.5× to 0.6× median word height
- Enhanced logging with cluster merge count and row height range

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 12:14:42 +01:00
Description
No description provided
42 MiB
Languages
TypeScript 60.2%
Python 32.9%
Go 5.5%
C# 0.8%
CSS 0.2%
Other 0.3%