The y_tolerance for word-center clustering was based on median word height (21px → 12px tolerance), which was too small. Words on the same line can have centers 15-20px apart due to different heights. Now uses 40% of the gap-based median row height as tolerance (e.g. 40px row → 16px tolerance), and 30% for merge threshold. This produces correct cluster counts matching actual text lines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
132 KiB
132 KiB