Benjamin Admin db8327f039 fix(ocr-pipeline): tune column detection based on GT comparison
Address 5 weaknesses found via ground-truth comparison on session df3548d1:
- Add column_ignore for edge columns with < 3 words (margin detection)
- Absorb tiny clusters (< 5% width) into neighbors post-merge
- Restrict page_ref to left 35% of content area across all 3 levels
- Loosen marker thresholds (width < 6%, words <= 15) and add strong
  marker score for very narrow non-edge columns (< 4%)
- Add EN/DE position tiebreaker when language signals are both weak

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 23:16:31 +01:00
Description
No description provided
42 MiB
Languages
TypeScript 60.2%
Python 32.9%
Go 5.5%
C# 0.8%
CSS 0.2%
Other 0.3%