The 25x25 dilation kernel merges nearby green words into large regions,
so pixel-overlap with OCR word boxes drops below 50%. Previous density
checks alone weren't sufficient.
New multi-layered approach:
- Count OCR word CENTROIDS inside each colored region
- ≥2 centroids → definitely text (images don't produce multiple words)
- 1 centroid + 10%+ pixel overlap → likely text
- Lower pixel overlap threshold from 50% to 40%
- Raise density+height thresholds for text-line detection
- Use INFO logging to diagnose remaining false positives
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>