The border strip filter (Step 4e) used the LARGEST x-gap which incorrectly removed base words along with edge artifacts. Now uses a two-stage approach: 1. _filter_border_strip_words() pre-filters raw words BEFORE column detection, scanning from the page edge inward to find the FIRST significant gap (>30px) 2. Step 4e runs as fallback only when pre-filter didn't apply Session 4233 now correctly detects 3 columns (base word | oder | synonyms) instead of 2. Threshold raised from 15% to 20% to handle pages with many edge artifacts. All 4 ground-truth sessions pass regression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
105 KiB
105 KiB