Thin black lines (1-5px) at page edges from scanning were incorrectly
detected as content, shifting content bounds and creating spurious
IGNORE columns. This filters narrow projection runs (<1% of image
dimension) and introduces explicit margin_left/margin_right regions
for downstream page reconstruction.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Unit tests: 76 new parametrized tests for noise filter, phonetic detection,
cell text cleaning, and row merging (116 total, all green)
2. Continuation-row merge: detect multi-line vocab entries where text wraps
(lowercase EN + empty DE) and merge into previous entry
3. Empty DE fallback: secondary PSM=7 OCR pass for cells missed by PSM=6
4. Batch-OCR: collect empty cells per column, run single Tesseract call on
column strip instead of per-cell (~66% fewer calls for 3+ empty cells)
5. StepReconstruction UI: font scaling via naturalHeight, empty EN/DE field
highlighting, undo/redo (Ctrl+Z), per-cell reset button
6. Session reprocess: POST /sessions/{id}/reprocess endpoint to re-run from
any step, with reprocess button on completed pipeline steps
Also fixes pre-existing dewarp_image tuple unpacking bug in run_cv_pipeline
and updates dewarp tests to match current (image, info) return signature.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>