Preamble controls that duplicate article controls (same regulation,
Jaccard title similarity >= 0.40) are marked as duplicate.
Article controls always take priority.
Result: 6,183 active controls (was 6,373), 648 unique preamble controls remain.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
QA pipeline that matches control source_original_text directly against
original PDF documents to verify article/paragraph assignments. Covers
backfill, dedup, source normalization, Qdrant cleanup, and prod sync.
Key results (2026-03-20):
- 4,110/7,943 controls matched to PDF (100% for major EU regs)
- 3,366 article corrections, 705 new assignments
- 1,290 controls from Erwägungsgründe (preamble) identified
- 779 controls from Anhänge (annexes) identified
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>