- Paginated DB queries (100 rows/page) instead of loading all 166k rows
- Individual timeout (30s) per embedding + qdrant call
- Per-control try/except — one failure doesn't kill the job
- Sequential processing (no asyncio.gather) for stability
- Progress logging every 500 controls
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
153k of 160k merge groups have only 1 control — no intra-group
dedup possible. Skip them in Phase 1, they become masters automatically.
Phase 2 (cross-group) still checks them via Qdrant embeddings.
Reduces Phase 1 from ~96h to ~2h.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stale UUIDs in the Qdrant dedup collection can reference controls
that were deprecated in earlier batches. Log warning and continue
instead of raising and killing the entire job.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Control-Pipeline (Pass 0a/0b, BatchDedup, Generator) als eigenstaendiger
Service in Core, damit Compliance-Repo unabhaengig refakturiert werden kann.
Schreibt weiterhin ins compliance-Schema der shared PostgreSQL.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>