fix: adapt batch dedup to NULL pattern_id — group by merge_group_hint
All checks were successful
CI/CD / go-lint (push) Has been skipped
CI/CD / python-lint (push) Has been skipped
CI/CD / nodejs-lint (push) Has been skipped
CI/CD / test-go-ai-compliance (push) Successful in 31s
CI/CD / test-python-backend-compliance (push) Successful in 31s
CI/CD / test-python-document-crawler (push) Successful in 21s
CI/CD / test-python-dsms-gateway (push) Successful in 19s
CI/CD / validate-canonical-controls (push) Successful in 10s
CI/CD / Deploy (push) Successful in 2s

All Pass 0b controls have pattern_id=NULL. Rewritten to:
- Phase 1: Group by merge_group_hint (action:object:trigger), 52k groups
- Phase 2: Cross-group embedding search for semantically similar masters
- Qdrant search uses unfiltered cross-regulation endpoint
- API param changed: pattern_id → hint_filter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-24 07:24:02 +01:00
parent 35784c35eb
commit 770f0b5ab0
3 changed files with 318 additions and 280 deletions

View File

@@ -776,12 +776,12 @@ _batch_dedup_runner = None
@router.post("/migrate/batch-dedup", response_model=MigrationResponse)
async def migrate_batch_dedup(
dry_run: bool = Query(False, description="Preview mode — no DB changes"),
pattern_id: Optional[str] = Query(None, description="Only process this pattern"),
hint_filter: Optional[str] = Query(None, description="Only process hints matching this prefix"),
):
"""Batch dedup: reduce ~85k Pass 0b controls to ~18-25k masters.
Groups controls by pattern_id + merge_group_hint, picks the best
quality master, and links duplicates via control_parent_links.
Phase 1: Groups by merge_group_hint, picks best quality master, links rest.
Phase 2: Cross-group embedding search for semantically similar masters.
"""
global _batch_dedup_runner
from compliance.services.batch_dedup_runner import BatchDedupRunner
@@ -790,7 +790,7 @@ async def migrate_batch_dedup(
try:
runner = BatchDedupRunner(db=db)
_batch_dedup_runner = runner
stats = await runner.run(dry_run=dry_run, pattern_filter=pattern_id)
stats = await runner.run(dry_run=dry_run, hint_filter=hint_filter)
return MigrationResponse(status="completed", stats=stats)
except Exception as e:
logger.error("Batch dedup failed: %s", e)