feat(control-pipeline): incremental dedup + ENISA CRA ingestion
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 43s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 37s

BatchDedup since-Parameter (services/batch_dedup_runner.py + api):
- Neuer 'since: datetime' Param scoped Phase 1 + Phase 2 SQL auf created_at >= since.
- Phase 2 checkpoint wird beim scoped Lauf geloescht (verhindert Skip neuer Atomics
  deren control_id alphabetisch unter dem stale last_id liegt).
- 6-13x schneller fuer nachgeschobene Dokumente (19k statt 172k Atomics).
- Doku: control-pipeline/docs/incremental-dedup.md.

Neue Scripts:
- gpre1_object_groups_incremental.py: Append neuer Objects an object_groups via
  bge-m3 nearest-neighbor (threshold default 0.85, empfehlbar 0.78 fuer breiteres
  Synonym-Matching). Pure INSERT/UPDATE, kein DELETE.
- gpre2_master_controls_incremental.py: Non-destructive Master-Controls-Update.
  Existing MCs unangetastet (UUIDs + master_control_id bleiben), nur neue Members
  appended + neue MCs fuer Object-Groups die jetzt min-phases erreichen.
- ingest_enisa_cra.py: Ingestion der 8 CRA-relevanten ENISA-Dokumente
  (Standards Mapping, EUCC-Implementation, NIS2 TIG, SRP FAQ, EUCC Eval Methodology,
  CVD Policies, Threat Landscape 2025). chunk_strategy=legal,
  requirement_strength=guidance|consultation_draft|evidentiary.

Quelldaten: legal-sources/enisa/enisa_cra_single_reporting_platform_faq.html
(PDFs sind .gitignore-gefiltert).

Ergebnis dieser Pipeline-Iteration:
- 1.296 neue CRA-Controls + 19.652 atomare Children
- +362 neue Master-Controls, 10.017 existing erweitert
- Total: 13.950 MCs, 620 CRA-MCs (vorher 566), 1.304 CRA-Atomics (vorher 841)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-18 18:21:46 +02:00
parent 47d7beeb52
commit 9783657da3
7 changed files with 1895 additions and 15 deletions
@@ -1553,6 +1553,7 @@ async def get_repair_backfill_status(backfill_id: str):
class BatchDedupRequest(BaseModel):
dry_run: bool = True
hint_filter: Optional[str] = None # Only process groups matching this hint prefix
since: Optional[str] = None # ISO datetime — scope to controls created at/after this
_batch_dedup_status: dict = {}
@@ -1567,7 +1568,15 @@ async def _run_batch_dedup(req: BatchDedupRequest, dedup_id: str):
runner = BatchDedupRunner(db)
_batch_dedup_status[dedup_id] = {"status": "running", "phase": "starting"}
stats = await runner.run(dry_run=req.dry_run, hint_filter=req.hint_filter)
since_dt = None
if req.since:
from datetime import datetime
since_dt = datetime.fromisoformat(req.since.replace("Z", "+00:00"))
stats = await runner.run(
dry_run=req.dry_run,
hint_filter=req.hint_filter,
since=since_dt,
)
_batch_dedup_status[dedup_id] = {
"status": "completed",