feat(control-pipeline): incremental dedup + ENISA CRA ingestion
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 43s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 37s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 43s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 37s
BatchDedup since-Parameter (services/batch_dedup_runner.py + api): - Neuer 'since: datetime' Param scoped Phase 1 + Phase 2 SQL auf created_at >= since. - Phase 2 checkpoint wird beim scoped Lauf geloescht (verhindert Skip neuer Atomics deren control_id alphabetisch unter dem stale last_id liegt). - 6-13x schneller fuer nachgeschobene Dokumente (19k statt 172k Atomics). - Doku: control-pipeline/docs/incremental-dedup.md. Neue Scripts: - gpre1_object_groups_incremental.py: Append neuer Objects an object_groups via bge-m3 nearest-neighbor (threshold default 0.85, empfehlbar 0.78 fuer breiteres Synonym-Matching). Pure INSERT/UPDATE, kein DELETE. - gpre2_master_controls_incremental.py: Non-destructive Master-Controls-Update. Existing MCs unangetastet (UUIDs + master_control_id bleiben), nur neue Members appended + neue MCs fuer Object-Groups die jetzt min-phases erreichen. - ingest_enisa_cra.py: Ingestion der 8 CRA-relevanten ENISA-Dokumente (Standards Mapping, EUCC-Implementation, NIS2 TIG, SRP FAQ, EUCC Eval Methodology, CVD Policies, Threat Landscape 2025). chunk_strategy=legal, requirement_strength=guidance|consultation_draft|evidentiary. Quelldaten: legal-sources/enisa/enisa_cra_single_reporting_platform_faq.html (PDFs sind .gitignore-gefiltert). Ergebnis dieser Pipeline-Iteration: - 1.296 neue CRA-Controls + 19.652 atomare Children - +362 neue Master-Controls, 10.017 existing erweitert - Total: 13.950 MCs, 620 CRA-MCs (vorher 566), 1.304 CRA-Atomics (vorher 841) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1553,6 +1553,7 @@ async def get_repair_backfill_status(backfill_id: str):
|
||||
class BatchDedupRequest(BaseModel):
|
||||
dry_run: bool = True
|
||||
hint_filter: Optional[str] = None # Only process groups matching this hint prefix
|
||||
since: Optional[str] = None # ISO datetime — scope to controls created at/after this
|
||||
|
||||
|
||||
_batch_dedup_status: dict = {}
|
||||
@@ -1567,7 +1568,15 @@ async def _run_batch_dedup(req: BatchDedupRequest, dedup_id: str):
|
||||
runner = BatchDedupRunner(db)
|
||||
_batch_dedup_status[dedup_id] = {"status": "running", "phase": "starting"}
|
||||
|
||||
stats = await runner.run(dry_run=req.dry_run, hint_filter=req.hint_filter)
|
||||
since_dt = None
|
||||
if req.since:
|
||||
from datetime import datetime
|
||||
since_dt = datetime.fromisoformat(req.since.replace("Z", "+00:00"))
|
||||
stats = await runner.run(
|
||||
dry_run=req.dry_run,
|
||||
hint_filter=req.hint_filter,
|
||||
since=since_dt,
|
||||
)
|
||||
|
||||
_batch_dedup_status[dedup_id] = {
|
||||
"status": "completed",
|
||||
|
||||
Reference in New Issue
Block a user