Benjamin_Boenisch/breakpilot-compliance

Fork 0

Files

Benjamin Admin 643b26618f

CI/CD / go-lint (push) Has been skipped

Details

CI/CD / python-lint (push) Has been skipped

Details

CI/CD / nodejs-lint (push) Has been skipped

Details

CI/CD / test-go-ai-compliance (push) Failing after 31s

Details

CI/CD / test-python-backend-compliance (push) Successful in 1m35s

Details

CI/CD / test-python-document-crawler (push) Successful in 20s

Details

CI/CD / test-python-dsms-gateway (push) Successful in 17s

Details

CI/CD / validate-canonical-controls (push) Successful in 10s

Details

CI/CD / Deploy (push) Has been skipped

Details

feat: Control Library UI, dedup migration, QA tooling, docs

- Control Library: parent control display, ObligationTypeBadge,
  GenerationStrategyBadge variants, evidence string fallback
- API: expose parent_control_uuid/id/title in canonical controls
- Fix: DSFA SQLAlchemy 2.0 Row._mapping compatibility
- Migration 074: control_parent_links + control_dedup_reviews tables
- QA scripts: benchmark, gap analysis, OSCAL import, OWASP cleanup,
  phase5 normalize, phase74 gap fill, sync_db, run_job
- Docs: dedup engine, RAG benchmark, lessons learned, pipeline docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-21 11:56:08 +01:00

8.2 KiB

Raw Blame History

Deduplizierungs-Engine (Control Dedup)

4-stufige Dedup-Pipeline zur Vermeidung doppelter atomarer Controls bei der Pass 0b Komposition. Kern-USP: "1 Control erfuellt 5 Gesetze" durch Multi-Parent-Linking.

Backend: backend-compliance/compliance/services/control_dedup.py Migration: backend-compliance/migrations/074_control_dedup.sql Tests: backend-compliance/tests/test_control_dedup.py (56 Tests)

Motivation

Aus ~6.800 technischen Controls x ~10 Obligations pro Control entstehen ~68.000 atomare Kandidaten. Ziel: ~18.000 einzigartige Master Controls. Viele Obligations aus verschiedenen Gesetzen fuehren zum gleichen technischen Control (z.B. "MFA implementieren" in DSGVO, NIS2, AI Act).

Problem: Embedding-only Deduplizierung ist GEFAEHRLICH fuer Compliance.

!!! danger "False-Positive Beispiel" - "Admin-Zugriffe muessen MFA nutzen" vs. "Remote-Zugriffe muessen MFA nutzen" - Embedding sagt >0.9 aehnlich - Aber es sind ZWEI verschiedene Controls (verschiedene Objekte!)

4-Stufen Entscheidungsbaum

flowchart TD
    A[Kandidat-Control] --> B{Pattern-Gate}
    B -->|pattern_id verschieden| N1[NEW CONTROL]
    B -->|pattern_id gleich| C{Action-Check}
    C -->|Action verschieden| N2[NEW CONTROL]
    C -->|Action gleich| D{Object-Normalization}
    D -->|Objekt verschieden| E{Similarity > 0.95?}
    E -->|Ja| L1[LINK]
    E -->|Nein| N3[NEW CONTROL]
    D -->|Objekt gleich| F{Tiered Thresholds}
    F -->|> 0.92| L2[LINK]
    F -->|0.85 - 0.92| R[REVIEW QUEUE]
    F -->|< 0.85| N4[NEW CONTROL]

Stufe 1: Pattern-Gate (hart)

pattern_id muss uebereinstimmen. Verhindert ~80% der False Positives.

if pattern_id != existing.pattern_id:
    → NEW CONTROL  # Verschiedene Kontrollmuster = verschiedene Controls

Stufe 2: Action-Check (hart)

Normalisierte Aktionsverben muessen uebereinstimmen. "Implementieren" vs. "Testen" = verschiedene Controls, auch bei gleichem Objekt.

if normalize_action("implementieren") != normalize_action("testen"):
    → NEW CONTROL  # "implement" != "test"

Action-Normalisierung (Deutsch → Englisch):

Deutsche Verben	Kanonische Form
implementieren, umsetzen, einrichten, aktivieren	`implement`
testen, pruefen, ueberpruefen, verifizieren	`test`
ueberwachen, monitoring, beobachten	`monitor`
verschluesseln	`encrypt`
protokollieren, aufzeichnen, loggen	`log`
beschraenken, einschraenken, begrenzen	`restrict`

Stufe 3: Object-Normalization (weich)

Compliance-Objekte werden auf kanonische Token normalisiert.

normalize_object("Admin-Konten") → "privileged_access"
normalize_object("Remote-Zugriff") → "remote_access"
normalize_object("MFA") → "multi_factor_auth"

Bei verschiedenen Objekten gilt ein hoeherer Schwellenwert (0.95 statt 0.92).

Objekt-Normalisierung:

Eingabe	Kanonischer Token
MFA, 2FA, Multi-Faktor-Authentifizierung	`multi_factor_auth`
Admin-Konten, privilegierte Zugriffe	`privileged_access`
Verschluesselung, Kryptografie	`encryption`
Schluessel, Key Management	`key_management`
TLS, SSL, HTTPS	`transport_encryption`
Firewall	`firewall`
Audit-Log, Protokoll, Logging	`audit_logging`

Stufe 4: Embedding Similarity (Qdrant)

Tiered Thresholds basierend auf Cosine-Similarity:

Score	Verdict	Aktion
> 0.95	LINK	Bei verschiedenen Objekten
> 0.92	LINK	Parent-Link hinzufuegen
0.85 - 0.92	REVIEW	In Review-Queue zur manuellen Pruefung
< 0.85	NEW	Neues Control anlegen

Canonicalization Layer

Vor dem Embedding wird der deutsche Compliance-Text in normalisiertes Englisch transformiert:

"Administratoren muessen MFA verwenden"
→ "implement multi_factor_auth for administratoren verwenden"
→ Bessere Matches, weniger Embedding-Rauschen

Dies reduziert das Rauschen durch synonyme Formulierungen in verschiedenen Gesetzen.

Multi-Parent-Linking (M:N)

Ein atomares Control kann mehrere Eltern-Controls aus verschiedenen Regulierungen haben:

{
  "control_id": "AUTH-1072-A01",
  "parent_links": [
    {"parent_control_id": "AUTH-1001", "source": "NIST IA-02(01)", "link_type": "decomposition"},
    {"parent_control_id": "NIS2-045", "source": "NIS2 Art. 21", "link_type": "dedup_merge"}
  ]
}

Datenbank-Schema

-- Migration 074: control_parent_links (M:N)
CREATE TABLE control_parent_links (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    control_uuid UUID NOT NULL REFERENCES canonical_controls(id),
    parent_control_uuid UUID NOT NULL REFERENCES canonical_controls(id),
    link_type VARCHAR(30) NOT NULL DEFAULT 'decomposition',
    confidence NUMERIC(3,2) DEFAULT 1.0,
    source_regulation VARCHAR(100),
    source_article VARCHAR(100),
    obligation_candidate_id UUID REFERENCES obligation_candidates(id),
    created_at TIMESTAMPTZ DEFAULT NOW(),
    CONSTRAINT uq_parent_link UNIQUE (control_uuid, parent_control_uuid)
);

Link-Typen:

Typ	Bedeutung
`decomposition`	Aus Pass 0b Zerlegung
`dedup_merge`	Durch Dedup-Engine als Duplikat erkannt
`manual`	Manuell durch Reviewer verknuepft
`crosswalk`	Aus Crosswalk-Matrix uebernommen

Review-Queue

Borderline-Matches (Similarity 0.85-0.92) werden in die Review-Queue geschrieben:

-- Migration 074: control_dedup_reviews
CREATE TABLE control_dedup_reviews (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    candidate_control_id VARCHAR(30) NOT NULL,
    candidate_title TEXT NOT NULL,
    candidate_objective TEXT,
    matched_control_uuid UUID REFERENCES canonical_controls(id),
    matched_control_id VARCHAR(30),
    similarity_score NUMERIC(4,3),
    dedup_stage VARCHAR(40) NOT NULL,
    review_status VARCHAR(20) DEFAULT 'pending',
    -- pending → accepted_link | accepted_new | rejected
    created_at TIMESTAMPTZ DEFAULT NOW()
);

Qdrant Collection

Collection:  atomic_controls
Dimension:   1024 (bge-m3)
Distance:    COSINE
Payload:     pattern_id, action_normalized, object_normalized, control_id, canonical_text
Index:       pattern_id (keyword), action_normalized (keyword), object_normalized (keyword)
Query:       IMMER mit filter: pattern_id == X (reduziert Suche drastisch)

Integration in Pass 0b

Die Dedup-Engine ist optional in DecompositionPass integriert:

decomp = DecompositionPass(db=session, dedup_enabled=True)
stats = await decomp.run_pass0b(limit=100, use_anthropic=True)

# Stats enthalten Dedup-Metriken:
# stats["dedup_linked"] = 15   (Duplikate → Parent-Link)
# stats["dedup_review"] = 3    (Borderline → Review-Queue)
# stats["controls_created"] = 82  (Neue Controls)

Ablauf bei Pass 0b mit Dedup:

LLM generiert atomares Control
Dedup-Engine prueft 4 Stufen
LINK: Kein neues Control, Parent-Link zu bestehendem
REVIEW: Kein neues Control, Eintrag in Review-Queue
NEW: Control anlegen + in Qdrant indexieren

Konfiguration

Umgebungsvariable	Default	Beschreibung
`DEDUP_ENABLED`	`true`	Dedup-Engine ein/ausschalten
`DEDUP_LINK_THRESHOLD`	`0.92`	Schwelle fuer automatisches Linking
`DEDUP_REVIEW_THRESHOLD`	`0.85`	Schwelle fuer Review-Queue
`DEDUP_LINK_THRESHOLD_DIFF_OBJ`	`0.95`	Schwelle bei verschiedenen Objekten
`DEDUP_QDRANT_COLLECTION`	`atomic_controls`	Qdrant-Collection fuer Dedup-Index
`QDRANT_URL`	`http://host.docker.internal:6333`	Qdrant-URL
`EMBEDDING_URL`	`http://embedding-service:8087`	Embedding-Service-URL

Quelldateien

Datei	Beschreibung
`compliance/services/control_dedup.py`	4-Stufen Dedup-Engine
`compliance/services/decomposition_pass.py`	Pass 0a/0b mit Dedup-Integration
`migrations/074_control_dedup.sql`	DB-Schema (parent_links, review_queue)
`tests/test_control_dedup.py`	56 Unit-Tests

8.2 KiB Raw Blame History

Deduplizierungs-Engine (Control Dedup)

Motivation

4-Stufen Entscheidungsbaum

Stufe 1: Pattern-Gate (hart)

Stufe 2: Action-Check (hart)

Stufe 3: Object-Normalization (weich)

Stufe 4: Embedding Similarity (Qdrant)

Canonicalization Layer

Multi-Parent-Linking (M:N)

Datenbank-Schema

Review-Queue

Qdrant Collection

Integration in Pass 0b

Konfiguration

Quelldateien

Verwandte Dokumentation

8.2 KiB

Raw Blame History