feat: Control Library UI, dedup migration, QA tooling, docs
Some checks failed
CI/CD / go-lint (push) Has been skipped
CI/CD / python-lint (push) Has been skipped
CI/CD / nodejs-lint (push) Has been skipped
CI/CD / test-go-ai-compliance (push) Failing after 31s
CI/CD / test-python-backend-compliance (push) Successful in 1m35s
CI/CD / test-python-document-crawler (push) Successful in 20s
CI/CD / test-python-dsms-gateway (push) Successful in 17s
CI/CD / validate-canonical-controls (push) Successful in 10s
CI/CD / Deploy (push) Has been skipped
Some checks failed
CI/CD / go-lint (push) Has been skipped
CI/CD / python-lint (push) Has been skipped
CI/CD / nodejs-lint (push) Has been skipped
CI/CD / test-go-ai-compliance (push) Failing after 31s
CI/CD / test-python-backend-compliance (push) Successful in 1m35s
CI/CD / test-python-document-crawler (push) Successful in 20s
CI/CD / test-python-dsms-gateway (push) Successful in 17s
CI/CD / validate-canonical-controls (push) Successful in 10s
CI/CD / Deploy (push) Has been skipped
- Control Library: parent control display, ObligationTypeBadge, GenerationStrategyBadge variants, evidence string fallback - API: expose parent_control_uuid/id/title in canonical controls - Fix: DSFA SQLAlchemy 2.0 Row._mapping compatibility - Migration 074: control_parent_links + control_dedup_reviews tables - QA scripts: benchmark, gap analysis, OSCAL import, OWASP cleanup, phase5 normalize, phase74 gap fill, sync_db, run_job - Docs: dedup engine, RAG benchmark, lessons learned, pipeline docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
253
docs-src/services/sdk-modules/dedup-engine.md
Normal file
253
docs-src/services/sdk-modules/dedup-engine.md
Normal file
@@ -0,0 +1,253 @@
|
||||
# Deduplizierungs-Engine (Control Dedup)
|
||||
|
||||
4-stufige Dedup-Pipeline zur Vermeidung doppelter atomarer Controls bei der Pass 0b Komposition. Kern-USP: **"1 Control erfuellt 5 Gesetze"** durch Multi-Parent-Linking.
|
||||
|
||||
**Backend:** `backend-compliance/compliance/services/control_dedup.py`
|
||||
**Migration:** `backend-compliance/migrations/074_control_dedup.sql`
|
||||
**Tests:** `backend-compliance/tests/test_control_dedup.py` (56 Tests)
|
||||
|
||||
---
|
||||
|
||||
## Motivation
|
||||
|
||||
Aus ~6.800 technischen Controls x ~10 Obligations pro Control entstehen ~68.000 atomare Kandidaten. Ziel: ~18.000 einzigartige Master Controls. Viele Obligations aus verschiedenen Gesetzen fuehren zum gleichen technischen Control (z.B. "MFA implementieren" in DSGVO, NIS2, AI Act).
|
||||
|
||||
**Problem:** Embedding-only Deduplizierung ist GEFAEHRLICH fuer Compliance.
|
||||
|
||||
!!! danger "False-Positive Beispiel"
|
||||
- "Admin-Zugriffe muessen MFA nutzen" vs. "Remote-Zugriffe muessen MFA nutzen"
|
||||
- Embedding sagt >0.9 aehnlich
|
||||
- Aber es sind **ZWEI verschiedene Controls** (verschiedene Objekte!)
|
||||
|
||||
---
|
||||
|
||||
## 4-Stufen Entscheidungsbaum
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Kandidat-Control] --> B{Pattern-Gate}
|
||||
B -->|pattern_id verschieden| N1[NEW CONTROL]
|
||||
B -->|pattern_id gleich| C{Action-Check}
|
||||
C -->|Action verschieden| N2[NEW CONTROL]
|
||||
C -->|Action gleich| D{Object-Normalization}
|
||||
D -->|Objekt verschieden| E{Similarity > 0.95?}
|
||||
E -->|Ja| L1[LINK]
|
||||
E -->|Nein| N3[NEW CONTROL]
|
||||
D -->|Objekt gleich| F{Tiered Thresholds}
|
||||
F -->|> 0.92| L2[LINK]
|
||||
F -->|0.85 - 0.92| R[REVIEW QUEUE]
|
||||
F -->|< 0.85| N4[NEW CONTROL]
|
||||
```
|
||||
|
||||
### Stufe 1: Pattern-Gate (hart)
|
||||
|
||||
`pattern_id` muss uebereinstimmen. Verhindert ~80% der False Positives.
|
||||
|
||||
```python
|
||||
if pattern_id != existing.pattern_id:
|
||||
→ NEW CONTROL # Verschiedene Kontrollmuster = verschiedene Controls
|
||||
```
|
||||
|
||||
### Stufe 2: Action-Check (hart)
|
||||
|
||||
Normalisierte Aktionsverben muessen uebereinstimmen. "Implementieren" vs. "Testen" = verschiedene Controls, auch bei gleichem Objekt.
|
||||
|
||||
```python
|
||||
if normalize_action("implementieren") != normalize_action("testen"):
|
||||
→ NEW CONTROL # "implement" != "test"
|
||||
```
|
||||
|
||||
**Action-Normalisierung (Deutsch → Englisch):**
|
||||
|
||||
| Deutsche Verben | Kanonische Form |
|
||||
|----------------|-----------------|
|
||||
| implementieren, umsetzen, einrichten, aktivieren | `implement` |
|
||||
| testen, pruefen, ueberpruefen, verifizieren | `test` |
|
||||
| ueberwachen, monitoring, beobachten | `monitor` |
|
||||
| verschluesseln | `encrypt` |
|
||||
| protokollieren, aufzeichnen, loggen | `log` |
|
||||
| beschraenken, einschraenken, begrenzen | `restrict` |
|
||||
|
||||
### Stufe 3: Object-Normalization (weich)
|
||||
|
||||
Compliance-Objekte werden auf kanonische Token normalisiert.
|
||||
|
||||
```python
|
||||
normalize_object("Admin-Konten") → "privileged_access"
|
||||
normalize_object("Remote-Zugriff") → "remote_access"
|
||||
normalize_object("MFA") → "multi_factor_auth"
|
||||
```
|
||||
|
||||
Bei verschiedenen Objekten gilt ein hoeherer Schwellenwert (0.95 statt 0.92).
|
||||
|
||||
**Objekt-Normalisierung:**
|
||||
|
||||
| Eingabe | Kanonischer Token |
|
||||
|---------|------------------|
|
||||
| MFA, 2FA, Multi-Faktor-Authentifizierung | `multi_factor_auth` |
|
||||
| Admin-Konten, privilegierte Zugriffe | `privileged_access` |
|
||||
| Verschluesselung, Kryptografie | `encryption` |
|
||||
| Schluessel, Key Management | `key_management` |
|
||||
| TLS, SSL, HTTPS | `transport_encryption` |
|
||||
| Firewall | `firewall` |
|
||||
| Audit-Log, Protokoll, Logging | `audit_logging` |
|
||||
|
||||
### Stufe 4: Embedding Similarity (Qdrant)
|
||||
|
||||
Tiered Thresholds basierend auf Cosine-Similarity:
|
||||
|
||||
| Score | Verdict | Aktion |
|
||||
|-------|---------|--------|
|
||||
| > 0.95 | **LINK** | Bei verschiedenen Objekten |
|
||||
| > 0.92 | **LINK** | Parent-Link hinzufuegen |
|
||||
| 0.85 - 0.92 | **REVIEW** | In Review-Queue zur manuellen Pruefung |
|
||||
| < 0.85 | **NEW** | Neues Control anlegen |
|
||||
|
||||
---
|
||||
|
||||
## Canonicalization Layer
|
||||
|
||||
Vor dem Embedding wird der deutsche Compliance-Text in normalisiertes Englisch transformiert:
|
||||
|
||||
```
|
||||
"Administratoren muessen MFA verwenden"
|
||||
→ "implement multi_factor_auth for administratoren verwenden"
|
||||
→ Bessere Matches, weniger Embedding-Rauschen
|
||||
```
|
||||
|
||||
Dies reduziert das Rauschen durch synonyme Formulierungen in verschiedenen Gesetzen.
|
||||
|
||||
---
|
||||
|
||||
## Multi-Parent-Linking (M:N)
|
||||
|
||||
Ein atomares Control kann mehrere Eltern-Controls aus verschiedenen Regulierungen haben:
|
||||
|
||||
```json
|
||||
{
|
||||
"control_id": "AUTH-1072-A01",
|
||||
"parent_links": [
|
||||
{"parent_control_id": "AUTH-1001", "source": "NIST IA-02(01)", "link_type": "decomposition"},
|
||||
{"parent_control_id": "NIS2-045", "source": "NIS2 Art. 21", "link_type": "dedup_merge"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Datenbank-Schema
|
||||
|
||||
```sql
|
||||
-- Migration 074: control_parent_links (M:N)
|
||||
CREATE TABLE control_parent_links (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
control_uuid UUID NOT NULL REFERENCES canonical_controls(id),
|
||||
parent_control_uuid UUID NOT NULL REFERENCES canonical_controls(id),
|
||||
link_type VARCHAR(30) NOT NULL DEFAULT 'decomposition',
|
||||
confidence NUMERIC(3,2) DEFAULT 1.0,
|
||||
source_regulation VARCHAR(100),
|
||||
source_article VARCHAR(100),
|
||||
obligation_candidate_id UUID REFERENCES obligation_candidates(id),
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
CONSTRAINT uq_parent_link UNIQUE (control_uuid, parent_control_uuid)
|
||||
);
|
||||
```
|
||||
|
||||
**Link-Typen:**
|
||||
|
||||
| Typ | Bedeutung |
|
||||
|-----|-----------|
|
||||
| `decomposition` | Aus Pass 0b Zerlegung |
|
||||
| `dedup_merge` | Durch Dedup-Engine als Duplikat erkannt |
|
||||
| `manual` | Manuell durch Reviewer verknuepft |
|
||||
| `crosswalk` | Aus Crosswalk-Matrix uebernommen |
|
||||
|
||||
---
|
||||
|
||||
## Review-Queue
|
||||
|
||||
Borderline-Matches (Similarity 0.85-0.92) werden in die Review-Queue geschrieben:
|
||||
|
||||
```sql
|
||||
-- Migration 074: control_dedup_reviews
|
||||
CREATE TABLE control_dedup_reviews (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
candidate_control_id VARCHAR(30) NOT NULL,
|
||||
candidate_title TEXT NOT NULL,
|
||||
candidate_objective TEXT,
|
||||
matched_control_uuid UUID REFERENCES canonical_controls(id),
|
||||
matched_control_id VARCHAR(30),
|
||||
similarity_score NUMERIC(4,3),
|
||||
dedup_stage VARCHAR(40) NOT NULL,
|
||||
review_status VARCHAR(20) DEFAULT 'pending',
|
||||
-- pending → accepted_link | accepted_new | rejected
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Qdrant Collection
|
||||
|
||||
```
|
||||
Collection: atomic_controls
|
||||
Dimension: 1024 (bge-m3)
|
||||
Distance: COSINE
|
||||
Payload: pattern_id, action_normalized, object_normalized, control_id, canonical_text
|
||||
Index: pattern_id (keyword), action_normalized (keyword), object_normalized (keyword)
|
||||
Query: IMMER mit filter: pattern_id == X (reduziert Suche drastisch)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration in Pass 0b
|
||||
|
||||
Die Dedup-Engine ist optional in `DecompositionPass` integriert:
|
||||
|
||||
```python
|
||||
decomp = DecompositionPass(db=session, dedup_enabled=True)
|
||||
stats = await decomp.run_pass0b(limit=100, use_anthropic=True)
|
||||
|
||||
# Stats enthalten Dedup-Metriken:
|
||||
# stats["dedup_linked"] = 15 (Duplikate → Parent-Link)
|
||||
# stats["dedup_review"] = 3 (Borderline → Review-Queue)
|
||||
# stats["controls_created"] = 82 (Neue Controls)
|
||||
```
|
||||
|
||||
**Ablauf bei Pass 0b mit Dedup:**
|
||||
|
||||
1. LLM generiert atomares Control
|
||||
2. Dedup-Engine prueft 4 Stufen
|
||||
3. **LINK:** Kein neues Control, Parent-Link zu bestehendem
|
||||
4. **REVIEW:** Kein neues Control, Eintrag in Review-Queue
|
||||
5. **NEW:** Control anlegen + in Qdrant indexieren
|
||||
|
||||
---
|
||||
|
||||
## Konfiguration
|
||||
|
||||
| Umgebungsvariable | Default | Beschreibung |
|
||||
|-------------------|---------|-------------|
|
||||
| `DEDUP_ENABLED` | `true` | Dedup-Engine ein/ausschalten |
|
||||
| `DEDUP_LINK_THRESHOLD` | `0.92` | Schwelle fuer automatisches Linking |
|
||||
| `DEDUP_REVIEW_THRESHOLD` | `0.85` | Schwelle fuer Review-Queue |
|
||||
| `DEDUP_LINK_THRESHOLD_DIFF_OBJ` | `0.95` | Schwelle bei verschiedenen Objekten |
|
||||
| `DEDUP_QDRANT_COLLECTION` | `atomic_controls` | Qdrant-Collection fuer Dedup-Index |
|
||||
| `QDRANT_URL` | `http://host.docker.internal:6333` | Qdrant-URL |
|
||||
| `EMBEDDING_URL` | `http://embedding-service:8087` | Embedding-Service-URL |
|
||||
|
||||
---
|
||||
|
||||
## Quelldateien
|
||||
|
||||
| Datei | Beschreibung |
|
||||
|-------|-------------|
|
||||
| `compliance/services/control_dedup.py` | 4-Stufen Dedup-Engine |
|
||||
| `compliance/services/decomposition_pass.py` | Pass 0a/0b mit Dedup-Integration |
|
||||
| `migrations/074_control_dedup.sql` | DB-Schema (parent_links, review_queue) |
|
||||
| `tests/test_control_dedup.py` | 56 Unit-Tests |
|
||||
|
||||
---
|
||||
|
||||
## Verwandte Dokumentation
|
||||
|
||||
- [Control Generator Pipeline](control-generator-pipeline.md) — 7-Stufen RAG→Control Pipeline
|
||||
- [Canonical Control Library](canonical-control-library.md) — Datenmodell, Domains, Similarity-Detektor
|
||||
Reference in New Issue
Block a user