feat(multi-layer): complete Multi-Layer Control Architecture (Phases 1-8 + Pass 0)
Some checks failed
CI/CD / go-lint (push) Has been skipped
CI/CD / python-lint (push) Has been skipped
CI/CD / nodejs-lint (push) Has been skipped
CI/CD / test-go-ai-compliance (push) Failing after 47s
CI/CD / test-python-backend-compliance (push) Successful in 33s
CI/CD / test-python-document-crawler (push) Successful in 24s
CI/CD / test-python-dsms-gateway (push) Successful in 18s
CI/CD / validate-canonical-controls (push) Successful in 11s
CI/CD / Deploy (push) Has been skipped

Implements the full Multi-Layer Control Architecture for migrating ~25,000
Rich Controls into atomic, deduplicated Master Controls with full traceability.

Architecture: Legal Source → Obligation → Control Pattern → Master Control → Customer Instance

New services:
- ObligationExtractor: 3-tier extraction (exact → embedding → LLM)
- PatternMatcher: 2-tier matching (keyword + embedding + domain-bonus)
- ControlComposer: Pattern + Obligation → Master Control
- PipelineAdapter: Pipeline integration + Migration Passes 1-5
- DecompositionPass: Pass 0a/0b — Rich Control → atomic Controls
- CrosswalkRoutes: 15 API endpoints under /v1/canonical/

New DB schema:
- Migration 060: obligation_extractions, control_patterns, crosswalk_matrix
- Migration 061: obligation_candidates, parent_control_uuid tracking

Pattern Library: 50 YAML patterns (30 core + 20 IT-security)
Go SDK: Pattern loader with YAML validation and indexing
Documentation: MkDocs updated with full architecture overview

500 Python tests passing across all components.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-17 09:00:37 +01:00
parent 4f6bc8f6f6
commit 825e070ed9
23 changed files with 13553 additions and 0 deletions

View File

@@ -707,3 +707,258 @@ Die Generator-Tests decken folgende Bereiche ab:
- **`TestAnchorFinder`** (2 Tests) — RAG-Suche filtert Rule 3 Quellen aus, Web-Suche erkennt Frameworks
- **`TestPipelineMocked`** (5 Tests) — End-to-End mit Mocks: Lizenz-Klassifikation, Rule 3 Blocking,
Hash-Deduplizierung, Config-Defaults (`batch_size: 5`), Rule 1 Citation-Generierung
---
## Multi-Layer Control Architecture
Erweitert die bestehende Pipeline um ein 5-Schichten-Modell:
```
Legal Source → Obligation → Control Pattern → Master Control → Customer Instance
```
### Architektur-Uebersicht
| Layer | Asset | Beschreibung |
|-------|-------|-------------|
| 1: Legal Sources | Qdrant 5 Collections, 105K+ Chunks | RAG-Rohdaten |
| 2: Obligations | v2 Framework (325 Pflichten, 9 Verordnungen) | Rechtliche Pflichten |
| 3: Control Patterns | 50 YAML Patterns (30 Core + 20 IT-Security) | Umsetzungsmuster |
| 4: Master Controls | canonical_controls (atomare Controls nach Dedup) | Kanonische Controls |
| 5: Customer Instance | TOM Controls + Gap Mapping | Kundenspezifisch |
### Control-Ebenen
| Ebene | Beschreibung | Nutzen |
|-------|-------------|--------|
| **Rich Controls** | Narrativ, erklaerend, kontextreich (~25.000) | Schulung, Audit-Fragen, Massnahmenplaene |
| **Atomare Controls** | 1 Pflicht = 1 Control (nach Decomposition + Dedup) | Systemaudits, Code-Checks, Gap-Analyse, Traceability |
### Pipeline-Erweiterung (10-Stage)
```
Stage 1: RAG SCAN (unveraendert)
Stage 2: LICENSE CLASSIFY (unveraendert)
Stage 3: PREFILTER (unveraendert)
Stage 4: OBLIGATION EXTRACT (NEU — 3-Tier: exact → embedding → LLM)
Stage 5: PATTERN MATCH (NEU — Keyword + Embedding + Domain-Bonus)
Stage 6: CONTROL COMPOSE (NEU — Pattern + Obligation → Control)
Stage 7: HARMONIZE (unveraendert)
Stage 8: ANCHOR SEARCH (unveraendert)
Stage 9: STORE + CROSSWALK (erweitert — Crosswalk-Matrix)
Stage 10: MARK PROCESSED (unveraendert)
```
---
### Obligation Extractor (Stage 4)
3-Tier Extraktion (schnellste zuerst):
| Tier | Methode | Latenz | Trefferquote |
|------|---------|--------|--------------|
| 1 | Exact Match (regulation_code + article → obligation_id) | <1ms | ~40% |
| 2 | Embedding Match (Cosine > 0.80 gegen 325 Obligations) | ~50ms | ~30% |
| 3 | LLM Extraction (lokales Ollama, nur Fallback) | ~2s | ~25% |
**Datei:** `compliance/services/obligation_extractor.py`
### Pattern Library (Stage 5)
50 YAML-basierte Control Patterns in 16 Domains:
| Datei | Patterns | Domains |
|-------|----------|---------|
| `core_patterns.yaml` | 30 | AUTH, CRYP, NET, DATA, LOG, ACC, SEC, INC, COMP, GOV, RES |
| `domain_it_security.yaml` | 20 | SEC, NET, AUTH, LOG, CRYP |
**Pattern ID Format:** `CP-{DOMAIN}-{NNN}` (z.B. `CP-AUTH-001`)
**Matching:** 2-Tier (Keyword-Index + Embedding), Domain-Bonus (+0.10)
**Dateien:**
- `ai-compliance-sdk/policies/control_patterns/core_patterns.yaml`
- `ai-compliance-sdk/policies/control_patterns/domain_it_security.yaml`
- `compliance/services/pattern_matcher.py`
### Control Composer (Stage 6)
Drei Kompositions-Modi:
| Modus | Wann | Qualitaet |
|-------|------|-----------|
| Pattern-guided | Pattern gefunden, LLM antwortet | Hoch |
| Template-only | LLM-Fehler, aber Pattern vorhanden | Mittel |
| Fallback | Kein Pattern-Match | Basis |
**Datei:** `compliance/services/control_composer.py`
---
### Decomposition Pass (Pass 0)
Zerlegt Rich Controls in atomare Controls. Laeuft VOR den Migration Passes 1-5.
#### Pass 0a — Obligation Extraction
Extrahiert einzelne normative Pflichten aus einem Rich Control per LLM.
**6 Guardrails:**
1. Nur normative Aussagen (müssen, sicherzustellen, verpflichtet, ...)
2. Ein Hauptverb pro Pflicht
3. Testpflichten separat
4. Meldepflichten separat
5. Nicht auf Evidence-Ebene zerlegen
6. Parent-Link immer erhalten
**Quality Gate:** Jeder Kandidat wird gegen 6 Kriterien geprueft:
- `has_normative_signal` — Normatives Sprachsignal erkannt
- `single_action` — Nur eine Handlung
- `not_rationale` — Keine blosse Begruendung
- `not_evidence_only` — Kein reines Evidence-Fragment
- `min_length` — Mindestlaenge erreicht
- `has_parent_link` — Referenz zum Rich Control
Kritische Checks: `has_normative_signal`, `not_evidence_only`, `min_length`, `has_parent_link`
#### Pass 0b — Atomic Control Composition
Erstellt aus jedem validierten Obligation Candidate ein atomares Control
(LLM-gestuetzt mit Template-Fallback).
**Datei:** `compliance/services/decomposition_pass.py`
---
### Migration Passes (1-5)
Nicht-destruktive Passes fuer bestehende Controls:
| Pass | Beschreibung | Methode |
|------|-------------|---------|
| 1 | Obligation Linkage | source_citation → article → obligation_id (deterministisch) |
| 2 | Pattern Classification | Keyword-Matching gegen Pattern Library |
| 3 | Quality Triage | Kategorisierung: review / needs_obligation / needs_pattern / legacy_unlinked |
| 4 | Crosswalk Backfill | crosswalk_matrix Zeilen fuer verlinkte Controls |
| 5 | Deduplication | Gleiche obligation_id + pattern_id → Duplikat markieren |
**Datei:** `compliance/services/pipeline_adapter.py`
---
### Crosswalk Matrix
Der "goldene Faden" von Gesetz bis Umsetzung:
```
Regulation → Article → Obligation → Pattern → Master Control → TOM
```
Ein atomares Control kann von **mehreren Gesetzen** gleichzeitig gefordert sein.
Die Crosswalk-Matrix bildet diese N:M-Beziehung ab.
---
### DB-Schema (Migrations 060 + 061)
**Migration 060:** Multi-Layer Basistabellen
| Tabelle | Beschreibung |
|---------|-------------|
| `obligation_extractions` | Chunk→Obligation Verknuepfungen (3-Tier Tracking) |
| `control_patterns` | DB-Spiegel der YAML-Patterns fuer SQL-Queries |
| `crosswalk_matrix` | Goldener Faden: Regulation→Obligation→Pattern→Control |
| `canonical_controls.pattern_id` | Pattern-Zuordnung (neues Feld) |
| `canonical_controls.obligation_ids` | Obligation-IDs als JSONB-Array (neues Feld) |
**Migration 061:** Decomposition-Tabellen
| Tabelle | Beschreibung |
|---------|-------------|
| `obligation_candidates` | Extrahierte atomare Pflichten aus Rich Controls |
| `canonical_controls.parent_control_uuid` | Self-Referenz zum Rich Control (neues Feld) |
| `canonical_controls.decomposition_method` | Zerlegungsmethode (neues Feld) |
---
### API Endpoints (Crosswalk Routes)
Alle Endpoints unter `/api/compliance/v1/canonical/`:
#### Pattern Library
| Methode | Pfad | Beschreibung |
|---------|------|-------------|
| GET | `/patterns` | Alle Patterns (Filter: domain, category, tag) |
| GET | `/patterns/{pattern_id}` | Einzelnes Pattern mit Details |
| GET | `/patterns/{pattern_id}/controls` | Controls aus einem Pattern |
#### Obligation Extraction
| Methode | Pfad | Beschreibung |
|---------|------|-------------|
| POST | `/obligations/extract` | Obligation aus Text extrahieren + Pattern matchen |
#### Crosswalk Matrix
| Methode | Pfad | Beschreibung |
|---------|------|-------------|
| GET | `/crosswalk` | Query (Filter: regulation, article, obligation, pattern) |
| GET | `/crosswalk/stats` | Abdeckungs-Statistiken |
#### Migration + Decomposition
| Methode | Pfad | Beschreibung |
|---------|------|-------------|
| POST | `/migrate/decompose` | Pass 0a: Obligation Extraction aus Rich Controls |
| POST | `/migrate/compose-atomic` | Pass 0b: Atomare Control-Komposition |
| POST | `/migrate/link-obligations` | Pass 1: Obligation-Linkage |
| POST | `/migrate/classify-patterns` | Pass 2: Pattern-Klassifikation |
| POST | `/migrate/triage` | Pass 3: Quality Triage |
| POST | `/migrate/backfill-crosswalk` | Pass 4: Crosswalk-Backfill |
| POST | `/migrate/deduplicate` | Pass 5: Deduplizierung |
| GET | `/migrate/status` | Migrations-Fortschritt |
| GET | `/migrate/decomposition-status` | Decomposition-Fortschritt |
**Route-Datei:** `compliance/api/crosswalk_routes.py`
---
### Multi-Layer Tests
| Datei | Tests | Schwerpunkt |
|-------|-------|-------------|
| `tests/test_obligation_extractor.py` | 107 | 3-Tier Extraktion, Helpers, Regex |
| `tests/test_pattern_matcher.py` | 72 | Keyword-Index, Embedding, Domain-Affinity |
| `tests/test_control_composer.py` | 54 | Composition, Templates, License-Rules |
| `tests/test_pipeline_adapter.py` | 36 | Pipeline Integration, 5 Migration Passes |
| `tests/test_crosswalk_routes.py` | 57 | 15 API Endpoints, Pydantic Models |
| `tests/test_decomposition_pass.py` | 68 | Pass 0a/0b, Quality Gate, 6 Guardrails |
| `tests/test_migration_060.py` | 12 | Schema-Validierung |
| `tests/test_control_patterns.py` | 18 | YAML-Validierung, Pattern-Schema |
| **Gesamt Multi-Layer** | | **424 Tests** |
### Geplanter Migrationsflow
```
Rich Controls (~25.000, release_state=raw)
Pass 0a: Obligation Extraction (LLM + Quality Gate)
Pass 0b: Atomic Control Composition (LLM + Template Fallback)
Pass 1: Obligation Linking (deterministisch)
Pass 2: Pattern Classification (Keyword + Embedding)
Pass 3: Quality Triage
Pass 4: Crosswalk Backfill
Pass 5: Dedup / Merge
Master Controls (~15.000-20.000 mit voller Traceability)
```