docs: add test strategy instruction for dedicated session (Block C)
3 test levels: Real-World Benchmarks (10 DE websites), Adversarial Tests (30 tricky cases), Regression Harness (CI/CD quality gate). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,335 @@
|
||||
# Instruktion: Teststrategie Block C
|
||||
|
||||
**Repo:** `/Users/benjaminadmin/Projekte/breakpilot-core/`
|
||||
**Verzeichnis:** `control-pipeline/tests/`
|
||||
**Erstellt:** 2026-05-01
|
||||
**Geschaetzter Aufwand:** 2-3 Tage
|
||||
|
||||
## Ausgangslage
|
||||
|
||||
- 221 bestehende Tests in 7 Dateien (NICHT aendern!)
|
||||
- 40 Golden Test Cases (golden_controls.yaml)
|
||||
- 24 Demo Cases (demo_cases.yaml)
|
||||
- Alle Tests sind pure Python, kein DB noetig
|
||||
- Pipeline v1 abgeschlossen: 151.675 unique Controls, 15.291 Dependencies
|
||||
|
||||
## Aufgabe 1: Real-World Benchmarks (C1)
|
||||
|
||||
### Was zu tun ist
|
||||
|
||||
10 echte deutsche E-Commerce Websites manuell pruefen und Ground Truth YAML erstellen.
|
||||
|
||||
### Verzeichnis
|
||||
|
||||
```
|
||||
control-pipeline/tests/benchmarks/
|
||||
├── amazon_de.yaml
|
||||
├── zalando_de.yaml
|
||||
├── otto_de.yaml
|
||||
├── lidl_de.yaml
|
||||
├── check24_de.yaml
|
||||
├── booking_de.yaml
|
||||
├── thomann_de.yaml
|
||||
├── aboutyou_de.yaml
|
||||
├── mytheresa_com.yaml
|
||||
└── kleiner_shop.yaml
|
||||
```
|
||||
|
||||
### Format pro Website
|
||||
|
||||
```yaml
|
||||
website: amazon.de
|
||||
url: https://www.amazon.de
|
||||
checked_at: "2026-05-XX"
|
||||
checked_by: "Name"
|
||||
|
||||
ground_truth:
|
||||
impressum:
|
||||
present: true/false
|
||||
complete: true/false # Name, Adresse, Email, HR-Nummer, USt-ID
|
||||
within_2_clicks: true/false
|
||||
missing_fields: [] # z.B. ["USt-ID", "Handelsregister"]
|
||||
|
||||
datenschutzerklaerung:
|
||||
present: true/false
|
||||
art13_complete: true/false
|
||||
missing_art13_fields: [] # z.B. ["Speicherdauer", "Empfaenger"]
|
||||
rechtsgrundlagen_korrekt: true/false
|
||||
wrong_legal_bases: [] # z.B. ["Analytics auf lit. f statt lit. a"]
|
||||
|
||||
cookie_banner:
|
||||
present: true/false
|
||||
reject_equally_easy: true/false # CNIL: Ablehnen = gleich prominent
|
||||
cookies_before_consent: true/false # Planet49: Cookies VOR Consent?
|
||||
dark_patterns: [] # z.B. ["Ablehnen-Button kleiner", "Ablehnen hinter Einstellungen"]
|
||||
|
||||
widerrufsbelehrung:
|
||||
present: true/false
|
||||
matches_legal_template: true/false # Gesetzliches Muster
|
||||
|
||||
agb:
|
||||
present: true/false
|
||||
checkout_button_text: "..." # z.B. "Jetzt kaufen" (korrekt) vs "Weiter" (falsch)
|
||||
|
||||
google_fonts_external: true/false
|
||||
google_analytics: true/false
|
||||
|
||||
third_party_services:
|
||||
- name: "Google Analytics"
|
||||
detected: true
|
||||
consent_required: true
|
||||
consent_obtained_before_load: false
|
||||
- name: "Facebook Pixel"
|
||||
detected: true
|
||||
consent_required: true
|
||||
consent_obtained_before_load: false
|
||||
|
||||
expected_findings:
|
||||
- "Cookie-Banner: Ablehnen nicht gleichwertig"
|
||||
- "Google Analytics ohne vorherige Einwilligung"
|
||||
- "DSE: Rechtsgrundlage fuer Analytics falsch"
|
||||
|
||||
expected_no_findings:
|
||||
- "Impressum fehlt" # Ist vorhanden, darf nicht geflagt werden
|
||||
```
|
||||
|
||||
### Test-Runner
|
||||
|
||||
```python
|
||||
# control-pipeline/tests/test_benchmarks.py
|
||||
"""
|
||||
Real-World Benchmark Tests — vergleicht Agent-Findings mit manueller Ground Truth.
|
||||
Erfordert: Compliance Agent muss laufen (https://macmini:3007/sdk/agent)
|
||||
"""
|
||||
|
||||
import yaml
|
||||
import pytest
|
||||
import os
|
||||
|
||||
BENCHMARK_DIR = os.path.join(os.path.dirname(__file__), "benchmarks")
|
||||
|
||||
def load_benchmarks():
|
||||
cases = []
|
||||
for f in sorted(os.listdir(BENCHMARK_DIR)):
|
||||
if f.endswith(".yaml"):
|
||||
with open(os.path.join(BENCHMARK_DIR, f)) as fh:
|
||||
cases.append(yaml.safe_load(fh))
|
||||
return cases
|
||||
|
||||
class TestBenchmarks:
|
||||
"""Precision/Recall gegen Ground Truth messen."""
|
||||
|
||||
@pytest.mark.parametrize("case", load_benchmarks(), ids=lambda c: c["website"])
|
||||
def test_benchmark(self, case):
|
||||
# TODO: Agent gegen Website laufen lassen
|
||||
# TODO: Findings mit expected_findings vergleichen
|
||||
# TODO: Precision + Recall berechnen
|
||||
pass
|
||||
```
|
||||
|
||||
### Wie die Ground Truth erstellt wird
|
||||
|
||||
1. Website im Browser oeffnen
|
||||
2. Impressum pruefen (alle Pflichtfelder nach § 5 DDG)
|
||||
3. Datenschutzerklaerung lesen (Art. 13 DSGVO Checkliste)
|
||||
4. Cookie-Banner testen (Ablehnen gleich einfach? Cookies vor Consent?)
|
||||
5. Widerrufsbelehrung gegen gesetzliches Muster pruefen
|
||||
6. Browser DevTools: Netzwerk-Tab → externe Requests vor Consent?
|
||||
7. Alles in YAML dokumentieren
|
||||
|
||||
**Ziel-Metriken:**
|
||||
- Precision > 80% (wenige False Positives)
|
||||
- Recall > 70% (findet die meisten echten Probleme)
|
||||
|
||||
---
|
||||
|
||||
## Aufgabe 2: Adversarial Tests (C2)
|
||||
|
||||
### Was zu tun ist
|
||||
|
||||
30 tricky Test Cases erstellen die den Agent/Controls herausfordern.
|
||||
|
||||
### Datei
|
||||
|
||||
`control-pipeline/tests/adversarial_cases.yaml`
|
||||
|
||||
### Kategorien
|
||||
|
||||
**A. Falsche Rechtsgrundlage (8 Cases):**
|
||||
- Analytics auf lit. f statt lit. a
|
||||
- Marketing-Emails auf lit. b statt lit. a
|
||||
- Mitarbeiter-Tracking auf lit. f statt Betriebsvereinbarung
|
||||
- Biometrische Daten auf lit. f statt Art. 9
|
||||
- Profiling auf lit. f statt Art. 22
|
||||
- Newsletter auf lit. b statt lit. a
|
||||
- Social Login auf lit. b statt lit. a
|
||||
- Kreditscoring auf lit. f statt lit. a + Art. 22
|
||||
|
||||
**B. Dark Patterns (6 Cases):**
|
||||
- Ablehnen-Button existiert aber 3px gross + grau
|
||||
- "Alle akzeptieren" prominent, "Einstellungen" statt "Ablehnen"
|
||||
- Cookie-Wall: Inhalt erst nach Zustimmung sichtbar
|
||||
- Vorausgefuellte Checkboxen (Planet49)
|
||||
- Confirm-Shaming: "Nein, ich moechte keine sichere Verbindung"
|
||||
- Ablehnen erfordert 3 Klicks, Akzeptieren nur 1
|
||||
|
||||
**C. Fast-vollstaendige Dokumente (6 Cases):**
|
||||
- Impressum komplett bis auf USt-ID
|
||||
- DSE ohne Speicherdauer
|
||||
- DSE ohne DSB-Kontakt
|
||||
- Widerrufsbelehrung mit falschem Fristbeginn
|
||||
- AGB ohne Gerichtsstand
|
||||
- Cookie-Policy ohne Auflistung aller Cookies
|
||||
|
||||
**D. Semantisch aehnlich aber verschieden (5 Cases):**
|
||||
- "Admin-MFA" vs "User-MFA" (verschiedene Scopes!)
|
||||
- "Daten loeschen nach Kuendigung" vs "Daten loeschen nach Aufbewahrungsfrist"
|
||||
- "Rate Limiting API" vs "Rate Limiting Login"
|
||||
- "Verschluesselung at rest" vs "Verschluesselung in transit"
|
||||
- "Incident Response Plan" vs "Business Continuity Plan"
|
||||
|
||||
**E. Semantisch verschieden aber gleich klingend (5 Cases):**
|
||||
- "Einwilligung" (DSGVO) vs "Einwilligung" (Werbung)
|
||||
- "Verarbeitung" (Daten) vs "Verarbeitung" (Lebensmittel)
|
||||
- "Risikobewertung" (DSGVO DSFA) vs "Risikobewertung" (Finanzrisiko)
|
||||
- "Audit" (Datenschutz) vs "Audit" (Finanzen)
|
||||
- "Zertifizierung" (ISO 27001) vs "Zertifizierung" (CE-Marking)
|
||||
|
||||
### Format
|
||||
|
||||
```yaml
|
||||
- id: ADV-LIT-001
|
||||
category: wrong_legal_basis
|
||||
input: "Wir verarbeiten Ihre Daten fuer Webanalyse auf Grundlage unseres berechtigten Interesses (Art. 6 Abs. 1 lit. f DSGVO)."
|
||||
context: "DSE-Abschnitt ueber Google Analytics"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: "wrong_legal_basis"
|
||||
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
|
||||
reason: "Analytics erfordert Einwilligung, nicht berechtigtes Interesse (EuGH C-673/17 Planet49)"
|
||||
difficulty: medium # easy / medium / hard
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Aufgabe 3: Regression-Harness (C3)
|
||||
|
||||
### Was zu tun ist
|
||||
|
||||
1. `conftest.py` mit shared Fixtures
|
||||
2. `test_regression.py` mit Snapshot-Tests
|
||||
3. CI/CD Quality Gate
|
||||
|
||||
### conftest.py
|
||||
|
||||
```python
|
||||
# control-pipeline/tests/conftest.py
|
||||
import os
|
||||
import pytest
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def db_session():
|
||||
"""DB session for integration tests — skip if no DATABASE_URL."""
|
||||
url = os.getenv("DATABASE_URL")
|
||||
if not url:
|
||||
pytest.skip("DATABASE_URL not set")
|
||||
from db.session import SessionLocal
|
||||
db = SessionLocal()
|
||||
yield db
|
||||
db.close()
|
||||
|
||||
@pytest.fixture
|
||||
def sample_controls(db_session):
|
||||
"""Load 100 random draft controls for regression testing."""
|
||||
from sqlalchemy import text
|
||||
rows = db_session.execute(text("""
|
||||
SELECT control_id, title, category, severity,
|
||||
generation_metadata->>'assertion' as assertion
|
||||
FROM compliance.canonical_controls
|
||||
WHERE release_state = 'draft' AND decomposition_method = 'pass0b'
|
||||
ORDER BY random() LIMIT 100
|
||||
""")).fetchall()
|
||||
return [dict(r._mapping) for r in rows]
|
||||
```
|
||||
|
||||
### test_regression.py
|
||||
|
||||
```python
|
||||
# control-pipeline/tests/test_regression.py
|
||||
"""
|
||||
Regression Tests — pruefen ob Pipeline-Updates bestehende Controls veraendern.
|
||||
Erfordert: DATABASE_URL Umgebungsvariable
|
||||
"""
|
||||
|
||||
class TestControlStability:
|
||||
def test_draft_count_stable(self, db_session):
|
||||
"""Draft count darf nicht um >5% abweichen."""
|
||||
from sqlalchemy import text
|
||||
count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
|
||||
)).scalar()
|
||||
assert count > 140000, f"Draft count too low: {count}"
|
||||
assert count < 200000, f"Draft count too high: {count}"
|
||||
|
||||
def test_no_null_assertions(self, db_session):
|
||||
"""Alle draft Controls muessen eine assertion haben."""
|
||||
from sqlalchemy import text
|
||||
null_count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
|
||||
"AND (generation_metadata->>'assertion' IS NULL OR generation_metadata->>'assertion' = '')"
|
||||
)).scalar()
|
||||
assert null_count < 1000, f"Too many controls without assertion: {null_count}"
|
||||
|
||||
def test_dependency_graph_valid(self, db_session):
|
||||
"""Keine Zyklen im Dependency-Graph."""
|
||||
from sqlalchemy import text
|
||||
cycle_count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.control_dependencies WHERE is_active = true"
|
||||
)).scalar()
|
||||
assert cycle_count > 10000, f"Too few dependencies: {cycle_count}"
|
||||
|
||||
class TestQualityGates:
|
||||
def test_duplicate_rate(self, db_session):
|
||||
pass # Implementieren: duplicate_rate < 5%
|
||||
|
||||
def test_evidence_leak_rate(self, db_session):
|
||||
pass # Implementieren: evidence_leak < 2%
|
||||
```
|
||||
|
||||
### CI/CD Quality Gate
|
||||
|
||||
```yaml
|
||||
# .gitea/workflows/quality-gate.yml
|
||||
name: Control Pipeline Quality Gate
|
||||
on:
|
||||
push:
|
||||
paths:
|
||||
- 'control-pipeline/**'
|
||||
|
||||
jobs:
|
||||
quality-gate:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Run Tests
|
||||
run: |
|
||||
cd control-pipeline
|
||||
pip install -r requirements.txt pytest pyyaml
|
||||
PYTHONPATH=. pytest tests/ -v --tb=short -x
|
||||
- name: Quality Metrics
|
||||
run: |
|
||||
# Nur wenn Container laeuft
|
||||
curl -sf http://127.0.0.1:8098/v1/canonical/generate/quality-metrics || echo "Pipeline not running, skip metrics"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## WICHTIG
|
||||
|
||||
- Bestehende 221 Tests NICHT aendern
|
||||
- NICHT deployen (Container nicht neustarten)
|
||||
- Alle neuen Tests muessen ohne DB laufen (ausser test_regression.py mit skip-Marker)
|
||||
- Ground Truth YAML manuell erstellen (kein LLM fuer die Referenzdaten!)
|
||||
- Bei Fragen: Memory lesen unter `/Users/benjaminadmin/.claude/projects/-Users-benjaminadmin-Projekte-breakpilot-core/memory/`
|
||||
Reference in New Issue
Block a user