feat(control-pipeline): add applicability demo test package with evaluator

6 priority demo cases with golden outputs, evaluator.py and run_demo.py: - CASE-001: Webshop+Stripe (anti-PSD2 false positive) - CASE-002: Bank+TAN-Generator (scope override for batteries) - CASE-004: FinTech Wallet (true positive PSD2/AML) - CASE-006: SaaS+SMS Gateway (anti-TKG false positive) - CASE-008: Software→IoT Hardware (multi-regime scope) - CASE-011: Embedded Finance (escalation case) Self-test passes 6/6 against golden outputs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-23 19:08:31 +02:00
parent e8ec50e0fc
commit ae5c5c24eb
12 changed files with 735 additions and 0 deletions
--- a/control-pipeline/tests/applicability_demo/README.md
+++ b/control-pipeline/tests/applicability_demo/README.md
@@ -0,0 +1,53 @@
+# Applicability Engine Demo Package
+
+## Inhalt
+- `demo_cases.yaml` — 6 priorisierte Demo- und Regressionstestfälle
+- `expected_outputs/CASE-*.json` — Golden Outputs für die 6 Fälle
+- `evaluator.py` — vergleicht tatsächliche Engine-Outputs gegen die Assertions
+- `run_demo.py` — einfacher Runner
+- `reports/` — Zielordner für JSON- und Markdown-Reports
+
+## Schnellstart
+```bash
+python run_demo.py
+```
+
+Das nutzt `expected_outputs/` als Self-Test.
+
+## Gegen echte SDK-Outputs laufen lassen
+Lege pro Fall eine Datei `CASE-XYZ.json` mit folgendem Schema in ein Verzeichnis:
+
+```json
+{
+  "case_id": "CASE-001",
+  "assigned_controls": [],
+  "excluded_controls": [],
+  "escalations": [],
+  "inferred_industries": [],
+  "confidence": {
+    "overall": 0.0,
+    "industry_assignment": 0.0,
+    "control_assignment": 0.0
+  },
+  "explanation": "",
+  "uncertainty_flags": []
+}
+```
+
+Dann:
+
+```bash
+python run_demo.py --actual-dir /pfad/zu/deinen/outputs
+```
+
+## Testlogik
+Der Evaluator prüft:
+- `must_assign`
+- `must_not_assign`
+- `escalate_for_legal_review`
+- `inferred_industries.must_include`
+- `inferred_industries.must_not_include`
+- `reasoning_must_contain`
+
+Zusätzlich gibt es Warnings, wenn Grenzfälle eskaliert sind, aber keine `uncertainty_flags`
+gesetzt wurden oder die Confidence unplausibel hoch ist.