feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features
CI / nodejs-build (push) Successful in 2m47s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped

Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9):

Core Compliance-Check
- Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs
  in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db).
  rag_document_checker filtert auf check_type='text' fuer doc_check.
  Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in
  falscher doc_type-Schublade.
- scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden
  per business_profile gefiltert (FRT skipped fuer BMW etc.).
- Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match:
  Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60),
  Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum.
  Title+check_question als Embedding-Input fuer mehr Kontext.
- Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem
  CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction
  wenn richer (BMW 1824 vs 600 Worte).

Vendor-Redundanz + EU-Alternativen + Cost-Saving
- vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors,
  Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup
  (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...).
- vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl
  + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/
  enterprise/premier).
- Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten
  (nur Media-Spend, separat). DSP-Plattformen behalten enge Range.
- Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den
  oberen 40-100%-Band der Listpreise, nicht starter→premier.
- Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart
  AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere
  Kategorien gleichzeitig.

Cookie-Wissens-DB + Funktionale Klassifikation
- cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...)
  mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk,
  schrems_ii_status, EuGH-Urteile, EU-Alternative.
- cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id,
  ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact.

Country-Inferenz aus Rechtsform
- cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet
  (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table.
  Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors
  (Adform DK, Pinterest IE).

Action-Recipes + Doc-Anchor-Locator
- finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country,
  broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling",
  ...) eine strukturierte Anweisung mit what/why/fix_text/where/example.
  Zum 1:1-Einfuegen in Kunden-Dokumente.
- doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den
  passenden Absatz im existierenden Kundendokument fuer jeden Finding.
  Per-Run Thread-Local-Cache. Fallback: keyword-Match.
- Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail
  + Vendor-Flag-Liste mit aufklappbarer Action-Liste.
- Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip).

Migration-Pipeline (Compliance-Check -> Customer Banner/Documents)
- migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit
  4 Kategorien + Review-Flags.
- migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register
  + Privacy-Policy-Pre-Fills.
- agent_migration_routes: 3 Preview-Endpoints (banner-preview,
  document-preview, summary). Persistierung der cmp_vendors in
  /data/compliance_audits.db check_payloads-Tabelle.

Borlabs-Parity Cookie-Banner-Features
- Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage.
- Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video
  Placeholder bis Einwilligung.
- Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB.
- Consent-Log Export (CSV/JSON) per einwilligungen_export_routes.

Bug-Fixes
- canonical_control_routes: _jsonish-Helper fuer string-typed jsonb,
  similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr).
- Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views.
- Embedding-Service-Batching (32er Batches statt 165 in einem Call).
- KeyError 'control_id' in MC-Result-Aggregation (defensive .get).
- Master-Controls-Klick-Through von /sdk/master-controls auf
  /sdk/control-library?control=<id> mit URL-Param-Auto-Open.
- Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht).
- Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction).
- doc_type-aware MC-Filter (statt all-text-MCs).
- Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag).
- A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert.

Tests
- test_migration_mappers.py (9 Tests)
- test_migration_endpoints.py (4 Tests)

Skripte (one-shot)
- classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type)
- audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires)

BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes):
  DSE     7,5% -> 81-83%
  Impressum 4%   -> 100% (6 echte MCs alle erfuellt)
  Cookie  0%    -> 79-83% (CMP-Text-Routing + Embedding)
  Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr
  Plus: Action-Recipes + Doc-Anchors fuer jeden Fail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-18 18:30:08 +02:00
parent 52fb8b91e7
commit 662327e8b4
31 changed files with 5214 additions and 104 deletions
@@ -0,0 +1,229 @@
"""
LLM-Audit: prueft fuer jeden text-MC ob er fachlich wirklich zu seinem
zugeordneten doc_type passt. BMW Run zeigte z.B. dass die meisten
Impressum-MCs eigentlich DSA/E-Commerce-Pflichten sind, die in einem
§5-TMG-Impressum gar nicht stehen.
Output:
- doc_type passt → MC bleibt active (kein DB-Update)
- doc_type passt NICHT → check_type wird auf 'misclassified' gesetzt;
rag_document_checker filtert die dann aus
Sidecar SQLite. KEINE Postgres-Aenderung (Schema frozen).
"""
from __future__ import annotations
import argparse
import json
import os
import sqlite3
import sys
import time
from datetime import datetime, timezone
import httpx
import psycopg2
from psycopg2.extras import RealDictCursor
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
MODEL = "claude-sonnet-4-6"
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
BATCH_SIZE = 25
SLEEP_BETWEEN_BATCHES = 0.5
DOC_TYPE_DESCRIPTIONS = {
"agb": "Allgemeine Geschaeftsbedingungen — Vertragsbedingungen "
"zwischen Anbieter und Kunde",
"avv": "Auftragsverarbeitungsvertrag Art. 28 DSGVO — Vertrag zwischen "
"Verantwortlichem und Auftragsverarbeiter",
"cookie": "Cookie-Richtlinie — Information ueber Zwecke, Anbieter, "
"Speicherdauer und Widerruf von Cookies/Tracking-Pixeln",
"dse": "Datenschutzerklaerung Art. 13 DSGVO — Informationen ueber "
"Verantwortlichen, Zwecke, Rechtsgrundlagen, Empfaenger, "
"Speicherdauer, Betroffenenrechte, Aufsichtsbehoerde",
"dsfa": "Datenschutz-Folgenabschaetzung Art. 35 DSGVO — Risikobewertung "
"von Verarbeitungen mit hohem Risiko",
"impressum": "Impressum §5 TMG / §18 MStV — Anbieterkennzeichnung mit "
"Name, Anschrift, Vertretungsberechtigten, Handelsregister, "
"USt-IdNr., berufsrechtliche Angaben, Aufsicht",
"loeschkonzept": "Loeschkonzept DIN 66398 — Definition Aufbewahrungs- "
"und Loeschfristen pro Datenkategorie + Prozess",
"widerruf": "Widerrufsbelehrung §312 BGB — Verbraucher-Widerrufsrecht "
"bei Fernabsatz, Frist, Folgen, Muster",
}
SYSTEM_PROMPT = """Du pruefst Master-Controls (MCs) der Compliance-Pruefung.
Fuer jeden MC bekommst du:
- den ihm zugeordneten doc_type (z.B. 'impressum', 'cookie', 'dse')
- den Titel und die check_question
Deine Aufgabe: entscheide ob der MC fachlich wirklich zu diesem doc_type passt.
Beispiele:
- MC "Wird die USt-IdNr im Impressum genannt?" + doc_type=impressum → PASST
- MC "Monatlich aktive Endnutzer fuer Online-Vermittlungsdienste messen" + doc_type=impressum → PASST NICHT
(DSA-Pflicht, gehoert nicht in ein klassisches §5-TMG-Impressum)
- MC "Mithoeren von Funknachrichten verhindern" + doc_type=cookie → PASST NICHT
(TKG-Spezialthema, nicht Cookie-Richtlinie)
Antworte als JSON-Array, eine Zeile pro MC:
[{"control_id": "<wie input>", "doc_type": "<wie input>", "fits": true|false,
"rationale": "ein kurzer satz"}, ...]
Kein Markdown."""
def fetch_pairs_to_audit(conn) -> list[dict]:
"""All text-MCs that haven't been audited yet (no 'fits' column)."""
with sqlite3.connect(SIDECAR_DB) as side:
cols = [r[1] for r in side.execute("PRAGMA table_info(mc_classification)")]
if "fits_doc_type" not in cols:
side.execute("ALTER TABLE mc_classification ADD COLUMN fits_doc_type INTEGER")
side.commit()
already = set()
for cid, dt in side.execute(
"SELECT control_id, doc_type FROM mc_classification "
"WHERE fits_doc_type IS NOT NULL"
):
already.add((cid, dt or ""))
with conn.cursor(cursor_factory=RealDictCursor) as c:
c.execute("""SELECT dc.control_id, dc.doc_type, dc.title, dc.check_question
FROM compliance.doc_check_controls dc
WHERE dc.control_id IN (
SELECT control_id FROM compliance.doc_check_controls
)""")
all_rows = list(c.fetchall())
# Audit only those classified as 'text' in sidecar — process/review
# never run through doc_check anyway
with sqlite3.connect(SIDECAR_DB) as side:
text_pairs = set()
for cid, dt in side.execute(
"SELECT control_id, doc_type FROM mc_classification "
"WHERE check_type = 'text'"
):
text_pairs.add((cid, dt or ""))
target = [r for r in all_rows
if (r["control_id"], r["doc_type"] or "") in text_pairs
and (r["control_id"], r["doc_type"] or "") not in already]
return target
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
payload = {
"model": MODEL,
"max_tokens": 4000,
"system": SYSTEM_PROMPT,
"messages": [{
"role": "user",
"content": (
"Doc-Typen-Beschreibungen:\n"
+ "\n".join(f"- {k}: {v}" for k, v in DOC_TYPE_DESCRIPTIONS.items())
+ "\n\nPruefe folgende MCs:\n\n"
+ json.dumps([
{"control_id": m["control_id"], "doc_type": m["doc_type"],
"title": m["title"], "check_question": (m["check_question"] or "")[:300]}
for m in batch
], ensure_ascii=False, indent=2)
),
}],
}
headers = {
"x-api-key": api_key,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
r.raise_for_status()
txt = r.json()["content"][0]["text"].strip()
if txt.startswith("```"):
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
if txt.startswith("json"):
txt = txt[4:].strip()
return json.loads(txt)
def store_audit(rows: list[dict]) -> None:
ts = datetime.now(timezone.utc).isoformat()
with sqlite3.connect(SIDECAR_DB) as c:
c.executemany(
"UPDATE mc_classification SET fits_doc_type = ?, "
"rationale = COALESCE(?, rationale), classified_at = ? "
"WHERE control_id = ? AND doc_type = ?",
[
(
1 if r.get("fits") else 0,
(r.get("rationale") or "")[:500] or None,
ts,
r.get("control_id"),
r.get("doc_type") or "",
)
for r in rows
],
)
c.commit()
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--sample", action="store_true")
args = ap.parse_args()
api_key = os.environ["ANTHROPIC_API_KEY"]
conn = psycopg2.connect(os.environ["DATABASE_URL"])
pairs = fetch_pairs_to_audit(conn)
if args.sample:
for m in pairs[:5]:
print(json.dumps(m, ensure_ascii=False, indent=2))
print(f"\nTotal pairs to audit: {len(pairs)}")
return
print(f"Audit {len(pairs)} text-MCs in Batches von {BATCH_SIZE}", flush=True)
if not pairs:
print("Alles auditiert.")
return
done = 0
failed_batches = 0
t0 = time.time()
for i in range(0, len(pairs), BATCH_SIZE):
batch = pairs[i:i + BATCH_SIZE]
try:
out = call_claude(api_key, batch)
store_audit(out)
done += len(out)
elapsed = time.time() - t0
rate = done / max(elapsed, 0.01)
eta = (len(pairs) - done) / max(rate, 0.01)
print(f" [{done:>4}/{len(pairs)}] {rate:.1f} MC/s ETA {eta/60:.1f}min", flush=True)
except Exception as e:
failed_batches += 1
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
if failed_batches >= 5:
print("Zu viele Fehler — abbrechen.", file=sys.stderr)
break
time.sleep(SLEEP_BETWEEN_BATCHES)
print(f"\nFertig: {done}/{len(pairs)} auditiert ({failed_batches} Fehlbatches).")
with sqlite3.connect(SIDECAR_DB) as c:
c.row_factory = sqlite3.Row
rows = c.execute(
"SELECT doc_type, "
" SUM(CASE WHEN fits_doc_type = 1 THEN 1 ELSE 0 END) AS fits, "
" SUM(CASE WHEN fits_doc_type = 0 THEN 1 ELSE 0 END) AS misfits, "
" COUNT(*) AS total "
"FROM mc_classification "
"WHERE check_type = 'text' AND fits_doc_type IS NOT NULL "
"GROUP BY doc_type ORDER BY doc_type"
).fetchall()
print("\n=== Audit-Verteilung doc_type x fits ===")
for r in rows:
print(f" {r['doc_type']:<14} fits={r['fits']:<4} misfits={r['misfits']:<4} total={r['total']}")
if __name__ == "__main__":
main()
@@ -0,0 +1,216 @@
"""
A3-v2: nachschärfen welche MCs in Wahrheit auf das CMP-Banner-UI / einen
Prozess zielen, nicht auf den Doc-TEXT.
BMW-Run zeigte ~10 'Sprache klar/einfach' + 'Button-Text formuliert' MCs,
die der erste Audit als doc_type-passend markierte. Sie sind ABER nicht
gegen den Cookie-Policy- oder DSE-Text pruefbar — die fragen nach der
Verstaendlichkeit der Einwilligungs-UI.
Schaltet 'scope_requires' bei FRT/biometrischen MCs und re-klassifiziert
diese UI-Items auf check_type='process' (filtert sie aus doc_check raus).
Plus: Setzt scope_requires fuer offensichtlich konditionale MCs:
- 'biometric_processing' bei FRT/Gesichtserkennung
- 'ai_decision_making' bei automatisierten Einzelentscheidungen
- 'child_targeting' bei Kinder-Einwilligungs-MCs
"""
from __future__ import annotations
import argparse
import json
import os
import sqlite3
import sys
import time
import httpx
import psycopg2
from psycopg2.extras import RealDictCursor
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
MODEL = "claude-sonnet-4-6"
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
BATCH_SIZE = 20
SYSTEM_PROMPT = """Du pruefst Master-Controls (MCs) der Compliance-Pruefung
zweiter Ordnung: jeder MC wurde bereits als 'text' klassifiziert und einem
doc_type zugeordnet. Du entscheidest:
A) Pruefst der MC den TEXT des Dokuments (z.B. "Wird im Impressum die
USt-IdNr genannt?") → fits_doc_type=true, ui_only=false
B) Pruefst der MC die UI/UX eines Banners oder Buttons (z.B.
"Ist der Ablehnungs-Button gleich gross wie Akzeptieren?", "Ist die
Einwilligungsanfrage in einfacher Sprache formuliert?") → ui_only=true
(Diese MCs koennen NIE im Dokumenttext stehen, weil sie sich auf eine
externe UI beziehen.)
Zusaetzlich kennzeichne SCOPE-Bedingungen, wenn der MC nur fuer bestimmte
Sites relevant ist:
- 'biometric_processing' : nur bei Sites die biometrische Daten
(Gesichtserkennung/FRT, Fingerabdruecke) verarbeiten
- 'ai_decision_making' : nur bei automatisierten Einzelentscheidungen
(Art. 22 DSGVO)
- 'child_targeting' : nur bei Sites die sich an Kinder richten
- 'ecommerce' : nur bei Webshops
- 'b2c' : nur fuer Verbraucher-Geschaeft (UWG/BGB-Pflichten)
Wenn der MC fuer ALLE Sites gilt, setze scope_requires=null.
Antworte als JSON-Array — keine Erklaerung davor/danach, kein Markdown.
Format:
[{"control_id": "<wie input>", "doc_type": "<wie input>",
"ui_only": true|false, "scope_requires": "biometric_processing|ai_decision_making|child_targeting|ecommerce|b2c|null",
"rationale": "ein kurzer satz"}, ...]"""
def fetch_pairs_to_audit(conn) -> list[dict]:
"""All text-MCs that haven't been v2-audited yet (no 'ui_only' column)."""
with sqlite3.connect(SIDECAR_DB) as side:
cols = [r[1] for r in side.execute("PRAGMA table_info(mc_classification)")]
added = False
if "ui_only" not in cols:
side.execute("ALTER TABLE mc_classification ADD COLUMN ui_only INTEGER")
added = True
if "scope_requires" not in cols:
side.execute("ALTER TABLE mc_classification ADD COLUMN scope_requires TEXT")
added = True
if added:
side.commit()
already = set()
for cid, dt in side.execute(
"SELECT control_id, doc_type FROM mc_classification "
"WHERE ui_only IS NOT NULL"
):
already.add((cid, dt or ""))
with conn.cursor(cursor_factory=RealDictCursor) as c:
c.execute("""SELECT dc.control_id, dc.doc_type, dc.title, dc.check_question
FROM compliance.doc_check_controls dc""")
all_rows = list(c.fetchall())
# Audit only those already classified as text+fits in sidecar
with sqlite3.connect(SIDECAR_DB) as side:
eligible = set()
for cid, dt in side.execute(
"SELECT control_id, doc_type FROM mc_classification "
"WHERE check_type = 'text' AND (fits_doc_type IS NULL OR fits_doc_type = 1)"
):
eligible.add((cid, dt or ""))
target = [r for r in all_rows
if (r["control_id"], r["doc_type"] or "") in eligible
and (r["control_id"], r["doc_type"] or "") not in already]
return target
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
payload = {
"model": MODEL,
"max_tokens": 4000,
"system": SYSTEM_PROMPT,
"messages": [{
"role": "user",
"content": "Pruefe folgende MCs:\n\n" + json.dumps([
{"control_id": m["control_id"], "doc_type": m["doc_type"],
"title": m["title"], "check_question": (m["check_question"] or "")[:300]}
for m in batch
], ensure_ascii=False, indent=2),
}],
}
headers = {
"x-api-key": api_key,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
r.raise_for_status()
txt = r.json()["content"][0]["text"].strip()
if txt.startswith("```"):
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
if txt.startswith("json"):
txt = txt[4:].strip()
return json.loads(txt)
def store(rows: list[dict]) -> None:
with sqlite3.connect(SIDECAR_DB) as c:
c.executemany(
"UPDATE mc_classification SET ui_only = ?, scope_requires = ? "
"WHERE control_id = ? AND doc_type = ?",
[
(
1 if r.get("ui_only") else 0,
(r.get("scope_requires") or "").strip() or None
if (r.get("scope_requires") or "").lower() not in ("", "null")
else None,
r.get("control_id"),
r.get("doc_type") or "",
)
for r in rows
],
)
# MCs flagged ui_only become check_type='process' so they're not in doc_check
c.executemany(
"UPDATE mc_classification SET check_type='process' "
"WHERE ui_only=1 AND control_id=? AND doc_type=?",
[(r.get("control_id"), r.get("doc_type") or "") for r in rows
if r.get("ui_only")],
)
c.commit()
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--sample", action="store_true")
args = ap.parse_args()
api_key = os.environ["ANTHROPIC_API_KEY"]
conn = psycopg2.connect(os.environ["DATABASE_URL"])
pairs = fetch_pairs_to_audit(conn)
if args.sample:
for m in pairs[:5]:
print(json.dumps(m, ensure_ascii=False, indent=2))
print(f"\nTotal: {len(pairs)}")
return
print(f"Audit-v2: {len(pairs)} MCs in Batches von {BATCH_SIZE}", flush=True)
if not pairs:
print("Alles geprueft.")
return
done = 0
fail = 0
t0 = time.time()
for i in range(0, len(pairs), BATCH_SIZE):
batch = pairs[i:i + BATCH_SIZE]
try:
out = call_claude(api_key, batch)
store(out)
done += len(out)
elapsed = time.time() - t0
rate = done / max(elapsed, 0.01)
eta = (len(pairs) - done) / max(rate, 0.01)
print(f" [{done:>4}/{len(pairs)}] {rate:.1f} MC/s ETA {eta/60:.1f}min", flush=True)
except Exception as e:
fail += 1
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", file=sys.stderr, flush=True)
if fail >= 5: break
time.sleep(0.5)
print(f"\nFertig: {done}/{len(pairs)} ({fail} Fehlbatches).")
with sqlite3.connect(SIDECAR_DB) as c:
ui = c.execute("SELECT COUNT(*) FROM mc_classification WHERE ui_only=1").fetchone()[0]
scope = c.execute(
"SELECT scope_requires, COUNT(*) FROM mc_classification "
"WHERE scope_requires IS NOT NULL GROUP BY scope_requires"
).fetchall()
print(f"\nui_only=1: {ui} MCs (jetzt 'process')")
print("scope_requires Verteilung:")
for s, n in scope:
print(f" {s}: {n}")
if __name__ == "__main__":
main()
@@ -0,0 +1,222 @@
"""
Classify doc_check_controls (1874 MCs) into check_type:
- text : MC asks about the DOCUMENT'S WORDING ("enthaelt die DSE..", "wird im Impressum genannt..")
- process : MC asks about a TECHNICAL OR ORGANISATIONAL CONTROL ("ist sichergestellt..", "ist implementiert..")
- review : MC asks an ABSTRACT META-QUESTION that needs human judgment ("sind alle Verarbeitungen vollstaendig?")
Output goes to sidecar SQLite at /data/mc_classification.db (no Postgres schema change
per CLAUDE.md guardrails). Schema:
CREATE TABLE mc_classification (
control_id TEXT PRIMARY KEY,
doc_type TEXT,
title TEXT,
check_type TEXT, -- text|process|review
confidence REAL, -- 0..1
rationale TEXT,
classified_at TEXT
);
Run from inside bp-compliance-backend container:
docker exec bp-compliance-backend python /app/scripts/classify_mc_check_type.py [--limit N] [--sample]
"""
from __future__ import annotations
import argparse
import json
import os
import sqlite3
import sys
import time
from datetime import datetime, timezone
from pathlib import Path
import httpx
import psycopg2
from psycopg2.extras import RealDictCursor
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
MODEL = "claude-sonnet-4-6"
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
BATCH_SIZE = 25
SLEEP_BETWEEN_BATCHES = 0.5 # sec — keep gentle for the parallel Haiku batch
SYSTEM_PROMPT = """Du klassifizierst Master-Controls (MCs) der DSGVO/TMG/TDDDG-Compliance-Pruefung in genau drei Klassen:
TEXT — Der MC prueft, ob ein bestimmter Inhalt im TEXT eines Dokuments (DSE, Impressum, Cookie-Richtlinie etc.) steht.
Beispiele: "Enthaelt die Datenschutzerklaerung die Aufsichtsbehoerde?" / "Wird im Impressum die USt-IdNr genannt?"
Diese MCs koennen gegen den Dokument-Text gematched werden.
PROCESS — Der MC prueft eine technische oder organisatorische MASSNAHME, die ausserhalb des Dokumenttexts liegt.
Beispiele: "Ist sichergestellt, dass Cookies erst nach Einwilligung gespeichert werden?" /
"Ist ein Loeschkonzept implementiert?" / "Wird das Consent-Tool monatlich getestet?"
Diese MCs lassen sich NICHT aus dem Dokumenttext beantworten — sie brauchen Evidence/TOM-Nachweis.
REVIEW — Der MC stellt eine abstrakte Vollstaendigkeits-/Konsistenzfrage, die menschliche Beurteilung braucht.
Beispiele: "Sind ALLE Verarbeitungszwecke vollstaendig erfasst?" / "Stimmen Cookie-Richtlinie und Banner ueberein?"
Diese MCs sind Checklisten-Items fuer den DSB, kein automatischer Check moeglich.
Antworte ausschliesslich als JSON-Array — keine Erklaerung davor/danach, kein Markdown-Codeblock. Format:
[{"control_id": "<wie input>", "check_type": "text|process|review", "confidence": 0.9, "rationale": "ein kurzer satz"}, ...]"""
def fetch_mcs(conn, limit: int | None = None, only_unclassified: bool = True) -> list[dict]:
sql = """SELECT control_id, doc_type, title, check_question
FROM compliance.doc_check_controls"""
if only_unclassified:
sql += " WHERE control_id NOT IN (SELECT control_id FROM _classified_already)"
sql += " ORDER BY doc_type, title"
if limit:
sql += f" LIMIT {limit}"
with conn.cursor(cursor_factory=RealDictCursor) as c:
try:
c.execute("CREATE TEMP TABLE _classified_already (control_id TEXT)")
with sqlite3.connect(SIDECAR_DB) as side:
rows = side.execute("SELECT control_id FROM mc_classification").fetchall()
if rows:
c.executemany("INSERT INTO _classified_already VALUES (%s)", [(r[0],) for r in rows])
except Exception:
pass
c.execute(sql)
return list(c.fetchall())
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
payload = {
"model": MODEL,
"max_tokens": 4000,
"system": SYSTEM_PROMPT,
"messages": [{
"role": "user",
"content": "Klassifiziere die folgenden MCs:\n\n" + json.dumps(
[{"control_id": m["control_id"],
"doc_type": m["doc_type"],
"title": m["title"],
"check_question": (m["check_question"] or "")[:400]}
for m in batch],
ensure_ascii=False, indent=2),
}],
}
headers = {
"x-api-key": api_key,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
r.raise_for_status()
txt = r.json()["content"][0]["text"].strip()
# Strip code fences if Sonnet adds them
if txt.startswith("```"):
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
if txt.startswith("json"):
txt = txt[4:].strip()
return json.loads(txt)
def ensure_sidecar() -> None:
Path(SIDECAR_DB).parent.mkdir(parents=True, exist_ok=True)
with sqlite3.connect(SIDECAR_DB) as c:
c.executescript("""
CREATE TABLE IF NOT EXISTS mc_classification (
control_id TEXT PRIMARY KEY,
doc_type TEXT,
title TEXT,
check_type TEXT,
confidence REAL,
rationale TEXT,
classified_at TEXT
);
CREATE INDEX IF NOT EXISTS idx_doctype ON mc_classification(doc_type);
CREATE INDEX IF NOT EXISTS idx_type ON mc_classification(check_type);
""")
def store_results(rows: list[dict], lookup: dict[str, dict]) -> None:
ts = datetime.now(timezone.utc).isoformat()
with sqlite3.connect(SIDECAR_DB) as c:
c.executemany(
"INSERT OR REPLACE INTO mc_classification "
"(control_id, doc_type, title, check_type, confidence, rationale, classified_at) "
"VALUES (?, ?, ?, ?, ?, ?, ?)",
[
(
r.get("control_id"),
lookup.get(r.get("control_id"), {}).get("doc_type", ""),
lookup.get(r.get("control_id"), {}).get("title", ""),
(r.get("check_type") or "").lower(),
float(r.get("confidence") or 0),
(r.get("rationale") or "")[:500],
ts,
)
for r in rows
],
)
c.commit()
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--limit", type=int, default=None, help="limit number of MCs (for testing)")
ap.add_argument("--sample", action="store_true", help="just print first 5 MCs and exit")
ap.add_argument("--all", action="store_true", help="re-classify all (otherwise skips already classified)")
args = ap.parse_args()
ensure_sidecar()
api_key = os.environ["ANTHROPIC_API_KEY"]
conn = psycopg2.connect(os.environ["DATABASE_URL"])
mcs = fetch_mcs(conn, limit=args.limit, only_unclassified=not args.all)
if args.sample:
for m in mcs[:5]:
print(json.dumps(m, ensure_ascii=False, indent=2))
return
print(f"Klassifiziere {len(mcs)} MCs in Batches von {BATCH_SIZE}", flush=True)
if not mcs:
print("Nichts zu tun.")
return
lookup = {m["control_id"]: m for m in mcs}
total = len(mcs)
done = 0
failed_batches = 0
t0 = time.time()
for i in range(0, total, BATCH_SIZE):
batch = mcs[i:i + BATCH_SIZE]
try:
out = call_claude(api_key, batch)
store_results(out, lookup)
done += len(out)
elapsed = time.time() - t0
rate = done / max(elapsed, 0.01)
eta = (total - done) / max(rate, 0.01)
print(f" [{done:>5}/{total}] {rate:.1f} MC/s ETA {eta/60:.1f}min",
flush=True)
except Exception as e:
failed_batches += 1
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
if failed_batches >= 5:
print(" Zu viele Fehler — abbrechen.", file=sys.stderr)
break
time.sleep(SLEEP_BETWEEN_BATCHES)
print(f"\nFertig: {done}/{total} klassifiziert ({failed_batches} Fehlbatches).")
# Summary
with sqlite3.connect(SIDECAR_DB) as c:
c.row_factory = sqlite3.Row
rows = c.execute(
"SELECT doc_type, check_type, COUNT(*) n FROM mc_classification "
"GROUP BY doc_type, check_type ORDER BY doc_type, check_type"
).fetchall()
print("\n=== Verteilung nach doc_type x check_type ===")
prev = None
for r in rows:
if r["doc_type"] != prev:
print(); print(f"[{r['doc_type']}]")
prev = r["doc_type"]
print(f" {r['check_type']:<8} {r['n']}")
if __name__ == "__main__":
main()
@@ -0,0 +1,241 @@
"""
v2: Re-classify only the (control_id, doc_type) PAIRS that aren't covered yet.
V1 used PK=control_id, so cross-doc-type variants (same control assigned
to e.g. AGB AND Widerruf with different check_questions) overwrote each
other. v2 migrates to PK=(control_id, doc_type) and classifies only the
~262 missing pairs.
Run from container:
docker exec bp-compliance-backend python /app/scripts/classify_mc_check_type_v2.py [--sample]
"""
from __future__ import annotations
import argparse
import json
import os
import sqlite3
import sys
import time
from datetime import datetime, timezone
from pathlib import Path
import httpx
import psycopg2
from psycopg2.extras import RealDictCursor
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
MODEL = "claude-sonnet-4-6"
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
BATCH_SIZE = 25
SLEEP_BETWEEN_BATCHES = 0.5
SYSTEM_PROMPT = """Du klassifizierst Master-Controls (MCs) der DSGVO/TMG/TDDDG-Compliance-Pruefung in genau drei Klassen:
TEXT — Der MC prueft, ob ein bestimmter Inhalt im TEXT eines Dokuments steht.
Beispiele: "Enthaelt die Datenschutzerklaerung die Aufsichtsbehoerde?"
Diese MCs koennen gegen den Dokument-Text gematched werden.
PROCESS — Der MC prueft eine technische oder organisatorische MASSNAHME ausserhalb des Dokumenttexts.
Beispiele: "Ist sichergestellt, dass Cookies erst nach Einwilligung gespeichert werden?"
REVIEW — Der MC stellt eine abstrakte Vollstaendigkeitsfrage, die menschliche Beurteilung braucht.
Beispiele: "Sind ALLE Verarbeitungszwecke vollstaendig erfasst?"
Wichtig: derselbe Control kann fuer DIFFERENT doc_types unterschiedlich klassifiziert sein —
mit doc_type-spezifisch angepasster check_question kann ein "text"-Check fuer ein Dok zum
"process"-Check fuer ein anderes werden.
Antworte ausschliesslich als JSON-Array — kein Markdown. Format:
[{"control_id": "<input>", "doc_type": "<input>", "check_type": "text|process|review",
"confidence": 0.9, "rationale": "ein kurzer satz"}, ...]"""
def migrate_schema() -> None:
"""Migrate sidecar from PK=control_id to PK=(control_id, doc_type)."""
Path(SIDECAR_DB).parent.mkdir(parents=True, exist_ok=True)
with sqlite3.connect(SIDECAR_DB) as c:
# Check if v2 schema already in place (composite PK)
cols = c.execute("PRAGMA table_info(mc_classification)").fetchall()
if not cols:
# First run — create fresh
c.executescript("""
CREATE TABLE mc_classification (
control_id TEXT,
doc_type TEXT,
title TEXT,
check_type TEXT,
confidence REAL,
rationale TEXT,
classified_at TEXT,
PRIMARY KEY (control_id, doc_type)
);
CREATE INDEX idx_doctype ON mc_classification(doc_type);
CREATE INDEX idx_type ON mc_classification(check_type);
""")
return
# Check whether the existing table already has composite PK
pk_cols = [r[1] for r in cols if r[5] > 0]
if set(pk_cols) == {"control_id", "doc_type"}:
print("Schema already v2 (composite PK). Skipping migration.")
return
print("Migrating sidecar schema to PK(control_id, doc_type)...")
c.executescript("""
CREATE TABLE mc_classification_v2 (
control_id TEXT,
doc_type TEXT,
title TEXT,
check_type TEXT,
confidence REAL,
rationale TEXT,
classified_at TEXT,
PRIMARY KEY (control_id, doc_type)
);
INSERT INTO mc_classification_v2
(control_id, doc_type, title, check_type, confidence, rationale, classified_at)
SELECT control_id, COALESCE(doc_type, ''), title, check_type, confidence, rationale, classified_at
FROM mc_classification;
DROP TABLE mc_classification;
ALTER TABLE mc_classification_v2 RENAME TO mc_classification;
CREATE INDEX idx_doctype ON mc_classification(doc_type);
CREATE INDEX idx_type ON mc_classification(check_type);
""")
n = c.execute("SELECT COUNT(*) FROM mc_classification").fetchone()[0]
print(f"Migrated {n} existing rows.")
def fetch_unclassified_pairs(conn) -> list[dict]:
"""All (control_id, doc_type) pairs in PG that aren't yet in sidecar."""
side_pairs: set[tuple[str, str]] = set()
with sqlite3.connect(SIDECAR_DB) as side:
for cid, dt in side.execute("SELECT control_id, doc_type FROM mc_classification"):
side_pairs.add((cid, dt or ""))
with conn.cursor(cursor_factory=RealDictCursor) as c:
c.execute("""SELECT control_id, doc_type, title, check_question
FROM compliance.doc_check_controls""")
all_rows = list(c.fetchall())
missing = [r for r in all_rows if (r["control_id"], r["doc_type"] or "") not in side_pairs]
return missing
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
payload = {
"model": MODEL,
"max_tokens": 4000,
"system": SYSTEM_PROMPT,
"messages": [{
"role": "user",
"content": "Klassifiziere die folgenden MCs:\n\n" + json.dumps(
[{"control_id": m["control_id"],
"doc_type": m["doc_type"],
"title": m["title"],
"check_question": (m["check_question"] or "")[:400]}
for m in batch],
ensure_ascii=False, indent=2),
}],
}
headers = {
"x-api-key": api_key,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
r.raise_for_status()
txt = r.json()["content"][0]["text"].strip()
if txt.startswith("```"):
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
if txt.startswith("json"):
txt = txt[4:].strip()
return json.loads(txt)
def store_results(rows: list[dict], lookup: dict[tuple[str, str], dict]) -> None:
ts = datetime.now(timezone.utc).isoformat()
with sqlite3.connect(SIDECAR_DB) as c:
c.executemany(
"INSERT OR REPLACE INTO mc_classification "
"(control_id, doc_type, title, check_type, confidence, rationale, classified_at) "
"VALUES (?, ?, ?, ?, ?, ?, ?)",
[
(
r.get("control_id"),
r.get("doc_type") or "",
lookup.get((r.get("control_id"), r.get("doc_type") or ""), {}).get("title", ""),
(r.get("check_type") or "").lower(),
float(r.get("confidence") or 0),
(r.get("rationale") or "")[:500],
ts,
)
for r in rows
],
)
c.commit()
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--sample", action="store_true")
args = ap.parse_args()
migrate_schema()
api_key = os.environ["ANTHROPIC_API_KEY"]
conn = psycopg2.connect(os.environ["DATABASE_URL"])
missing = fetch_unclassified_pairs(conn)
if args.sample:
for m in missing[:5]:
print(json.dumps(m, ensure_ascii=False, indent=2))
print(f"\nTotal missing pairs: {len(missing)}")
return
print(f"Klassifiziere {len(missing)} fehlende (control_id, doc_type)-Paare in Batches von {BATCH_SIZE}", flush=True)
if not missing:
print("Alles klassifiziert. Nichts zu tun.")
return
lookup = {(m["control_id"], m["doc_type"] or ""): m for m in missing}
total = len(missing)
done = 0
failed_batches = 0
t0 = time.time()
for i in range(0, total, BATCH_SIZE):
batch = missing[i:i + BATCH_SIZE]
try:
out = call_claude(api_key, batch)
store_results(out, lookup)
done += len(out)
elapsed = time.time() - t0
rate = done / max(elapsed, 0.01)
eta = (total - done) / max(rate, 0.01)
print(f" [{done:>4}/{total}] {rate:.1f} MC/s ETA {eta/60:.1f}min", flush=True)
except Exception as e:
failed_batches += 1
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
if failed_batches >= 5:
print(" Zu viele Fehler — abbrechen.", file=sys.stderr)
break
time.sleep(SLEEP_BETWEEN_BATCHES)
print(f"\nFertig: {done}/{total} klassifiziert ({failed_batches} Fehlbatches).")
with sqlite3.connect(SIDECAR_DB) as c:
c.row_factory = sqlite3.Row
rows = c.execute(
"SELECT doc_type, check_type, COUNT(*) n FROM mc_classification "
"GROUP BY doc_type, check_type ORDER BY doc_type, check_type"
).fetchall()
print("\n=== Final-Verteilung doc_type x check_type ===")
prev = None
for r in rows:
if r["doc_type"] != prev:
print(); print(f"[{r['doc_type']}]")
prev = r["doc_type"]
print(f" {r['check_type']:<8} {r['n']}")
if __name__ == "__main__":
main()