feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features

Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9): Core Compliance-Check - Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db). rag_document_checker filtert auf check_type='text' fuer doc_check. Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in falscher doc_type-Schublade. - scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden per business_profile gefiltert (FRT skipped fuer BMW etc.). - Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match: Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60), Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum. Title+check_question als Embedding-Input fuer mehr Kontext. - Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction wenn richer (BMW 1824 vs 600 Worte). Vendor-Redundanz + EU-Alternativen + Cost-Saving - vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors, Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...). - vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/ enterprise/premier). - Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten (nur Media-Spend, separat). DSP-Plattformen behalten enge Range. - Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den oberen 40-100%-Band der Listpreise, nicht starter→premier. - Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere Kategorien gleichzeitig. Cookie-Wissens-DB + Funktionale Klassifikation - cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...) mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk, schrems_ii_status, EuGH-Urteile, EU-Alternative. - cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id, ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact. Country-Inferenz aus Rechtsform - cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table. Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors (Adform DK, Pinterest IE). Action-Recipes + Doc-Anchor-Locator - finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country, broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling", ...) eine strukturierte Anweisung mit what/why/fix_text/where/example. Zum 1:1-Einfuegen in Kunden-Dokumente. - doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den passenden Absatz im existierenden Kundendokument fuer jeden Finding. Per-Run Thread-Local-Cache. Fallback: keyword-Match. - Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail + Vendor-Flag-Liste mit aufklappbarer Action-Liste. - Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip). Migration-Pipeline (Compliance-Check -> Customer Banner/Documents) - migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit 4 Kategorien + Review-Flags. - migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register + Privacy-Policy-Pre-Fills. - agent_migration_routes: 3 Preview-Endpoints (banner-preview, document-preview, summary). Persistierung der cmp_vendors in /data/compliance_audits.db check_payloads-Tabelle. Borlabs-Parity Cookie-Banner-Features - Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage. - Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video Placeholder bis Einwilligung. - Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB. - Consent-Log Export (CSV/JSON) per einwilligungen_export_routes. Bug-Fixes - canonical_control_routes: _jsonish-Helper fuer string-typed jsonb, similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr). - Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views. - Embedding-Service-Batching (32er Batches statt 165 in einem Call). - KeyError 'control_id' in MC-Result-Aggregation (defensive .get). - Master-Controls-Klick-Through von /sdk/master-controls auf /sdk/control-library?control=<id> mit URL-Param-Auto-Open. - Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht). - Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction). - doc_type-aware MC-Filter (statt all-text-MCs). - Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag). - A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert. Tests - test_migration_mappers.py (9 Tests) - test_migration_endpoints.py (4 Tests) Skripte (one-shot) - classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type) - audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires) BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes): DSE 7,5% -> 81-83% Impressum 4% -> 100% (6 echte MCs alle erfuellt) Cookie 0% -> 79-83% (CMP-Text-Routing + Embedding) Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr Plus: Action-Recipes + Doc-Anchors fuer jeden Fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:30:08 +02:00
parent 52fb8b91e7
commit 662327e8b4
31 changed files with 5214 additions and 104 deletions
@@ -0,0 +1,229 @@
+"""
+LLM-Audit: prueft fuer jeden text-MC ob er fachlich wirklich zu seinem
+zugeordneten doc_type passt. BMW Run zeigte z.B. dass die meisten
+Impressum-MCs eigentlich DSA/E-Commerce-Pflichten sind, die in einem
+§5-TMG-Impressum gar nicht stehen.
+
+Output:
+- doc_type passt → MC bleibt active (kein DB-Update)
+- doc_type passt NICHT → check_type wird auf 'misclassified' gesetzt;
+  rag_document_checker filtert die dann aus
+
+Sidecar SQLite. KEINE Postgres-Aenderung (Schema frozen).
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import sqlite3
+import sys
+import time
+from datetime import datetime, timezone
+
+import httpx
+import psycopg2
+from psycopg2.extras import RealDictCursor
+
+ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
+MODEL = "claude-sonnet-4-6"
+SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
+BATCH_SIZE = 25
+SLEEP_BETWEEN_BATCHES = 0.5
+
+DOC_TYPE_DESCRIPTIONS = {
+    "agb": "Allgemeine Geschaeftsbedingungen — Vertragsbedingungen "
+           "zwischen Anbieter und Kunde",
+    "avv": "Auftragsverarbeitungsvertrag Art. 28 DSGVO — Vertrag zwischen "
+           "Verantwortlichem und Auftragsverarbeiter",
+    "cookie": "Cookie-Richtlinie — Information ueber Zwecke, Anbieter, "
+              "Speicherdauer und Widerruf von Cookies/Tracking-Pixeln",
+    "dse": "Datenschutzerklaerung Art. 13 DSGVO — Informationen ueber "
+           "Verantwortlichen, Zwecke, Rechtsgrundlagen, Empfaenger, "
+           "Speicherdauer, Betroffenenrechte, Aufsichtsbehoerde",
+    "dsfa": "Datenschutz-Folgenabschaetzung Art. 35 DSGVO — Risikobewertung "
+            "von Verarbeitungen mit hohem Risiko",
+    "impressum": "Impressum §5 TMG / §18 MStV — Anbieterkennzeichnung mit "
+                 "Name, Anschrift, Vertretungsberechtigten, Handelsregister, "
+                 "USt-IdNr., berufsrechtliche Angaben, Aufsicht",
+    "loeschkonzept": "Loeschkonzept DIN 66398 — Definition Aufbewahrungs- "
+                     "und Loeschfristen pro Datenkategorie + Prozess",
+    "widerruf": "Widerrufsbelehrung §312 BGB — Verbraucher-Widerrufsrecht "
+                "bei Fernabsatz, Frist, Folgen, Muster",
+}
+
+SYSTEM_PROMPT = """Du pruefst Master-Controls (MCs) der Compliance-Pruefung.
+
+Fuer jeden MC bekommst du:
+- den ihm zugeordneten doc_type (z.B. 'impressum', 'cookie', 'dse')
+- den Titel und die check_question
+
+Deine Aufgabe: entscheide ob der MC fachlich wirklich zu diesem doc_type passt.
+
+Beispiele:
+- MC "Wird die USt-IdNr im Impressum genannt?" + doc_type=impressum → PASST
+- MC "Monatlich aktive Endnutzer fuer Online-Vermittlungsdienste messen" + doc_type=impressum → PASST NICHT
+  (DSA-Pflicht, gehoert nicht in ein klassisches §5-TMG-Impressum)
+- MC "Mithoeren von Funknachrichten verhindern" + doc_type=cookie → PASST NICHT
+  (TKG-Spezialthema, nicht Cookie-Richtlinie)
+
+Antworte als JSON-Array, eine Zeile pro MC:
+[{"control_id": "<wie input>", "doc_type": "<wie input>", "fits": true|false,
+  "rationale": "ein kurzer satz"}, ...]
+Kein Markdown."""
+
+
+def fetch_pairs_to_audit(conn) -> list[dict]:
+    """All text-MCs that haven't been audited yet (no 'fits' column)."""
+    with sqlite3.connect(SIDECAR_DB) as side:
+        cols = [r[1] for r in side.execute("PRAGMA table_info(mc_classification)")]
+        if "fits_doc_type" not in cols:
+            side.execute("ALTER TABLE mc_classification ADD COLUMN fits_doc_type INTEGER")
+            side.commit()
+        already = set()
+        for cid, dt in side.execute(
+            "SELECT control_id, doc_type FROM mc_classification "
+            "WHERE fits_doc_type IS NOT NULL"
+        ):
+            already.add((cid, dt or ""))
+
+    with conn.cursor(cursor_factory=RealDictCursor) as c:
+        c.execute("""SELECT dc.control_id, dc.doc_type, dc.title, dc.check_question
+                     FROM compliance.doc_check_controls dc
+                     WHERE dc.control_id IN (
+                       SELECT control_id FROM compliance.doc_check_controls
+                     )""")
+        all_rows = list(c.fetchall())
+
+    # Audit only those classified as 'text' in sidecar — process/review
+    # never run through doc_check anyway
+    with sqlite3.connect(SIDECAR_DB) as side:
+        text_pairs = set()
+        for cid, dt in side.execute(
+            "SELECT control_id, doc_type FROM mc_classification "
+            "WHERE check_type = 'text'"
+        ):
+            text_pairs.add((cid, dt or ""))
+
+    target = [r for r in all_rows
+              if (r["control_id"], r["doc_type"] or "") in text_pairs
+              and (r["control_id"], r["doc_type"] or "") not in already]
+    return target
+
+
+def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
+    payload = {
+        "model": MODEL,
+        "max_tokens": 4000,
+        "system": SYSTEM_PROMPT,
+        "messages": [{
+            "role": "user",
+            "content": (
+                "Doc-Typen-Beschreibungen:\n"
+                + "\n".join(f"- {k}: {v}" for k, v in DOC_TYPE_DESCRIPTIONS.items())
+                + "\n\nPruefe folgende MCs:\n\n"
+                + json.dumps([
+                    {"control_id": m["control_id"], "doc_type": m["doc_type"],
+                     "title": m["title"], "check_question": (m["check_question"] or "")[:300]}
+                    for m in batch
+                ], ensure_ascii=False, indent=2)
+            ),
+        }],
+    }
+    headers = {
+        "x-api-key": api_key,
+        "anthropic-version": "2023-06-01",
+        "content-type": "application/json",
+    }
+    r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
+    r.raise_for_status()
+    txt = r.json()["content"][0]["text"].strip()
+    if txt.startswith("```"):
+        txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
+        if txt.startswith("json"):
+            txt = txt[4:].strip()
+    return json.loads(txt)
+
+
+def store_audit(rows: list[dict]) -> None:
+    ts = datetime.now(timezone.utc).isoformat()
+    with sqlite3.connect(SIDECAR_DB) as c:
+        c.executemany(
+            "UPDATE mc_classification SET fits_doc_type = ?, "
+            "rationale = COALESCE(?, rationale), classified_at = ? "
+            "WHERE control_id = ? AND doc_type = ?",
+            [
+                (
+                    1 if r.get("fits") else 0,
+                    (r.get("rationale") or "")[:500] or None,
+                    ts,
+                    r.get("control_id"),
+                    r.get("doc_type") or "",
+                )
+                for r in rows
+            ],
+        )
+        c.commit()
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--sample", action="store_true")
+    args = ap.parse_args()
+
+    api_key = os.environ["ANTHROPIC_API_KEY"]
+    conn = psycopg2.connect(os.environ["DATABASE_URL"])
+    pairs = fetch_pairs_to_audit(conn)
+
+    if args.sample:
+        for m in pairs[:5]:
+            print(json.dumps(m, ensure_ascii=False, indent=2))
+        print(f"\nTotal pairs to audit: {len(pairs)}")
+        return
+
+    print(f"Audit {len(pairs)} text-MCs in Batches von {BATCH_SIZE}", flush=True)
+    if not pairs:
+        print("Alles auditiert.")
+        return
+
+    done = 0
+    failed_batches = 0
+    t0 = time.time()
+    for i in range(0, len(pairs), BATCH_SIZE):
+        batch = pairs[i:i + BATCH_SIZE]
+        try:
+            out = call_claude(api_key, batch)
+            store_audit(out)
+            done += len(out)
+            elapsed = time.time() - t0
+            rate = done / max(elapsed, 0.01)
+            eta = (len(pairs) - done) / max(rate, 0.01)
+            print(f"  [{done:>4}/{len(pairs)}] {rate:.1f} MC/s  ETA {eta/60:.1f}min", flush=True)
+        except Exception as e:
+            failed_batches += 1
+            print(f"  FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
+            if failed_batches >= 5:
+                print("Zu viele Fehler — abbrechen.", file=sys.stderr)
+                break
+        time.sleep(SLEEP_BETWEEN_BATCHES)
+
+    print(f"\nFertig: {done}/{len(pairs)} auditiert ({failed_batches} Fehlbatches).")
+    with sqlite3.connect(SIDECAR_DB) as c:
+        c.row_factory = sqlite3.Row
+        rows = c.execute(
+            "SELECT doc_type, "
+            "  SUM(CASE WHEN fits_doc_type = 1 THEN 1 ELSE 0 END) AS fits, "
+            "  SUM(CASE WHEN fits_doc_type = 0 THEN 1 ELSE 0 END) AS misfits, "
+            "  COUNT(*) AS total "
+            "FROM mc_classification "
+            "WHERE check_type = 'text' AND fits_doc_type IS NOT NULL "
+            "GROUP BY doc_type ORDER BY doc_type"
+        ).fetchall()
+        print("\n=== Audit-Verteilung doc_type x fits ===")
+        for r in rows:
+            print(f"  {r['doc_type']:<14}  fits={r['fits']:<4} misfits={r['misfits']:<4} total={r['total']}")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,216 @@
+"""
+A3-v2: nachschärfen welche MCs in Wahrheit auf das CMP-Banner-UI / einen
+Prozess zielen, nicht auf den Doc-TEXT.
+
+BMW-Run zeigte ~10 'Sprache klar/einfach' + 'Button-Text formuliert' MCs,
+die der erste Audit als doc_type-passend markierte. Sie sind ABER nicht
+gegen den Cookie-Policy- oder DSE-Text pruefbar — die fragen nach der
+Verstaendlichkeit der Einwilligungs-UI.
+
+Schaltet 'scope_requires' bei FRT/biometrischen MCs und re-klassifiziert
+diese UI-Items auf check_type='process' (filtert sie aus doc_check raus).
+
+Plus: Setzt scope_requires fuer offensichtlich konditionale MCs:
+  - 'biometric_processing' bei FRT/Gesichtserkennung
+  - 'ai_decision_making' bei automatisierten Einzelentscheidungen
+  - 'child_targeting' bei Kinder-Einwilligungs-MCs
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import sqlite3
+import sys
+import time
+
+import httpx
+import psycopg2
+from psycopg2.extras import RealDictCursor
+
+ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
+MODEL = "claude-sonnet-4-6"
+SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
+BATCH_SIZE = 20
+
+SYSTEM_PROMPT = """Du pruefst Master-Controls (MCs) der Compliance-Pruefung
+zweiter Ordnung: jeder MC wurde bereits als 'text' klassifiziert und einem
+doc_type zugeordnet. Du entscheidest:
+
+A) Pruefst der MC den TEXT des Dokuments (z.B. "Wird im Impressum die
+   USt-IdNr genannt?") → fits_doc_type=true, ui_only=false
+B) Pruefst der MC die UI/UX eines Banners oder Buttons (z.B.
+   "Ist der Ablehnungs-Button gleich gross wie Akzeptieren?", "Ist die
+   Einwilligungsanfrage in einfacher Sprache formuliert?") → ui_only=true
+   (Diese MCs koennen NIE im Dokumenttext stehen, weil sie sich auf eine
+   externe UI beziehen.)
+
+Zusaetzlich kennzeichne SCOPE-Bedingungen, wenn der MC nur fuer bestimmte
+Sites relevant ist:
+  - 'biometric_processing' : nur bei Sites die biometrische Daten
+    (Gesichtserkennung/FRT, Fingerabdruecke) verarbeiten
+  - 'ai_decision_making'   : nur bei automatisierten Einzelentscheidungen
+    (Art. 22 DSGVO)
+  - 'child_targeting'      : nur bei Sites die sich an Kinder richten
+  - 'ecommerce'            : nur bei Webshops
+  - 'b2c'                  : nur fuer Verbraucher-Geschaeft (UWG/BGB-Pflichten)
+Wenn der MC fuer ALLE Sites gilt, setze scope_requires=null.
+
+Antworte als JSON-Array — keine Erklaerung davor/danach, kein Markdown.
+Format:
+[{"control_id": "<wie input>", "doc_type": "<wie input>",
+  "ui_only": true|false, "scope_requires": "biometric_processing|ai_decision_making|child_targeting|ecommerce|b2c|null",
+  "rationale": "ein kurzer satz"}, ...]"""
+
+
+def fetch_pairs_to_audit(conn) -> list[dict]:
+    """All text-MCs that haven't been v2-audited yet (no 'ui_only' column)."""
+    with sqlite3.connect(SIDECAR_DB) as side:
+        cols = [r[1] for r in side.execute("PRAGMA table_info(mc_classification)")]
+        added = False
+        if "ui_only" not in cols:
+            side.execute("ALTER TABLE mc_classification ADD COLUMN ui_only INTEGER")
+            added = True
+        if "scope_requires" not in cols:
+            side.execute("ALTER TABLE mc_classification ADD COLUMN scope_requires TEXT")
+            added = True
+        if added:
+            side.commit()
+        already = set()
+        for cid, dt in side.execute(
+            "SELECT control_id, doc_type FROM mc_classification "
+            "WHERE ui_only IS NOT NULL"
+        ):
+            already.add((cid, dt or ""))
+
+    with conn.cursor(cursor_factory=RealDictCursor) as c:
+        c.execute("""SELECT dc.control_id, dc.doc_type, dc.title, dc.check_question
+                     FROM compliance.doc_check_controls dc""")
+        all_rows = list(c.fetchall())
+
+    # Audit only those already classified as text+fits in sidecar
+    with sqlite3.connect(SIDECAR_DB) as side:
+        eligible = set()
+        for cid, dt in side.execute(
+            "SELECT control_id, doc_type FROM mc_classification "
+            "WHERE check_type = 'text' AND (fits_doc_type IS NULL OR fits_doc_type = 1)"
+        ):
+            eligible.add((cid, dt or ""))
+
+    target = [r for r in all_rows
+              if (r["control_id"], r["doc_type"] or "") in eligible
+              and (r["control_id"], r["doc_type"] or "") not in already]
+    return target
+
+
+def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
+    payload = {
+        "model": MODEL,
+        "max_tokens": 4000,
+        "system": SYSTEM_PROMPT,
+        "messages": [{
+            "role": "user",
+            "content": "Pruefe folgende MCs:\n\n" + json.dumps([
+                {"control_id": m["control_id"], "doc_type": m["doc_type"],
+                 "title": m["title"], "check_question": (m["check_question"] or "")[:300]}
+                for m in batch
+            ], ensure_ascii=False, indent=2),
+        }],
+    }
+    headers = {
+        "x-api-key": api_key,
+        "anthropic-version": "2023-06-01",
+        "content-type": "application/json",
+    }
+    r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
+    r.raise_for_status()
+    txt = r.json()["content"][0]["text"].strip()
+    if txt.startswith("```"):
+        txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
+        if txt.startswith("json"):
+            txt = txt[4:].strip()
+    return json.loads(txt)
+
+
+def store(rows: list[dict]) -> None:
+    with sqlite3.connect(SIDECAR_DB) as c:
+        c.executemany(
+            "UPDATE mc_classification SET ui_only = ?, scope_requires = ? "
+            "WHERE control_id = ? AND doc_type = ?",
+            [
+                (
+                    1 if r.get("ui_only") else 0,
+                    (r.get("scope_requires") or "").strip() or None
+                       if (r.get("scope_requires") or "").lower() not in ("", "null")
+                       else None,
+                    r.get("control_id"),
+                    r.get("doc_type") or "",
+                )
+                for r in rows
+            ],
+        )
+        # MCs flagged ui_only become check_type='process' so they're not in doc_check
+        c.executemany(
+            "UPDATE mc_classification SET check_type='process' "
+            "WHERE ui_only=1 AND control_id=? AND doc_type=?",
+            [(r.get("control_id"), r.get("doc_type") or "") for r in rows
+             if r.get("ui_only")],
+        )
+        c.commit()
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--sample", action="store_true")
+    args = ap.parse_args()
+
+    api_key = os.environ["ANTHROPIC_API_KEY"]
+    conn = psycopg2.connect(os.environ["DATABASE_URL"])
+    pairs = fetch_pairs_to_audit(conn)
+
+    if args.sample:
+        for m in pairs[:5]:
+            print(json.dumps(m, ensure_ascii=False, indent=2))
+        print(f"\nTotal: {len(pairs)}")
+        return
+
+    print(f"Audit-v2: {len(pairs)} MCs in Batches von {BATCH_SIZE}", flush=True)
+    if not pairs:
+        print("Alles geprueft.")
+        return
+
+    done = 0
+    fail = 0
+    t0 = time.time()
+    for i in range(0, len(pairs), BATCH_SIZE):
+        batch = pairs[i:i + BATCH_SIZE]
+        try:
+            out = call_claude(api_key, batch)
+            store(out)
+            done += len(out)
+            elapsed = time.time() - t0
+            rate = done / max(elapsed, 0.01)
+            eta = (len(pairs) - done) / max(rate, 0.01)
+            print(f"  [{done:>4}/{len(pairs)}] {rate:.1f} MC/s  ETA {eta/60:.1f}min", flush=True)
+        except Exception as e:
+            fail += 1
+            print(f"  FAIL batch {i//BATCH_SIZE + 1}: {e}", file=sys.stderr, flush=True)
+            if fail >= 5: break
+        time.sleep(0.5)
+
+    print(f"\nFertig: {done}/{len(pairs)} ({fail} Fehlbatches).")
+    with sqlite3.connect(SIDECAR_DB) as c:
+        ui = c.execute("SELECT COUNT(*) FROM mc_classification WHERE ui_only=1").fetchone()[0]
+        scope = c.execute(
+            "SELECT scope_requires, COUNT(*) FROM mc_classification "
+            "WHERE scope_requires IS NOT NULL GROUP BY scope_requires"
+        ).fetchall()
+        print(f"\nui_only=1: {ui} MCs (jetzt 'process')")
+        print("scope_requires Verteilung:")
+        for s, n in scope:
+            print(f"  {s}: {n}")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,222 @@
+"""
+Classify doc_check_controls (1874 MCs) into check_type:
+  - text    : MC asks about the DOCUMENT'S WORDING ("enthaelt die DSE..", "wird im Impressum genannt..")
+  - process : MC asks about a TECHNICAL OR ORGANISATIONAL CONTROL ("ist sichergestellt..", "ist implementiert..")
+  - review  : MC asks an ABSTRACT META-QUESTION that needs human judgment ("sind alle Verarbeitungen vollstaendig?")
+
+Output goes to sidecar SQLite at /data/mc_classification.db (no Postgres schema change
+per CLAUDE.md guardrails). Schema:
+
+  CREATE TABLE mc_classification (
+    control_id TEXT PRIMARY KEY,
+    doc_type   TEXT,
+    title      TEXT,
+    check_type TEXT,    -- text|process|review
+    confidence REAL,    -- 0..1
+    rationale  TEXT,
+    classified_at TEXT
+  );
+
+Run from inside bp-compliance-backend container:
+  docker exec bp-compliance-backend python /app/scripts/classify_mc_check_type.py [--limit N] [--sample]
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import sqlite3
+import sys
+import time
+from datetime import datetime, timezone
+from pathlib import Path
+
+import httpx
+import psycopg2
+from psycopg2.extras import RealDictCursor
+
+ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
+MODEL = "claude-sonnet-4-6"
+SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
+BATCH_SIZE = 25
+SLEEP_BETWEEN_BATCHES = 0.5   # sec — keep gentle for the parallel Haiku batch
+
+SYSTEM_PROMPT = """Du klassifizierst Master-Controls (MCs) der DSGVO/TMG/TDDDG-Compliance-Pruefung in genau drei Klassen:
+
+TEXT  — Der MC prueft, ob ein bestimmter Inhalt im TEXT eines Dokuments (DSE, Impressum, Cookie-Richtlinie etc.) steht.
+        Beispiele: "Enthaelt die Datenschutzerklaerung die Aufsichtsbehoerde?" / "Wird im Impressum die USt-IdNr genannt?"
+        Diese MCs koennen gegen den Dokument-Text gematched werden.
+
+PROCESS — Der MC prueft eine technische oder organisatorische MASSNAHME, die ausserhalb des Dokumenttexts liegt.
+          Beispiele: "Ist sichergestellt, dass Cookies erst nach Einwilligung gespeichert werden?" /
+                    "Ist ein Loeschkonzept implementiert?" / "Wird das Consent-Tool monatlich getestet?"
+          Diese MCs lassen sich NICHT aus dem Dokumenttext beantworten — sie brauchen Evidence/TOM-Nachweis.
+
+REVIEW — Der MC stellt eine abstrakte Vollstaendigkeits-/Konsistenzfrage, die menschliche Beurteilung braucht.
+         Beispiele: "Sind ALLE Verarbeitungszwecke vollstaendig erfasst?" / "Stimmen Cookie-Richtlinie und Banner ueberein?"
+         Diese MCs sind Checklisten-Items fuer den DSB, kein automatischer Check moeglich.
+
+Antworte ausschliesslich als JSON-Array — keine Erklaerung davor/danach, kein Markdown-Codeblock. Format:
+[{"control_id": "<wie input>", "check_type": "text|process|review", "confidence": 0.9, "rationale": "ein kurzer satz"}, ...]"""
+
+
+def fetch_mcs(conn, limit: int | None = None, only_unclassified: bool = True) -> list[dict]:
+    sql = """SELECT control_id, doc_type, title, check_question
+             FROM compliance.doc_check_controls"""
+    if only_unclassified:
+        sql += " WHERE control_id NOT IN (SELECT control_id FROM _classified_already)"
+    sql += " ORDER BY doc_type, title"
+    if limit:
+        sql += f" LIMIT {limit}"
+    with conn.cursor(cursor_factory=RealDictCursor) as c:
+        try:
+            c.execute("CREATE TEMP TABLE _classified_already (control_id TEXT)")
+            with sqlite3.connect(SIDECAR_DB) as side:
+                rows = side.execute("SELECT control_id FROM mc_classification").fetchall()
+                if rows:
+                    c.executemany("INSERT INTO _classified_already VALUES (%s)", [(r[0],) for r in rows])
+        except Exception:
+            pass
+        c.execute(sql)
+        return list(c.fetchall())
+
+
+def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
+    payload = {
+        "model": MODEL,
+        "max_tokens": 4000,
+        "system": SYSTEM_PROMPT,
+        "messages": [{
+            "role": "user",
+            "content": "Klassifiziere die folgenden MCs:\n\n" + json.dumps(
+                [{"control_id": m["control_id"],
+                  "doc_type": m["doc_type"],
+                  "title": m["title"],
+                  "check_question": (m["check_question"] or "")[:400]}
+                 for m in batch],
+                ensure_ascii=False, indent=2),
+        }],
+    }
+    headers = {
+        "x-api-key": api_key,
+        "anthropic-version": "2023-06-01",
+        "content-type": "application/json",
+    }
+    r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
+    r.raise_for_status()
+    txt = r.json()["content"][0]["text"].strip()
+    # Strip code fences if Sonnet adds them
+    if txt.startswith("```"):
+        txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
+        if txt.startswith("json"):
+            txt = txt[4:].strip()
+    return json.loads(txt)
+
+
+def ensure_sidecar() -> None:
+    Path(SIDECAR_DB).parent.mkdir(parents=True, exist_ok=True)
+    with sqlite3.connect(SIDECAR_DB) as c:
+        c.executescript("""
+            CREATE TABLE IF NOT EXISTS mc_classification (
+                control_id    TEXT PRIMARY KEY,
+                doc_type      TEXT,
+                title         TEXT,
+                check_type    TEXT,
+                confidence    REAL,
+                rationale     TEXT,
+                classified_at TEXT
+            );
+            CREATE INDEX IF NOT EXISTS idx_doctype ON mc_classification(doc_type);
+            CREATE INDEX IF NOT EXISTS idx_type    ON mc_classification(check_type);
+        """)
+
+
+def store_results(rows: list[dict], lookup: dict[str, dict]) -> None:
+    ts = datetime.now(timezone.utc).isoformat()
+    with sqlite3.connect(SIDECAR_DB) as c:
+        c.executemany(
+            "INSERT OR REPLACE INTO mc_classification "
+            "(control_id, doc_type, title, check_type, confidence, rationale, classified_at) "
+            "VALUES (?, ?, ?, ?, ?, ?, ?)",
+            [
+                (
+                    r.get("control_id"),
+                    lookup.get(r.get("control_id"), {}).get("doc_type", ""),
+                    lookup.get(r.get("control_id"), {}).get("title", ""),
+                    (r.get("check_type") or "").lower(),
+                    float(r.get("confidence") or 0),
+                    (r.get("rationale") or "")[:500],
+                    ts,
+                )
+                for r in rows
+            ],
+        )
+        c.commit()
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--limit", type=int, default=None, help="limit number of MCs (for testing)")
+    ap.add_argument("--sample", action="store_true", help="just print first 5 MCs and exit")
+    ap.add_argument("--all", action="store_true", help="re-classify all (otherwise skips already classified)")
+    args = ap.parse_args()
+
+    ensure_sidecar()
+    api_key = os.environ["ANTHROPIC_API_KEY"]
+    conn = psycopg2.connect(os.environ["DATABASE_URL"])
+    mcs = fetch_mcs(conn, limit=args.limit, only_unclassified=not args.all)
+
+    if args.sample:
+        for m in mcs[:5]:
+            print(json.dumps(m, ensure_ascii=False, indent=2))
+        return
+
+    print(f"Klassifiziere {len(mcs)} MCs in Batches von {BATCH_SIZE}", flush=True)
+    if not mcs:
+        print("Nichts zu tun.")
+        return
+
+    lookup = {m["control_id"]: m for m in mcs}
+    total = len(mcs)
+    done = 0
+    failed_batches = 0
+    t0 = time.time()
+    for i in range(0, total, BATCH_SIZE):
+        batch = mcs[i:i + BATCH_SIZE]
+        try:
+            out = call_claude(api_key, batch)
+            store_results(out, lookup)
+            done += len(out)
+            elapsed = time.time() - t0
+            rate = done / max(elapsed, 0.01)
+            eta = (total - done) / max(rate, 0.01)
+            print(f"  [{done:>5}/{total}] {rate:.1f} MC/s  ETA {eta/60:.1f}min",
+                  flush=True)
+        except Exception as e:
+            failed_batches += 1
+            print(f"  FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
+            if failed_batches >= 5:
+                print("  Zu viele Fehler — abbrechen.", file=sys.stderr)
+                break
+        time.sleep(SLEEP_BETWEEN_BATCHES)
+
+    print(f"\nFertig: {done}/{total} klassifiziert ({failed_batches} Fehlbatches).")
+    # Summary
+    with sqlite3.connect(SIDECAR_DB) as c:
+        c.row_factory = sqlite3.Row
+        rows = c.execute(
+            "SELECT doc_type, check_type, COUNT(*) n FROM mc_classification "
+            "GROUP BY doc_type, check_type ORDER BY doc_type, check_type"
+        ).fetchall()
+        print("\n=== Verteilung nach doc_type x check_type ===")
+        prev = None
+        for r in rows:
+            if r["doc_type"] != prev:
+                print(); print(f"[{r['doc_type']}]")
+                prev = r["doc_type"]
+            print(f"  {r['check_type']:<8} {r['n']}")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,241 @@
+"""
+v2: Re-classify only the (control_id, doc_type) PAIRS that aren't covered yet.
+
+V1 used PK=control_id, so cross-doc-type variants (same control assigned
+to e.g. AGB AND Widerruf with different check_questions) overwrote each
+other. v2 migrates to PK=(control_id, doc_type) and classifies only the
+~262 missing pairs.
+
+Run from container:
+  docker exec bp-compliance-backend python /app/scripts/classify_mc_check_type_v2.py [--sample]
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import sqlite3
+import sys
+import time
+from datetime import datetime, timezone
+from pathlib import Path
+
+import httpx
+import psycopg2
+from psycopg2.extras import RealDictCursor
+
+ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
+MODEL = "claude-sonnet-4-6"
+SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
+BATCH_SIZE = 25
+SLEEP_BETWEEN_BATCHES = 0.5
+
+SYSTEM_PROMPT = """Du klassifizierst Master-Controls (MCs) der DSGVO/TMG/TDDDG-Compliance-Pruefung in genau drei Klassen:
+
+TEXT  — Der MC prueft, ob ein bestimmter Inhalt im TEXT eines Dokuments steht.
+        Beispiele: "Enthaelt die Datenschutzerklaerung die Aufsichtsbehoerde?"
+        Diese MCs koennen gegen den Dokument-Text gematched werden.
+
+PROCESS — Der MC prueft eine technische oder organisatorische MASSNAHME ausserhalb des Dokumenttexts.
+          Beispiele: "Ist sichergestellt, dass Cookies erst nach Einwilligung gespeichert werden?"
+
+REVIEW — Der MC stellt eine abstrakte Vollstaendigkeitsfrage, die menschliche Beurteilung braucht.
+         Beispiele: "Sind ALLE Verarbeitungszwecke vollstaendig erfasst?"
+
+Wichtig: derselbe Control kann fuer DIFFERENT doc_types unterschiedlich klassifiziert sein —
+mit doc_type-spezifisch angepasster check_question kann ein "text"-Check fuer ein Dok zum
+"process"-Check fuer ein anderes werden.
+
+Antworte ausschliesslich als JSON-Array — kein Markdown. Format:
+[{"control_id": "<input>", "doc_type": "<input>", "check_type": "text|process|review",
+  "confidence": 0.9, "rationale": "ein kurzer satz"}, ...]"""
+
+
+def migrate_schema() -> None:
+    """Migrate sidecar from PK=control_id to PK=(control_id, doc_type)."""
+    Path(SIDECAR_DB).parent.mkdir(parents=True, exist_ok=True)
+    with sqlite3.connect(SIDECAR_DB) as c:
+        # Check if v2 schema already in place (composite PK)
+        cols = c.execute("PRAGMA table_info(mc_classification)").fetchall()
+        if not cols:
+            # First run — create fresh
+            c.executescript("""
+                CREATE TABLE mc_classification (
+                    control_id    TEXT,
+                    doc_type      TEXT,
+                    title         TEXT,
+                    check_type    TEXT,
+                    confidence    REAL,
+                    rationale     TEXT,
+                    classified_at TEXT,
+                    PRIMARY KEY (control_id, doc_type)
+                );
+                CREATE INDEX idx_doctype ON mc_classification(doc_type);
+                CREATE INDEX idx_type    ON mc_classification(check_type);
+            """)
+            return
+
+        # Check whether the existing table already has composite PK
+        pk_cols = [r[1] for r in cols if r[5] > 0]
+        if set(pk_cols) == {"control_id", "doc_type"}:
+            print("Schema already v2 (composite PK). Skipping migration.")
+            return
+
+        print("Migrating sidecar schema to PK(control_id, doc_type)...")
+        c.executescript("""
+            CREATE TABLE mc_classification_v2 (
+                control_id    TEXT,
+                doc_type      TEXT,
+                title         TEXT,
+                check_type    TEXT,
+                confidence    REAL,
+                rationale     TEXT,
+                classified_at TEXT,
+                PRIMARY KEY (control_id, doc_type)
+            );
+            INSERT INTO mc_classification_v2
+              (control_id, doc_type, title, check_type, confidence, rationale, classified_at)
+            SELECT control_id, COALESCE(doc_type, ''), title, check_type, confidence, rationale, classified_at
+            FROM mc_classification;
+            DROP TABLE mc_classification;
+            ALTER TABLE mc_classification_v2 RENAME TO mc_classification;
+            CREATE INDEX idx_doctype ON mc_classification(doc_type);
+            CREATE INDEX idx_type ON mc_classification(check_type);
+        """)
+        n = c.execute("SELECT COUNT(*) FROM mc_classification").fetchone()[0]
+        print(f"Migrated {n} existing rows.")
+
+
+def fetch_unclassified_pairs(conn) -> list[dict]:
+    """All (control_id, doc_type) pairs in PG that aren't yet in sidecar."""
+    side_pairs: set[tuple[str, str]] = set()
+    with sqlite3.connect(SIDECAR_DB) as side:
+        for cid, dt in side.execute("SELECT control_id, doc_type FROM mc_classification"):
+            side_pairs.add((cid, dt or ""))
+
+    with conn.cursor(cursor_factory=RealDictCursor) as c:
+        c.execute("""SELECT control_id, doc_type, title, check_question
+                     FROM compliance.doc_check_controls""")
+        all_rows = list(c.fetchall())
+
+    missing = [r for r in all_rows if (r["control_id"], r["doc_type"] or "") not in side_pairs]
+    return missing
+
+
+def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
+    payload = {
+        "model": MODEL,
+        "max_tokens": 4000,
+        "system": SYSTEM_PROMPT,
+        "messages": [{
+            "role": "user",
+            "content": "Klassifiziere die folgenden MCs:\n\n" + json.dumps(
+                [{"control_id": m["control_id"],
+                  "doc_type": m["doc_type"],
+                  "title": m["title"],
+                  "check_question": (m["check_question"] or "")[:400]}
+                 for m in batch],
+                ensure_ascii=False, indent=2),
+        }],
+    }
+    headers = {
+        "x-api-key": api_key,
+        "anthropic-version": "2023-06-01",
+        "content-type": "application/json",
+    }
+    r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
+    r.raise_for_status()
+    txt = r.json()["content"][0]["text"].strip()
+    if txt.startswith("```"):
+        txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
+        if txt.startswith("json"):
+            txt = txt[4:].strip()
+    return json.loads(txt)
+
+
+def store_results(rows: list[dict], lookup: dict[tuple[str, str], dict]) -> None:
+    ts = datetime.now(timezone.utc).isoformat()
+    with sqlite3.connect(SIDECAR_DB) as c:
+        c.executemany(
+            "INSERT OR REPLACE INTO mc_classification "
+            "(control_id, doc_type, title, check_type, confidence, rationale, classified_at) "
+            "VALUES (?, ?, ?, ?, ?, ?, ?)",
+            [
+                (
+                    r.get("control_id"),
+                    r.get("doc_type") or "",
+                    lookup.get((r.get("control_id"), r.get("doc_type") or ""), {}).get("title", ""),
+                    (r.get("check_type") or "").lower(),
+                    float(r.get("confidence") or 0),
+                    (r.get("rationale") or "")[:500],
+                    ts,
+                )
+                for r in rows
+            ],
+        )
+        c.commit()
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--sample", action="store_true")
+    args = ap.parse_args()
+
+    migrate_schema()
+    api_key = os.environ["ANTHROPIC_API_KEY"]
+    conn = psycopg2.connect(os.environ["DATABASE_URL"])
+    missing = fetch_unclassified_pairs(conn)
+
+    if args.sample:
+        for m in missing[:5]:
+            print(json.dumps(m, ensure_ascii=False, indent=2))
+        print(f"\nTotal missing pairs: {len(missing)}")
+        return
+
+    print(f"Klassifiziere {len(missing)} fehlende (control_id, doc_type)-Paare in Batches von {BATCH_SIZE}", flush=True)
+    if not missing:
+        print("Alles klassifiziert. Nichts zu tun.")
+        return
+
+    lookup = {(m["control_id"], m["doc_type"] or ""): m for m in missing}
+    total = len(missing)
+    done = 0
+    failed_batches = 0
+    t0 = time.time()
+    for i in range(0, total, BATCH_SIZE):
+        batch = missing[i:i + BATCH_SIZE]
+        try:
+            out = call_claude(api_key, batch)
+            store_results(out, lookup)
+            done += len(out)
+            elapsed = time.time() - t0
+            rate = done / max(elapsed, 0.01)
+            eta = (total - done) / max(rate, 0.01)
+            print(f"  [{done:>4}/{total}] {rate:.1f} MC/s  ETA {eta/60:.1f}min", flush=True)
+        except Exception as e:
+            failed_batches += 1
+            print(f"  FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
+            if failed_batches >= 5:
+                print("  Zu viele Fehler — abbrechen.", file=sys.stderr)
+                break
+        time.sleep(SLEEP_BETWEEN_BATCHES)
+
+    print(f"\nFertig: {done}/{total} klassifiziert ({failed_batches} Fehlbatches).")
+    with sqlite3.connect(SIDECAR_DB) as c:
+        c.row_factory = sqlite3.Row
+        rows = c.execute(
+            "SELECT doc_type, check_type, COUNT(*) n FROM mc_classification "
+            "GROUP BY doc_type, check_type ORDER BY doc_type, check_type"
+        ).fetchall()
+        print("\n=== Final-Verteilung doc_type x check_type ===")
+        prev = None
+        for r in rows:
+            if r["doc_type"] != prev:
+                print(); print(f"[{r['doc_type']}]")
+                prev = r["doc_type"]
+            print(f"  {r['check_type']:<8} {r['n']}")
+
+
+if __name__ == "__main__":
+    main()