feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features
CI / nodejs-build (push) Successful in 2m47s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / nodejs-build (push) Successful in 2m47s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9): Core Compliance-Check - Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db). rag_document_checker filtert auf check_type='text' fuer doc_check. Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in falscher doc_type-Schublade. - scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden per business_profile gefiltert (FRT skipped fuer BMW etc.). - Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match: Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60), Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum. Title+check_question als Embedding-Input fuer mehr Kontext. - Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction wenn richer (BMW 1824 vs 600 Worte). Vendor-Redundanz + EU-Alternativen + Cost-Saving - vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors, Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...). - vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/ enterprise/premier). - Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten (nur Media-Spend, separat). DSP-Plattformen behalten enge Range. - Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den oberen 40-100%-Band der Listpreise, nicht starter→premier. - Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere Kategorien gleichzeitig. Cookie-Wissens-DB + Funktionale Klassifikation - cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...) mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk, schrems_ii_status, EuGH-Urteile, EU-Alternative. - cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id, ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact. Country-Inferenz aus Rechtsform - cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table. Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors (Adform DK, Pinterest IE). Action-Recipes + Doc-Anchor-Locator - finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country, broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling", ...) eine strukturierte Anweisung mit what/why/fix_text/where/example. Zum 1:1-Einfuegen in Kunden-Dokumente. - doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den passenden Absatz im existierenden Kundendokument fuer jeden Finding. Per-Run Thread-Local-Cache. Fallback: keyword-Match. - Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail + Vendor-Flag-Liste mit aufklappbarer Action-Liste. - Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip). Migration-Pipeline (Compliance-Check -> Customer Banner/Documents) - migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit 4 Kategorien + Review-Flags. - migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register + Privacy-Policy-Pre-Fills. - agent_migration_routes: 3 Preview-Endpoints (banner-preview, document-preview, summary). Persistierung der cmp_vendors in /data/compliance_audits.db check_payloads-Tabelle. Borlabs-Parity Cookie-Banner-Features - Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage. - Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video Placeholder bis Einwilligung. - Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB. - Consent-Log Export (CSV/JSON) per einwilligungen_export_routes. Bug-Fixes - canonical_control_routes: _jsonish-Helper fuer string-typed jsonb, similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr). - Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views. - Embedding-Service-Batching (32er Batches statt 165 in einem Call). - KeyError 'control_id' in MC-Result-Aggregation (defensive .get). - Master-Controls-Klick-Through von /sdk/master-controls auf /sdk/control-library?control=<id> mit URL-Param-Auto-Open. - Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht). - Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction). - doc_type-aware MC-Filter (statt all-text-MCs). - Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag). - A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert. Tests - test_migration_mappers.py (9 Tests) - test_migration_endpoints.py (4 Tests) Skripte (one-shot) - classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type) - audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires) BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes): DSE 7,5% -> 81-83% Impressum 4% -> 100% (6 echte MCs alle erfuellt) Cookie 0% -> 79-83% (CMP-Text-Routing + Embedding) Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr Plus: Action-Recipes + Doc-Anchors fuer jeden Fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,229 @@
|
||||
"""
|
||||
LLM-Audit: prueft fuer jeden text-MC ob er fachlich wirklich zu seinem
|
||||
zugeordneten doc_type passt. BMW Run zeigte z.B. dass die meisten
|
||||
Impressum-MCs eigentlich DSA/E-Commerce-Pflichten sind, die in einem
|
||||
§5-TMG-Impressum gar nicht stehen.
|
||||
|
||||
Output:
|
||||
- doc_type passt → MC bleibt active (kein DB-Update)
|
||||
- doc_type passt NICHT → check_type wird auf 'misclassified' gesetzt;
|
||||
rag_document_checker filtert die dann aus
|
||||
|
||||
Sidecar SQLite. KEINE Postgres-Aenderung (Schema frozen).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sqlite3
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
|
||||
import httpx
|
||||
import psycopg2
|
||||
from psycopg2.extras import RealDictCursor
|
||||
|
||||
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||
MODEL = "claude-sonnet-4-6"
|
||||
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
|
||||
BATCH_SIZE = 25
|
||||
SLEEP_BETWEEN_BATCHES = 0.5
|
||||
|
||||
DOC_TYPE_DESCRIPTIONS = {
|
||||
"agb": "Allgemeine Geschaeftsbedingungen — Vertragsbedingungen "
|
||||
"zwischen Anbieter und Kunde",
|
||||
"avv": "Auftragsverarbeitungsvertrag Art. 28 DSGVO — Vertrag zwischen "
|
||||
"Verantwortlichem und Auftragsverarbeiter",
|
||||
"cookie": "Cookie-Richtlinie — Information ueber Zwecke, Anbieter, "
|
||||
"Speicherdauer und Widerruf von Cookies/Tracking-Pixeln",
|
||||
"dse": "Datenschutzerklaerung Art. 13 DSGVO — Informationen ueber "
|
||||
"Verantwortlichen, Zwecke, Rechtsgrundlagen, Empfaenger, "
|
||||
"Speicherdauer, Betroffenenrechte, Aufsichtsbehoerde",
|
||||
"dsfa": "Datenschutz-Folgenabschaetzung Art. 35 DSGVO — Risikobewertung "
|
||||
"von Verarbeitungen mit hohem Risiko",
|
||||
"impressum": "Impressum §5 TMG / §18 MStV — Anbieterkennzeichnung mit "
|
||||
"Name, Anschrift, Vertretungsberechtigten, Handelsregister, "
|
||||
"USt-IdNr., berufsrechtliche Angaben, Aufsicht",
|
||||
"loeschkonzept": "Loeschkonzept DIN 66398 — Definition Aufbewahrungs- "
|
||||
"und Loeschfristen pro Datenkategorie + Prozess",
|
||||
"widerruf": "Widerrufsbelehrung §312 BGB — Verbraucher-Widerrufsrecht "
|
||||
"bei Fernabsatz, Frist, Folgen, Muster",
|
||||
}
|
||||
|
||||
SYSTEM_PROMPT = """Du pruefst Master-Controls (MCs) der Compliance-Pruefung.
|
||||
|
||||
Fuer jeden MC bekommst du:
|
||||
- den ihm zugeordneten doc_type (z.B. 'impressum', 'cookie', 'dse')
|
||||
- den Titel und die check_question
|
||||
|
||||
Deine Aufgabe: entscheide ob der MC fachlich wirklich zu diesem doc_type passt.
|
||||
|
||||
Beispiele:
|
||||
- MC "Wird die USt-IdNr im Impressum genannt?" + doc_type=impressum → PASST
|
||||
- MC "Monatlich aktive Endnutzer fuer Online-Vermittlungsdienste messen" + doc_type=impressum → PASST NICHT
|
||||
(DSA-Pflicht, gehoert nicht in ein klassisches §5-TMG-Impressum)
|
||||
- MC "Mithoeren von Funknachrichten verhindern" + doc_type=cookie → PASST NICHT
|
||||
(TKG-Spezialthema, nicht Cookie-Richtlinie)
|
||||
|
||||
Antworte als JSON-Array, eine Zeile pro MC:
|
||||
[{"control_id": "<wie input>", "doc_type": "<wie input>", "fits": true|false,
|
||||
"rationale": "ein kurzer satz"}, ...]
|
||||
Kein Markdown."""
|
||||
|
||||
|
||||
def fetch_pairs_to_audit(conn) -> list[dict]:
|
||||
"""All text-MCs that haven't been audited yet (no 'fits' column)."""
|
||||
with sqlite3.connect(SIDECAR_DB) as side:
|
||||
cols = [r[1] for r in side.execute("PRAGMA table_info(mc_classification)")]
|
||||
if "fits_doc_type" not in cols:
|
||||
side.execute("ALTER TABLE mc_classification ADD COLUMN fits_doc_type INTEGER")
|
||||
side.commit()
|
||||
already = set()
|
||||
for cid, dt in side.execute(
|
||||
"SELECT control_id, doc_type FROM mc_classification "
|
||||
"WHERE fits_doc_type IS NOT NULL"
|
||||
):
|
||||
already.add((cid, dt or ""))
|
||||
|
||||
with conn.cursor(cursor_factory=RealDictCursor) as c:
|
||||
c.execute("""SELECT dc.control_id, dc.doc_type, dc.title, dc.check_question
|
||||
FROM compliance.doc_check_controls dc
|
||||
WHERE dc.control_id IN (
|
||||
SELECT control_id FROM compliance.doc_check_controls
|
||||
)""")
|
||||
all_rows = list(c.fetchall())
|
||||
|
||||
# Audit only those classified as 'text' in sidecar — process/review
|
||||
# never run through doc_check anyway
|
||||
with sqlite3.connect(SIDECAR_DB) as side:
|
||||
text_pairs = set()
|
||||
for cid, dt in side.execute(
|
||||
"SELECT control_id, doc_type FROM mc_classification "
|
||||
"WHERE check_type = 'text'"
|
||||
):
|
||||
text_pairs.add((cid, dt or ""))
|
||||
|
||||
target = [r for r in all_rows
|
||||
if (r["control_id"], r["doc_type"] or "") in text_pairs
|
||||
and (r["control_id"], r["doc_type"] or "") not in already]
|
||||
return target
|
||||
|
||||
|
||||
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
|
||||
payload = {
|
||||
"model": MODEL,
|
||||
"max_tokens": 4000,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": [{
|
||||
"role": "user",
|
||||
"content": (
|
||||
"Doc-Typen-Beschreibungen:\n"
|
||||
+ "\n".join(f"- {k}: {v}" for k, v in DOC_TYPE_DESCRIPTIONS.items())
|
||||
+ "\n\nPruefe folgende MCs:\n\n"
|
||||
+ json.dumps([
|
||||
{"control_id": m["control_id"], "doc_type": m["doc_type"],
|
||||
"title": m["title"], "check_question": (m["check_question"] or "")[:300]}
|
||||
for m in batch
|
||||
], ensure_ascii=False, indent=2)
|
||||
),
|
||||
}],
|
||||
}
|
||||
headers = {
|
||||
"x-api-key": api_key,
|
||||
"anthropic-version": "2023-06-01",
|
||||
"content-type": "application/json",
|
||||
}
|
||||
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
|
||||
r.raise_for_status()
|
||||
txt = r.json()["content"][0]["text"].strip()
|
||||
if txt.startswith("```"):
|
||||
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
|
||||
if txt.startswith("json"):
|
||||
txt = txt[4:].strip()
|
||||
return json.loads(txt)
|
||||
|
||||
|
||||
def store_audit(rows: list[dict]) -> None:
|
||||
ts = datetime.now(timezone.utc).isoformat()
|
||||
with sqlite3.connect(SIDECAR_DB) as c:
|
||||
c.executemany(
|
||||
"UPDATE mc_classification SET fits_doc_type = ?, "
|
||||
"rationale = COALESCE(?, rationale), classified_at = ? "
|
||||
"WHERE control_id = ? AND doc_type = ?",
|
||||
[
|
||||
(
|
||||
1 if r.get("fits") else 0,
|
||||
(r.get("rationale") or "")[:500] or None,
|
||||
ts,
|
||||
r.get("control_id"),
|
||||
r.get("doc_type") or "",
|
||||
)
|
||||
for r in rows
|
||||
],
|
||||
)
|
||||
c.commit()
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--sample", action="store_true")
|
||||
args = ap.parse_args()
|
||||
|
||||
api_key = os.environ["ANTHROPIC_API_KEY"]
|
||||
conn = psycopg2.connect(os.environ["DATABASE_URL"])
|
||||
pairs = fetch_pairs_to_audit(conn)
|
||||
|
||||
if args.sample:
|
||||
for m in pairs[:5]:
|
||||
print(json.dumps(m, ensure_ascii=False, indent=2))
|
||||
print(f"\nTotal pairs to audit: {len(pairs)}")
|
||||
return
|
||||
|
||||
print(f"Audit {len(pairs)} text-MCs in Batches von {BATCH_SIZE}", flush=True)
|
||||
if not pairs:
|
||||
print("Alles auditiert.")
|
||||
return
|
||||
|
||||
done = 0
|
||||
failed_batches = 0
|
||||
t0 = time.time()
|
||||
for i in range(0, len(pairs), BATCH_SIZE):
|
||||
batch = pairs[i:i + BATCH_SIZE]
|
||||
try:
|
||||
out = call_claude(api_key, batch)
|
||||
store_audit(out)
|
||||
done += len(out)
|
||||
elapsed = time.time() - t0
|
||||
rate = done / max(elapsed, 0.01)
|
||||
eta = (len(pairs) - done) / max(rate, 0.01)
|
||||
print(f" [{done:>4}/{len(pairs)}] {rate:.1f} MC/s ETA {eta/60:.1f}min", flush=True)
|
||||
except Exception as e:
|
||||
failed_batches += 1
|
||||
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
|
||||
if failed_batches >= 5:
|
||||
print("Zu viele Fehler — abbrechen.", file=sys.stderr)
|
||||
break
|
||||
time.sleep(SLEEP_BETWEEN_BATCHES)
|
||||
|
||||
print(f"\nFertig: {done}/{len(pairs)} auditiert ({failed_batches} Fehlbatches).")
|
||||
with sqlite3.connect(SIDECAR_DB) as c:
|
||||
c.row_factory = sqlite3.Row
|
||||
rows = c.execute(
|
||||
"SELECT doc_type, "
|
||||
" SUM(CASE WHEN fits_doc_type = 1 THEN 1 ELSE 0 END) AS fits, "
|
||||
" SUM(CASE WHEN fits_doc_type = 0 THEN 1 ELSE 0 END) AS misfits, "
|
||||
" COUNT(*) AS total "
|
||||
"FROM mc_classification "
|
||||
"WHERE check_type = 'text' AND fits_doc_type IS NOT NULL "
|
||||
"GROUP BY doc_type ORDER BY doc_type"
|
||||
).fetchall()
|
||||
print("\n=== Audit-Verteilung doc_type x fits ===")
|
||||
for r in rows:
|
||||
print(f" {r['doc_type']:<14} fits={r['fits']:<4} misfits={r['misfits']:<4} total={r['total']}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,216 @@
|
||||
"""
|
||||
A3-v2: nachschärfen welche MCs in Wahrheit auf das CMP-Banner-UI / einen
|
||||
Prozess zielen, nicht auf den Doc-TEXT.
|
||||
|
||||
BMW-Run zeigte ~10 'Sprache klar/einfach' + 'Button-Text formuliert' MCs,
|
||||
die der erste Audit als doc_type-passend markierte. Sie sind ABER nicht
|
||||
gegen den Cookie-Policy- oder DSE-Text pruefbar — die fragen nach der
|
||||
Verstaendlichkeit der Einwilligungs-UI.
|
||||
|
||||
Schaltet 'scope_requires' bei FRT/biometrischen MCs und re-klassifiziert
|
||||
diese UI-Items auf check_type='process' (filtert sie aus doc_check raus).
|
||||
|
||||
Plus: Setzt scope_requires fuer offensichtlich konditionale MCs:
|
||||
- 'biometric_processing' bei FRT/Gesichtserkennung
|
||||
- 'ai_decision_making' bei automatisierten Einzelentscheidungen
|
||||
- 'child_targeting' bei Kinder-Einwilligungs-MCs
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sqlite3
|
||||
import sys
|
||||
import time
|
||||
|
||||
import httpx
|
||||
import psycopg2
|
||||
from psycopg2.extras import RealDictCursor
|
||||
|
||||
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||
MODEL = "claude-sonnet-4-6"
|
||||
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
|
||||
BATCH_SIZE = 20
|
||||
|
||||
SYSTEM_PROMPT = """Du pruefst Master-Controls (MCs) der Compliance-Pruefung
|
||||
zweiter Ordnung: jeder MC wurde bereits als 'text' klassifiziert und einem
|
||||
doc_type zugeordnet. Du entscheidest:
|
||||
|
||||
A) Pruefst der MC den TEXT des Dokuments (z.B. "Wird im Impressum die
|
||||
USt-IdNr genannt?") → fits_doc_type=true, ui_only=false
|
||||
B) Pruefst der MC die UI/UX eines Banners oder Buttons (z.B.
|
||||
"Ist der Ablehnungs-Button gleich gross wie Akzeptieren?", "Ist die
|
||||
Einwilligungsanfrage in einfacher Sprache formuliert?") → ui_only=true
|
||||
(Diese MCs koennen NIE im Dokumenttext stehen, weil sie sich auf eine
|
||||
externe UI beziehen.)
|
||||
|
||||
Zusaetzlich kennzeichne SCOPE-Bedingungen, wenn der MC nur fuer bestimmte
|
||||
Sites relevant ist:
|
||||
- 'biometric_processing' : nur bei Sites die biometrische Daten
|
||||
(Gesichtserkennung/FRT, Fingerabdruecke) verarbeiten
|
||||
- 'ai_decision_making' : nur bei automatisierten Einzelentscheidungen
|
||||
(Art. 22 DSGVO)
|
||||
- 'child_targeting' : nur bei Sites die sich an Kinder richten
|
||||
- 'ecommerce' : nur bei Webshops
|
||||
- 'b2c' : nur fuer Verbraucher-Geschaeft (UWG/BGB-Pflichten)
|
||||
Wenn der MC fuer ALLE Sites gilt, setze scope_requires=null.
|
||||
|
||||
Antworte als JSON-Array — keine Erklaerung davor/danach, kein Markdown.
|
||||
Format:
|
||||
[{"control_id": "<wie input>", "doc_type": "<wie input>",
|
||||
"ui_only": true|false, "scope_requires": "biometric_processing|ai_decision_making|child_targeting|ecommerce|b2c|null",
|
||||
"rationale": "ein kurzer satz"}, ...]"""
|
||||
|
||||
|
||||
def fetch_pairs_to_audit(conn) -> list[dict]:
|
||||
"""All text-MCs that haven't been v2-audited yet (no 'ui_only' column)."""
|
||||
with sqlite3.connect(SIDECAR_DB) as side:
|
||||
cols = [r[1] for r in side.execute("PRAGMA table_info(mc_classification)")]
|
||||
added = False
|
||||
if "ui_only" not in cols:
|
||||
side.execute("ALTER TABLE mc_classification ADD COLUMN ui_only INTEGER")
|
||||
added = True
|
||||
if "scope_requires" not in cols:
|
||||
side.execute("ALTER TABLE mc_classification ADD COLUMN scope_requires TEXT")
|
||||
added = True
|
||||
if added:
|
||||
side.commit()
|
||||
already = set()
|
||||
for cid, dt in side.execute(
|
||||
"SELECT control_id, doc_type FROM mc_classification "
|
||||
"WHERE ui_only IS NOT NULL"
|
||||
):
|
||||
already.add((cid, dt or ""))
|
||||
|
||||
with conn.cursor(cursor_factory=RealDictCursor) as c:
|
||||
c.execute("""SELECT dc.control_id, dc.doc_type, dc.title, dc.check_question
|
||||
FROM compliance.doc_check_controls dc""")
|
||||
all_rows = list(c.fetchall())
|
||||
|
||||
# Audit only those already classified as text+fits in sidecar
|
||||
with sqlite3.connect(SIDECAR_DB) as side:
|
||||
eligible = set()
|
||||
for cid, dt in side.execute(
|
||||
"SELECT control_id, doc_type FROM mc_classification "
|
||||
"WHERE check_type = 'text' AND (fits_doc_type IS NULL OR fits_doc_type = 1)"
|
||||
):
|
||||
eligible.add((cid, dt or ""))
|
||||
|
||||
target = [r for r in all_rows
|
||||
if (r["control_id"], r["doc_type"] or "") in eligible
|
||||
and (r["control_id"], r["doc_type"] or "") not in already]
|
||||
return target
|
||||
|
||||
|
||||
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
|
||||
payload = {
|
||||
"model": MODEL,
|
||||
"max_tokens": 4000,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": [{
|
||||
"role": "user",
|
||||
"content": "Pruefe folgende MCs:\n\n" + json.dumps([
|
||||
{"control_id": m["control_id"], "doc_type": m["doc_type"],
|
||||
"title": m["title"], "check_question": (m["check_question"] or "")[:300]}
|
||||
for m in batch
|
||||
], ensure_ascii=False, indent=2),
|
||||
}],
|
||||
}
|
||||
headers = {
|
||||
"x-api-key": api_key,
|
||||
"anthropic-version": "2023-06-01",
|
||||
"content-type": "application/json",
|
||||
}
|
||||
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
|
||||
r.raise_for_status()
|
||||
txt = r.json()["content"][0]["text"].strip()
|
||||
if txt.startswith("```"):
|
||||
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
|
||||
if txt.startswith("json"):
|
||||
txt = txt[4:].strip()
|
||||
return json.loads(txt)
|
||||
|
||||
|
||||
def store(rows: list[dict]) -> None:
|
||||
with sqlite3.connect(SIDECAR_DB) as c:
|
||||
c.executemany(
|
||||
"UPDATE mc_classification SET ui_only = ?, scope_requires = ? "
|
||||
"WHERE control_id = ? AND doc_type = ?",
|
||||
[
|
||||
(
|
||||
1 if r.get("ui_only") else 0,
|
||||
(r.get("scope_requires") or "").strip() or None
|
||||
if (r.get("scope_requires") or "").lower() not in ("", "null")
|
||||
else None,
|
||||
r.get("control_id"),
|
||||
r.get("doc_type") or "",
|
||||
)
|
||||
for r in rows
|
||||
],
|
||||
)
|
||||
# MCs flagged ui_only become check_type='process' so they're not in doc_check
|
||||
c.executemany(
|
||||
"UPDATE mc_classification SET check_type='process' "
|
||||
"WHERE ui_only=1 AND control_id=? AND doc_type=?",
|
||||
[(r.get("control_id"), r.get("doc_type") or "") for r in rows
|
||||
if r.get("ui_only")],
|
||||
)
|
||||
c.commit()
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--sample", action="store_true")
|
||||
args = ap.parse_args()
|
||||
|
||||
api_key = os.environ["ANTHROPIC_API_KEY"]
|
||||
conn = psycopg2.connect(os.environ["DATABASE_URL"])
|
||||
pairs = fetch_pairs_to_audit(conn)
|
||||
|
||||
if args.sample:
|
||||
for m in pairs[:5]:
|
||||
print(json.dumps(m, ensure_ascii=False, indent=2))
|
||||
print(f"\nTotal: {len(pairs)}")
|
||||
return
|
||||
|
||||
print(f"Audit-v2: {len(pairs)} MCs in Batches von {BATCH_SIZE}", flush=True)
|
||||
if not pairs:
|
||||
print("Alles geprueft.")
|
||||
return
|
||||
|
||||
done = 0
|
||||
fail = 0
|
||||
t0 = time.time()
|
||||
for i in range(0, len(pairs), BATCH_SIZE):
|
||||
batch = pairs[i:i + BATCH_SIZE]
|
||||
try:
|
||||
out = call_claude(api_key, batch)
|
||||
store(out)
|
||||
done += len(out)
|
||||
elapsed = time.time() - t0
|
||||
rate = done / max(elapsed, 0.01)
|
||||
eta = (len(pairs) - done) / max(rate, 0.01)
|
||||
print(f" [{done:>4}/{len(pairs)}] {rate:.1f} MC/s ETA {eta/60:.1f}min", flush=True)
|
||||
except Exception as e:
|
||||
fail += 1
|
||||
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", file=sys.stderr, flush=True)
|
||||
if fail >= 5: break
|
||||
time.sleep(0.5)
|
||||
|
||||
print(f"\nFertig: {done}/{len(pairs)} ({fail} Fehlbatches).")
|
||||
with sqlite3.connect(SIDECAR_DB) as c:
|
||||
ui = c.execute("SELECT COUNT(*) FROM mc_classification WHERE ui_only=1").fetchone()[0]
|
||||
scope = c.execute(
|
||||
"SELECT scope_requires, COUNT(*) FROM mc_classification "
|
||||
"WHERE scope_requires IS NOT NULL GROUP BY scope_requires"
|
||||
).fetchall()
|
||||
print(f"\nui_only=1: {ui} MCs (jetzt 'process')")
|
||||
print("scope_requires Verteilung:")
|
||||
for s, n in scope:
|
||||
print(f" {s}: {n}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,222 @@
|
||||
"""
|
||||
Classify doc_check_controls (1874 MCs) into check_type:
|
||||
- text : MC asks about the DOCUMENT'S WORDING ("enthaelt die DSE..", "wird im Impressum genannt..")
|
||||
- process : MC asks about a TECHNICAL OR ORGANISATIONAL CONTROL ("ist sichergestellt..", "ist implementiert..")
|
||||
- review : MC asks an ABSTRACT META-QUESTION that needs human judgment ("sind alle Verarbeitungen vollstaendig?")
|
||||
|
||||
Output goes to sidecar SQLite at /data/mc_classification.db (no Postgres schema change
|
||||
per CLAUDE.md guardrails). Schema:
|
||||
|
||||
CREATE TABLE mc_classification (
|
||||
control_id TEXT PRIMARY KEY,
|
||||
doc_type TEXT,
|
||||
title TEXT,
|
||||
check_type TEXT, -- text|process|review
|
||||
confidence REAL, -- 0..1
|
||||
rationale TEXT,
|
||||
classified_at TEXT
|
||||
);
|
||||
|
||||
Run from inside bp-compliance-backend container:
|
||||
docker exec bp-compliance-backend python /app/scripts/classify_mc_check_type.py [--limit N] [--sample]
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sqlite3
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
import psycopg2
|
||||
from psycopg2.extras import RealDictCursor
|
||||
|
||||
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||
MODEL = "claude-sonnet-4-6"
|
||||
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
|
||||
BATCH_SIZE = 25
|
||||
SLEEP_BETWEEN_BATCHES = 0.5 # sec — keep gentle for the parallel Haiku batch
|
||||
|
||||
SYSTEM_PROMPT = """Du klassifizierst Master-Controls (MCs) der DSGVO/TMG/TDDDG-Compliance-Pruefung in genau drei Klassen:
|
||||
|
||||
TEXT — Der MC prueft, ob ein bestimmter Inhalt im TEXT eines Dokuments (DSE, Impressum, Cookie-Richtlinie etc.) steht.
|
||||
Beispiele: "Enthaelt die Datenschutzerklaerung die Aufsichtsbehoerde?" / "Wird im Impressum die USt-IdNr genannt?"
|
||||
Diese MCs koennen gegen den Dokument-Text gematched werden.
|
||||
|
||||
PROCESS — Der MC prueft eine technische oder organisatorische MASSNAHME, die ausserhalb des Dokumenttexts liegt.
|
||||
Beispiele: "Ist sichergestellt, dass Cookies erst nach Einwilligung gespeichert werden?" /
|
||||
"Ist ein Loeschkonzept implementiert?" / "Wird das Consent-Tool monatlich getestet?"
|
||||
Diese MCs lassen sich NICHT aus dem Dokumenttext beantworten — sie brauchen Evidence/TOM-Nachweis.
|
||||
|
||||
REVIEW — Der MC stellt eine abstrakte Vollstaendigkeits-/Konsistenzfrage, die menschliche Beurteilung braucht.
|
||||
Beispiele: "Sind ALLE Verarbeitungszwecke vollstaendig erfasst?" / "Stimmen Cookie-Richtlinie und Banner ueberein?"
|
||||
Diese MCs sind Checklisten-Items fuer den DSB, kein automatischer Check moeglich.
|
||||
|
||||
Antworte ausschliesslich als JSON-Array — keine Erklaerung davor/danach, kein Markdown-Codeblock. Format:
|
||||
[{"control_id": "<wie input>", "check_type": "text|process|review", "confidence": 0.9, "rationale": "ein kurzer satz"}, ...]"""
|
||||
|
||||
|
||||
def fetch_mcs(conn, limit: int | None = None, only_unclassified: bool = True) -> list[dict]:
|
||||
sql = """SELECT control_id, doc_type, title, check_question
|
||||
FROM compliance.doc_check_controls"""
|
||||
if only_unclassified:
|
||||
sql += " WHERE control_id NOT IN (SELECT control_id FROM _classified_already)"
|
||||
sql += " ORDER BY doc_type, title"
|
||||
if limit:
|
||||
sql += f" LIMIT {limit}"
|
||||
with conn.cursor(cursor_factory=RealDictCursor) as c:
|
||||
try:
|
||||
c.execute("CREATE TEMP TABLE _classified_already (control_id TEXT)")
|
||||
with sqlite3.connect(SIDECAR_DB) as side:
|
||||
rows = side.execute("SELECT control_id FROM mc_classification").fetchall()
|
||||
if rows:
|
||||
c.executemany("INSERT INTO _classified_already VALUES (%s)", [(r[0],) for r in rows])
|
||||
except Exception:
|
||||
pass
|
||||
c.execute(sql)
|
||||
return list(c.fetchall())
|
||||
|
||||
|
||||
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
|
||||
payload = {
|
||||
"model": MODEL,
|
||||
"max_tokens": 4000,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": [{
|
||||
"role": "user",
|
||||
"content": "Klassifiziere die folgenden MCs:\n\n" + json.dumps(
|
||||
[{"control_id": m["control_id"],
|
||||
"doc_type": m["doc_type"],
|
||||
"title": m["title"],
|
||||
"check_question": (m["check_question"] or "")[:400]}
|
||||
for m in batch],
|
||||
ensure_ascii=False, indent=2),
|
||||
}],
|
||||
}
|
||||
headers = {
|
||||
"x-api-key": api_key,
|
||||
"anthropic-version": "2023-06-01",
|
||||
"content-type": "application/json",
|
||||
}
|
||||
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
|
||||
r.raise_for_status()
|
||||
txt = r.json()["content"][0]["text"].strip()
|
||||
# Strip code fences if Sonnet adds them
|
||||
if txt.startswith("```"):
|
||||
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
|
||||
if txt.startswith("json"):
|
||||
txt = txt[4:].strip()
|
||||
return json.loads(txt)
|
||||
|
||||
|
||||
def ensure_sidecar() -> None:
|
||||
Path(SIDECAR_DB).parent.mkdir(parents=True, exist_ok=True)
|
||||
with sqlite3.connect(SIDECAR_DB) as c:
|
||||
c.executescript("""
|
||||
CREATE TABLE IF NOT EXISTS mc_classification (
|
||||
control_id TEXT PRIMARY KEY,
|
||||
doc_type TEXT,
|
||||
title TEXT,
|
||||
check_type TEXT,
|
||||
confidence REAL,
|
||||
rationale TEXT,
|
||||
classified_at TEXT
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_doctype ON mc_classification(doc_type);
|
||||
CREATE INDEX IF NOT EXISTS idx_type ON mc_classification(check_type);
|
||||
""")
|
||||
|
||||
|
||||
def store_results(rows: list[dict], lookup: dict[str, dict]) -> None:
|
||||
ts = datetime.now(timezone.utc).isoformat()
|
||||
with sqlite3.connect(SIDECAR_DB) as c:
|
||||
c.executemany(
|
||||
"INSERT OR REPLACE INTO mc_classification "
|
||||
"(control_id, doc_type, title, check_type, confidence, rationale, classified_at) "
|
||||
"VALUES (?, ?, ?, ?, ?, ?, ?)",
|
||||
[
|
||||
(
|
||||
r.get("control_id"),
|
||||
lookup.get(r.get("control_id"), {}).get("doc_type", ""),
|
||||
lookup.get(r.get("control_id"), {}).get("title", ""),
|
||||
(r.get("check_type") or "").lower(),
|
||||
float(r.get("confidence") or 0),
|
||||
(r.get("rationale") or "")[:500],
|
||||
ts,
|
||||
)
|
||||
for r in rows
|
||||
],
|
||||
)
|
||||
c.commit()
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--limit", type=int, default=None, help="limit number of MCs (for testing)")
|
||||
ap.add_argument("--sample", action="store_true", help="just print first 5 MCs and exit")
|
||||
ap.add_argument("--all", action="store_true", help="re-classify all (otherwise skips already classified)")
|
||||
args = ap.parse_args()
|
||||
|
||||
ensure_sidecar()
|
||||
api_key = os.environ["ANTHROPIC_API_KEY"]
|
||||
conn = psycopg2.connect(os.environ["DATABASE_URL"])
|
||||
mcs = fetch_mcs(conn, limit=args.limit, only_unclassified=not args.all)
|
||||
|
||||
if args.sample:
|
||||
for m in mcs[:5]:
|
||||
print(json.dumps(m, ensure_ascii=False, indent=2))
|
||||
return
|
||||
|
||||
print(f"Klassifiziere {len(mcs)} MCs in Batches von {BATCH_SIZE}", flush=True)
|
||||
if not mcs:
|
||||
print("Nichts zu tun.")
|
||||
return
|
||||
|
||||
lookup = {m["control_id"]: m for m in mcs}
|
||||
total = len(mcs)
|
||||
done = 0
|
||||
failed_batches = 0
|
||||
t0 = time.time()
|
||||
for i in range(0, total, BATCH_SIZE):
|
||||
batch = mcs[i:i + BATCH_SIZE]
|
||||
try:
|
||||
out = call_claude(api_key, batch)
|
||||
store_results(out, lookup)
|
||||
done += len(out)
|
||||
elapsed = time.time() - t0
|
||||
rate = done / max(elapsed, 0.01)
|
||||
eta = (total - done) / max(rate, 0.01)
|
||||
print(f" [{done:>5}/{total}] {rate:.1f} MC/s ETA {eta/60:.1f}min",
|
||||
flush=True)
|
||||
except Exception as e:
|
||||
failed_batches += 1
|
||||
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
|
||||
if failed_batches >= 5:
|
||||
print(" Zu viele Fehler — abbrechen.", file=sys.stderr)
|
||||
break
|
||||
time.sleep(SLEEP_BETWEEN_BATCHES)
|
||||
|
||||
print(f"\nFertig: {done}/{total} klassifiziert ({failed_batches} Fehlbatches).")
|
||||
# Summary
|
||||
with sqlite3.connect(SIDECAR_DB) as c:
|
||||
c.row_factory = sqlite3.Row
|
||||
rows = c.execute(
|
||||
"SELECT doc_type, check_type, COUNT(*) n FROM mc_classification "
|
||||
"GROUP BY doc_type, check_type ORDER BY doc_type, check_type"
|
||||
).fetchall()
|
||||
print("\n=== Verteilung nach doc_type x check_type ===")
|
||||
prev = None
|
||||
for r in rows:
|
||||
if r["doc_type"] != prev:
|
||||
print(); print(f"[{r['doc_type']}]")
|
||||
prev = r["doc_type"]
|
||||
print(f" {r['check_type']:<8} {r['n']}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,241 @@
|
||||
"""
|
||||
v2: Re-classify only the (control_id, doc_type) PAIRS that aren't covered yet.
|
||||
|
||||
V1 used PK=control_id, so cross-doc-type variants (same control assigned
|
||||
to e.g. AGB AND Widerruf with different check_questions) overwrote each
|
||||
other. v2 migrates to PK=(control_id, doc_type) and classifies only the
|
||||
~262 missing pairs.
|
||||
|
||||
Run from container:
|
||||
docker exec bp-compliance-backend python /app/scripts/classify_mc_check_type_v2.py [--sample]
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sqlite3
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
import psycopg2
|
||||
from psycopg2.extras import RealDictCursor
|
||||
|
||||
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||
MODEL = "claude-sonnet-4-6"
|
||||
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
|
||||
BATCH_SIZE = 25
|
||||
SLEEP_BETWEEN_BATCHES = 0.5
|
||||
|
||||
SYSTEM_PROMPT = """Du klassifizierst Master-Controls (MCs) der DSGVO/TMG/TDDDG-Compliance-Pruefung in genau drei Klassen:
|
||||
|
||||
TEXT — Der MC prueft, ob ein bestimmter Inhalt im TEXT eines Dokuments steht.
|
||||
Beispiele: "Enthaelt die Datenschutzerklaerung die Aufsichtsbehoerde?"
|
||||
Diese MCs koennen gegen den Dokument-Text gematched werden.
|
||||
|
||||
PROCESS — Der MC prueft eine technische oder organisatorische MASSNAHME ausserhalb des Dokumenttexts.
|
||||
Beispiele: "Ist sichergestellt, dass Cookies erst nach Einwilligung gespeichert werden?"
|
||||
|
||||
REVIEW — Der MC stellt eine abstrakte Vollstaendigkeitsfrage, die menschliche Beurteilung braucht.
|
||||
Beispiele: "Sind ALLE Verarbeitungszwecke vollstaendig erfasst?"
|
||||
|
||||
Wichtig: derselbe Control kann fuer DIFFERENT doc_types unterschiedlich klassifiziert sein —
|
||||
mit doc_type-spezifisch angepasster check_question kann ein "text"-Check fuer ein Dok zum
|
||||
"process"-Check fuer ein anderes werden.
|
||||
|
||||
Antworte ausschliesslich als JSON-Array — kein Markdown. Format:
|
||||
[{"control_id": "<input>", "doc_type": "<input>", "check_type": "text|process|review",
|
||||
"confidence": 0.9, "rationale": "ein kurzer satz"}, ...]"""
|
||||
|
||||
|
||||
def migrate_schema() -> None:
|
||||
"""Migrate sidecar from PK=control_id to PK=(control_id, doc_type)."""
|
||||
Path(SIDECAR_DB).parent.mkdir(parents=True, exist_ok=True)
|
||||
with sqlite3.connect(SIDECAR_DB) as c:
|
||||
# Check if v2 schema already in place (composite PK)
|
||||
cols = c.execute("PRAGMA table_info(mc_classification)").fetchall()
|
||||
if not cols:
|
||||
# First run — create fresh
|
||||
c.executescript("""
|
||||
CREATE TABLE mc_classification (
|
||||
control_id TEXT,
|
||||
doc_type TEXT,
|
||||
title TEXT,
|
||||
check_type TEXT,
|
||||
confidence REAL,
|
||||
rationale TEXT,
|
||||
classified_at TEXT,
|
||||
PRIMARY KEY (control_id, doc_type)
|
||||
);
|
||||
CREATE INDEX idx_doctype ON mc_classification(doc_type);
|
||||
CREATE INDEX idx_type ON mc_classification(check_type);
|
||||
""")
|
||||
return
|
||||
|
||||
# Check whether the existing table already has composite PK
|
||||
pk_cols = [r[1] for r in cols if r[5] > 0]
|
||||
if set(pk_cols) == {"control_id", "doc_type"}:
|
||||
print("Schema already v2 (composite PK). Skipping migration.")
|
||||
return
|
||||
|
||||
print("Migrating sidecar schema to PK(control_id, doc_type)...")
|
||||
c.executescript("""
|
||||
CREATE TABLE mc_classification_v2 (
|
||||
control_id TEXT,
|
||||
doc_type TEXT,
|
||||
title TEXT,
|
||||
check_type TEXT,
|
||||
confidence REAL,
|
||||
rationale TEXT,
|
||||
classified_at TEXT,
|
||||
PRIMARY KEY (control_id, doc_type)
|
||||
);
|
||||
INSERT INTO mc_classification_v2
|
||||
(control_id, doc_type, title, check_type, confidence, rationale, classified_at)
|
||||
SELECT control_id, COALESCE(doc_type, ''), title, check_type, confidence, rationale, classified_at
|
||||
FROM mc_classification;
|
||||
DROP TABLE mc_classification;
|
||||
ALTER TABLE mc_classification_v2 RENAME TO mc_classification;
|
||||
CREATE INDEX idx_doctype ON mc_classification(doc_type);
|
||||
CREATE INDEX idx_type ON mc_classification(check_type);
|
||||
""")
|
||||
n = c.execute("SELECT COUNT(*) FROM mc_classification").fetchone()[0]
|
||||
print(f"Migrated {n} existing rows.")
|
||||
|
||||
|
||||
def fetch_unclassified_pairs(conn) -> list[dict]:
|
||||
"""All (control_id, doc_type) pairs in PG that aren't yet in sidecar."""
|
||||
side_pairs: set[tuple[str, str]] = set()
|
||||
with sqlite3.connect(SIDECAR_DB) as side:
|
||||
for cid, dt in side.execute("SELECT control_id, doc_type FROM mc_classification"):
|
||||
side_pairs.add((cid, dt or ""))
|
||||
|
||||
with conn.cursor(cursor_factory=RealDictCursor) as c:
|
||||
c.execute("""SELECT control_id, doc_type, title, check_question
|
||||
FROM compliance.doc_check_controls""")
|
||||
all_rows = list(c.fetchall())
|
||||
|
||||
missing = [r for r in all_rows if (r["control_id"], r["doc_type"] or "") not in side_pairs]
|
||||
return missing
|
||||
|
||||
|
||||
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
|
||||
payload = {
|
||||
"model": MODEL,
|
||||
"max_tokens": 4000,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": [{
|
||||
"role": "user",
|
||||
"content": "Klassifiziere die folgenden MCs:\n\n" + json.dumps(
|
||||
[{"control_id": m["control_id"],
|
||||
"doc_type": m["doc_type"],
|
||||
"title": m["title"],
|
||||
"check_question": (m["check_question"] or "")[:400]}
|
||||
for m in batch],
|
||||
ensure_ascii=False, indent=2),
|
||||
}],
|
||||
}
|
||||
headers = {
|
||||
"x-api-key": api_key,
|
||||
"anthropic-version": "2023-06-01",
|
||||
"content-type": "application/json",
|
||||
}
|
||||
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
|
||||
r.raise_for_status()
|
||||
txt = r.json()["content"][0]["text"].strip()
|
||||
if txt.startswith("```"):
|
||||
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
|
||||
if txt.startswith("json"):
|
||||
txt = txt[4:].strip()
|
||||
return json.loads(txt)
|
||||
|
||||
|
||||
def store_results(rows: list[dict], lookup: dict[tuple[str, str], dict]) -> None:
|
||||
ts = datetime.now(timezone.utc).isoformat()
|
||||
with sqlite3.connect(SIDECAR_DB) as c:
|
||||
c.executemany(
|
||||
"INSERT OR REPLACE INTO mc_classification "
|
||||
"(control_id, doc_type, title, check_type, confidence, rationale, classified_at) "
|
||||
"VALUES (?, ?, ?, ?, ?, ?, ?)",
|
||||
[
|
||||
(
|
||||
r.get("control_id"),
|
||||
r.get("doc_type") or "",
|
||||
lookup.get((r.get("control_id"), r.get("doc_type") or ""), {}).get("title", ""),
|
||||
(r.get("check_type") or "").lower(),
|
||||
float(r.get("confidence") or 0),
|
||||
(r.get("rationale") or "")[:500],
|
||||
ts,
|
||||
)
|
||||
for r in rows
|
||||
],
|
||||
)
|
||||
c.commit()
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--sample", action="store_true")
|
||||
args = ap.parse_args()
|
||||
|
||||
migrate_schema()
|
||||
api_key = os.environ["ANTHROPIC_API_KEY"]
|
||||
conn = psycopg2.connect(os.environ["DATABASE_URL"])
|
||||
missing = fetch_unclassified_pairs(conn)
|
||||
|
||||
if args.sample:
|
||||
for m in missing[:5]:
|
||||
print(json.dumps(m, ensure_ascii=False, indent=2))
|
||||
print(f"\nTotal missing pairs: {len(missing)}")
|
||||
return
|
||||
|
||||
print(f"Klassifiziere {len(missing)} fehlende (control_id, doc_type)-Paare in Batches von {BATCH_SIZE}", flush=True)
|
||||
if not missing:
|
||||
print("Alles klassifiziert. Nichts zu tun.")
|
||||
return
|
||||
|
||||
lookup = {(m["control_id"], m["doc_type"] or ""): m for m in missing}
|
||||
total = len(missing)
|
||||
done = 0
|
||||
failed_batches = 0
|
||||
t0 = time.time()
|
||||
for i in range(0, total, BATCH_SIZE):
|
||||
batch = missing[i:i + BATCH_SIZE]
|
||||
try:
|
||||
out = call_claude(api_key, batch)
|
||||
store_results(out, lookup)
|
||||
done += len(out)
|
||||
elapsed = time.time() - t0
|
||||
rate = done / max(elapsed, 0.01)
|
||||
eta = (total - done) / max(rate, 0.01)
|
||||
print(f" [{done:>4}/{total}] {rate:.1f} MC/s ETA {eta/60:.1f}min", flush=True)
|
||||
except Exception as e:
|
||||
failed_batches += 1
|
||||
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
|
||||
if failed_batches >= 5:
|
||||
print(" Zu viele Fehler — abbrechen.", file=sys.stderr)
|
||||
break
|
||||
time.sleep(SLEEP_BETWEEN_BATCHES)
|
||||
|
||||
print(f"\nFertig: {done}/{total} klassifiziert ({failed_batches} Fehlbatches).")
|
||||
with sqlite3.connect(SIDECAR_DB) as c:
|
||||
c.row_factory = sqlite3.Row
|
||||
rows = c.execute(
|
||||
"SELECT doc_type, check_type, COUNT(*) n FROM mc_classification "
|
||||
"GROUP BY doc_type, check_type ORDER BY doc_type, check_type"
|
||||
).fetchall()
|
||||
print("\n=== Final-Verteilung doc_type x check_type ===")
|
||||
prev = None
|
||||
for r in rows:
|
||||
if r["doc_type"] != prev:
|
||||
print(); print(f"[{r['doc_type']}]")
|
||||
prev = r["doc_type"]
|
||||
print(f" {r['check_type']:<8} {r['n']}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user