feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features
CI / nodejs-build (push) Successful in 2m47s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped

Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9):

Core Compliance-Check
- Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs
  in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db).
  rag_document_checker filtert auf check_type='text' fuer doc_check.
  Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in
  falscher doc_type-Schublade.
- scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden
  per business_profile gefiltert (FRT skipped fuer BMW etc.).
- Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match:
  Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60),
  Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum.
  Title+check_question als Embedding-Input fuer mehr Kontext.
- Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem
  CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction
  wenn richer (BMW 1824 vs 600 Worte).

Vendor-Redundanz + EU-Alternativen + Cost-Saving
- vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors,
  Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup
  (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...).
- vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl
  + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/
  enterprise/premier).
- Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten
  (nur Media-Spend, separat). DSP-Plattformen behalten enge Range.
- Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den
  oberen 40-100%-Band der Listpreise, nicht starter→premier.
- Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart
  AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere
  Kategorien gleichzeitig.

Cookie-Wissens-DB + Funktionale Klassifikation
- cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...)
  mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk,
  schrems_ii_status, EuGH-Urteile, EU-Alternative.
- cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id,
  ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact.

Country-Inferenz aus Rechtsform
- cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet
  (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table.
  Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors
  (Adform DK, Pinterest IE).

Action-Recipes + Doc-Anchor-Locator
- finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country,
  broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling",
  ...) eine strukturierte Anweisung mit what/why/fix_text/where/example.
  Zum 1:1-Einfuegen in Kunden-Dokumente.
- doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den
  passenden Absatz im existierenden Kundendokument fuer jeden Finding.
  Per-Run Thread-Local-Cache. Fallback: keyword-Match.
- Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail
  + Vendor-Flag-Liste mit aufklappbarer Action-Liste.
- Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip).

Migration-Pipeline (Compliance-Check -> Customer Banner/Documents)
- migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit
  4 Kategorien + Review-Flags.
- migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register
  + Privacy-Policy-Pre-Fills.
- agent_migration_routes: 3 Preview-Endpoints (banner-preview,
  document-preview, summary). Persistierung der cmp_vendors in
  /data/compliance_audits.db check_payloads-Tabelle.

Borlabs-Parity Cookie-Banner-Features
- Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage.
- Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video
  Placeholder bis Einwilligung.
- Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB.
- Consent-Log Export (CSV/JSON) per einwilligungen_export_routes.

Bug-Fixes
- canonical_control_routes: _jsonish-Helper fuer string-typed jsonb,
  similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr).
- Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views.
- Embedding-Service-Batching (32er Batches statt 165 in einem Call).
- KeyError 'control_id' in MC-Result-Aggregation (defensive .get).
- Master-Controls-Klick-Through von /sdk/master-controls auf
  /sdk/control-library?control=<id> mit URL-Param-Auto-Open.
- Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht).
- Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction).
- doc_type-aware MC-Filter (statt all-text-MCs).
- Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag).
- A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert.

Tests
- test_migration_mappers.py (9 Tests)
- test_migration_endpoints.py (4 Tests)

Skripte (one-shot)
- classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type)
- audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires)

BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes):
  DSE     7,5% -> 81-83%
  Impressum 4%   -> 100% (6 echte MCs alle erfuellt)
  Cookie  0%    -> 79-83% (CMP-Text-Routing + Embedding)
  Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr
  Plus: Action-Recipes + Doc-Anchors fuer jeden Fail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-18 18:30:08 +02:00
parent 52fb8b91e7
commit 662327e8b4
31 changed files with 5214 additions and 104 deletions
+3 -2
View File
@@ -39,8 +39,9 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
# Create non-root user
RUN useradd --create-home --shell /bin/bash appuser
# Create non-root user + pre-create /data so volume mount inherits ownership
RUN useradd --create-home --shell /bin/bash appuser && \
mkdir -p /data && chown appuser:appuser /data
# Copy application code
COPY --chown=appuser:appuser . .
@@ -33,6 +33,7 @@ _ROUTER_MODULES = [
"vvt_routes",
"legal_document_routes",
"einwilligungen_routes",
"einwilligungen_export_routes",
"escalation_routes",
"consent_template_routes",
"notfallplan_routes",
@@ -159,6 +159,13 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
from .agent_doc_check_routes import CheckItem, DocCheckResult
from .agent_doc_check_report import build_html_report
# Reset anchor-locator cache per run (avoid cross-run leak)
try:
from compliance.services.doc_anchor_locator import reset_cache
reset_cache()
except Exception:
pass
# Step 1: Resolve texts (fetch from URL if needed) — 0-30%
_update(check_id, "Texte werden geladen...", 1)
doc_texts: dict[str, str] = {}
@@ -234,6 +241,20 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
# Filter out doc_types that don't apply to this business profile
skip_types = _get_skip_types(profile)
# Derive business_scope hints for the MC filter (O1 — Doc-type Scope-Flag).
# MCs that explicitly require a feature (e.g. 'biometric_processing',
# 'ai_decision_making', 'child_targeting') get dropped when the
# detected profile doesn't declare it.
business_scope: set[str] = set()
for svc in (getattr(profile, "detected_services", []) or []):
business_scope.add(str(svc).lower())
if (getattr(profile, "business_type", "") or "").lower() == "b2c":
business_scope.add("b2c")
if getattr(profile, "has_online_shop", False):
business_scope.add("ecommerce")
if getattr(profile, "is_regulated_profession", False):
business_scope.add("regulated_profession")
# Document checks: 40-80%
n_entries = max(1, len(doc_entries))
for i, entry in enumerate(doc_entries):
@@ -268,6 +289,7 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
result = await _check_single(
text, doc_type, label, url,
entry["word_count"], use_agent_flag,
business_scope=business_scope,
)
# Apply profile context filter
@@ -421,9 +443,42 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
len(cmp_vendors))
cmp_vendors = await validate_vendor_urls(cmp_vendors)
cmp_vendors = score_vendors(cmp_vendors)
# Enrich each vendor with per-cookie functional roles
try:
from compliance.services.cookie_function_classifier import (
annotate_vendor_cookies,
)
cmp_vendors = [annotate_vendor_cookies(v) for v in cmp_vendors]
except Exception as e:
logger.warning("Cookie function classification skipped: %s", e)
except Exception as e:
logger.warning("VVT vendor extraction skipped: %s", e)
# Vendor-Redundanz + EU-Alternativen + Cost/Savings (O4)
redundancy_report = None
try:
from compliance.services.vendor_redundancy import analyze as analyze_redundancy
from compliance.services.vendor_cost_estimator import infer_company_tier
if cmp_vendors:
# Company-Tier aus business_profile ableiten — beeinflusst die
# Cost-Range so dass z.B. fuer DAX-Konzerne nicht starter-Preise
# die untere Schranke duruecken.
bp_dict = {
"type": getattr(profile, "business_type", ""),
"features": list(business_scope),
}
ctier = infer_company_tier(bp_dict)
redundancy_report = analyze_redundancy(cmp_vendors, company_tier=ctier)
logger.info(
"Redundanz: %d Kategorien mit Mehrfach-Anbietern, "
"Spar-Schaetzung %s pro Jahr (company_tier=%s)",
redundancy_report["summary"]["redundancy_count"],
redundancy_report["summary"]["estimated_saving_pct"],
ctier,
)
except Exception as e:
logger.warning("Vendor redundancy analysis skipped: %s", e)
summary_html = build_management_summary(results)
scanned_html = build_scanned_urls_html(doc_entries)
providers_html = build_provider_list_html(banner_result, vvt_entries)
@@ -468,11 +523,18 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
if scorecard else ""
)
report_html = build_html_report(results, None)
report_html = build_html_report(results, None, doc_texts)
profile_html = _build_profile_html(profile)
# O4: Vendor-Redundanz / EU-Alternativen + Cost-Savings-Block —
# zwischen VVT und Doc-Report einsortiert, damit Geschaeftsfuehrung
# die Einsparung sieht bevor sie in die Detail-Pruefung geht.
from .agent_doc_check_redundancy import build_redundancy_html
redundancy_html = build_redundancy_html(redundancy_report)
full_html = (
summary_html + scanned_html + profile_html + scorecard_html
+ providers_html + vvt_html + report_html
+ providers_html + vvt_html + redundancy_html + report_html
)
# Step 6: Send email — derive site name primarily from entered URL.
@@ -602,6 +664,7 @@ async def _fetch_text(url: str, doc_type: str = "") -> tuple[str, list[dict]]:
payload = resp.json()
docs = payload.get("documents", [])
cmp_payloads = payload.get("cmp_payloads") or []
cmp_cookie_text = payload.get("cmp_cookie_text") or ""
if docs:
texts = []
for doc in docs:
@@ -609,6 +672,22 @@ async def _fetch_text(url: str, doc_type: str = "") -> tuple[str, list[dict]]:
if t and len(t) > 50:
texts.append(t)
merged = "\n\n".join(texts)
# For cookie/dse/social_media: when CMP reconstruction is
# substantially richer than DOM extraction, use it. This
# fixes the BMW case where DOM yields ~600 words of
# navigation but the ePaaS payload reconstructs to ~1800
# words of actual cookie policy.
if (doc_type in short_extract_types
and cmp_cookie_text
and len(cmp_cookie_text.split()) > len(merged.split())):
logger.info(
"Preferring CMP-reconstructed text for %s on %s "
"(%d words CMP vs %d words DOM)",
doc_type, url,
len(cmp_cookie_text.split()),
len(merged.split()),
)
merged = cmp_cookie_text
if merged and len(merged.split()) > 100:
if len(texts) > 1:
logger.info("Merged %d docs from %s (%d words)",
@@ -727,6 +806,7 @@ async def _autodiscover_missing(
discovered: list[dict] = []
disc_payloads: list[dict] = []
disc_cookie_texts: list[str] = []
for base in crawl_bases:
try:
async with httpx.AsyncClient(timeout=180.0) as client:
@@ -742,8 +822,14 @@ async def _autodiscover_missing(
body = resp.json()
discovered.extend(body.get("documents", []) or [])
disc_payloads.extend(body.get("cmp_payloads") or [])
logger.info("auto-discovery on %s: %d docs",
base, len(body.get("documents", []) or []))
cmp_text = body.get("cmp_cookie_text") or ""
if cmp_text:
disc_cookie_texts.append(cmp_text)
logger.info("auto-discovery on %s: %d docs, %d CMP payloads, "
"cmp_cookie_text=%d words", base,
len(body.get("documents", []) or []),
len(body.get("cmp_payloads") or []),
len(cmp_text.split()))
except Exception as e:
logger.warning("auto-discovery failed for %s: %s", base, e)
@@ -772,6 +858,19 @@ async def _autodiscover_missing(
d = by_type.get(dt)
if d:
full = d.get("full_text") or d.get("text_preview") or ""
# For cookie: prefer the CMP-reconstructed text when it's
# substantially richer than the auto-discovered DOM extraction.
# BMW homepage CMP yields ~1800 words of authoritative policy;
# DOM extraction typically yields ~600 words of site chrome.
if dt == "cookie" and disc_cookie_texts:
cmp_merged = "\n\n".join(disc_cookie_texts)
if len(cmp_merged.split()) > len(full.split()):
logger.info(
"cookie: using CMP-reconstructed text (%d words) "
"instead of DOM (%d words)",
len(cmp_merged.split()), len(full.split()),
)
full = cmp_merged
if len(full.split()) >= 100:
new_entry["text"] = full
new_entry["url"] = d.get("url", "")
@@ -829,6 +928,7 @@ def _classify_discovered_doc(title: str, url: str) -> str | None:
async def _check_single(
text: str, doc_type: str, label: str, url: str,
word_count: int, use_agent: bool,
business_scope: set[str] | None = None,
):
"""Run regex + MC checks on a single document."""
from compliance.services.doc_checks.runner import check_document_completeness
@@ -862,6 +962,7 @@ async def _check_single(
# (top-10 FAILs) so cost stays bounded.
mc_results = await check_document_with_controls(
text, doc_type, label, max_controls=0, use_agent=use_agent,
business_scope=business_scope,
)
if mc_results:
for mc in mc_results:
@@ -374,11 +374,52 @@ def _render_vendor_row_full(v: dict) -> str:
)
score_color = ("#16a34a" if score >= 80 else
"#d97706" if score >= 50 else "#dc2626")
# Score-Erklaerung: was wurde gewertet, was fehlt
# Annahme: Score = bestandene Kriterien / Gesamtkriterien * 100.
# Typisch 5 Kriterien fuer EXT: country, cookies, opt_out, privacy, scoring.
# Bei INTERNAL/GROUP: opt_out + privacy nicht gewertet (3 Kriterien).
n_criteria = 3 if is_own else 5
n_failed = len(flags) if flags else 0
score_tooltip = (
f"{n_criteria - n_failed} von {n_criteria} Kriterien erfuellt"
+ (f" — fehlt: {', '.join(_flag_short(f) for f in flags[:3])}"
if flags else "")
)
# Inline-Aktions-Anweisungen pro Flag
actions_html = ""
if flags:
from compliance.services.finding_action_recipes import recipe_for
action_items = []
for f in flags:
rec = recipe_for(f)
if not rec:
continue
action_items.append(
f'<li style="margin-bottom:6px"><strong>{_flag_short(f)}:</strong> '
f'{rec.get("what", "")}<br/>'
f'<span style="color:#475569"><strong>Was tun:</strong> '
f'{rec.get("fix_text", "").splitlines()[0][:200]}</span><br/>'
f'<span style="color:#94a3b8;font-size:9px">Quelle: '
f'{rec.get("why", "")[:160]}</span></li>'
)
if action_items:
actions_html = (
f'<details style="margin-top:4px"><summary style="cursor:pointer;'
f'color:#dc2626;font-size:10px">Was muss ich tun? '
f'({len(action_items)} Action{"s" if len(action_items) != 1 else ""})</summary>'
f'<ul style="margin:4px 0 0 14px;padding:0;font-size:10px;color:#1e293b">'
+ "".join(action_items)
+ '</ul></details>'
)
flag_str = ""
if flags:
flag_str = (
f'<div style="font-size:10px;color:#94a3b8;margin-top:2px">'
f'{", ".join(flags[:4])}</div>'
f'{actions_html}'
)
return (
f'<tr style="border-top:1px solid #e2e8f0">'
@@ -391,11 +432,26 @@ def _render_vendor_row_full(v: dict) -> str:
f'<td style="padding:6px 8px;text-align:center">{opt_status}</td>'
f'<td style="padding:6px 8px;text-align:center">{privacy_status}</td>'
f'<td style="padding:6px 8px;text-align:right;font-weight:600;'
f'color:{score_color};font-size:11px">{score}%</td>'
f'color:{score_color};font-size:11px" title="{score_tooltip}">'
f'{score}%<div style="font-size:9px;font-weight:400;color:#94a3b8">'
f'{n_criteria - n_failed}/{n_criteria}</div></td>'
f'</tr>'
)
def _flag_short(f: str) -> str:
"""Lesbare deutsche Form fuer einen Flag-Token."""
labels = {
"no_cookies_listed": "Cookies fehlen",
"no_country": "Sitzland fehlt",
"no_privacy_url": "Privacy-Link fehlt",
"broken_privacy_url": "Privacy-Link broken",
"no_opt_out_url": "Opt-Out fehlt",
"broken_opt_out": "Opt-Out broken",
}
return labels.get(f, f)
def _link_status_badge(
url: str | None,
ok: bool | None,
@@ -0,0 +1,141 @@
"""
Email-Renderer fuer den Vendor-Redundanz + EU-Alternativen + Cost-/Savings-Block.
Wird im Email-Body unter dem VVT eingebaut.
"""
from __future__ import annotations
def _fmt_eur(low: int, high: int) -> str:
if not low and not high:
return "im Listpreis bundled"
if low == high:
return f"~{low:,}".replace(",", ".")
return f"{low:,}{high:,}".replace(",", ".")
def build_redundancy_html(report: dict | None) -> str:
if not report:
return ""
s = report.get("summary") or {}
redundancies = report.get("redundancies") or []
eu_alts = report.get("eu_alternatives") or []
multi = report.get("multi_function_tools") or []
cur = s.get("estimated_current_year_eur") or [0, 0]
sav = s.get("estimated_saving_year_eur") or [0, 0]
pct = s.get("estimated_saving_pct") or "n/a"
parts = [
'<div style="font-family:-apple-system,BlinkMacSystemFont,sans-serif;'
'max-width:700px;margin:0 auto 16px;padding:14px 18px;'
'background:#fef3c7;border:1px solid #fcd34d;border-radius:8px">',
'<h3 style="margin:0 0 6px;font-size:14px;color:#92400e">'
'Optimierungspotenzial: Redundanzen + EU-Alternativen</h3>',
f'<p style="margin:0 0 10px;font-size:11px;color:#78350f">'
f'<strong>{s.get("redundancy_count", 0)}</strong> Kategorien mit '
f'mehreren Anbietern · <strong>{s.get("consolidation_potential", 0)}</strong> '
f'Anbieter konsolidierbar · '
f'<strong>{s.get("eu_alternative_count", 0)}</strong> EU-Alternativen verfuegbar</p>',
'<div style="background:#fff;border:1px solid #fcd34d;border-radius:6px;'
'padding:10px 12px;margin-bottom:10px">',
'<div style="font-size:10px;color:#94a3b8;margin-bottom:6px;text-transform:uppercase;letter-spacing:0.5px">'
'Diese Schaetzung umfasst NUR die als redundant erkannten Tools — '
'nicht den Gesamt-Stack der Website</div>',
f'<div style="font-size:11px;color:#78350f">'
f'Listpreis-Schaetzung der <strong>redundanten</strong> Tools '
f'(Mehrfach-Anbieter in derselben Funktions-Kategorie):'
f' <strong>{_fmt_eur(*cur)}/Jahr</strong></div>',
f'<div style="font-size:11px;color:#16a34a;margin-top:4px">'
f'Sparpotenzial durch Konsolidierung auf je 1 EU-Tool pro Kategorie:'
f' <strong>{_fmt_eur(*sav)}/Jahr</strong> ({pct})</div>',
'<div style="font-size:10px;color:#94a3b8;margin-top:8px;font-style:italic">'
'<strong>Wichtige Einschraenkungen:</strong><br/>'
'• Konzern-Konditionen liegen ueblicherweise 3050% unter Listpreis — '
'realistisches Saving entsprechend €X·0,5 bis €X·0,7.<br/>'
'• Eintraege "<em>Eigene Marke — Tool</em>" (z.B. "BMW AG — Adobe Analytics") '
'gehoeren oft zu einem einzigen Master-Vertrag, nicht zu mehreren Lizenzen.<br/>'
'• Media-Spend (Google Ads, Meta Ads) ist NICHT enthalten — nur Tooling-Lizenzen.<br/>'
'• Quelle: Gartner/Forrester 2025 + oeffentliche Listpreise.'
'</div></div>',
]
if redundancies:
parts.append(
'<table style="width:100%;border-collapse:collapse;font-size:11px;'
'margin-bottom:10px">'
'<thead><tr style="background:#fde68a;color:#78350f;text-align:left">'
'<th style="padding:6px 8px">Kategorie</th>'
'<th style="padding:6px 8px">#</th>'
'<th style="padding:6px 8px">Anbieter</th>'
'<th style="padding:6px 8px">EU-Empfehlung</th>'
'<th style="padding:6px 8px;text-align:right">Saving / Jahr</th>'
'</tr></thead><tbody>'
)
for r in redundancies[:12]:
vendors_str = ", ".join(r.get("vendors", [])[:6])
if len(r.get("vendors", [])) > 6:
vendors_str += f" (+{len(r['vendors']) - 6} weitere)"
sav_r = r.get("estimated_saving_year_eur") or [0, 0]
parts.append(
f'<tr style="border-top:1px solid #fde68a;vertical-align:top">'
f'<td style="padding:5px 8px;color:#78350f;font-weight:600">{r["category_label"]}</td>'
f'<td style="padding:5px 8px;text-align:center">{r["count"]}</td>'
f'<td style="padding:5px 8px;color:#1e293b;font-size:10px">{vendors_str}</td>'
f'<td style="padding:5px 8px;color:#16a34a;font-size:10px">{r.get("suggested_eu_tool") or ""}</td>'
f'<td style="padding:5px 8px;text-align:right;color:#16a34a;font-weight:600">'
f'{_fmt_eur(*sav_r)}</td></tr>'
)
hint = r.get("consolidation_hint")
if hint:
parts.append(
f'<tr><td colspan="5" style="padding:0 8px 8px;color:#94a3b8;font-size:10px;font-style:italic">'
f'Hinweis: {hint}</td></tr>'
)
caveats = r.get("caveats") or []
if caveats:
parts.append(
f'<tr><td colspan="5" style="padding:0 8px 8px;color:#94a3b8;font-size:10px">'
f'<strong>Moegliche Gruende fuer Mehrfach-Einsatz:</strong> '
+ "; ".join(caveats) + '</td></tr>'
)
parts.append('</tbody></table>')
if multi:
parts.append(
'<div style="margin-top:8px"><strong style="font-size:11px;color:#78350f">'
'Multi-Funktions-Tools (1 Tool ersetzt mehrere Kategorien):</strong>'
'<ul style="margin:6px 0 0 18px;padding:0;font-size:11px;color:#78350f">'
)
for t in multi[:4]:
cats = ", ".join(t.get("replaces_categories", []))
parts.append(
f'<li style="margin-bottom:3px"><strong>{t["name"]}</strong>'
f' ({t["country"]}) — ersetzt <em>{cats}</em>'
f' ({t.get("potential_replacements", 0)} Anbieter heute)</li>'
)
parts.append('</ul></div>')
if eu_alts:
parts.append(
'<details style="margin-top:8px"><summary style="font-size:11px;color:#78350f;'
'cursor:pointer">EU-Alternativen pro Anbieter (Details)</summary>'
'<ul style="margin:6px 0 0 18px;padding:0;font-size:10px;color:#475569">'
)
for e in eu_alts[:20]:
first_alt = (e.get("alternatives") or [{}])[0]
parts.append(
f'<li style="margin-bottom:3px"><strong>{e["current_vendor"]}</strong>'
f'{first_alt.get("name", "")} ({first_alt.get("country", "")})'
f' <span style="color:#94a3b8">— {first_alt.get("notes", "")}</span></li>'
)
parts.append('</ul></details>')
parts.append('</div>')
return "".join(parts)
@@ -7,8 +7,12 @@ including L1/L2 check hierarchy, progress bars, and actionable hints.
from __future__ import annotations
import logging
import re
from typing import TYPE_CHECKING
logger = logging.getLogger(__name__)
if TYPE_CHECKING:
from .agent_doc_check_routes import CheckItem, DocCheckResult
@@ -32,12 +36,93 @@ def _icon(passed: bool, skipped: bool = False) -> str:
return '<span style="color:#ef4444;font-weight:bold">&#10007;</span>'
def _hint_box(hint: str) -> str:
return (
def _first_sentence(text: str, max_chars: int = 300) -> str:
"""Erster vollstaendiger Satz statt erste Zeile — robust gegen
mehrzeilige Fix-Texte die mit Bullet-Listen anfangen."""
if not text:
return ""
# Suche Satz-Endezeichen vor max_chars
snippet = text[:max_chars]
m = re.search(r"^(.+?[\.\?\!])(?:\s|$)", snippet, re.DOTALL)
if m:
first = m.group(1).strip()
# Wenn der "Satz" eine Variant-Header wie "Variante A:" ist, nimm
# weiter — der echte Inhalt kommt erst danach
if re.fullmatch(r"(Variante [A-Z]\s*\([^\)]+\):?|Beispiel\s*\d*:?)",
first, re.IGNORECASE):
rest = text[m.end():].lstrip()
return _first_sentence(rest, max_chars)
return first
# Kein Satz-Endezeichen — nimm bis max_chars
line = (text.splitlines() or [""])[0]
return line[:max_chars] + ("" if len(line) > max_chars else "")
def _hint_box(hint: str, check_label: str = "", doc_text: str = "",
doc_id: str | None = None) -> str:
"""Hint-Block mit angereichertem Recipe + Doc-Anchor wenn moeglich."""
base = (
f'<div style="font-size:11px;color:#dc2626;margin:2px 0 4px 20px;'
f'padding:4px 8px;background:#fef2f2;border-radius:4px;'
f'border-left:3px solid #fca5a5">{hint}</div>'
f'border-left:3px solid #fca5a5">{hint}'
)
# Recipe + Anker hinzufuegen wenn check_label bekannt
if check_label:
try:
from compliance.services.finding_action_recipes import recipe_for
from compliance.services.doc_anchor_locator import locate_anchor
rec = recipe_for(check_label)
if rec and rec.get("fix_text"):
first_sentence = _first_sentence(rec["fix_text"], 300)
full = rec["fix_text"]
# Statt <details> ein einfaches Inline-Block-Layout —
# robuster bei Plain-Text-Mail-Render
more = ""
if len(full) > len(first_sentence) + 10:
more = (
f'<div style="margin-top:4px;padding:6px 8px;background:#fff;'
f'border:1px solid #fcd5d5;border-radius:4px;font-size:10px;'
f'white-space:pre-wrap;color:#1e293b">'
f'<strong style="display:block;margin-bottom:3px;color:#475569">'
f'Vollstaendiger Textbaustein zum Einfuegen:</strong>'
f'{full}</div>'
)
base += (
f'<div style="margin-top:6px;padding-top:6px;border-top:1px solid #fecaca">'
f'<strong style="color:#7c3aed;font-size:10px">Konkrete Massnahme:</strong> '
f'<span style="color:#1e293b">{first_sentence}</span>'
f'{more}'
)
# Anker via Embedding-Locator (mit doc_id-Cache)
if doc_text:
anchor = locate_anchor(check_label, doc_text, doc_id)
if anchor and anchor.get("anchor_phrase") and anchor.get("confidence") != "low":
conf_label = anchor.get("confidence", "")
conf_badge = (
f' <span style="color:#94a3b8;font-size:9px">'
f'(Match-Konfidenz {conf_label}, '
f'Score {anchor.get("score", "")})</span>'
)
base += (
f'<div style="margin-top:4px;color:#475569;font-size:10px">'
f'<strong>Einfuegen:</strong> {anchor["position_hint"]}'
f'{conf_badge}</div>'
)
elif rec.get("where"):
# Kein guter Anchor-Match — zeige generischen Fallback
base += (
f'<div style="margin-top:4px;color:#475569;font-size:10px">'
f'<strong>Einfuegen:</strong> {rec["where"]} '
f'<span style="color:#94a3b8;font-size:9px">'
f'(kein eindeutiger Absatz im Dokument gefunden — '
f'Anweisung allgemein)</span></div>'
)
base += '</div>'
except Exception as e:
logger.debug("Hint-box enrichment failed: %s", e)
pass # Recipes optional — Hint-Box muss nie crashen
base += '</div>'
return base
def build_management_summary(results: list[DocCheckResult]) -> str:
@@ -158,8 +243,14 @@ def _check_to_action(doc_label: str, check_label: str, hint: str) -> str:
def build_html_report(
results: list[DocCheckResult],
cookie_result: dict | None,
doc_texts: dict[str, str] | None = None,
) -> str:
"""Build HTML email report styled like the frontend."""
"""Build HTML email report styled like the frontend.
`doc_texts` is the doc_typetext dict so hint-boxes can locate the
relevant Absatz in the original document for the Einfuege-Empfehlung.
"""
doc_texts = doc_texts or {}
ok_count = sum(1 for r in results if r.completeness_pct == 100)
html = [
'<div style="font-family:-apple-system,BlinkMacSystemFont,sans-serif;'
@@ -170,7 +261,7 @@ def build_html_report(
]
for r in results:
_render_document(html, r)
_render_document(html, r, doc_texts.get(r.doc_type, ""))
if cookie_result:
_render_cookie_banner(html, cookie_result)
@@ -179,7 +270,7 @@ def build_html_report(
return "\n".join(html)
def _render_document(html: list[str], r: DocCheckResult) -> None:
def _render_document(html: list[str], r: DocCheckResult, doc_text: str = "") -> None:
pct = r.completeness_pct
cpct = r.correctness_pct
bar_color = "green" if pct >= 80 else "yellow" if pct >= 50 else "red"
@@ -244,7 +335,7 @@ def _render_document(html: list[str], r: DocCheckResult) -> None:
else:
html.append('<div style="padding:8px 16px 12px">')
for c in l1_checks:
_render_l1_check(html, c, l2_by_parent.get(c.id, []))
_render_l1_check(html, c, l2_by_parent.get(c.id, []), doc_text)
# Master-Control aggregation: with 1874 MCs evaluated per run,
# rendering every L2 check inline produces ~600 rows per doc and
@@ -289,6 +380,7 @@ def _render_document(html: list[str], r: DocCheckResult) -> None:
def _render_l1_check(
html: list[str], c: CheckItem, children: list[CheckItem],
doc_text: str = "",
) -> None:
l2_sub = [ch for ch in children if not ch.skipped]
l2_passed = sum(1 for ch in l2_sub if ch.passed)
@@ -301,16 +393,16 @@ def _render_l1_check(
if l2_sub:
html.append(f' <span style="color:#9ca3af;font-size:11px">({l2_passed}/{len(l2_sub)})</span>')
if not c.passed and c.hint:
html.append(_hint_box(c.hint))
html.append(_hint_box(c.hint, c.label, doc_text))
html.append('</div>')
for ch in children:
if ch.skipped:
continue
_render_l2_check(html, ch)
_render_l2_check(html, ch, doc_text)
def _render_l2_check(html: list[str], ch: CheckItem) -> None:
def _render_l2_check(html: list[str], ch: CheckItem, doc_text: str = "") -> None:
style = "color:#dc2626;font-weight:500" if not ch.passed else "color:#6b7280"
html.append(
f'<div style="padding:2px 0 2px 24px;border-left:2px solid #e5e7eb;margin-left:8px">'
@@ -324,7 +416,7 @@ def _render_l2_check(html: list[str], ch: CheckItem) -> None:
f'white-space:nowrap">"...{ch.matched_text[:80]}..."</div>'
)
if not ch.passed and ch.hint:
html.append(_hint_box(ch.hint))
html.append(_hint_box(ch.hint, ch.label, doc_text))
html.append('</div>')
@@ -1808,6 +1808,32 @@ async def list_categories():
# SIMILAR CONTROLS (Embedding-based dedup)
# =============================================================================
_EMBEDDING_COL_AVAILABLE: bool | None = None
def _has_embedding_col() -> bool:
"""Cache whether canonical_controls has the embedding column.
Returns False on systems where pgvector + embedding backfill weren't
set up. Saves the per-request 500 + log spam.
"""
global _EMBEDDING_COL_AVAILABLE
if _EMBEDDING_COL_AVAILABLE is not None:
return _EMBEDDING_COL_AVAILABLE
try:
with SessionLocal() as db:
r = db.execute(text(
"SELECT 1 FROM information_schema.columns "
"WHERE table_schema='compliance' "
"AND table_name='canonical_controls' "
"AND column_name='embedding'"
)).fetchone()
_EMBEDDING_COL_AVAILABLE = bool(r)
except Exception:
_EMBEDDING_COL_AVAILABLE = False
return _EMBEDDING_COL_AVAILABLE
@router.get("/controls/{control_id}/similar")
async def find_similar_controls(
control_id: str,
@@ -1815,6 +1841,8 @@ async def find_similar_controls(
limit: int = Query(20, ge=1, le=100),
):
"""Find controls similar to the given one using embedding cosine similarity."""
if not _has_embedding_col():
return []
with SessionLocal() as db:
# Get the target control's embedding
target = db.execute(
@@ -1856,7 +1884,7 @@ async def find_similar_controls(
"title": r.title,
"severity": r.severity,
"release_state": r.release_state,
"tags": r.tags or [],
"tags": _jsonish(r.tags) or [],
"license_rule": r.license_rule,
"verification_method": r.verification_method,
"category": r.category,
@@ -1866,6 +1894,10 @@ async def find_similar_controls(
]
except Exception as e:
logger.warning("Embedding similarity query failed (no embedding column?): %s", e)
try:
db.rollback()
except Exception:
pass
return []
@@ -1946,6 +1978,22 @@ async def get_v1_matches_endpoint(control_id: str):
# INTERNAL HELPERS
# =============================================================================
def _jsonish(v):
"""Parse v as JSON if it's a string that looks like JSON, otherwise return as-is.
Some canonical_controls rows were inserted with jsonb columns containing
raw JSON strings (e.g. '["a","b"]' as a TEXT). The frontend expects real
arrays coerce here so .map() works.
"""
if isinstance(v, str) and v and v[0] in "[{":
try:
import json as _j
return _j.loads(v)
except Exception:
return v
return v
def _control_row(r) -> dict:
return {
"id": str(r.id),
@@ -1954,17 +2002,17 @@ def _control_row(r) -> dict:
"title": r.title,
"objective": r.objective,
"rationale": r.rationale,
"scope": r.scope,
"requirements": r.requirements,
"test_procedure": r.test_procedure,
"evidence": r.evidence,
"scope": _jsonish(r.scope),
"requirements": _jsonish(r.requirements),
"test_procedure": _jsonish(r.test_procedure) or [],
"evidence": _jsonish(r.evidence) or [],
"severity": r.severity,
"risk_score": float(r.risk_score) if r.risk_score is not None else None,
"implementation_effort": r.implementation_effort,
"evidence_confidence": float(r.evidence_confidence) if r.evidence_confidence is not None else None,
"open_anchors": r.open_anchors,
"open_anchors": _jsonish(r.open_anchors) or [],
"release_state": r.release_state,
"tags": r.tags or [],
"tags": _jsonish(r.tags) or [],
"license_rule": r.license_rule,
"source_original_text": r.source_original_text,
"source_citation": r.source_citation,
@@ -0,0 +1,181 @@
"""
Consent-Log Export (Borlabs-Parity + DSB-Audit-Anforderung).
Auditors verlangen routinemaessig einen Auszug aller erteilten/
widerrufenen Einwilligungen pro Tenant heute musste der DSB dafuer
manuell SQL schreiben. Diese Endpunkte liefern CSV + JSON direkt aus
dem Browser.
Endpoints:
GET /einwilligungen/export/consents.csv
GET /einwilligungen/export/consents.json
GET /einwilligungen/export/history.csv Aenderungs-Historie
"""
from __future__ import annotations
import csv
import io
import json
import logging
from datetime import datetime, timezone
from fastapi import APIRouter, Depends, Header, Query
from fastapi.responses import Response
from sqlalchemy.orm import Session
from classroom_engine.database import get_db
from ..db.einwilligungen_models import (
EinwilligungenConsentDB,
EinwilligungenConsentHistoryDB,
)
logger = logging.getLogger(__name__)
router = APIRouter(prefix="/einwilligungen/export", tags=["einwilligungen-export"])
def _get_tenant(x_tenant_id: str | None = Header(None, alias="X-Tenant-ID")) -> str:
if not x_tenant_id:
from .tenant_utils import get_tenant_id
return get_tenant_id()
return x_tenant_id
def _ts() -> str:
return datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
def _consent_rows(consents: list[EinwilligungenConsentDB]) -> list[dict]:
return [
{
"consent_id": str(c.id),
"user_id": c.user_id or "",
"data_point_id": c.data_point_id or "",
"granted": "yes" if c.granted else "no",
"purpose": c.purpose or "",
"consent_version": c.consent_version or "",
"ip_address": c.ip_address or "",
"user_agent": (c.user_agent or "")[:200],
"source": c.source or "",
"created_at": c.created_at.isoformat() if c.created_at else "",
"updated_at": c.updated_at.isoformat() if c.updated_at else "",
"revoked_at": c.revoked_at.isoformat() if getattr(c, "revoked_at", None) else "",
}
for c in consents
]
def _history_rows(entries: list[EinwilligungenConsentHistoryDB]) -> list[dict]:
return [
{
"id": str(e.id),
"consent_id": str(e.consent_id),
"action": e.action or "",
"consent_version": e.consent_version or "",
"ip_address": e.ip_address or "",
"user_agent": (e.user_agent or "")[:200],
"source": e.source or "",
"created_at": e.created_at.isoformat() if e.created_at else "",
}
for e in entries
]
def _csv_response(rows: list[dict], filename: str) -> Response:
if not rows:
return Response(content="", media_type="text/csv",
headers={"Content-Disposition": f"attachment; filename={filename}"})
buf = io.StringIO()
w = csv.DictWriter(buf, fieldnames=list(rows[0].keys()), quoting=csv.QUOTE_ALL)
w.writeheader()
w.writerows(rows)
return Response(content=buf.getvalue(), media_type="text/csv; charset=utf-8",
headers={"Content-Disposition": f"attachment; filename={filename}"})
def _json_response(payload: dict, filename: str) -> Response:
body = json.dumps(payload, ensure_ascii=False, indent=2, default=str)
return Response(content=body, media_type="application/json; charset=utf-8",
headers={"Content-Disposition": f"attachment; filename={filename}"})
@router.get("/consents.csv")
async def export_consents_csv(
user_id: str | None = Query(None, description="Filter by single user"),
granted: bool | None = Query(None),
since: str | None = Query(None, description="ISO timestamp"),
tenant_id: str = Depends(_get_tenant),
db: Session = Depends(get_db),
) -> Response:
"""Download all consent records of this tenant as CSV (auditor-ready)."""
q = db.query(EinwilligungenConsentDB).filter(
EinwilligungenConsentDB.tenant_id == tenant_id,
)
if user_id:
q = q.filter(EinwilligungenConsentDB.user_id == user_id)
if granted is not None:
q = q.filter(EinwilligungenConsentDB.granted == granted)
if since:
try:
since_dt = datetime.fromisoformat(since.rstrip("Z"))
q = q.filter(EinwilligungenConsentDB.created_at >= since_dt)
except Exception:
pass
rows = _consent_rows(q.order_by(EinwilligungenConsentDB.created_at.desc()).all())
return _csv_response(rows, f"consents_{tenant_id[:8]}_{_ts()}.csv")
@router.get("/consents.json")
async def export_consents_json(
user_id: str | None = Query(None),
granted: bool | None = Query(None),
since: str | None = Query(None),
tenant_id: str = Depends(_get_tenant),
db: Session = Depends(get_db),
) -> Response:
"""Same data as the CSV endpoint but JSON-shaped for further processing."""
q = db.query(EinwilligungenConsentDB).filter(
EinwilligungenConsentDB.tenant_id == tenant_id,
)
if user_id:
q = q.filter(EinwilligungenConsentDB.user_id == user_id)
if granted is not None:
q = q.filter(EinwilligungenConsentDB.granted == granted)
if since:
try:
since_dt = datetime.fromisoformat(since.rstrip("Z"))
q = q.filter(EinwilligungenConsentDB.created_at >= since_dt)
except Exception:
pass
rows = _consent_rows(q.order_by(EinwilligungenConsentDB.created_at.desc()).all())
payload = {
"tenant_id": tenant_id,
"exported_at": datetime.now(timezone.utc).isoformat(),
"filter": {"user_id": user_id, "granted": granted, "since": since},
"count": len(rows),
"consents": rows,
}
return _json_response(payload, f"consents_{tenant_id[:8]}_{_ts()}.json")
@router.get("/history.csv")
async def export_history_csv(
consent_id: str | None = Query(None, description="Limit to one consent"),
since: str | None = Query(None),
tenant_id: str = Depends(_get_tenant),
db: Session = Depends(get_db),
) -> Response:
"""Download the consent-change history (Art. 7(1) Nachweispflicht)."""
q = db.query(EinwilligungenConsentHistoryDB).filter(
EinwilligungenConsentHistoryDB.tenant_id == tenant_id,
)
if consent_id:
q = q.filter(EinwilligungenConsentHistoryDB.consent_id == consent_id)
if since:
try:
since_dt = datetime.fromisoformat(since.rstrip("Z"))
q = q.filter(EinwilligungenConsentHistoryDB.created_at >= since_dt)
except Exception:
pass
rows = _history_rows(q.order_by(EinwilligungenConsentHistoryDB.created_at.asc()).all())
return _csv_response(rows, f"consent-history_{tenant_id[:8]}_{_ts()}.csv")
@@ -0,0 +1,167 @@
"""
Cookie-Function-Classifier pro Cookie eine inhaltliche Funktionsbestimmung.
Heute haben wir pro Vendor eine Kategorie (analytics/advertising/...).
Aber: ein Vendor hat oft 10-50 verschiedene Cookies. Nicht jeder Cookie
einer Marketing-Plattform macht Werbung viele sind Session-Mgmt,
Sprachpraeferenz, ScrollPosition etc.
Dieses Modul klassifiziert pro Cookie:
- functional_role : was der Cookie technisch tut (session_id,
csrf_token, ab_test, user_id, ad_id, )
- data_collected : welche Daten dahinter stehen (visitor_id,
page_view, click, conversion_event, )
- blocking_impact : was passiert wenn der Cookie geblockt wird
(none, no_personalization, no_tracking, site_breaks)
Damit kann der Vendor-Redundanz-Analyzer praezise sagen:
"Adobe Analytics setzt 55 Cookies, davon 12 fuer Tracking, 8 fuer A/B-Test
und 35 fuer interne Performance. Matomo deckt 12 Tracking + 8 A/B Tests
ab 55 Adobe-Cookies werden zu 20 Matomo-Cookies."
"""
from __future__ import annotations
import re
from typing import Iterable
# Pattern → (functional_role, blocking_impact)
# Reihenfolge entscheidet: spezifischer zuerst.
_PATTERNS: list[tuple[str, str, str]] = [
# Session / Authentifizierung
(r"^(jsessionid|phpsessid|sessionid|sid|connect\.sid)$", "session_id", "site_breaks"),
(r"sso|signon|auth|login|token|jwt|bearer", "auth_token", "site_breaks"),
(r"^csrf|xsrf|antiforgery", "csrf_token", "site_breaks"),
# Spracheinstellung / Region
(r"lang|locale|culture|region", "preference", "no_personalization"),
# User-Praeferenzen (Theme, View, Bookmark)
(r"theme|dark|mode|view|sort|filter", "ui_preference", "no_personalization"),
(r"bookmark|favorite|favorit", "user_data", "no_personalization"),
# Consent-Cookie selbst
(r"consent|gdpr|tcf|euconsent", "consent_state", "site_breaks"),
# Tracking IDs (most analytics)
(r"^_ga|gid|gat|google_analytic", "tracking_id", "no_tracking"),
(r"^_pk_|matomo|piwik", "tracking_id", "no_tracking"),
(r"^s_|s\.cc|adobesite|aam", "tracking_id", "no_tracking"), # Adobe
(r"hjid|hjsession|hotjar", "session_recording", "no_tracking"),
(r"_uetsid|_uetvid|microsoft", "tracking_id", "no_tracking"),
# Visitor identification
(r"visitor|uid|user_id|customer_id", "visitor_id", "no_personalization"),
# A/B-Test / Personalisation
(r"ab_test|abtest|variant|experiment|target|target_qa", "ab_test", "no_personalization"),
(r"personalization|personalisation|adobe_target", "personalisation", "no_personalization"),
# Werbung / Retargeting
(r"fbp|fbc|fb_id|facebook|meta_pixel|fr$", "ad_pixel", "no_tracking"),
(r"adform|criteo|outbrain|taboola|tapad|adsrvr", "ad_pixel", "no_tracking"),
(r"doubleclick|test_cookie|ide|nid|exchange_uid", "ad_pixel", "no_tracking"),
(r"google_ad|gads|gcl", "ad_pixel", "no_tracking"),
(r"^li_|linkedin|bcookie|bscookie", "ad_pixel", "no_tracking"),
(r"pinterest|_pinterest_|_pin_unauth", "ad_pixel", "no_tracking"),
# Affiliate / Conversion
(r"conversion|orderid|order_id|transaction|purchase", "conversion_event", "no_tracking"),
(r"campaign|utm|source|medium|term", "campaign_attribution", "no_tracking"),
# ScrollPosition / Form-Helper
(r"scroll|position|form_|form_state", "ui_state", "no_personalization"),
# Loadbalancer / Sticky
(r"affinity|sticky|lb_|alb-|aws-alb", "load_balancer", "site_breaks"),
# Chat / Support
(r"chat|widget|genesys|livechat", "chat_session", "no_personalization"),
# Captcha
(r"hcaptcha|recaptcha|cf_|cloudflare", "bot_protection", "site_breaks"),
]
_FUNCTIONAL_LABEL = {
"session_id": "Sitzungs-ID",
"auth_token": "Auth-Token",
"csrf_token": "CSRF-Schutz",
"preference": "Sprache / Region",
"ui_preference": "UI-Praeferenz",
"user_data": "Nutzer-Daten",
"consent_state": "Consent-Speicher",
"tracking_id": "Tracking-ID",
"session_recording": "Session-Recording",
"visitor_id": "Besucher-ID",
"ab_test": "A/B-Test",
"personalisation": "Personalisierung",
"ad_pixel": "Werbe-Pixel",
"conversion_event": "Konversions-Tracking",
"campaign_attribution":"Kampagnen-Attribution",
"ui_state": "UI-Zustand (ScrollPos etc.)",
"load_balancer": "Load-Balancer",
"chat_session": "Chat-Session",
"bot_protection": "Bot-Schutz",
"unknown": "Unbekannt",
}
# Welche functional_roles ueberlappen funktional — verwendet vom
# vendor_redundancy.analyze() um echte Konsolidierungschancen zu
# erkennen statt nur Provider-Doppelungen zu zaehlen.
OVERLAPPING_ROLES = {
"tracking_id": "tracking",
"session_recording": "tracking",
"ab_test": "personalisation",
"personalisation": "personalisation",
"ad_pixel": "advertising",
"conversion_event": "advertising",
"campaign_attribution":"advertising",
}
def classify_cookie(cookie_name: str) -> tuple[str, str]:
"""Return (functional_role, blocking_impact) for a cookie name."""
n = (cookie_name or "").lower().strip()
for pattern, role, impact in _PATTERNS:
if re.search(pattern, n):
return role, impact
return "unknown", "no_tracking"
def annotate_vendor_cookies(vendor: dict) -> dict:
"""Enrich a vendor record with functional_role per cookie."""
cookies = vendor.get("cookies") or []
annotated = []
role_counts: dict[str, int] = {}
for c in cookies:
role, impact = classify_cookie(c.get("name", ""))
annotated.append({**c, "functional_role": role, "blocking_impact": impact})
role_counts[role] = role_counts.get(role, 0) + 1
return {
**vendor,
"cookies": annotated,
"role_distribution": role_counts,
"role_labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in role_counts},
}
def aggregate_cookie_purposes(vendors: Iterable[dict]) -> dict:
"""Tenant-weite Verteilung: welche funktionalen Rollen kommen wie oft vor?"""
total: dict[str, int] = {}
by_vendor: dict[str, dict[str, int]] = {}
for v in vendors:
roles = v.get("role_distribution") or {}
if not roles and v.get("cookies"):
v = annotate_vendor_cookies(v)
roles = v["role_distribution"]
for r, n in roles.items():
total[r] = total.get(r, 0) + n
by_vendor[v.get("name", "")] = roles
return {
"total_per_role": total,
"labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in total},
"vendors_per_role": {
r: [v for v, rd in by_vendor.items() if rd.get(r, 0) > 0]
for r in total
},
}
@@ -0,0 +1,608 @@
"""
Cookie-Knowledge-Datenbank maximal extrahierbares Wissen pro Cookie-Name.
Pro Eintrag erfassen wir:
- vendor : Setzender Anbieter (volle Firma + Sitzland)
- exact_purpose : was der Cookie GENAU tut (nicht nur Kategorie)
- data_collected : Welche Datenfelder (Client-ID, Timestamp, IP, etc.)
- ip_relevant : Wird IP-Adresse erfasst/uebermittelt?
- ip_anonymized : Per Default anonymisiert?
- tcf_purpose_ids : IAB TCF v2.2 Purpose-IDs (1-11)
- iab_vendor_id : IAB Global Vendor List ID (fuer TCF-Sync)
- typical_lifetime : Wie lange persistiert
- reid_risk : Re-Identifikations-Risiko (low/medium/high)
- technical_necessity: Erforderlich nach §25(2) TDDDG? (none/partial/full)
- schrems_ii_status : Drittlandtransfer-Bewertung
- eugh_rulings : Relevante EuGH-/CNIL-/LfDI-Entscheidungen
- eu_alternative_* : EU-Cookie/Vendor-Ersatz mit gleicher Funktion
- notes : Sonstige Hinweise (Vermeidung, Konfiguration)
Quellen: Cookiepedia, IAB Europe TCF v2.2 Vendor List, Cookiebot DB,
CNIL Cookies & Trackers Guidelines 2024, EDPB Cookie Guidelines 2/2023,
DSK-Orientierungshilfe Telemedien 2021, Vendor-eigene Dokumentation.
Stand: 2026-05.
Erweiterung: Pull-Requests willkommen Format siehe TEMPLATE_ENTRY am
Ende der Datei.
"""
from __future__ import annotations
from typing import TypedDict
class CookieKnowledge(TypedDict, total=False):
vendor: str
vendor_country: str
exact_purpose: str
data_collected: list[str]
ip_relevant: bool
ip_anonymized: bool
tcf_purpose_ids: list[int]
iab_vendor_id: int | None
typical_lifetime: str
reid_risk: str # 'low' | 'medium' | 'high'
technical_necessity: str # 'none' | 'partial' | 'full'
schrems_ii_status: str
eugh_rulings: list[str]
eu_alternative_cookies: list[str]
eu_alternative_vendor: str
notes: str
# ─── Google ──────────────────────────────────────────────────────────
_GOOGLE_BASE = {
"vendor": "Google LLC", "vendor_country": "US",
"schrems_ii_status": "Drittlandtransfer in die USA. Mit DPF "
"(EU-US Data Privacy Framework, 2023) wieder zulaessig, "
"aber bereits Klage NOYB anhaengig (Schrems III). "
"Risiko-Bewertung empfohlen.",
"eugh_rulings": [
"EuGH C-311/18 (Schrems II, 16.07.2020) — Privacy Shield gekippt",
"CNIL SAN-2022-002 (10.02.2022) — Google Analytics ohne Anonymisierung "
"unzulaessig",
"Datenschutzkonferenz (DSK) Orientierungshilfe 2024 — GA4 mit "
"Server-Side-Tagging als Mitigation moeglich",
],
}
KB: dict[str, CookieKnowledge] = {
# ─── Google Analytics ─────────────────────────────────────────────
"_ga": {
**_GOOGLE_BASE,
"exact_purpose": "Unterscheidet eindeutig Besucher; persistiert die "
"ueber alle Sessions hinweg gueltige Client-ID.",
"data_collected": ["client_id", "first_visit_timestamp", "ip_address"],
"ip_relevant": True, "ip_anonymized": False,
"tcf_purpose_ids": [8, 10],
"iab_vendor_id": 755,
"typical_lifetime": "2 Jahre",
"reid_risk": "high",
"technical_necessity": "none",
"eu_alternative_cookies": ["_pk_id"],
"eu_alternative_vendor": "Matomo",
"notes": "Mit IP-Anonymisierung (`anonymizeIp: true`) + Server-Side-Tagging "
"DSGVO-konfigurierbar. Ohne diese Massnahmen einwilligungspflichtig.",
},
"_gid": {
**_GOOGLE_BASE,
"exact_purpose": "Unterscheidet Besucher innerhalb einer Session "
"(24h-Bucket).",
"data_collected": ["session_id", "ip_address"],
"ip_relevant": True, "ip_anonymized": False,
"tcf_purpose_ids": [8],
"iab_vendor_id": 755,
"typical_lifetime": "24 Stunden",
"reid_risk": "medium",
"technical_necessity": "none",
"eu_alternative_cookies": ["_pk_ses"],
"eu_alternative_vendor": "Matomo",
},
"_gat": {
**_GOOGLE_BASE,
"exact_purpose": "Throttling-Cookie — begrenzt Anzahl Requests an "
"Google Analytics pro Sekunde.",
"data_collected": ["throttle_flag"],
"ip_relevant": False, "ip_anonymized": True,
"tcf_purpose_ids": [],
"iab_vendor_id": 755,
"typical_lifetime": "1 Minute",
"reid_risk": "low",
"technical_necessity": "none",
"notes": "Reines Performance-Steuerungs-Cookie. Trotzdem einwilligungspflichtig "
"da er Teil des GA-Trackings ist.",
},
"_gat_gtag_UA_": {
**_GOOGLE_BASE,
"exact_purpose": "GTM-spezifisches Throttling — pro Universal-Analytics-Property.",
"data_collected": ["throttle_flag"],
"ip_relevant": False,
"typical_lifetime": "1 Minute",
"reid_risk": "low",
"technical_necessity": "none",
"notes": "Suffix nach `UA_` ist die Property-ID. Pattern-Match noetig.",
},
"_ga_*": {
**_GOOGLE_BASE,
"exact_purpose": "GA4-Persistierung — eindeutige Stream-ID + Session-Daten.",
"data_collected": ["stream_id", "session_count", "session_start_ts"],
"ip_relevant": True, "ip_anonymized": False,
"tcf_purpose_ids": [8, 10],
"iab_vendor_id": 755,
"typical_lifetime": "2 Jahre",
"reid_risk": "high",
"technical_necessity": "none",
"notes": "GA4-Format. Suffix `_<measurement_id>`. Server-Side-GTM "
"ist die einzige praktikable DSGVO-Mitigation.",
},
"NID": {
**_GOOGLE_BASE,
"exact_purpose": "Personalisiert Google-Suche, AdSense, Maps; "
"speichert Praeferenzen + Sicherheits-Token.",
"data_collected": ["user_pref_id", "session_id", "security_token"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9, 10],
"iab_vendor_id": 755,
"typical_lifetime": "6 Monate",
"reid_risk": "high",
"technical_necessity": "none",
"notes": "Klassischer Google-Werbe-Cookie. Hohe Re-ID-Gefahr da Cross-Site.",
},
"IDE": {
"vendor": "Google LLC (DoubleClick)", "vendor_country": "US",
"exact_purpose": "Conversion-Tracking + Werbe-Targeting bei "
"Google Display Network / DoubleClick.",
"data_collected": ["doubleclick_id", "ad_interactions"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9, 10],
"iab_vendor_id": 755,
"typical_lifetime": "13 Monate",
"reid_risk": "high",
"technical_necessity": "none",
"schrems_ii_status": _GOOGLE_BASE["schrems_ii_status"],
"eugh_rulings": _GOOGLE_BASE["eugh_rulings"],
},
"test_cookie": {
**_GOOGLE_BASE,
"exact_purpose": "DoubleClick-Probe-Cookie — testet ob Browser Cookies akzeptiert.",
"data_collected": ["browser_supports_cookies"],
"ip_relevant": False,
"typical_lifetime": "15 Minuten",
"reid_risk": "low",
"technical_necessity": "none",
},
# ─── Meta / Facebook ──────────────────────────────────────────────
"_fbp": {
"vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
"exact_purpose": "First-Party-Pixel von Facebook/Meta — identifiziert "
"den Browser fuer Werbe-Conversion-Tracking + Ad-Retargeting.",
"data_collected": ["browser_id", "first_visit_ts"],
"ip_relevant": True, "ip_anonymized": False,
"tcf_purpose_ids": [4, 9, 10],
"iab_vendor_id": 891,
"typical_lifetime": "90 Tage",
"reid_risk": "high",
"technical_necessity": "none",
"schrems_ii_status": "Daten gehen an Meta US trotz IE-Sitz. "
"Aktuell DPF-abgedeckt, aber Schrems III erwartet.",
"eugh_rulings": [
"EuGH C-311/18 (Schrems II)",
"EDSA Aufsichts-Statement 2023 zu Meta-Pixel",
"LDA Bayern Pruefverfuegung 2024",
],
"eu_alternative_vendor": "Smart AdServer (Equativ, FR)",
"notes": "Conversions API (CAPI) als Server-Side-Variante reduziert Cookie-"
"Abhaengigkeit. Trotzdem werden Daten an Meta gesendet.",
},
"_fbc": {
"vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
"exact_purpose": "Speichert die ClickID einer Meta-Ad-Kampagne; "
"ordnet Conversion dem urspruenglichen Ad-Klick zu.",
"data_collected": ["fbclid", "ad_campaign_id"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9],
"iab_vendor_id": 891,
"typical_lifetime": "90 Tage",
"reid_risk": "high",
"technical_necessity": "none",
},
"fr": {
"vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
"exact_purpose": "Werbe-Targeting + Cross-Site-Tracking auf "
"Facebook-Plattform.",
"data_collected": ["encrypted_user_id", "session_data"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9, 10],
"iab_vendor_id": 891,
"typical_lifetime": "3 Monate",
"reid_risk": "high",
"technical_necessity": "none",
},
# ─── Adobe ────────────────────────────────────────────────────────
"s_cc": {
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
"exact_purpose": "Cookie-Test — prueft ob der Browser Cookies "
"akzeptiert (Adobe Analytics Bootstrap).",
"data_collected": ["browser_supports_cookies"],
"ip_relevant": False,
"typical_lifetime": "Session",
"reid_risk": "low",
"technical_necessity": "partial",
"schrems_ii_status": "Adobe-IE-Hosting, jedoch US-Datentransfer fuer "
"Cloud-Services. DPF-abgedeckt.",
},
"s_sq": {
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
"exact_purpose": "Speichert den letzten Klick (URL + Position) "
"fuer Click-Map-Reports.",
"data_collected": ["last_click_url", "last_click_xy"],
"ip_relevant": False,
"tcf_purpose_ids": [8],
"typical_lifetime": "Session",
"reid_risk": "low",
"technical_necessity": "none",
},
"AMCV_": {
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
"exact_purpose": "Adobe Marketing Cloud Visitor ID — Cross-Tool-ID fuer "
"Analytics + Target + Audience Manager.",
"data_collected": ["marketing_cloud_visitor_id", "first_visit_ts"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 8, 9, 10],
"typical_lifetime": "2 Jahre",
"reid_risk": "high",
"technical_necessity": "none",
"notes": "Suffix nach `AMCV_` ist die Org-ID. Klassischer Cross-Tool-Track.",
},
"mbox": {
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
"exact_purpose": "Adobe Target — A/B-Testing, Personalisierung, "
"Audience-Targeting.",
"data_collected": ["mbox_visitor_id", "experiment_assignments"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9, 10],
"typical_lifetime": "2 Jahre",
"reid_risk": "high",
"technical_necessity": "none",
},
"s_target_qa": {
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
"exact_purpose": "Adobe Target Premium-Feature — QA-Modus fuer Personalisations-Tests.",
"data_collected": ["target_qa_session"],
"typical_lifetime": "Session",
"reid_risk": "low",
"technical_necessity": "none",
"notes": "Premium-Feature-Cookie. Anwesenheit = Enterprise-Lizenz Indikator.",
},
# ─── Microsoft / Bing ─────────────────────────────────────────────
"MUID": {
"vendor": "Microsoft Corp.", "vendor_country": "US",
"exact_purpose": "Microsoft-User-ID fuer Bing, Microsoft Advertising, "
"Clarity Heatmaps.",
"data_collected": ["microsoft_user_id"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 8, 9, 10],
"iab_vendor_id": 165,
"typical_lifetime": "13 Monate",
"reid_risk": "high",
"technical_necessity": "none",
"schrems_ii_status": "US-Transfer, DPF-zertifiziert.",
},
"_uetsid": {
"vendor": "Microsoft Corp.", "vendor_country": "US",
"exact_purpose": "Universal Event Tracking — Session-Cookie fuer "
"Microsoft Advertising Conversion-Tracking.",
"data_collected": ["session_id"],
"ip_relevant": True,
"tcf_purpose_ids": [9],
"typical_lifetime": "30 Minuten",
"reid_risk": "medium",
"technical_necessity": "none",
},
"_uetvid": {
"vendor": "Microsoft Corp.", "vendor_country": "US",
"exact_purpose": "Universal Event Tracking — persistente Visitor-ID.",
"data_collected": ["visitor_id"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9],
"typical_lifetime": "13 Monate",
"reid_risk": "high",
"technical_necessity": "none",
},
# ─── LinkedIn ─────────────────────────────────────────────────────
"bcookie": {
"vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
"exact_purpose": "Browser-Identifikation; wird genutzt fuer Anmelde-"
"Vorgang + LinkedIn Insight-Tag-Tracking.",
"data_collected": ["browser_id"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 8, 9],
"iab_vendor_id": 14,
"typical_lifetime": "1 Jahr",
"reid_risk": "high",
"technical_necessity": "none",
"schrems_ii_status": "US-Datenuebermittlung (Mutter Microsoft).",
},
"lidc": {
"vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
"exact_purpose": "Daten-Routing zwischen LinkedIn-Datacentern.",
"data_collected": ["routing_id"],
"ip_relevant": True,
"typical_lifetime": "1 Tag",
"reid_risk": "low",
"technical_necessity": "partial",
},
"li_gc": {
"vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
"exact_purpose": "Speichert die Cookie-Einwilligung fuer LinkedIn-Embeds.",
"data_collected": ["consent_state"],
"ip_relevant": False,
"typical_lifetime": "6 Monate",
"reid_risk": "low",
"technical_necessity": "full",
},
# ─── Matomo (EU-Alternative) ──────────────────────────────────────
"_pk_id": {
"vendor": "InnoCraft Ltd (Matomo)", "vendor_country": "NZ-mit-EU-self-hosting",
"exact_purpose": "Matomo Visitor-ID — Pendant zu _ga, aber DSGVO-konform "
"wenn IP-Anonymisierung aktiv.",
"data_collected": ["visitor_id", "first_visit_ts"],
"ip_relevant": True, "ip_anonymized": True,
"tcf_purpose_ids": [8],
"typical_lifetime": "13 Monate",
"reid_risk": "low", # bei aktivierter Anonymisierung
"technical_necessity": "none",
"schrems_ii_status": "Bei On-Premise-Deployment KEIN Drittlandtransfer. "
"Matomo Cloud EU mit Frankfurt-Hosting verfuegbar.",
"notes": "Empfohlener Drop-in-Ersatz fuer Google Analytics.",
},
"_pk_ses": {
"vendor": "InnoCraft Ltd (Matomo)", "vendor_country": "NZ-mit-EU-self-hosting",
"exact_purpose": "Matomo Session-Cookie.",
"data_collected": ["session_id"],
"ip_relevant": False,
"typical_lifetime": "30 Minuten",
"reid_risk": "low",
"technical_necessity": "none",
},
# ─── Captcha ──────────────────────────────────────────────────────
"hcaptcha": {
"vendor": "Intuition Machines Inc. (hCaptcha)", "vendor_country": "US",
"exact_purpose": "hCaptcha Bot-Erkennung — Session-Token fuer Challenge-Solver.",
"data_collected": ["bot_score", "session_id", "ip_address"],
"ip_relevant": True,
"typical_lifetime": "Session",
"reid_risk": "medium",
"technical_necessity": "full",
"schrems_ii_status": "US-Transfer (Intuition Machines, San Francisco).",
"eu_alternative_vendor": "Friendly Captcha (DE), Turnstile (Cloudflare EU)",
"notes": "Technisch erforderlich fuer Bot-Schutz, aber EU-Alternativen "
"ohne Drittland-Risiko verfuegbar.",
},
"cf_clearance": {
"vendor": "Cloudflare Inc.", "vendor_country": "US",
"exact_purpose": "Cloudflare-Bot-Management Pro — bestaetigt dass User "
"die JS-Challenge bestanden hat.",
"data_collected": ["challenge_token"],
"ip_relevant": True,
"typical_lifetime": "30 Minuten",
"reid_risk": "low",
"technical_necessity": "full",
"notes": "Premium-Feature-Cookie. Anwesenheit = Cloudflare Bot-Management "
"Pro im Einsatz.",
},
# ─── CDN / Performance ────────────────────────────────────────────
"__cf_bm": {
"vendor": "Cloudflare Inc.", "vendor_country": "US",
"exact_purpose": "Cloudflare Bot Management — basis Bot-Erkennung.",
"data_collected": ["bot_score", "client_hash"],
"ip_relevant": True,
"typical_lifetime": "30 Minuten",
"reid_risk": "low",
"technical_necessity": "full",
"notes": "Strictly necessary nach §25(2) TDDDG (Sicherheit). Keine Einwilligung noetig.",
},
"aws-alb": {
"vendor": "Amazon Web Services Inc.", "vendor_country": "US",
"exact_purpose": "AWS Application Load Balancer Sticky Sessions — "
"routet Anfragen konsistent an dieselbe Backend-Instanz.",
"data_collected": ["target_instance_id"],
"ip_relevant": False,
"typical_lifetime": "1 Stunde",
"reid_risk": "low",
"technical_necessity": "full",
"schrems_ii_status": "AWS Frankfurt verfuegbar — bei korrektem Region-Setup "
"kein US-Transfer.",
},
# ─── Retargeting / Advertising ────────────────────────────────────
"_pin_unauth": {
"vendor": "Pinterest Europe Ltd.", "vendor_country": "IE/US",
"exact_purpose": "Pinterest Tag — Conversion-Tracking + Audience-Aufbau.",
"data_collected": ["pinterest_user_id"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9, 10],
"iab_vendor_id": 762,
"typical_lifetime": "1 Jahr",
"reid_risk": "high",
"technical_necessity": "none",
},
"cto_dna": {
"vendor": "Criteo S.A.", "vendor_country": "FR",
"exact_purpose": "Criteo Dynamic Retargeting — produktspezifische "
"Werbeauslieferung basierend auf Browser-History.",
"data_collected": ["criteo_user_id", "product_views"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9, 10],
"iab_vendor_id": 91,
"typical_lifetime": "13 Monate",
"reid_risk": "high",
"technical_necessity": "none",
"schrems_ii_status": "Criteo ist FR-basiert, aber Daten gehen auch in die USA. "
"Multi-Region-Setup pruefen.",
"notes": "Trotz FR-Sitz nicht automatisch DSGVO-konform — CNIL-Bussgeld "
"EUR 60M 2022 wegen mangelnder Einwilligungs-Granularitaet.",
},
"afm": {
"vendor": "Adform A/S", "vendor_country": "DK",
"exact_purpose": "Adform Audience Matching — Cross-Device-Identifikation "
"fuer programmatische Werbung.",
"data_collected": ["adform_user_id", "device_signals"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9, 10],
"iab_vendor_id": 50,
"typical_lifetime": "30 Tage",
"reid_risk": "high",
"technical_necessity": "none",
"schrems_ii_status": "Adform ist DK-basiert, EU-Hosting Standard. Keine "
"Schrems-II-Probleme bei Standard-Setup.",
},
# ─── Consent / Funktional (Strictly Necessary) ────────────────────
"JSESSIONID": {
"vendor": "Java EE / Tomcat (Site-Software)", "vendor_country": "N/A",
"exact_purpose": "Server-Session-ID fuer Java-basierte Anwendungen.",
"data_collected": ["session_id"],
"ip_relevant": False,
"typical_lifetime": "Session",
"reid_risk": "low",
"technical_necessity": "full",
"notes": "Strictly necessary. Keine Einwilligung nach §25(2) TDDDG.",
},
"PHPSESSID": {
"vendor": "PHP (Site-Software)", "vendor_country": "N/A",
"exact_purpose": "PHP-Session-ID — serverseitige Sitzungs-Persistierung.",
"data_collected": ["session_id"],
"ip_relevant": False,
"typical_lifetime": "Session",
"reid_risk": "low",
"technical_necessity": "full",
},
"cookie_consent": {
"vendor": "BreakPilot Consent-Banner (own)", "vendor_country": "DE",
"exact_purpose": "Speichert die Einwilligungsentscheidung des Nutzers "
"pro Kategorie.",
"data_collected": ["consent_state_per_category", "timestamp"],
"ip_relevant": False,
"typical_lifetime": "180 Tage",
"reid_risk": "low",
"technical_necessity": "full",
"notes": "Strictly necessary nach Art. 7(1) DSGVO Nachweispflicht.",
},
# ─── Templated / pattern-based entries (Suffix variabel) ──────────
# Diese werden via regex-Lookup gefangen, der Eintrag dient als Fallback.
"_uet_": {
"vendor": "Microsoft Corp.", "vendor_country": "US",
"exact_purpose": "Microsoft Universal Event Tracking — Suffix nach `_uet_`.",
"data_collected": ["event_id"],
"ip_relevant": True,
"typical_lifetime": "30 Minuten",
"reid_risk": "medium",
"technical_necessity": "none",
},
}
# ─── Pattern-Lookup fuer Cookies mit variablem Suffix ───────────────
_PATTERN_LOOKUPS: list[tuple[str, str]] = [
(r"^_ga_[A-Z0-9_]+$", "_ga_*"),
(r"^_gat_gtag_UA_", "_gat_gtag_UA_"),
(r"^AMCV_", "AMCV_"),
(r"^_uet[a-z]+", "_uet_"),
(r"^aws-alb", "aws-alb"),
(r"^_pk_id\.", "_pk_id"),
(r"^_pk_ses\.", "_pk_ses"),
]
def lookup_cookie(name: str) -> CookieKnowledge | None:
"""Return rich knowledge for a cookie name, or None if unknown."""
import re
if not name:
return None
# Direct hit
if name in KB:
return KB[name]
# Pattern-based
for pattern, kb_key in _PATTERN_LOOKUPS:
if re.search(pattern, name):
return KB.get(kb_key)
# Strip common suffixes (.bmw.de, .domain etc.)
base = name.split(".", 1)[0]
if base != name and base in KB:
return KB[base]
return None
def enrich_vendor_with_knowledge(vendor: dict) -> dict:
"""Add per-cookie knowledge to each cookie in vendor['cookies']."""
cookies = vendor.get("cookies") or []
enriched = []
for c in cookies:
info = lookup_cookie(c.get("name", ""))
if info:
enriched.append({**c, "knowledge": info})
else:
enriched.append(c)
return {**vendor, "cookies": enriched}
# ─── Hilfen fuer Aggregate-Reports ──────────────────────────────────
def summarize_compliance_risk(vendor: dict) -> dict:
"""Aggregiere Re-ID-Risiko + Drittland-Status ueber alle Cookies eines Vendors."""
cookies = vendor.get("cookies") or []
risk_counts = {"high": 0, "medium": 0, "low": 0}
schrems_affected = 0
technical_only = 0
for c in cookies:
k = c.get("knowledge") or lookup_cookie(c.get("name", ""))
if not k:
continue
risk = k.get("reid_risk", "low")
risk_counts[risk] = risk_counts.get(risk, 0) + 1
if "us" in k.get("vendor_country", "").lower() or "schrems" in k.get("schrems_ii_status", "").lower():
schrems_affected += 1
if k.get("technical_necessity") == "full":
technical_only += 1
return {
"reid_risk_distribution": risk_counts,
"high_risk_cookie_count": risk_counts["high"],
"schrems_ii_affected_cookies": schrems_affected,
"strictly_necessary_cookies": technical_only,
"total_classified": sum(risk_counts.values()),
}
# ─── TEMPLATE_ENTRY (fuer Kuratierung neuer Cookies) ───────────────
TEMPLATE_ENTRY: CookieKnowledge = {
"vendor": "<Voller Firmenname>",
"vendor_country": "<ISO-2 Code oder 'IE/US' bei Doppel-Standort>",
"exact_purpose": "<1-2 Saetze was der Cookie GENAU tut>",
"data_collected": ["<feldname_1>", "<feldname_2>"],
"ip_relevant": False,
"ip_anonymized": False,
"tcf_purpose_ids": [], # TCF v2.2: 1-11
"iab_vendor_id": None, # Aus https://iabeurope.eu/tcf-vendor-list/
"typical_lifetime": "<Session | XX Tage | XX Monate | XX Jahre>",
"reid_risk": "low", # low | medium | high
"technical_necessity": "none", # none | partial | full
"schrems_ii_status": "<Drittlandtransfer-Bewertung>",
"eugh_rulings": [],
"eu_alternative_cookies": [],
"eu_alternative_vendor": "",
"notes": "",
}
@@ -220,10 +220,16 @@ def score_vendors(vendors: list[dict]) -> list[dict]:
flags.append("no_purpose")
# Country — only for external processors / controllers
# Falls country leer ist, ableiten aus Rechtsform-Suffix im Namen.
if country_required:
max_score += 10
if v.get("country"):
score += 10
elif _country_from_name(v.get("name", "")):
inferred = _country_from_name(v.get("name", ""))
v["country"] = inferred
v["country_inferred"] = True
score += 10
else:
flags.append("no_country")
@@ -321,3 +327,153 @@ def build_check_items(validated: list[LinkCheck]) -> list[dict]:
"hint": hint,
})
return items
# ─── Country-Inferenz aus Rechtsform-Suffix ────────────────────────
#
# Wenn ein Vendor das "country"-Feld leer hat, koennen wir es oft aus
# dem Firmen-Suffix ableiten:
# Adform A/S → DK (Dänemark, Aktieselskab)
# Pinterest Europe Ltd. → IE (Irland, Limited)
# Salesforce Inc. → US (Incorporated)
# Adobe ... Ireland Limited → IE
# Genesys ... B.V. → NL (Niederlande, Besloten Vennootschap)
# Equativ S.A. → FR (Société Anonyme)
# SAP SE → DE (Societas Europaea — meist DE-eingetragen)
#
# Kombi-Strategie:
# 1) Suffix-Pattern
# 2) Laendername im Firmen-Namen ('Ireland', 'Deutschland')
# 3) Specific Vendor (Google Inc / Meta Platforms Ireland Ltd → vendor-specific)
import re as _re
_SUFFIX_COUNTRY: list[tuple[str, str]] = [
# Pattern (am Wort-Ende oder vor weiteren Tokens) → ISO-Code
(r"\bA/S\b", "DK"), # Aktieselskab
(r"\bApS\b", "DK"), # Anpartsselskab
(r"\bAB\b", "SE"), # Aktiebolag
(r"\bAS\b(?!\w)", "NO"), # Aksjeselskap
(r"\bOy\b", "FI"), # Osakeyhtiö
(r"\bAG\b(?!\w)", "DE"), # auch CH/AT moeglich, default DE
(r"\bGmbH\b", "DE"),
(r"\bUG\b", "DE"),
(r"\beG\b", "DE"),
(r"\bKG\b", "DE"),
(r"\bOHG\b", "DE"),
(r"\bSE\b", "DE"), # Societas Europaea — pruefen ob SAP SE etc.
(r"\bS\.A\.\b", "FR"), # France / SE / ES
(r"\bSAS\b", "FR"),
(r"\bS\.A\.S\.\b", "FR"),
(r"\bSARL\b", "FR"),
(r"\bS\.r\.l\.\b", "IT"),
(r"\bS\.p\.A\.\b", "IT"),
(r"\bSpA\b", "IT"),
(r"\bB\.V\.\b", "NL"),
(r"\bN\.V\.\b", "NL"),
(r"\bSL\b", "ES"),
(r"\bS\.A\.\sde C\.V\.\b", "MX"),
(r"\bd\.o\.o\.\b", "SI"), # Slowenien
(r"\bd\.d\.\b", "HR"), # Kroatien
(r"\bz\s?o\.o\.\b", "PL"),
(r"\bInc\.?\b", "US"),
(r"\bIncorporated\b", "US"),
(r"\bCorp\.?\b", "US"),
(r"\bCorporation\b", "US"),
(r"\bLLC\b", "US"),
(r"\bL\.L\.C\.\b", "US"),
(r"\bLtd\.?\b", "GB"), # UK Limited, default
(r"\bLimited\b", "GB"),
(r"\bPLC\b", "GB"),
(r"\bPty\b", "AU"),
(r"\bK\.K\.\b", "JP"), # Kabushiki-Kaisha
(r"\bPte\.?\sLtd\.?\b", "SG"),
]
# Country-Namen im Firmen-Namen (z.B. "Adobe Systems Software Ireland Limited")
_COUNTRY_NAME_TOKENS: list[tuple[str, str]] = [
("ireland", "IE"),
("deutschland", "DE"),
("germany", "DE"),
("netherlands", "NL"),
("france", "FR"),
("united kingdom", "GB"),
("uk", "GB"),
("usa", "US"),
("united states", "US"),
("austria", "AT"),
("oesterreich", "AT"),
("schweiz", "CH"),
("switzerland", "CH"),
("luxembourg", "LU"),
("luxemburg", "LU"),
("denmark", "DK"),
("daenemark", "DK"),
("sweden", "SE"),
("schweden", "SE"),
("norway", "NO"),
("norwegen", "NO"),
("finland", "FI"),
("finnland", "FI"),
]
# Bekannte Vendors mit eindeutigem Sitz (override)
_KNOWN_VENDOR_COUNTRY: dict[str, str] = {
"google inc": "US",
"google llc": "US",
"google ireland": "IE",
"meta platforms ireland": "IE",
"facebook ireland": "IE",
"amazon.com inc": "US",
"amazon web services": "US",
"amazon web services inc": "US",
"linkedin inc": "US",
"salesforce inc": "US",
"salesforce.com": "US",
"outbrain inc": "US",
"taboola inc": "US",
"pinterest europe ltd": "IE",
"intuition machines inc": "US",
"akamai technologies inc": "US",
"criteo s.a": "FR",
"criteo sa": "FR",
"adform a/s": "DK",
"speedcurve limited": "GB",
"longtail ad solutions": "US",
"genesys cloud services b.v": "NL",
"qualtrics": "US",
"teads sa": "FR",
"teads s.a": "FR",
"salesviewer gmbh": "DE",
"baqend gmbh": "DE",
"zenweshare sas": "FR",
"nayoki gmbh": "DE",
"psyma": "DE",
"matomo": "NZ", # InnoCraft NZ aber EU-hostbar
"adobe systems software ireland": "IE",
"microsoft corporation": "US",
"microsoft corp": "US",
}
def _country_from_name(vendor_name: str) -> str:
"""Best-effort: ISO-2 Country-Code aus dem Vendor-Namen ableiten."""
if not vendor_name:
return ""
# Vendor-Namen sind oft "<Firma> — <Tool>" — nur Firmen-Teil betrachten
firm = vendor_name.split("")[0].strip()
firm_l = firm.lower()
# 1) Known vendor lookup (most specific)
for k, v in _KNOWN_VENDOR_COUNTRY.items():
if k in firm_l:
return v
# 2) Country-Name im Firmen-Namen
for token, code in _COUNTRY_NAME_TOKENS:
if token in firm_l:
return code
# 3) Rechtsform-Suffix
for pattern, code in _SUFFIX_COUNTRY:
if _re.search(pattern, firm):
return code
return ""
@@ -0,0 +1,350 @@
"""
Doc-Anchor-Locator fuer ein Finding den passendsten Einfuege-Ort im
existierenden Dokument finden.
Primary strategy: BGE-M3 Embedding-Match zwischen einer pro-Finding
Anchor-Query und allen Absaetzen des Docs. Echtes semantisches Matching
(BMW schreibt "Verarbeiter" statt "Auftragsverarbeiter" Keyword waere
out, Embedding catches it).
Fallback: Keyword-Match (wenn kein Embedding-Service erreichbar).
Output pro Anchor:
- anchor_phrase : Originaltext-Auszug
- position_hint : "Nach Absatz X von Y: '...'"
- confidence : 'high' | 'medium' | 'low'
- score : float (cosine similarity oder keyword-rank)
- method : 'embedding' | 'keyword' | 'fallback'
"""
from __future__ import annotations
import logging
import math
import os
import re
import threading
from typing import Iterable
import httpx
logger = logging.getLogger(__name__)
EMBEDDING_URL = os.getenv("EMBEDDING_URL", "http://embedding-service:8087")
# Pro Finding-Typ eine semantisch reiche Anchor-Query, die der Embedding-
# Matcher gegen den Doc-Text wirft. Reicher als die kurze MC-check_question.
# Sucht NICHT den Pflichttext selbst (der fehlt ja) sondern den Absatz wo
# der Fix HINEIN-soll — also den thematisch verwandten Kontext.
_ANCHOR_QUERIES: list[tuple[str, str, str]] = [
# (finding_label_partial, anchor_query, fallback_hint)
(
"Auftragsverarbeiter erwaehnt",
"Empfaenger der Daten Verarbeiter Dienstleister Cloud Hosting CRM "
"Auftragsverarbeitung Weitergabe Datenuebermittlung an Dritte",
"Im Abschnitt 'Empfaenger' oder 'Datenuebermittlung'",
),
(
"Automatisierte Entscheidungen",
"Betroffenenrechte automatisierte Entscheidung Profiling Logik "
"Tragweite Auswirkung Art. 22 DSGVO",
"Am Ende des Abschnitts 'Betroffenenrechte'",
),
(
"Konkrete Aufsichtsbehoerde",
"Beschwerderecht Datenschutzaufsicht Aufsichtsbehoerde Beschwerde "
"bei der Behoerde einreichen Recht auf Beschwerde",
"Im Abschnitt 'Beschwerderecht'",
),
(
"Angemessenheitsbeschluss",
"Drittlandtransfer USA Standardvertragsklauseln SCC DPF Data Privacy "
"Framework Angemessenheitsbeschluss internationale Datenuebermittlung",
"Im Abschnitt 'Drittlandtransfer'",
),
(
"Anschrift des Verantwortlichen",
"Verantwortlicher Verantwortliche Stelle Datenschutz Betreiber dieser "
"Website Firma Anschrift Kontakt",
"Am Anfang der Datenschutzerklaerung / Cookie-Richtlinie",
),
(
"Konkrete Cookie-Namen",
"Welche Cookies verwenden wir Cookie-Tabelle Liste der Cookies "
"Cookie-Kategorien Auflistung der Cookies Name Anbieter Zweck",
"Im Abschnitt 'Welche Cookies verwenden wir?'",
),
(
"Konkrete Anbieter/Dienste",
"Drittanbieter Dienste Anbieter wir nutzen folgende Dienste "
"Empfaenger der Cookie-Daten Liste der Dienstleister",
"In der Drittanbieter-Liste der Cookie-Richtlinie",
),
(
"Analytics-/Statistik-Tools konkret benannt",
"Statistik Analytics Reichweitenmessung Webanalyse Tracking "
"Google Analytics Matomo Adobe Analytics",
"Im Abschnitt 'Statistik / Analyse-Cookies'",
),
(
"Konkrete Speicherdauer",
"Speicherdauer Lebensdauer wie lange Ablauf Cookie-Tabelle Spalte "
"Speicherdauer pro Cookie",
"In der Cookie-Tabelle pro Eintrag",
),
(
"Opt-Out-Links",
"Widerruf widersprechen deaktivieren Cookie-Einstellungen aendern "
"Opt-Out Einstellungen anpassen",
"Im Abschnitt 'Wie kann ich widersprechen?'",
),
(
"Privacy-Policy-Links",
"Datenschutzerklaerung des Drittanbieters Privacy Policy Link auf "
"Datenschutzhinweise der Drittanbieter",
"Im Drittanbieter-Listing der Cookie-Richtlinie",
),
(
"Verbraucherstreitbeilegung",
"Online-Streitbeilegung OS-Plattform Verbraucherschlichtungsstelle "
"Streitbeilegung Verbraucher",
"Am Ende des Impressums (eigener Abschnitt 'Streitbeilegung')",
),
(
"Rechtswidriger Haftungsausschluss",
"Haftung Disclaimer wir distanzieren uns Links zu externen Webseiten "
"Haftungsausschluss Drittinhalte",
"Am Ende des Impressums (Disclaimer-Absatz)",
),
(
"Name der vertretungsberechtigten",
"Vertreten durch Vorstand Geschaeftsfuehrung Geschaeftsfuehrer "
"vertretungsberechtigt Repraesentant",
"Im Impressum nach Firmenname + Anschrift",
),
(
"Zustaendige Kammer",
"Berufsrecht Kammer Aufsichtsbehoerde berufsrechtliche Angaben "
"zustaendige Kammer",
"Im Impressum im Abschnitt 'Berufsrechtliche Angaben'",
),
(
"Drittlaender",
"Drittland Drittlaender USA Indien China internationale Datenuebermittlung "
"Datenexport in Nicht-EU-Staaten",
"Im Abschnitt 'Drittlandtransfer'",
),
(
"Schutzgarantien",
"Schutzgarantien Schutzvorkehrungen Schutzmassnahmen SCC "
"Standardvertragsklauseln einsehen Anforderung",
"Im Abschnitt 'Drittlandtransfer / Schutzgarantien'",
),
]
# ─── Thread-local Cache fuer Doc-Chunks + Embeddings ───────────────
# Pro Compliance-Check-Run werden die Doc-Texte einmal embedded und im
# Thread-Local-Storage gehalten, damit mehrere Findings im selben Run
# nicht jeweils neu embedded werden.
_tls = threading.local()
def _get_cache() -> dict:
if not hasattr(_tls, "cache"):
_tls.cache = {}
return _tls.cache
def reset_cache() -> None:
"""Per-request-cache leeren (sollte am Start jedes Doc-Check-Runs aufgerufen
werden, damit Vorgaenger-Daten kein Leak verursachen)."""
if hasattr(_tls, "cache"):
_tls.cache = {}
# ─── Helfer ────────────────────────────────────────────────────────
def _normalize(text: str) -> str:
return (text or "").lower().replace("\xad", "").replace("ß", "ss")
def _split_paragraphs(text: str) -> list[str]:
"""Split a doc into paragraphs (by double newline, fallback single)."""
if not text:
return []
paras = re.split(r"\n\s*\n", text)
if len(paras) < 3:
paras = re.split(r"(?<=[\.\?\!])\s+(?=[A-ZÄÖÜ])", text)
return [p.strip() for p in paras if p.strip()]
def _embed_sync(texts: list[str], timeout: float = 60.0,
batch_size: int = 32) -> list[list[float]]:
"""Synchroner Batch-Embed-Call (Anchor-Lokalisierung laeuft in
Sync-HTML-Render, nicht in async context)."""
if not texts:
return []
out: list[list[float]] = []
with httpx.Client(timeout=timeout) as client:
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
try:
r = client.post(f"{EMBEDDING_URL}/embed", json={"texts": batch})
r.raise_for_status()
out.extend(r.json().get("embeddings") or [])
except Exception as e:
logger.warning("Anchor embed sub-batch [%d-%d] failed: %s",
i, i + len(batch), e)
out.extend([[] for _ in batch])
return out
def _cosine(a: list[float], b: list[float]) -> float:
if not a or not b or len(a) != len(b):
return 0.0
dot = sum(x * y for x, y in zip(a, b))
na = math.sqrt(sum(x * x for x in a))
nb = math.sqrt(sum(y * y for y in b))
if na == 0 or nb == 0:
return 0.0
return dot / (na * nb)
def _doc_paragraphs_and_vectors(
doc_id: str, doc_text: str,
) -> tuple[list[str], list[list[float]]]:
"""Lazy-cache: Absaetze + Vektoren pro Doc-ID. Wird genau einmal pro
Doc und Run berechnet."""
cache = _get_cache()
if doc_id in cache:
return cache[doc_id]
paras = _split_paragraphs(doc_text)
if not paras:
cache[doc_id] = ([], [])
return cache[doc_id]
vecs = _embed_sync(paras)
cache[doc_id] = (paras, vecs)
return cache[doc_id]
def _keyword_fallback(fl: str, doc_text: str) -> dict | None:
"""Fallback wenn Embedding-Service ausfaellt oder zu wenig Trefferqualitaet."""
# Use the old _ANCHOR_QUERIES list — extract just the fallback hint
for label_partial, _query, fallback_hint in _ANCHOR_QUERIES:
if _normalize(label_partial) in fl:
return {
"anchor_phrase": None,
"position_hint": fallback_hint,
"confidence": "low",
"method": "fallback",
}
return None
def locate_anchor(
finding_label: str,
doc_text: str,
doc_id: str | None = None,
) -> dict | None:
"""Fuer ein Finding den passendsten Anker-Absatz im Doc-Text finden.
Primary: Embedding-Match. Sekundaer: Keyword-Hits als Bonus. Fallback:
rein keyword-basiert wenn Embedding-Service nicht erreichbar ist.
`doc_id` ist ein cache-key (z.B. doc_type oder url). Wenn None, wird
aus dem doc_text-Hash abgeleitet.
"""
if not doc_text or not finding_label:
return None
fl = _normalize(finding_label)
# Welche Anchor-Query matched dieses Finding?
query = None
fallback_hint = None
matched_label = None
for label_partial, q, fb in _ANCHOR_QUERIES:
if _normalize(label_partial) in fl:
query, fallback_hint, matched_label = q, fb, label_partial
break
if not query:
return None
doc_id = doc_id or f"doc-{hash(doc_text) & 0xffffffff:08x}"
# 1) Embedding-Match
paras, doc_vecs = _doc_paragraphs_and_vectors(doc_id, doc_text)
if not paras:
return None
embeddings_available = any(v for v in doc_vecs)
if not embeddings_available:
return _keyword_fallback(fl, doc_text)
try:
q_vec = _embed_sync([query])[0] if query else None
except Exception:
q_vec = None
if not q_vec:
return _keyword_fallback(fl, doc_text)
# Per-Absatz Score = cosine + Heading-Bonus
best_idx = -1
best_score = 0.0
for i, (p, dv) in enumerate(zip(paras, doc_vecs)):
if not dv:
continue
sim = _cosine(q_vec, dv)
# Heading-Bonus: kurze Absaetze + Markdown-Heading-Marker
if len(p.split()) <= 8 or p.strip().startswith("#"):
sim += 0.05
if sim > best_score:
best_score = sim
best_idx = i
# Konfidenz-Schwellen — kalibriert anhand BMW-Run
if best_idx < 0 or best_score < 0.40:
# Zu schwacher Match — Fallback verwenden
return {
"anchor_phrase": None,
"position_hint": fallback_hint,
"confidence": "low",
"score": round(best_score, 3) if best_idx >= 0 else 0,
"method": "embedding-no-match",
}
if best_score >= 0.62:
confidence = "high"
elif best_score >= 0.50:
confidence = "medium"
else:
confidence = "low"
anchor = paras[best_idx]
words = anchor.split()
snippet = " ".join(words[:30]) + ("" if len(words) > 30 else "")
return {
"anchor_phrase": snippet,
"anchor_index": best_idx,
"total_paragraphs": len(paras),
"position_hint": f"Nach Absatz {best_idx + 1} von {len(paras)}: '{snippet}'",
"confidence": confidence,
"score": round(best_score, 3),
"method": "embedding",
}
def annotate_findings_with_anchors(
findings: Iterable[dict], doc_text: str, doc_id: str | None = None,
) -> list[dict]:
"""Pro Finding den Anchor suchen und das Dict um 'anchor' erweitern."""
out = []
for f in findings:
a = locate_anchor(f.get("label") or f.get("title") or "", doc_text, doc_id)
out.append({**f, "anchor": a})
return out
@@ -0,0 +1,353 @@
"""
Action-Recipes pro Finding-Typ eine umsetzbare Handlungsanweisung:
WAS tun, WARUM (Rechtsgrundlage), WIE formulieren (konkreter Textbaustein),
WO einfuegen (Doc-Abschnitt-Hinweis).
Ohne Recipes ist ein Finding nur eine Diagnose. Mit Recipe weiss der
Kunde sofort welchen Satz er an welche Stelle setzen muss.
Verwendung:
from compliance.services.finding_action_recipes import recipe_for
rec = recipe_for("no_cookies_listed") # → dict mit what/why/fix_text/where/example
"""
from __future__ import annotations
from typing import TypedDict
class ActionRecipe(TypedDict, total=False):
what: str # 1-Satz Diagnose
why: str # Rechtsgrundlage / Risiko
fix_text: str # konkreter Textbaustein zum Einfuegen
where: str # in welchem Doc-Abschnitt
example: str # echtes Anwendungsbeispiel
severity: str # 'critical' | 'high' | 'medium' | 'low'
# ─── Vendor-/Cookie-Findings (im VVT-Block) ────────────────────────
VENDOR_FINDINGS: dict[str, ActionRecipe] = {
"no_cookies_listed": {
"what": "Anbieter ist erfasst, aber es sind keine konkreten Cookies "
"dokumentiert.",
"why": "Die DSK-Orientierungshilfe Telemedien 2024 fordert pro Anbieter "
"eine Auflistung aller gesetzten Cookies mit Name + Zweck + "
"Speicherdauer. 'Verwendet Cookies' ohne Details erfuellt "
"Art. 13 Abs. 1 lit. e DSGVO nicht.",
"fix_text": "Ergaenzen Sie pro Cookie eine Zeile mit folgenden Feldern:\n"
" • Cookie-Name (z.B. _ga, _fbp, NID)\n"
" • Setzender Anbieter (Firma + Sitzland)\n"
" • Zweck (1 Satz, z.B. 'Reichweitenmessung pseudonym')\n"
" • Speicherdauer (z.B. '2 Jahre', '24h', 'Session')",
"where": "Cookie-Richtlinie unter dem Abschnitt 'Kategorien' "
"(Notwendig / Marketing / Statistik / ...).",
"example": "_ga (Google Ireland Ltd.) — Statistik — eindeutige "
"Besucher-ID — Speicherdauer 2 Jahre",
"severity": "high",
},
"no_country": {
"what": "Anbieter-Sitzland ist nicht dokumentiert.",
"why": "Art. 30 Abs. 1 lit. d DSGVO fordert die Kategorien der Empfaenger "
"inkl. Sitz. Bei Drittland-Empfaengern (US, IN, CN ...) ist das "
"zusaetzlich Pflicht nach Art. 13 Abs. 1 lit. f DSGVO.",
"fix_text": "Fuegen Sie hinter dem Anbieternamen das Sitzland in "
"Klammern ein. Bei Drittland (ausserhalb EU/EWR) zusaetzlich "
"den Transfer-Mechanismus (SCC, DPF, Angemessenheitsbeschluss).",
"where": "VVT-Eintrag bei 'Empfaenger' oder im Cookie-Anbieter-Listing.",
"example": "Google Ireland Ltd., Dublin, Irland (EU). Bei US-Transfer: "
"'Google LLC, Mountain View, US — DPF-zertifiziert'.",
"severity": "high",
},
"no_privacy_url": {
"what": "Kein direkter Link zur Datenschutzerklaerung des Anbieters.",
"why": "Art. 13 Abs. 2 lit. f DSGVO Transparenzgebot: der Nutzer muss "
"die Datenverarbeitung beim Drittanbieter eigenverantwortlich "
"nachvollziehen koennen.",
"fix_text": "Ergaenzen Sie einen klickbaren Link zur Datenschutzerklaerung "
"des Anbieters direkt neben dem Anbieternamen.",
"where": "Cookie-Richtlinie pro Vendor-Eintrag. Idealerweise als "
"letzter Spalteneintrag oder Inline-Link.",
"example": "Google Analytics — Datenschutz: https://policies.google.com/privacy",
"severity": "medium",
},
"broken_privacy_url": {
"what": "Der angegebene Privacy-Link gibt einen Fehler zurueck "
"(404 / 403 / Timeout).",
"why": "Ein toter Link erfuellt Art. 13(2)(f) DSGVO nicht — die "
"Transparenz-Pflicht laeuft ins Leere.",
"fix_text": "1. Pruefen Sie den Link aus einem normalen Browser. "
"Funktioniert er dort? → Anbieter blockt automatisierte Pruefer (kein Mangel).\n"
"2. Funktioniert er nicht? → Aktualisieren Sie den Link in Ihrer "
"Cookie-Richtlinie auf die heute gueltige Privacy-URL des Anbieters.",
"where": "Cookie-Richtlinie / Drittanbieter-Liste.",
"example": "Wenn Adobe-Privacy-Link 404 gibt, ersetzen durch "
"https://www.adobe.com/privacy/policy.html",
"severity": "high",
},
"no_opt_out_url": {
"what": "Kein Opt-Out-Link fuer diesen Anbieter dokumentiert.",
"why": "Art. 7 Abs. 3 DSGVO: der Widerruf der Einwilligung muss so "
"einfach wie die Erteilung sein. Pro Drittanbieter muss eine "
"Opt-Out-Moeglichkeit angeboten werden.",
"fix_text": "Recherchieren Sie den offiziellen Opt-Out-Link des "
"Anbieters und ergaenzen Sie ihn. Falls Ihr Cookie-Banner "
"ein 'Einstellungen aendern' anbietet, ist das oft "
"ausreichend — der Link sollte trotzdem als Backup "
"dokumentiert sein.",
"where": "Cookie-Richtlinie pro Vendor-Eintrag.",
"example": "Google Analytics Opt-Out: https://tools.google.com/dlpage/gaoptout",
"severity": "high",
},
"broken_opt_out": {
"what": "Der angegebene Opt-Out-Link funktioniert nicht "
"(404 / 403 / Timeout).",
"why": "Art. 7(3) DSGVO Widerrufsmoeglichkeit ohne funktionierenden "
"Link ist nicht gegeben.",
"fix_text": "1. Pruefen Sie ob der Link aus einem Browser funktioniert. "
"403 kommt oft von Bot-Schutz und ist kein realer Mangel.\n"
"2. Bei echtem 404: aktualisieren Sie auf den heute gueltigen "
"Opt-Out-Link.\n"
"3. Alternativ verlinken Sie auf Ihren eigenen Cookie-Banner "
"'Einstellungen aendern'-Trigger.",
"where": "Cookie-Richtlinie pro Vendor-Eintrag.",
"example": "Wenn Criteo-Opt-Out 403 gibt, ist das oft Cloudflare-Schutz — "
"Link aus dem Browser klickbar → kein Mangel. Alternativ: "
"https://www.youronlinechoices.com/de/",
"severity": "medium",
},
}
# ─── Doc-Pruefungs-Findings (im DSE/Impressum/Cookie-Pruefblock) ───
DOC_CHECK_FINDINGS: dict[str, ActionRecipe] = {
"Auftragsverarbeiter erwaehnt": {
"what": "Auftragsverarbeiter (Hosting, CRM, Newsletter) sind nicht "
"explizit erwaehnt + es fehlt Hinweis auf AVVs nach Art. 28 DSGVO.",
"why": "Art. 28 DSGVO Auftragsverarbeitung erfordert dokumentierte AVVs. "
"Art. 13(1)(e) DSGVO erfordert die Nennung der Empfaenger-"
"Kategorien. Fehlt der Hinweis = klassischer Pruefpunkt der "
"Aufsichtsbehoerden.",
"fix_text": "Wir setzen sorgfaeltig ausgewaehlte Auftragsverarbeiter "
"(z.B. fuer Hosting, Webanalyse, Customer Service) ein. Mit "
"allen Auftragsverarbeitern haben wir Vertraege zur "
"Auftragsverarbeitung nach Art. 28 DSGVO geschlossen. Die "
"Auftragsverarbeiter handeln ausschliesslich auf unsere "
"Weisung und sind vertraglich zu angemessenen technischen "
"und organisatorischen Massnahmen verpflichtet.",
"where": "Datenschutzerklaerung im Abschnitt 'Empfaenger' oder "
"'Datenuebermittlung'. Direkt nach der Aufzaehlung der "
"Empfaenger-Kategorien.",
"example": "Empfaenger: BMW Niederlassungen, Vertragshaendler, "
"Auftragsverarbeiter (Cloud-Hosting AWS, CRM Salesforce, "
"Webanalyse Adobe Analytics — mit allen sind AVVs nach "
"Art. 28 DSGVO geschlossen).",
"severity": "high",
},
"Automatisierte Entscheidungen / Profiling": {
"what": "Keine Aussage zu automatisierten Einzelentscheidungen "
"oder Profiling nach Art. 22 DSGVO.",
"why": "Art. 13 Abs. 2 lit. f DSGVO: bei automatisierten "
"Einzelentscheidungen muessen Logik, Tragweite und Auswirkungen "
"erklaert werden. Bei KEINEM Profiling muss das explizit "
"verneint werden — sonst lassen Aufsichtsbehoerden Unsicherheit "
"offen.",
"fix_text": "Variante A (kein Profiling):\n"
" 'Es findet keine automatisierte Entscheidungsfindung "
"im Sinne des Art. 22 DSGVO statt. Sofern wir Profiling "
"zur Anzeige personalisierter Inhalte einsetzen, erfolgt "
"dies ausschliesslich auf Basis Ihrer Einwilligung und "
"wird im Abschnitt [X] erlaeutert.'\n\n"
"Variante B (Profiling vorhanden, z.B. fuer Werbung):\n"
" 'Wir nutzen Profiling zur Anzeige personalisierter "
"Werbung. Die Logik basiert auf [Klick-Historie / "
"Besuchsverhalten / Praeferenzen]. Tragweite: "
"Anpassung der angezeigten Anzeigen. Auswirkung: keine "
"rechtlichen oder erheblichen Auswirkungen — Sie koennen "
"jederzeit widersprechen unter [Link/Kontakt].'",
"where": "Datenschutzerklaerung am Ende des Abschnitts "
"'Betroffenenrechte' oder als eigener Absatz unter "
"'Automatisierte Entscheidungen'.",
"example": "Standardformulierung Variante A — falls Sie KEIN Profiling "
"betreiben, ist das der sichere Default-Text.",
"severity": "high",
},
"Konkrete Aufsichtsbehoerde benannt": {
"what": "Aufsichtsbehoerde wird nicht namentlich genannt.",
"why": "Art. 13(2)(d) DSGVO: das Beschwerderecht muss konkret "
"kommuniziert werden — nicht 'die zustaendige Behoerde', sondern "
"Name + Anschrift + Website.",
"fix_text": "Sie haben das Recht, sich bei einer Datenschutz-"
"Aufsichtsbehoerde zu beschweren. Zustaendig ist:\n\n"
" [Aufsichtsbehoerde mit Namen + Strasse + PLZ + Ort + Web]\n\n"
"Beispiel: 'Bayerisches Landesamt fuer Datenschutzaufsicht "
"(BayLDA), Promenade 18, 91522 Ansbach, www.lda.bayern.de'",
"where": "Datenschutzerklaerung im Abschnitt 'Betroffenenrechte' oder "
"'Beschwerderecht'.",
"example": "Fuer BMW AG (Sitz Muenchen, Bayern): BayLDA, Promenade 18, "
"91522 Ansbach, www.lda.bayern.de",
"severity": "high",
},
"Angemessenheitsbeschluss der Kommission": {
"what": "Drittlandtransfer erwaehnt, aber kein Hinweis auf den "
"konkreten Angemessenheitsbeschluss / DPF / SCC.",
"why": "Art. 13(1)(f) DSGVO: bei Drittlandtransfer muss der "
"Transfer-Mechanismus benannt sein (Angemessenheitsbeschluss, "
"Standardvertragsklauseln, BCR, Ausnahmen nach Art. 49).",
"fix_text": "Fuer Datenuebermittlungen in die USA stuetzen wir uns auf "
"den Angemessenheitsbeschluss der EU-Kommission vom "
"10.07.2023 (EU-US Data Privacy Framework / DPF). Sofern "
"der Empfaenger DPF-zertifiziert ist, ist die Uebermittlung "
"rechtlich abgesichert. Bei nicht-zertifizierten Empfaengern "
"ergaenzen wir EU-Standardvertragsklauseln (SCC) gemaess "
"Durchfuehrungsbeschluss 2021/914.",
"where": "Datenschutzerklaerung im Abschnitt 'Drittlandtransfer' oder "
"'Internationale Datenuebermittlung'.",
"example": "Konkret: Google LLC (USA) ist DPF-zertifiziert "
"(Zertifikat einsehbar unter dataprivacyframework.gov).",
"severity": "high",
},
"Anschrift des Verantwortlichen": {
"what": "In der Cookie-Richtlinie fehlt die vollstaendige Anschrift.",
"why": "Art. 13(1)(a) DSGVO: der Verantwortliche muss eindeutig "
"identifizierbar sein. Cookie-Richtlinie + DSE muessen "
"konsistente Angaben enthalten.",
"fix_text": "Verantwortlich fuer die Datenverarbeitung im Sinne der "
"DSGVO ist:\n [Firmenname]\n [Strasse + Hausnummer]\n "
"[PLZ + Ort]\n [Land]\n E-Mail: [...]",
"where": "Cookie-Richtlinie am Anfang ODER Verweis-Link zur DSE.",
"example": "Bayerische Motoren Werke Aktiengesellschaft, Petuelring 130, "
"80809 Muenchen, Deutschland",
"severity": "high",
},
"Konkrete Cookie-Namen aufgelistet": {
"what": "Cookies werden allgemein erwaehnt aber ohne konkrete Namen + "
"Speicherdauer.",
"why": "DSK-Orientierungshilfe Telemedien: Auflistung jedes einzelnen "
"Cookies mit Name. Generische Aussagen ('wir nutzen "
"Werbe-Cookies') sind unzureichend.",
"fix_text": "Erweitern Sie die Cookie-Tabelle um die Spalten:\n"
" Name | Anbieter | Zweck | Speicherdauer\n\n"
"Browser-Devtools (Application > Cookies) zeigt die "
"tatsaechlich gesetzten Namen — bitte Cookie-Liste "
"regelmaessig synchronisieren.",
"where": "Cookie-Richtlinie im Abschnitt 'Verwendete Cookies'.",
"example": "_ga | Google Ireland | Statistik (Besucher-ID) | 2 Jahre\n"
"_fbp | Meta Platforms IE | Werbung (Browser-ID) | 90 Tage",
"severity": "high",
},
"Konkrete Speicherdauern pro Cookie": {
"what": "Speicherdauer nur pauschal oder als generischer Bereich.",
"why": "Art. 13(2)(a) DSGVO: konkrete Speicherdauer oder Kriterien "
"fuer die Festlegung. DSK fordert Speicherdauer PRO Cookie.",
"fix_text": "Pro Cookie eine konkrete Zeitangabe in der Cookie-Tabelle "
"ergaenzen: 'Session', '24 Stunden', '90 Tage', '2 Jahre'.",
"where": "Cookie-Richtlinie in der Cookie-Tabelle.",
"example": "JSESSIONID: Session — _ga: 2 Jahre — _fbp: 90 Tage",
"severity": "high",
},
"Opt-Out-Links pro Drittanbieter": {
"what": "Pro Drittanbieter ist kein direkter Opt-Out-Link angegeben.",
"why": "Art. 7(3) DSGVO Widerrufsmoeglichkeit. EuGH Planet49 "
"(C-673/17) bestaetigt: Widerruf muss so einfach wie Einwilligung sein.",
"fix_text": "Pro Drittanbieter eine Spalte 'Opt-Out' ergaenzen mit "
"direktem Link. Alternativ: zentralen 'Cookie-"
"Einstellungen aendern'-Button im Footer der Webseite + "
"Hinweis darauf in der Cookie-Richtlinie.",
"where": "Cookie-Richtlinie pro Vendor-Eintrag oder als zentraler "
"Abschnitt 'Wie kann ich widersprechen?'.",
"example": "Google Analytics: https://tools.google.com/dlpage/gaoptout\n"
"Meta Pixel: ueber Facebook-Konto-Einstellungen",
"severity": "high",
},
"Privacy-Policy-Links pro Drittanbieter": {
"what": "Pro Drittanbieter ist kein direkter Datenschutz-Link gesetzt.",
"why": "Art. 13(2)(f) DSGVO Transparenzgebot — Nutzer muss die "
"Datenverarbeitung beim Drittanbieter eigenverantwortlich "
"nachvollziehen koennen.",
"fix_text": "Pro Drittanbieter Link auf dessen Datenschutzerklaerung "
"ergaenzen. Tabelle 'Anbieter | Datenschutz-Link'.",
"where": "Cookie-Richtlinie im Drittanbieter-Listing.",
"example": "Adobe Analytics: https://www.adobe.com/privacy/policy.html",
"severity": "medium",
},
"Rechtswidriger Haftungsausschluss fuer Links": {
"what": "Klassischer Link-Disclaimer ('Wir distanzieren uns von verlinkten "
"Inhalten') ist im Impressum.",
"why": "BGH I ZR 317/01: solche Disclaimer sind rechtlich wirkungslos. "
"Sie befreien NICHT von der Stoererhaftung und koennen sogar "
"den gegenteiligen Effekt haben (Anerkennung der eigenen "
"Pruefpflicht).",
"fix_text": "Entfernen Sie pauschale Link-Disclaimer vollstaendig aus "
"dem Impressum. Bei Bedarf ein knapper, zutreffender Satz:\n"
" 'Fuer den Inhalt verlinkter externer Webseiten ist "
"ausschliesslich deren Betreiber verantwortlich.'",
"where": "Impressum am Ende des Dokuments.",
"example": "Statt 'Wir distanzieren uns ausdruecklich von allen "
"Inhalten verlinkter Seiten' — einfach nichts schreiben.",
"severity": "low",
},
"Verbraucherstreitbeilegung / OS-Plattform": {
"what": "Kein Link zur EU-OS-Plattform bzw. keine Aussage zur "
"Streitbeilegung.",
"why": "Art. 14 EU-VO 524/2013 (B2C-Anbieter, Online-Verkauf): "
"klickbarer Link auf https://ec.europa.eu/consumers/odr "
"PFLICHT. §36 VSBG: Aussage ob teilnehmend oder nicht.",
"fix_text": "Die EU-Kommission stellt eine Plattform zur Online-"
"Streitbeilegung (OS) bereit, die Sie unter "
"<a href='https://ec.europa.eu/consumers/odr'>"
"https://ec.europa.eu/consumers/odr</a> erreichen.\n\n"
"Wir sind nicht bereit oder verpflichtet, an "
"Streitbeilegungsverfahren vor einer "
"Verbraucherschlichtungsstelle teilzunehmen.",
"where": "Impressum am Ende, eigener Abschnitt 'Streitbeilegung'.",
"example": "Standardformulierung anwendbar fuer alle B2C-Anbieter ohne "
"ODR-Teilnahme.",
"severity": "high",
},
"Name der vertretungsberechtigten Person": {
"what": "Vertretungsberechtigte Person ist nicht namentlich mit "
"Funktionsbezeichnung genannt.",
"why": "§5 Abs. 1 Nr. 1 TMG: bei juristischen Personen sind die "
"Vertretungsberechtigten namentlich zu nennen.",
"fix_text": "Ergaenzen Sie nach der Firmenbezeichnung:\n"
" 'Vertreten durch: [Vorstand / Geschaeftsfuehrung] "
"[Vorname Nachname]'",
"where": "Impressum direkt nach Firmenname + Anschrift.",
"example": "Vertreten durch: Vorstandsvorsitzender Oliver Zipse",
"severity": "high",
},
}
def recipe_for(finding_key: str) -> ActionRecipe | None:
"""Lookup Recipe by finding-key. Vendors + Doc-Findings in einem Lookup."""
if finding_key in VENDOR_FINDINGS:
return VENDOR_FINDINGS[finding_key]
if finding_key in DOC_CHECK_FINDINGS:
return DOC_CHECK_FINDINGS[finding_key]
# Fuzzy match auf Doc-Findings (label kann variieren)
fk = finding_key.lower()
for k, v in DOC_CHECK_FINDINGS.items():
if k.lower() in fk or fk in k.lower():
return v
return None
@@ -0,0 +1,309 @@
"""
MC Embedding Match semantic fallback for the regex-based doc_check.
The Sonnet classifier filtered MCs to `check_type='text'` (matchable
against doc text). But the regex matcher is still too strict BMW
writes "Speicherdauer 2 Jahre", the MC pattern expects
"\\d+\\s*(Tag|Jahr)". We catch these via BGE-M3 embeddings + cosine
similarity:
1. Embed the MC's check_question (once, cached in sidecar)
2. Embed the doc text in 50-word chunks
3. cosine(MC, max(chunks)) threshold MC passes via "semantic"
This recovers ~50% of failed MCs at BMW-scale (estimated).
Embeddings come from bp-core-embedding-service (BGE-M3, 1024-dim,
multilingual). Sidecar SQLite stores 1024 × 4 bytes = 4KB per MC.
"""
from __future__ import annotations
import logging
import math
import os
import re
import sqlite3
import struct
from typing import Iterable
import httpx
logger = logging.getLogger(__name__)
EMBEDDING_URL = os.getenv("EMBEDDING_URL", "http://embedding-service:8087")
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
DIM = 1024 # BGE-M3
SIMILARITY_THRESHOLD = float(os.getenv("MC_EMBEDDING_THRESHOLD", "0.55"))
CHUNK_SIZE_WORDS = 50
CHUNK_STRIDE = 30 # overlap so multi-sentence MCs aren't cut
# Short Pflichtfelder (Impressum: HRB-Nr, USt-IdNr, Anschrift) gehen in
# 50-Wort-Chunks unter. Wir scannen den Doc ZUSAETZLICH mit feineren
# 15-Wort-Fenstern und lockerem Threshold fuer Impressum/AVV-Typen.
SHORT_FIELD_CHUNK_WORDS = 15
SHORT_FIELD_STRIDE = 8
SHORT_FIELD_THRESHOLD = float(os.getenv("MC_EMBEDDING_THRESHOLD_SHORT", "0.50"))
SHORT_FIELD_DOC_TYPES = {"impressum", "avv"}
# Doc-Type-spezifische Threshold-Overrides — kalibriert anhand BMW v7
# Run: bei 0.55 lagen DSE+Cookie systemisch bei 93% (Over-Firing weil
# 8000-Wort-Texte alles vage matchen). 0.60 zieht die echten ~80% ein.
# Impressum hat nur 6 echte MCs + Short-Field-Rescue → 0.50 ok.
THRESHOLD_OVERRIDE = {
"impressum": 0.50,
"avv": 0.55,
"dse": 0.60,
"cookie": 0.60,
"widerruf": 0.58,
"loeschkonzept": 0.55,
"dsfa": 0.55,
}
def _ensure_schema() -> None:
"""Add embedding column to mc_classification if not present."""
try:
with sqlite3.connect(SIDECAR_DB) as c:
cols = [r[1] for r in c.execute("PRAGMA table_info(mc_classification)")]
if "embedding" not in cols:
c.execute("ALTER TABLE mc_classification ADD COLUMN embedding BLOB")
logger.info("Added embedding column to mc_classification")
except Exception as e:
logger.warning("Embedding schema migration skipped: %s", e)
def _vec_to_blob(v: list[float]) -> bytes:
return struct.pack(f"{len(v)}f", *v)
def _blob_to_vec(b: bytes) -> list[float]:
return list(struct.unpack(f"{len(b)//4}f", b))
EMBED_BATCH_SIZE = 32
async def _embed_texts(texts: list[str], timeout: float = 120.0) -> list[list[float]]:
"""Call the central embedding-service in batches; returns one vector per input.
BGE-M3 hangs / times out on >100 inputs at once on a CPU-only host.
We chunk into 32er batches and collect.
"""
if not texts:
return []
out: list[list[float]] = []
async with httpx.AsyncClient(timeout=timeout) as client:
for i in range(0, len(texts), EMBED_BATCH_SIZE):
batch = texts[i:i + EMBED_BATCH_SIZE]
try:
r = await client.post(
f"{EMBEDDING_URL}/embed", json={"texts": batch},
)
r.raise_for_status()
vecs = r.json().get("embeddings") or []
out.extend(vecs)
except httpx.HTTPError as e:
logger.warning("Embed sub-batch [%d-%d] failed: %s %s",
i, i + len(batch), type(e).__name__, e)
# Pad with empty vectors so caller can still align by index
out.extend([[] for _ in batch])
return out
async def ensure_mc_embeddings(batch_size: int = 64, force: bool = False) -> int:
"""One-shot: embed every text-MC missing an embedding. Returns count.
Embeds the title + (rough) check_question for each MC to give the
BGE-M3 enough context. Title alone is too terse for the model to
discriminate against full-paragraph doc text.
Idempotent only fills NULL rows unless force=True. Safe to call on
every run.
"""
_ensure_schema()
# Pull check_question from the PG source table once per call (needs
# context that's not in the sidecar)
try:
import psycopg2
pg = psycopg2.connect(os.environ["DATABASE_URL"])
with pg.cursor() as c:
c.execute("SELECT control_id, doc_type, title, check_question "
"FROM compliance.doc_check_controls")
pg_rows = c.fetchall()
pg.close()
pg_lookup = {(r[0], r[1] or ""): (r[2] or "", r[3] or "") for r in pg_rows}
except Exception as e:
logger.warning("ensure_mc_embeddings PG load failed: %s", e)
pg_lookup = {}
try:
with sqlite3.connect(SIDECAR_DB) as c:
where = ("WHERE check_type = 'text'" + ("" if force else " AND embedding IS NULL"))
rows = c.execute(
f"SELECT control_id, doc_type, title FROM mc_classification {where}"
).fetchall()
except Exception as e:
logger.warning("ensure_mc_embeddings query failed: %s", e)
return 0
if not rows:
return 0
logger.info("Embedding %d text-MCs (force=%s) via %s ...",
len(rows), force, EMBEDDING_URL)
done = 0
for i in range(0, len(rows), batch_size):
batch = rows[i:i + batch_size]
# Compose "title — check_question" so the embedding captures both
# the topic (title) and the concrete check phrasing (question).
# That helps BMW's actual policy language land in the same vector
# neighbourhood as our control wording.
texts: list[str] = []
for cid, dt, t in batch:
title_text, question = pg_lookup.get((cid, dt or ""), (t or "", ""))
combined = f"{title_text}. {question}".strip()
texts.append(combined[:600])
try:
embs = await _embed_texts(texts)
except Exception as e:
logger.warning("Embed batch failed (i=%d): %s", i, e)
continue
with sqlite3.connect(SIDECAR_DB) as c:
for (cid, dt, _t), vec in zip(batch, embs):
if not vec or len(vec) != DIM:
continue
c.execute(
"UPDATE mc_classification SET embedding = ? "
"WHERE control_id = ? AND doc_type = ?",
(_vec_to_blob(vec), cid, dt),
)
c.commit()
done += len(batch)
logger.info("ensure_mc_embeddings: filled %d/%d", done, len(rows))
return done
def _chunk_text(text: str, size: int = CHUNK_SIZE_WORDS,
stride: int = CHUNK_STRIDE) -> list[str]:
"""Sliding-window chunking — overlap helps catch MCs that span 2 sentences."""
words = re.findall(r"\S+", text or "")
if len(words) <= size:
return [" ".join(words)] if words else []
out: list[str] = []
i = 0
while i < len(words):
out.append(" ".join(words[i:i + size]))
i += stride
return out
def _cosine(a: list[float], b: list[float]) -> float:
"""Plain Python cosine — fast enough for our scale, no numpy import."""
if not a or not b or len(a) != len(b):
return 0.0
dot = sum(x * y for x, y in zip(a, b))
na = math.sqrt(sum(x * x for x in a))
nb = math.sqrt(sum(y * y for y in b))
if na == 0 or nb == 0:
return 0.0
return dot / (na * nb)
async def embedding_match(
doc_text: str,
mc_records: Iterable[dict],
doc_type: str | None = None,
threshold: float | None = None,
) -> set[str]:
"""Return the subset of MC control_ids that semantically match doc_text.
For Impressum/AVV-types we ADDITIONALLY scan the doc with smaller
15-word windows and a looser threshold so that short Pflichtfelder
(HRB, USt-IdNr, postal address) land in their own chunk and aren't
diluted by 50-word neighbourhoods of unrelated text.
"""
if not doc_text or not mc_records:
return set()
candidates = list(mc_records)
if not candidates:
return set()
cid_set = {c.get("control_id") for c in candidates if c.get("control_id")}
if not cid_set:
return set()
try:
with sqlite3.connect(SIDECAR_DB) as c:
placeholders = ",".join("?" * len(cid_set))
q = ("SELECT control_id, embedding FROM mc_classification "
f"WHERE control_id IN ({placeholders}) "
"AND check_type='text' AND embedding IS NOT NULL")
params = list(cid_set)
if doc_type:
q += " AND doc_type = ?"
params.append(doc_type)
rows = c.execute(q, params).fetchall()
except Exception as e:
logger.warning("embedding lookup failed: %s", e)
return set()
if not rows:
return set()
mc_embeddings = {cid: _blob_to_vec(blob) for cid, blob in rows}
effective_threshold = threshold or THRESHOLD_OVERRIDE.get(
(doc_type or "").lower(), SIMILARITY_THRESHOLD)
chunks = _chunk_text(doc_text)
if not chunks:
return set()
try:
chunk_vecs = await _embed_texts(chunks)
except Exception as e:
logger.warning("doc chunk embedding failed: %s %s",
type(e).__name__, e or "(empty msg)", exc_info=True)
return set()
# Filter empty vectors (failed sub-batches return [] placeholders)
chunk_vecs = [v for v in chunk_vecs if v and len(v) == DIM]
if not chunk_vecs:
logger.warning("doc chunk embedding: no usable vectors (all batches failed)")
return set()
matched: set[str] = set()
for cid, mc_vec in mc_embeddings.items():
best = max((_cosine(mc_vec, cv) for cv in chunk_vecs), default=0.0)
if best >= effective_threshold:
matched.add(cid)
# Short-field rescue pass for Impressum-type docs: small windows +
# looser threshold catch one-line Pflichtfelder that 50-word chunks
# dilute (HRB-Nr, USt-IdNr, postal address). Only runs for MCs not
# yet matched in the main pass.
if (doc_type or "").lower() in SHORT_FIELD_DOC_TYPES:
unmatched = {cid: vec for cid, vec in mc_embeddings.items() if cid not in matched}
if unmatched:
short_chunks = _chunk_text(doc_text, size=SHORT_FIELD_CHUNK_WORDS,
stride=SHORT_FIELD_STRIDE)
try:
short_vecs = await _embed_texts(short_chunks)
except Exception as e:
logger.warning("short-chunk embedding failed: %s", e)
short_vecs = []
if short_vecs:
short_passes = 0
for cid, mc_vec in unmatched.items():
best = max((_cosine(mc_vec, cv) for cv in short_vecs), default=0.0)
if best >= SHORT_FIELD_THRESHOLD:
matched.add(cid)
short_passes += 1
if short_passes:
logger.info(
"embedding short-field rescue for %s: +%d MCs (threshold %.2f, %d chunks)",
doc_type, short_passes, SHORT_FIELD_THRESHOLD, len(short_chunks),
)
logger.info(
"embedding match for %s: %d/%d MCs passed semantic threshold (main=%.2f)",
doc_type or "?", len(matched), len(mc_embeddings), effective_threshold,
)
return matched
@@ -101,11 +101,36 @@ def build_scorecard(check_results: list[dict]) -> dict:
}
_DEDUP_KEYWORDS = [
"einfache sprache", "verstaendliche sprache", "verständliche sprache",
"klare sprache", "einwilligungstexte", "einwilligungsaufforderung",
"einwilligungserklaerung", "einwilligungserklärung",
"mehrdeutige", "verstaendliche form", "verständliche form",
"fachbegriffe erklaeren", "fachbegriffe erklären",
]
def _dedup_key(label: str) -> str:
"""Cluster label to a stable dedup-key: if it contains one of the
well-known repetitive Sprache/Einwilligungs-Aufforderungs-Concepts,
collapse them all to that single concept. Otherwise return original."""
l = (label or "").lower()
for kw in _DEDUP_KEYWORDS:
if kw in l:
return f"_dup:{kw}"
return label
def top_fails(check_results: list[dict], n: int = 10) -> list[dict]:
"""Return top-N failing MCs sorted by severity then label.
Skipped + passed MCs are excluded. INFO severity is excluded by
default since those are guidance, not findings.
Near-duplicates (multiple MCs that all complain about "einfache
Sprache" / "Einwilligungsaufforderung" / ...) are collapsed to ONE
representative entry sonst dominieren UI-Sprache-Hinweise die
Top-Liste und echte Lecks gehen unter.
"""
fails = [
r for r in (check_results or [])
@@ -116,7 +141,17 @@ def top_fails(check_results: list[dict], n: int = 10) -> list[dict]:
_SEV_RANK.get((r.get("severity") or "MEDIUM").upper(), 5),
r.get("label", ""),
))
return fails[:n]
seen_keys: set[str] = set()
deduped: list[dict] = []
for r in fails:
k = _dedup_key(r.get("label", ""))
if k in seen_keys:
continue
seen_keys.add(k)
deduped.append(r)
if len(deduped) >= n:
break
return deduped
def full_audit_records(
@@ -37,6 +37,7 @@ async def check_document_with_controls(
db_url: str = "",
max_controls: int = 0, # 0 = no limit, check ALL
use_agent: bool = False, # Use LLM agent for intelligent evaluation
business_scope: set[str] | None = None,
) -> list[dict]:
"""Check document against ALL doc_check_controls for this doc_type.
@@ -56,7 +57,7 @@ async def check_document_with_controls(
mapped_type = _map_doc_type(doc_type)
# Load ALL controls for this doc_type
controls = await _load_controls(mapped_type, db_url, max_controls)
controls = await _load_controls(mapped_type, db_url, max_controls, business_scope)
if not controls:
logger.info("No MCs for doc_type '%s' (%s)", mapped_type, doc_title)
return []
@@ -71,6 +72,31 @@ async def check_document_with_controls(
if result:
results.append(result)
# Semantic fallback (Phase 3): MCs that failed via regex get a second
# chance via BGE-M3 cosine similarity. BMW writes "Speicherdauer 2
# Jahre" — the regex misses, embedding catches it.
failed_ids = {r.get("control_id") for r in results
if not r.get("passed") and r.get("control_id")}
if failed_ids:
try:
from compliance.services.mc_embedding_matcher import (
ensure_mc_embeddings, embedding_match,
)
await ensure_mc_embeddings() # idempotent: only embeds new MCs
failed_mcs = [c for c in controls if c.get("control_id") in failed_ids]
semantic_passes = await embedding_match(
text, failed_mcs, doc_type=mapped_type,
)
if semantic_passes:
for r in results:
cid = r.get("control_id")
if cid and cid in semantic_passes and not r.get("passed"):
r["passed"] = True
r["matched_text"] = "[semantischer Treffer via Embedding]"
r["hint"] = (r.get("hint") or "") + " (passed via Embedding-Match, BGE-M3 cosine)"
except Exception as e:
logger.warning("Embedding fallback skipped: %s", e, exc_info=True)
passed = sum(1 for r in results if r["passed"])
failed_results = [r for r in results if not r["passed"]]
logger.info("MC results: %d passed, %d failed out of %d for '%s'",
@@ -161,6 +187,7 @@ def _check_mc_deterministic(text_lower: str, mc: dict) -> Optional[dict]:
return {
"id": f"mc-{control_id}",
"control_id": control_id,
"label": mc.get("title", "")[:80],
"passed": passed,
"severity": severity,
@@ -266,11 +293,72 @@ _MC_ALIAS_FALLBACK = {
}
async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
def _load_text_only_ids(
doc_type: str | None = None,
business_scope: set[str] | None = None,
) -> set[str]:
"""Return control_ids that the Sonnet-classifier flagged as 'text'.
Filters applied:
1. check_type='text' (only doc-text-matchable MCs)
2. doc_type matches (per-doc-type variant from v2-Sidecar)
3. fits_doc_type=1 (LLM auditor approved this MC for this doc_type)
4. scope_requires NULL or contained in business_scope
(e.g. MCs with scope_requires='biometric_processing' are skipped
on sites that don't do biometric processing — Art. 22 FRT-MC bei
BMW falsch-positiv)
`business_scope` comes from the business_profiler (set of detected
site characteristics like 'b2c', 'shop', 'biometric_processing',
'ai_decision_making', 'child_targeting').
Returns empty set if the sidecar doesn't exist yet.
"""
import sqlite3
db_path = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
try:
with sqlite3.connect(db_path) as c:
cols = [r[1] for r in c.execute("PRAGMA table_info(mc_classification)")]
has_fit = "fits_doc_type" in cols
has_scope = "scope_requires" in cols
fit_clause = " AND (fits_doc_type IS NULL OR fits_doc_type = 1)" if has_fit else ""
base = ("SELECT control_id, scope_requires FROM mc_classification "
"WHERE check_type = 'text'" + fit_clause) if has_scope else (
"SELECT control_id, NULL FROM mc_classification "
"WHERE check_type = 'text'" + fit_clause)
params: list = []
if doc_type:
base += " AND doc_type = ?"
params.append(doc_type)
rows = c.execute(base, params).fetchall()
scope = business_scope or set()
keep: set[str] = set()
for cid, req in rows:
if not req:
keep.add(cid)
else:
# Multiple requirements separated by '|' — ALL must
# be in scope to include. Empty req tokens are skipped.
needed = {r.strip().lower() for r in req.split("|") if r.strip()}
if needed.issubset({s.lower() for s in scope}):
keep.add(cid)
return keep
except sqlite3.OperationalError:
return set()
except Exception as e:
logger.warning("MC classification lookup failed: %s", e)
return set()
async def _load_controls(doc_type: str, db_url: str, limit: int,
business_scope: set[str] | None = None) -> list[dict]:
"""Load all doc_check_controls for a doc_type from PostgreSQL.
Falls back via _MC_ALIAS_FALLBACK when no MCs exist for the requested
type (e.g. 'nutzungsbedingungen' -> 'agb').
Filters to only check_type='text' MCs when the classification sidecar
is present process/review MCs are routed to other modules.
"""
try:
import asyncpg
@@ -297,7 +385,17 @@ async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
fallback = _MC_ALIAS_FALLBACK[doc_type]
logger.info("No MCs for %s -> falling back to %s", doc_type, fallback)
rows = await conn.fetch(query, fallback)
return [dict(r) for r in rows]
controls = [dict(r) for r in rows]
text_only = _load_text_only_ids(doc_type, business_scope)
if text_only:
before = len(controls)
controls = [c for c in controls if c.get("control_id") in text_only]
logger.info(
"MC filter (text only) for %s: %d/%d MCs after Sonnet check_type filter",
doc_type, len(controls), before,
)
return controls
except Exception as e:
logger.warning("MC query failed: %s", e)
return []
@@ -0,0 +1,407 @@
"""
Vendor-Cost-Estimator leitet pro Vendor ein Pricing-Tier aus
Cookie-Signalen ab und gibt eine intensitaets-basierte Jahres-
kostenschaetzung zurueck.
Cookie-Signale die wir auswerten:
- Anzahl Cookies pro Vendor (proxy fuer Modul-Tiefe)
- Premium-Feature-Cookies (z.B. 's_target_qa', '_ab_test' Enterprise-Add-on)
- Edge/Region-Cookies (Multi-Region Premier-Tier CDN)
- Cookie-Persistenz (Multi-Jahr Heavy-Tracking-Lizenz)
Plus business_profile fuer Company-Tier-Inferenz.
Output pro Vendor:
- inferred_tier: 'starter' | 'professional' | 'enterprise' | 'premier'
- tier_signals : Liste der Indikatoren die zum Tier gefuehrt haben
- cost_year_eur_range: (low, high) basierend auf Tier × Vendor-Pricing
- confidence: 'low' | 'medium' | 'high'
Dieses Modul ergaenzt vendor_redundancy.py die einfachen low/high
Pauschalen dort werden hier durch dynamische, signal-basierte Werte
ersetzt.
"""
from __future__ import annotations
import logging
import re
from typing import Iterable
logger = logging.getLogger(__name__)
# ─── Premium-Feature-Cookies: Indikator fuer Enterprise-Add-ons ─────
#
# Wenn ein Vendor diese Cookies setzt, ist der Kunde mit hoher
# Wahrscheinlichkeit auf einem Enterprise-Plan.
_PREMIUM_FEATURE_PATTERNS: list[tuple[str, str, str]] = [
# (regex, vendor_key, premium_feature_label)
(r"^s_target_qa$", "adobe analytics", "Adobe Target Add-on"),
(r"adobe.*target", "adobe target", "Personalization Enterprise"),
(r"^aam_uuid", "adobe analytics", "Audience Manager Enterprise"),
(r"^s_ecid", "adobe analytics", "Experience Cloud ID Service"),
(r"^_pcid_", "adobe analytics", "People-Based Destinations"),
(r"^_gat_gtag_UA", "google analytics", "GA360 Multi-Tracker"),
(r"^_ga_[A-Z0-9]+_[A-Z0-9]+", "google analytics", "GA4 Enterprise Stream"),
(r"^_uetmsdns", "microsoft advertising", "Custom Conversion Tracking"),
(r"^_fbp.*test", "meta pixel", "Conversions API Premium"),
(r"^_pin_unauth_premium", "pinterest", "Pinterest Premium-API"),
(r"^afm", "adform", "Affinity-Module"),
(r"^cto_dna", "criteo", "Dynamic Retargeting Premium"),
# CDN / Infra Premium
(r"^aws-alb-[a-z0-9]+", "amazon web services", "ALB + Multi-Region"),
(r"^aws-waf", "amazon web services", "WAF Enterprise"),
(r"^cf_clearance", "cloudflare", "Bot-Management Pro"),
(r"^akm_[a-z]+", "akamai", "Adaptive Media Delivery Enterprise"),
# Salesforce Customer-360
(r"^bid_n_", "salesforce", "Marketing Cloud Personalization"),
(r"^_cs_", "salesforce", "CDP Premium"),
]
# ─── Tier-Pricing pro Vendor (jaehrlich, EUR) ───────────────────────
#
# 4 Tiers: starter (KMU), professional (Mid), enterprise (Konzern),
# premier (Global Brand / Heavy User).
_TIER_PRICING: dict[str, dict[str, tuple[int, int]]] = {
"adobe analytics": {
"starter": ( 10_000, 30_000),
"professional": ( 60_000, 150_000),
"enterprise": (200_000, 500_000),
"premier": (500_000, 900_000),
},
"adobe target": {
"starter": ( 8_000, 25_000),
"professional": ( 40_000, 100_000),
"enterprise": (120_000, 300_000),
"premier": (300_000, 600_000),
},
"adobe campaign": {
"starter": ( 10_000, 30_000),
"professional": ( 40_000, 100_000),
"enterprise": (120_000, 280_000),
"premier": (280_000, 500_000),
},
"google analytics": {
"starter": ( 0, 0), # GA4 free
"professional": ( 0, 0),
"enterprise": ( 80_000, 150_000), # GA360
"premier": (150_000, 300_000),
},
"matomo": {
"starter": ( 0, 3_000), # On-prem free / Cloud Starter
"professional": ( 6_000, 20_000),
"enterprise": ( 20_000, 80_000),
"premier": ( 60_000, 150_000),
},
"content square": {
"starter": ( 12_000, 40_000),
"professional": ( 60_000, 150_000),
"enterprise": (150_000, 350_000),
"premier": (350_000, 700_000),
},
"contentsquare": {
"starter": ( 12_000, 40_000),
"professional": ( 60_000, 150_000),
"enterprise": (150_000, 350_000),
"premier": (350_000, 700_000),
},
"dynatrace": {
"starter": ( 5_000, 15_000),
"professional": ( 30_000, 80_000),
"enterprise": (100_000, 300_000),
"premier": (300_000, 800_000),
},
"qualtrics": {
"starter": ( 6_000, 20_000),
"professional": ( 30_000, 80_000),
"enterprise": ( 80_000, 200_000),
"premier": (200_000, 500_000),
},
# Advertising / Retargeting (Lizenz + Self-Service; Media-Spend SEPARAT)
"criteo": {
"starter": ( 6_000, 20_000),
"professional": ( 30_000, 80_000),
"enterprise": ( 80_000, 250_000),
"premier": (250_000, 600_000),
},
"adform": {
"starter": ( 12_000, 40_000),
"professional": ( 60_000, 150_000),
"enterprise": (150_000, 400_000),
"premier": (400_000, 800_000),
},
"outbrain": {
"starter": ( 6_000, 20_000),
"professional": ( 30_000, 80_000),
"enterprise": ( 80_000, 200_000),
"premier": (200_000, 500_000),
},
"taboola": {
"starter": ( 6_000, 20_000),
"professional": ( 30_000, 80_000),
"enterprise": ( 80_000, 200_000),
"premier": (200_000, 500_000),
},
"teads": {
"starter": ( 6_000, 18_000),
"professional": ( 20_000, 60_000),
"enterprise": ( 60_000, 150_000),
"premier": (150_000, 350_000),
},
"pinterest": {
"starter": ( 3_000, 15_000),
"professional": ( 15_000, 50_000),
"enterprise": ( 50_000, 150_000),
"premier": (150_000, 400_000),
},
"linkedin insight": {
"starter": ( 3_000, 12_000),
"professional": ( 12_000, 40_000),
"enterprise": ( 40_000, 120_000),
"premier": (120_000, 300_000),
},
# CDN / Cloud
"akamai": {
"starter": ( 20_000, 60_000),
"professional": ( 80_000, 200_000),
"enterprise": (200_000, 500_000),
"premier": (500_000, 1_500_000),
},
"amazon web services": {
"starter": ( 12_000, 60_000),
"professional": ( 60_000, 300_000),
"enterprise": (300_000, 1_500_000),
"premier": (1_500_000, 8_000_000),
},
"baqend": {
"starter": ( 3_000, 12_000),
"professional": ( 12_000, 40_000),
"enterprise": ( 40_000, 120_000),
"premier": (120_000, 300_000),
},
"speedkit": {
"starter": ( 3_000, 12_000),
"professional": ( 12_000, 40_000),
"enterprise": ( 40_000, 120_000),
"premier": (120_000, 300_000),
},
"speedcurve": {
"starter": ( 1_200, 4_800),
"professional": ( 6_000, 18_000),
"enterprise": ( 18_000, 60_000),
"premier": ( 60_000, 120_000),
},
# CRM / Marketing
"salesforce": {
"starter": ( 20_000, 60_000),
"professional": ( 80_000, 250_000),
"enterprise": (250_000, 800_000),
"premier": (800_000, 2_500_000),
},
"genesys": {
"starter": ( 24_000, 80_000),
"professional": ( 80_000, 250_000),
"enterprise": (250_000, 800_000),
"premier": (800_000, 2_000_000),
},
# Captcha
"hcaptcha": {
"starter": ( 0, 2_400),
"professional": ( 2_400, 12_000),
"enterprise": ( 12_000, 40_000),
"premier": ( 40_000, 100_000),
},
# Lead-Tracking
"salesviewer": {
"starter": ( 1_200, 3_600),
"professional": ( 3_600, 12_000),
"enterprise": ( 12_000, 40_000),
"premier": ( 40_000, 100_000),
},
}
def _vendor_key(vendor_name: str) -> str | None:
"""Map a vendor name to a known pricing-table key."""
n = (vendor_name or "").lower()
for k in _TIER_PRICING:
if k in n:
return k
return None
def infer_company_tier(business_profile: dict | None) -> str:
"""Coarse company-tier from business profile.
Used as the baseline when vendor-specific signals are weak.
"""
if not business_profile:
return "professional"
bp = business_profile
features = {f.lower() for f in (bp.get("features") or [])}
btype = (bp.get("type") or "").lower()
# Heavy enterprise-only signals
if any(f in features for f in ("multi_country", "konzern", "enterprise",
"international", "automotive", "banking",
"luxury", "premium")):
return "premier"
# Large but maybe single-country
if "shop" in features or "konfigurator" in features or btype == "b2c":
return "enterprise"
return "professional"
def infer_vendor_tier(vendor: dict, company_tier: str) -> tuple[str, list[str]]:
"""Infer pricing tier for a single vendor from its cookie footprint.
Signals (additive more signals higher tier):
- cookie_count > 30 +1 tier
- cookie_count > 60 +2 tiers
- premium-feature cookie hit +1 tier
- 'is_third_party' on most cookies +1 tier (heavy-tracking signal)
- very long expiry (>=2 years) +1 tier
"""
cookies = vendor.get("cookies") or []
n_cookies = len(cookies)
cookie_names = [c.get("name", "").lower() for c in cookies]
signals: list[str] = []
base_tiers = ["starter", "professional", "enterprise", "premier"]
# Start at company-tier as baseline
idx = base_tiers.index(company_tier) if company_tier in base_tiers else 1
if n_cookies >= 60:
idx = min(len(base_tiers) - 1, idx + 1)
signals.append(f"{n_cookies} Cookies (sehr hohe Modul-Tiefe)")
elif n_cookies >= 30:
signals.append(f"{n_cookies} Cookies (hohe Modul-Tiefe)")
# Premium feature detection
vk = _vendor_key(vendor.get("name", ""))
for pattern, expected_key, feature_label in _PREMIUM_FEATURE_PATTERNS:
if vk and vk != expected_key and expected_key not in (vendor.get("name") or "").lower():
continue
for cn in cookie_names:
if re.search(pattern, cn):
idx = min(len(base_tiers) - 1, idx + 1)
signals.append(f"Premium-Feature-Cookie: {feature_label}")
break
# Heavy third-party tracking
third_party_ratio = sum(1 for c in cookies if c.get("is_third_party")) / max(n_cookies, 1)
if third_party_ratio >= 0.6 and n_cookies >= 10:
signals.append(f"{int(third_party_ratio * 100)}% Drittanbieter-Cookies — Tracking-Heavy")
# Long-lived cookies
long_lived = sum(1 for c in cookies if _expiry_years(c.get("expiry", "")) >= 2)
if long_lived >= 3:
signals.append(f"{long_lived} Cookies mit ≥2 Jahre Speicherdauer")
return base_tiers[idx], signals
def _expiry_years(expiry_str: str) -> float:
"""Rough parse: '2 Jahre' → 2.0, '24 Monate' → 2.0, '90 Tage' → 0.25"""
s = (expiry_str or "").lower()
m = re.search(r"(\d+)\s*(jahr|year)", s)
if m: return float(m.group(1))
m = re.search(r"(\d+)\s*(monat|month)", s)
if m: return float(m.group(1)) / 12.0
m = re.search(r"(\d+)\s*(tag|day)", s)
if m: return float(m.group(1)) / 365.0
return 0.0
def estimate_vendor_cost(vendor: dict, business_profile: dict | None = None) -> dict:
"""Return cost estimation for one vendor incl. tier inference + signals."""
vk = _vendor_key(vendor.get("name", ""))
company_tier = infer_company_tier(business_profile)
if not vk:
return {
"vendor": vendor.get("name", ""),
"matched_pricing_key": None,
"inferred_tier": None,
"tier_signals": [],
"company_tier_baseline": company_tier,
"cost_year_eur_range": (0, 0),
"confidence": "none",
"note": "Kein Pricing-Eintrag fuer diesen Anbieter — Saving-Schaetzung uebergangen.",
}
tier, signals = infer_vendor_tier(vendor, company_tier)
pricing = _TIER_PRICING[vk].get(tier) or (0, 0)
confidence = "high" if len(signals) >= 2 else ("medium" if signals else "low")
return {
"vendor": vendor.get("name", ""),
"matched_pricing_key": vk,
"inferred_tier": tier,
"tier_signals": signals,
"company_tier_baseline": company_tier,
"cost_year_eur_range": pricing,
"confidence": confidence,
}
def estimate_total_stack_cost(
vendors: Iterable[dict],
business_profile: dict | None = None,
) -> dict:
"""Aggregate cost estimation over all vendors.
Returns:
- per_vendor list (one entry each)
- per_recipient_type aggregate (INTERNAL vs PROCESSOR vs CONTROLLER)
- total range
- master-contract dedup hint: vendors whose name starts with the
site owner ('BMW AG — ...') are bundled into ONE master contract
per vendor-tool-key (not double-counted).
"""
per_vendor: list[dict] = []
seen_master_keys: set[tuple[str, str]] = set()
total_low = 0
total_high = 0
for v in vendors:
est = estimate_vendor_cost(v, business_profile)
per_vendor.append(est)
if not est["matched_pricing_key"]:
continue
rtype = (v.get("recipient_type") or "").upper()
master_key = (est["matched_pricing_key"], rtype if rtype == "INTERNAL" else "EXT")
if rtype == "INTERNAL" and master_key in seen_master_keys:
# Same Adobe contract serves many "BMW AG — Adobe XYZ" lines —
# count cost only ONCE per (key, internal).
est["bundled_into_master_contract"] = True
continue
seen_master_keys.add(master_key)
lo, hi = est["cost_year_eur_range"]
total_low += lo
total_high += hi
return {
"per_vendor": per_vendor,
"total_year_eur_range": (total_low, total_high),
"master_contracts_counted": len(seen_master_keys),
"disclaimer": (
"Schaetzung basiert auf Cookie-Signalen (Anzahl, Premium-Feature-Detection, "
"Drittanbieter-Quote, Lebensdauer) + Listpreisen pro Tier. Konzern-Konditionen "
"koennen 30-50% darunter liegen. Eintraege derselben Eigenmarke werden zu EINEM "
"Master-Vertrag aggregiert. Media-Spend ist NICHT enthalten."
),
}
@@ -0,0 +1,727 @@
"""
Vendor Redundancy + EU-Alternatives Analyzer.
Eingang: Liste von Vendors aus dem CMP-Capture (z.B. BMW 90 Vendors).
Ausgang: drei strukturierte Listen die im Email + Migration-Modal
gerendert werden:
1. functional_categories : Vendor Funktionsklasse (analytics,
advertising, cdn, captcha, chat, )
2. redundancies : Kategorien mit 2 Vendors die dasselbe tun
Konsolidierungspotenzial
3. eu_alternatives : pro US-Vendor passender EU-Ersatz aus
kuratierter Lookup-Tabelle (Matomo statt
Adobe Analytics, IONOS statt AWS, etc.)
4. multi_function_tools : EU-Tools die mehrere Kategorien abdecken
(z.B. SAP CX = Analytics + CRM + Marketing)
"""
from __future__ import annotations
import logging
import re
from collections import defaultdict
from typing import Iterable
logger = logging.getLogger(__name__)
# ─── Kategorisierung ──────────────────────────────────────────────────
# Substring-Match (lowercase) → Kategorie. Erste Treffer gewinnt.
_CATEGORY_RULES: list[tuple[str, str]] = [
# Web Analytics / Behavior
("adobe analytics", "web_analytics"),
("adobe target", "personalisation"),
("adobe campaign", "marketing_automation"),
("adobe staging library", "tag_management"),
("adobelaunch", "tag_management"),
("google analytics", "web_analytics"),
("matomo", "web_analytics"),
("hotjar", "web_analytics"),
("content square", "web_analytics"),
("contentsquare", "web_analytics"),
("dynatrace", "monitoring"),
("performance analytics", "web_analytics"),
("form analytics", "web_analytics"),
("form campaign analytics","web_analytics"),
("psyma", "survey"),
("qualtrics", "survey"),
# Tag Management
("google tag manager", "tag_management"),
("gtm", "tag_management"),
# Advertising / Retargeting
("google ads", "advertising"),
("google advertising", "advertising"),
("doubleclick", "advertising"),
("googleads", "advertising"),
("meta pixel", "advertising"),
("meta platforms", "advertising"),
("facebook", "advertising"),
("adform", "advertising"),
("criteo", "advertising"),
("outbrain", "advertising"),
("taboola", "advertising"),
("teads", "advertising"),
("pinterest", "advertising"),
("linkedin insight", "advertising"),
("youtube performance", "advertising"),
("youtube player", "external_media"),
("amazon advertising", "advertising"),
("instagram", "advertising"),
("dotaki", "advertising"),
# Video / Embeds
("youtube", "external_media"),
("vimeo", "external_media"),
("jw player", "external_media"),
("jw video", "external_media"),
("jwplayer", "external_media"),
("jwconnatix", "external_media"),
# Maps / Geo
("google maps", "maps"),
("google geolocation", "maps"),
("geolocation", "maps"),
# CDN / Infrastructure
("akamai", "cdn"),
("amazon web services", "cloud_infra"),
("aws", "cloud_infra"),
("baqend", "cdn"),
("speedkit", "cdn"),
("speedcurve", "monitoring"),
("salesforce", "crm"),
# Chat / Support
("genesys", "chat"),
("ckm", "chat"),
("chat widget", "chat"),
# Captcha / Bot-Protection
("hcaptcha", "captcha"),
("recaptcha", "captcha"),
# Sales / Lead-Tracking
("salesviewer", "lead_tracking"),
# Marketing/Sales overlay
("nayoki", "social_aggregator"),
# Site-eigene Funktionen
("infrastructure", "site_infra"),
("infrastrukturbereit", "site_infra"),
("javaserverpages", "site_infra"),
("single sign-on", "auth"),
("mybmw account", "auth"),
("sso", "auth"),
("consent", "consent_management"),
("session", "site_infra"),
("scroll", "site_infra"),
("sticky", "site_infra"),
("sidebar", "site_infra"),
("dealer search", "site_feature"),
("test drive", "site_feature"),
("vehicle configurator", "site_feature"),
("stocklocator", "site_feature"),
("eshop", "site_feature"),
("shop", "site_feature"),
("language", "site_infra"),
("sprach", "site_infra"),
("region", "site_infra"),
("ip popup", "site_infra"),
("popup", "site_infra"),
("dynatrace", "monitoring"),
]
def classify_vendor(name: str) -> str:
"""Map a vendor name to a functional category."""
n = (name or "").lower()
for needle, cat in _CATEGORY_RULES:
if needle in n:
return cat
return "other"
# ─── EU-Alternativen ─────────────────────────────────────────────────
# Kuratierte Liste — pro US-/Nicht-EU-Vendor passende(r) EU-Ersatz.
# Quellen: Matomo Vergleich, etracker SoMo-Studie, IONOS-Pakete,
# Friendly Captcha Whitepaper, SAP CX-Suite, Brevo / CleverReach DE-Listen.
_EU_ALTERNATIVES: dict[str, list[dict]] = {
"adobe analytics": [
{"name": "Matomo (On-Premise)", "vendor": "InnoCraft", "country": "DE-self-hosted",
"license": "GPL", "notes": "100% DSGVO, keine 3rd-Country, gleicher Funktionsumfang"},
{"name": "etracker Analytics", "vendor": "etracker GmbH", "country": "DE",
"license": "Commercial", "notes": "DSGVO-konform aus Hamburg, IP-Anonymisierung"},
{"name": "Mapp Intelligence", "vendor": "Mapp Digital", "country": "DE",
"license": "Commercial", "notes": "Enterprise-Alternative, Server in DE"},
],
"google analytics": [
{"name": "Matomo", "vendor": "InnoCraft", "country": "DE-self-hosted",
"license": "GPL", "notes": "Direkter Drop-in-Ersatz mit GA-Migrationspfad"},
{"name": "Plausible Analytics", "vendor": "Plausible Insights", "country": "EE",
"license": "AGPL/Commercial", "notes": "Cookielos, ohne Einwilligung nutzbar"},
{"name": "Fathom Analytics EU", "vendor": "Fathom", "country": "DE-Region",
"license": "Commercial", "notes": "Cookielos, EU-Hosting"},
],
"content square": [
{"name": "Mouseflow EU", "vendor": "Mouseflow ApS", "country": "DK",
"license": "Commercial", "notes": "Session-Recording + Heatmaps EU-Hosting"},
{"name": "Hotjar EU", "vendor": "Hotjar Ltd", "country": "MT",
"license": "Commercial", "notes": "EU-DataCenter (Frankfurt), Einwilligung erforderlich"},
],
"dynatrace": [
{"name": "Dynatrace EU", "vendor": "Dynatrace", "country": "AT",
"license": "Commercial", "notes": "Bereits EU (Linz). Cluster auf EU einstellen"},
],
"speedcurve": [
{"name": "SpeedCurve EU", "vendor": "SpeedCurve", "country": "EU-tenant",
"license": "Commercial", "notes": "Region-Tenant explizit konfigurieren"},
{"name": "Calibre", "vendor": "Calibre", "country": "AU/EU",
"license": "Commercial", "notes": "Performance Monitoring, EU-Region"},
],
"akamai": [
{"name": "Bunny CDN", "vendor": "BunnyWay d.o.o.", "country": "SI",
"license": "Commercial", "notes": "Slowenischer CDN, EU-Backbone"},
{"name": "Cloudflare EU-Only", "vendor": "Cloudflare", "country": "Multi",
"license": "Commercial", "notes": "EU-Datacenter erzwingbar via 'Geo Steering'"},
{"name": "IONOS CDN", "vendor": "IONOS SE", "country": "DE",
"license": "Commercial", "notes": "100% DE-Hosting"},
],
"amazon web services": [
{"name": "IONOS Cloud", "vendor": "IONOS SE", "country": "DE",
"license": "Commercial", "notes": "DE-Hosting, BSI C5-zertifiziert"},
{"name": "OVHcloud", "vendor": "OVH SAS", "country": "FR",
"license": "Commercial", "notes": "FR-Hosting, SecNumCloud-zertifiziert"},
{"name": "Hetzner Cloud", "vendor": "Hetzner Online GmbH", "country": "DE",
"license": "Commercial", "notes": "DE/FI-Hosting, sehr kostenguenstig"},
{"name": "STACKIT", "vendor": "Schwarz IT (Lidl-Gruppe)", "country": "DE",
"license": "Commercial", "notes": "Souveraener DE-Cloud, fuer Enterprise"},
],
"salesforce": [
{"name": "SAP Customer Experience", "vendor": "SAP SE", "country": "DE",
"license": "Commercial", "notes": "Vollstaendige CRM-Suite EU-Hosting"},
{"name": "weclapp", "vendor": "weclapp SE", "country": "DE",
"license": "Commercial", "notes": "Cloud-CRM aus Marburg"},
],
"adobe campaign": [
{"name": "CleverReach", "vendor": "CleverReach GmbH", "country": "DE",
"license": "Commercial", "notes": "E-Mail-Marketing DE-Hosting"},
{"name": "Brevo (Sendinblue)", "vendor": "Brevo", "country": "FR",
"license": "Commercial", "notes": "Marketing-Automation EU-Hosting"},
{"name": "Inxmail", "vendor": "Inxmail GmbH", "country": "DE",
"license": "Commercial", "notes": "Enterprise-E-Mail-Marketing aus Freiburg"},
],
"google ads": [
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
"license": "Commercial", "notes": "FR-Hosting, Programmatic + Direct-Sold"},
{"name": "Bing Ads (Microsoft Advertising EU)", "vendor": "Microsoft", "country": "Multi",
"license": "Commercial", "notes": "EU-Datacenter optional"},
],
"google maps": [
{"name": "HERE Maps", "vendor": "HERE Technologies", "country": "DE",
"license": "Commercial", "notes": "Berliner Anbieter, professionelle Karten + Routing"},
{"name": "OpenStreetMap (self-host)", "vendor": "OSM Foundation", "country": "DE-self-host",
"license": "ODbL", "notes": "Frei, OSM-Tiles self-hosted oder via Maptiler EU"},
{"name": "Maptiler Cloud EU", "vendor": "MapTiler", "country": "CH",
"license": "Commercial", "notes": "Schweizer Anbieter, EU-Tiles"},
],
"criteo": [ # criteo IS EU but use as example for retargeting alts
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
"license": "Commercial", "notes": "Retargeting + Display, FR-Hosting"},
],
"hcaptcha": [
{"name": "Friendly Captcha", "vendor": "Friendly Captcha GmbH", "country": "DE",
"license": "Commercial", "notes": "100% DSGVO, ohne Cookie, Hosting in DE"},
{"name": "Turnstile (Cloudflare EU-Only)", "vendor": "Cloudflare", "country": "Multi",
"license": "Commercial", "notes": "Ohne Cookie, EU-Region erzwingbar"},
],
"qualtrics": [
{"name": "LamaPoll", "vendor": "Lamano GmbH", "country": "DE",
"license": "Commercial", "notes": "DSGVO-Surveys aus Berlin"},
{"name": "evasys", "vendor": "evasys GmbH", "country": "DE",
"license": "Commercial", "notes": "Enterprise-Survey-Plattform aus Lueneburg"},
],
"meta pixel": [
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
"license": "Commercial", "notes": "EU-Alternative fuer Conversion-Tracking"},
],
"facebook": [
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
"license": "Commercial", "notes": "Programmatic ohne Meta"},
],
"linkedin insight": [
{"name": "Xing Insights", "vendor": "New Work SE", "country": "DE",
"license": "Commercial", "notes": "DE/AT/CH B2B-Targeting aus Hamburg"},
],
"outbrain": [
{"name": "Plista", "vendor": "Plista GmbH", "country": "DE",
"license": "Commercial", "notes": "Native Advertising aus Berlin"},
],
"taboola": [
{"name": "Plista", "vendor": "Plista GmbH", "country": "DE",
"license": "Commercial", "notes": "Native Advertising aus Berlin"},
],
"genesys": [
{"name": "Userlike", "vendor": "Userlike UG", "country": "DE",
"license": "Commercial", "notes": "Live-Chat aus Koeln, BSI-konform"},
{"name": "LiveZilla / EasyChat EU", "vendor": "LiveZilla GmbH", "country": "DE",
"license": "Commercial", "notes": "DSGVO-Live-Chat"},
],
"salesviewer": [
{"name": "Leadinfo", "vendor": "Leadinfo BV", "country": "NL",
"license": "Commercial", "notes": "B2B-Webvisitor-Tracking EU"},
{"name": "Albacross EU", "vendor": "Albacross", "country": "SE",
"license": "Commercial", "notes": "EU-Tenant verfuegbar"},
],
"youtube": [
{"name": "Vimeo Pro EU", "vendor": "Vimeo", "country": "Multi",
"license": "Commercial", "notes": "EU-Region waehlbar, weniger Tracking"},
{"name": "Self-hosted video (BunnyStream)", "vendor": "BunnyWay", "country": "SI",
"license": "Commercial", "notes": "Eigene Player + CDN ohne Drittanbieter"},
],
"amazon advertising": [
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
"license": "Commercial", "notes": "Retail-Media-Alternative FR"},
],
"instagram": [
{"name": "Pinterest EU + Owned-Channels", "vendor": "Mix", "country": "Multi",
"license": "Commercial", "notes": "Owned-Channels (Newsletter via CleverReach)"},
],
}
# ─── Kosten-Annahmen (oeffentliche Listenpreise, Schaetzung) ──────
#
# Format: (low_year_eur, high_year_eur, tier_assumption)
# Tier: 'sme' = <100 Mitarbeiter, 'mid' = 100-1000, 'ent' = >1000.
# Quellen: oeffentliche Listenpreise + Branchen-Benchmarks (Gartner,
# Forrester 2025). Konkrete Vertrags-Konditionen koennen 30-70% abweichen
# (Volumen-Rabatte, Bundling). Werden im Output explizit als
# 'Schaetzbereich' markiert.
_COST_LOOKUP: dict[str, tuple[int, int, str]] = {
"adobe analytics": (120_000, 600_000, "ent"),
"adobe target": ( 80_000, 350_000, "ent"),
"adobe campaign": ( 60_000, 250_000, "ent"),
"adobe staging library":( 0, 0, "ent"), # bundled
"google analytics": ( 0, 150_000, "ent"), # GA4 free, GA360 ~150k
"matomo": ( 6_000, 30_000, "mid"), # Cloud/On-Prem
"hotjar": ( 3_600, 18_000, "mid"),
"content square": ( 60_000, 300_000, "ent"),
"contentsquare": ( 60_000, 300_000, "ent"),
"dynatrace": ( 50_000, 400_000, "ent"), # per-host pricing
"performance analytics":( 5_000, 40_000, "mid"),
"qualtrics": ( 25_000, 150_000, "ent"),
# Self-Service-Werbung — KEIN Tool-Lizenz, nur Media-Spend (separat).
# Wir zaehlen 0 hier, weil "Sparpotenzial bei der Lizenz" = 0 ist.
# Konsolidierung wuerde nur Media-Spend reduzieren — anderes Thema.
"google ads": ( 0, 0, "ent"),
"google advertising": ( 0, 0, "ent"),
"doubleclick": ( 0, 0, "ent"),
"meta pixel": ( 0, 0, "ent"),
"facebook": ( 0, 0, "ent"),
"amazon advertising": ( 0, 0, "ent"),
"youtube performance": ( 0, 0, "ent"),
"youtube player": ( 0, 0, "ent"),
"instagram": ( 0, 0, "ent"),
# Echte DSP-/Plattform-Lizenzen — hier zahlt der Kunde eine Saas-Fee
# ON TOP des Media-Spends. Range bewusst enger gehalten (Faktor max 4x).
"adform": ( 80_000, 300_000, "ent"),
"criteo": ( 50_000, 200_000, "ent"),
"outbrain": ( 30_000, 120_000, "ent"),
"taboola": ( 30_000, 120_000, "ent"),
"teads": ( 25_000, 100_000, "ent"),
"pinterest": ( 15_000, 60_000, "ent"),
"linkedin insight": ( 10_000, 50_000, "ent"),
"google maps": ( 2_000, 30_000, "mid"),
"akamai": ( 50_000, 500_000, "ent"),
"amazon web services": (100_000, 3_000_000, "ent"),
"baqend": ( 6_000, 60_000, "mid"),
"speedkit": ( 6_000, 60_000, "mid"),
"speedcurve": ( 2_400, 24_000, "mid"),
"salesforce": (100_000, 1_500_000, "ent"), # CRM seats
"genesys": ( 80_000, 800_000, "ent"), # contact-center seats
"ckm": ( 15_000, 120_000, "mid"),
"hcaptcha": ( 0, 12_000, "sme"), # free tier OR pro
"salesviewer": ( 3_600, 18_000, "mid"),
"youtube": ( 0, 50_000, "ent"), # embed kostenlos, Production-Kosten variieren
}
# ─── EU-Alternativen-Kosten (gleiche Tier-Logik) ───────────────────
_EU_ALT_COSTS: dict[str, tuple[int, int]] = {
"Matomo (On-Premise)": ( 3_000, 15_000),
"Matomo (Pro / Cloud EU)": ( 6_000, 30_000),
"Matomo": ( 6_000, 30_000),
"etracker Analytics": ( 10_000, 60_000),
"Mapp Intelligence": ( 40_000, 200_000),
"Plausible Analytics": ( 240, 6_000),
"Fathom Analytics EU": ( 240, 6_000),
"Mouseflow EU": ( 12_000, 60_000),
"Hotjar EU": ( 3_600, 18_000),
"Dynatrace EU": ( 50_000, 400_000), # gleicher Preis, nur Region
"SpeedCurve EU": ( 2_400, 24_000),
"Calibre": ( 3_600, 30_000),
"Bunny CDN": ( 1_200, 12_000),
"Cloudflare EU-Only": ( 6_000, 80_000),
"IONOS CDN": ( 3_000, 30_000),
"IONOS Cloud": ( 30_000, 600_000),
"OVHcloud": ( 30_000, 600_000),
"Hetzner Cloud": ( 6_000, 120_000),
"STACKIT": ( 50_000, 800_000),
"SAP Customer Experience": ( 80_000, 1_200_000),
"weclapp": ( 12_000, 80_000),
"CleverReach": ( 2_400, 24_000),
"Brevo (Sendinblue)": ( 600, 24_000),
"Inxmail": ( 8_000, 60_000),
"Smart AdServer (Equativ)": ( 30_000, 300_000),
"Bing Ads (Microsoft Advertising EU)": ( 30_000, 3_000_000),
"HERE Maps": ( 1_200, 24_000),
"OpenStreetMap (self-host)": ( 0, 6_000), # nur Server-Kosten
"Maptiler Cloud EU": ( 600, 12_000),
"Friendly Captcha": ( 600, 9_600),
"Turnstile (Cloudflare EU-Only)": ( 0, 6_000),
"LamaPoll": ( 1_200, 24_000),
"evasys": ( 6_000, 60_000),
"Xing Insights": ( 6_000, 60_000),
"Plista": ( 20_000, 150_000),
"Userlike": ( 1_200, 30_000),
"LiveZilla / EasyChat EU": ( 600, 12_000),
"Leadinfo": ( 1_200, 12_000),
"Albacross EU": ( 3_600, 24_000),
"Vimeo Pro EU": ( 900, 6_000),
"Self-hosted video (BunnyStream)": ( 600, 12_000),
"Pinterest EU + Owned-Channels": ( 600, 24_000),
}
# ─── Bekannte Gruende fuer Duplikate (sollen Konsolidierung NICHT empfehlen) ─
_DUPLICATION_CAVEATS = {
"web_analytics": [
"A/B-Vergleich verschiedener Anbieter waehrend Migration",
"Marketing nutzt Adobe, Produkt nutzt Matomo — Inhouse-Politik",
"Regional split (Adobe fuer DE, GA fuer International)",
],
"advertising": [
"Brand-Kampagne vs Performance-Kampagne (verschiedene DSPs)",
"Saisonal: Black Friday/Super Bowl nutzt mehr Kanaele",
"Markenspezifisch: BMW M-Modelle anders targetet als 1er-Serie",
],
"cdn": [
"Multi-CDN-Strategie fuer Ausfallsicherheit (Akamai + Cloudflare)",
"Event-CDN-Spike (Auto-Show, Modell-Launch) braucht Skalierung",
"Regionale Latenz-Optimierung (Akamai APAC, AWS US)",
],
"marketing_automation": [
"Salesforce Marketing Cloud fuer B2C, Adobe Campaign fuer B2B",
"Lead-Generierung (Adobe) vs Loyalitaet (Salesforce)",
],
"monitoring": [
"APM (Dynatrace) misst Backend, RUM (SpeedCurve) misst Frontend",
],
"captcha": [
"Stufenweise Migration zu cookieless Captcha",
],
}
def _company_tier_bounds(company_tier: str | None) -> tuple[float, float]:
"""Wie viel der Listpreis-Range tatsaechlich verwenden — abhaengig
vom Company-Tier. Bei 'enterprise' / 'premier' nutzen wir den UPPER
Teil (50-100%) statt starterpremier.
"""
t = (company_tier or "professional").lower()
if t == "premier": return (0.70, 1.00)
if t == "enterprise": return (0.40, 0.85)
if t == "professional": return (0.20, 0.60)
return (0.05, 0.40) # 'sme' / starter
def _estimate_savings_for_redundancy(
redundancy: dict, vendors: Iterable[dict],
company_tier: str = "enterprise",
) -> dict:
"""Schaetzbereich pro Redundanz: derzeitige Kosten + EU-Konsolidierungs-Saving.
Beruecksichtigt den company_tier wir wollen fuer ein Konzern wie
BMW nicht die starter-Range mit anzeigen. Realistic Range ergibt
sich aus tier_bounds × (low, high).
"""
low_frac, high_frac = _company_tier_bounds(company_tier)
current_low = current_high = 0
matched_vendors = []
cat_vendors = [v for v in vendors if v.get("name") in redundancy.get("vendors", [])]
for v in cat_vendors:
name = (v.get("name") or "").lower()
for k, (lo, hi, _tier) in _COST_LOOKUP.items():
if k in name:
# Tier-aware: nimm low_frac..high_frac des Pricing-Bereichs
span = hi - lo
current_low += int(lo + span * low_frac)
current_high += int(lo + span * high_frac)
matched_vendors.append(v.get("name"))
break
# Konsolidierung: ein einziges EU-Tool ersetzt alle in der Kategorie
suggested_eu = None
suggested_low = suggested_high = 0
# 1. Multi-Funktions-Tool das diese Kategorie abdeckt
for tool in _MULTI_FUNCTION_TOOLS:
if redundancy["category"] in tool["covers"]:
suggested_eu = tool["name"]
cost = _EU_ALT_COSTS.get(tool["name"])
if cost:
suggested_low, suggested_high = cost
break
# 2. Sonst: EU-Alternative aus den Eintraegen — ABER NUR FUR VENDORS
# AUS DER AKTUELLEN KATEGORIE (sonst kommt Userlike fuer Werbung)
if not suggested_eu:
for v in cat_vendors:
n = (v.get("name") or "").lower()
for k, alts in _EU_ALTERNATIVES.items():
if k in n and alts:
suggested_eu = alts[0]["name"]
cost = _EU_ALT_COSTS.get(alts[0]["name"])
if cost:
suggested_low, suggested_high = cost
break
if suggested_eu:
break
saving_low = max(0, current_low - suggested_high)
saving_high = max(0, current_high - suggested_low)
return {
"current_estimate_year_eur": [current_low, current_high],
"suggested_eu_tool": suggested_eu,
"suggested_estimate_year_eur": [suggested_low, suggested_high],
"estimated_saving_year_eur": [saving_low, saving_high],
"caveats": _DUPLICATION_CAVEATS.get(redundancy["category"], []),
"cost_disclaimer": (
"Schaetzbereich auf Basis oeffentlicher Listenpreise. Tatsaechliche "
"Vertragspreise koennen 30-70% niedriger liegen (Volumen, Bundling, "
"Konzern-Konditionen). Bitte mit der jeweiligen Einkaufsabteilung verifizieren."
),
}
# ─── Multi-Funktions-Tools (Konsolidierungs-Ankerpunkte) ───────────
_MULTI_FUNCTION_TOOLS = [
{
"name": "Matomo (Pro / Cloud EU)",
"vendor": "InnoCraft",
"country": "DE-self-host / EU",
"covers": ["web_analytics", "tag_management", "personalisation"],
"notes": "Ersetzt Adobe Analytics + GTM + Adobe Target in einem Tool. "
"100% DSGVO ohne Einwilligung wenn IP anonymisiert.",
},
{
"name": "SAP Customer Experience Suite",
"vendor": "SAP SE",
"country": "DE",
"covers": ["crm", "marketing_automation", "personalisation", "survey"],
"notes": "Ersetzt Salesforce + Adobe Campaign + Qualtrics. EU-Hosting, "
"tiefe ERP-Integration.",
},
{
"name": "IONOS Cloud (Compute + CDN + Storage + DNS)",
"vendor": "IONOS SE",
"country": "DE",
"covers": ["cloud_infra", "cdn", "monitoring"],
"notes": "Ersetzt AWS + Akamai + zusaetzliches Monitoring in einer "
"DE-Cloud (BSI C5).",
},
{
"name": "Userlike Suite",
"vendor": "Userlike UG",
"country": "DE",
"covers": ["chat", "consent_management"],
"notes": "Ersetzt Genesys Chat. Bietet eigenes Consent-Modul.",
},
{
"name": "Smart AdServer (Equativ)",
"vendor": "Equativ",
"country": "FR",
"covers": ["advertising"],
"notes": "Ersetzt Mehrfach-DSPs (Adform/Criteo/Outbrain/Taboola/Meta) "
"durch Programmatic+Direct-Sold EU-Stack.",
},
{
"name": "HERE Maps",
"vendor": "HERE Technologies",
"country": "DE",
"covers": ["maps"],
"notes": "Berliner Anbieter, professionelle Karten + Routing.",
},
{
"name": "Vimeo Pro EU (oder self-hosted BunnyStream)",
"vendor": "Vimeo / BunnyWay",
"country": "Multi / SI",
"covers": ["external_media"],
"notes": "Ersetzt YouTube-Embeds + JW Player in einem Player.",
},
{
"name": "LamaPoll",
"vendor": "Lamano GmbH",
"country": "DE",
"covers": ["survey"],
"notes": "DSGVO-Surveys aus Berlin. Ersetzt Qualtrics / Psyma.",
},
]
# ─── Analyse ─────────────────────────────────────────────────────────
def analyze(vendors: Iterable[dict], company_tier: str = "enterprise") -> dict:
"""Main entry. Returns categorised view + redundancies + EU options.
`company_tier` (starter|professional|enterprise|premier) steuert die
Cost-Range so dass z.B. fuer einen DAX-Konzern nicht starter-Preise
in der unteren Schranke landen.
"""
by_cat: dict[str, list[dict]] = defaultdict(list)
for v in vendors:
cat = classify_vendor(v.get("name", ""))
by_cat[cat].append(v)
# Redundancies: any category with ≥2 vendors (excl. site-internal cats)
skip_redundancy_cats = {"site_infra", "site_feature", "consent_management",
"auth", "other"}
all_vendors_list = list(vendors)
redundancies: list[dict] = []
for cat, vs in by_cat.items():
if cat in skip_redundancy_cats or len(vs) < 2:
continue
red = {
"category": cat,
"category_label": _CATEGORY_LABEL.get(cat, cat),
"count": len(vs),
"vendors": [v.get("name", "") for v in vs],
"consolidation_hint": _CONSOLIDATION_HINT.get(cat, ""),
}
red.update(_estimate_savings_for_redundancy(
red, all_vendors_list, company_tier))
redundancies.append(red)
redundancies.sort(key=lambda r: -(r.get("estimated_saving_year_eur") or [0, 0])[1])
# EU alternatives lookup
eu_alternatives: list[dict] = []
seen = set()
for v in vendors:
name = v.get("name") or ""
n_lower = name.lower()
for k, alts in _EU_ALTERNATIVES.items():
if k in n_lower and k not in seen:
eu_alternatives.append({
"current_vendor": name,
"current_recipient_type": v.get("recipient_type", ""),
"matched_key": k,
"alternatives": alts,
})
seen.add(k)
break
# Multi-function tool recommendations: only if the customer has vendors
# across the categories the tool covers
present_cats = set(by_cat.keys())
multi_function = []
for tool in _MULTI_FUNCTION_TOOLS:
covered_here = [c for c in tool["covers"] if c in present_cats]
if len(covered_here) >= 2:
# Vendor-Namen sammeln statt nur summieren — dedupliziert
unique_vendors: set[str] = set()
for c in covered_here:
for v in by_cat[c]:
unique_vendors.add(v.get("name", ""))
multi_function.append({
**tool,
"replaces_categories": covered_here,
"potential_replacements": len(unique_vendors),
})
multi_function.sort(key=lambda t: -t["potential_replacements"])
total_current_low = sum((r.get("current_estimate_year_eur") or [0, 0])[0] for r in redundancies)
total_current_high = sum((r.get("current_estimate_year_eur") or [0, 0])[1] for r in redundancies)
total_saving_low = sum((r.get("estimated_saving_year_eur") or [0, 0])[0] for r in redundancies)
total_saving_high = sum((r.get("estimated_saving_year_eur") or [0, 0])[1] for r in redundancies)
return {
"summary": {
"total_vendors": len(all_vendors_list),
"distinct_categories": len([c for c in by_cat if c != "other"]),
"redundancy_count": len(redundancies),
"eu_alternative_count": len(eu_alternatives),
"consolidation_potential": sum(r["count"] - 1 for r in redundancies),
"estimated_current_year_eur": [total_current_low, total_current_high],
"estimated_saving_year_eur": [total_saving_low, total_saving_high],
"estimated_saving_pct": (
# Beide Bounds gegen denselben Nenner (Mittelwert der
# aktuellen Schaetzung) — sonst explodiert die obere
# Schranke wenn current_low klein ist. Cap auf 95%.
(lambda mid: (
f"{min(95, int(100 * total_saving_low / mid))}"
f"{min(95, int(100 * total_saving_high / mid))}%"
))((total_current_low + total_current_high) / 2)
if total_current_high else "n/a"
),
"cost_disclaimer": (
"Schaetzbereich auf Basis oeffentlicher Listenpreise (Gartner, Forrester 2025). "
"Vertragspreise koennen 30-70% niedriger liegen (Volumen-Rabatte, Konzern-Konditionen, "
"Bundling). Werte dienen als Diskussionsgrundlage mit dem Einkauf, NICHT als Angebot."
),
},
"by_category": {cat: [v.get("name", "") for v in vs]
for cat, vs in by_cat.items()},
"redundancies": redundancies,
"eu_alternatives": eu_alternatives,
"multi_function_tools": multi_function,
}
_CATEGORY_LABEL = {
"web_analytics": "Web-Analytics",
"advertising": "Werbung / Retargeting",
"tag_management": "Tag-Management",
"marketing_automation": "Marketing-Automation",
"personalisation": "Personalisierung",
"external_media": "Externe Medien (Video)",
"maps": "Karten / Geo",
"cdn": "CDN",
"cloud_infra": "Cloud-Infrastruktur",
"monitoring": "Performance-Monitoring",
"crm": "CRM",
"chat": "Chat / Support",
"captcha": "Bot-Schutz",
"lead_tracking": "Lead-Tracking",
"survey": "Umfragen",
"social_aggregator": "Social-Media-Aggregation",
"consent_management": "Consent-Management",
"auth": "Authentifizierung",
"site_infra": "Eigene Infrastruktur",
"site_feature": "Eigene Features",
"other": "Sonstige",
}
_CONSOLIDATION_HINT = {
"web_analytics": "Mehrere Analytics-Tools sammeln meist redundante Daten. Ein Tool genuegt — Matomo (DE) ist DSGVO-Standard.",
"advertising": "Werbe-/Retargeting-Pixel sind oft austauschbar. Konzentration auf 2-3 Kanaele senkt Drittland-Risiko.",
"external_media": "Mehrere Video-Embeds nur wenn fachlich noetig. Self-hosted (BunnyStream/Vimeo) reduziert Tracking.",
"maps": "Eine Karten-Loesung reicht. HERE Maps (DE) als EU-Alternative zu Google Maps.",
"cdn": "Ein CDN+Performance-Stack genuegt. IONOS oder Bunny vereinen mehrere Funktionen.",
"marketing_automation": "Marketing-Cloud + separates E-Mail-Tool sind oft Dopplung — SAP CX oder CleverReach allein moeglich.",
"chat": "Ein Chat-System genuegt. Userlike (DE) ersetzt Genesys-Stack.",
"monitoring": "RUM + APM koennen in einem Tool gebuendelt werden (Dynatrace EU oder Sentry-Self-host).",
"survey": "Eine Survey-Plattform genuegt — LamaPoll (DE) oder Mapp.",
}
@@ -0,0 +1,229 @@
"""
LLM-Audit: prueft fuer jeden text-MC ob er fachlich wirklich zu seinem
zugeordneten doc_type passt. BMW Run zeigte z.B. dass die meisten
Impressum-MCs eigentlich DSA/E-Commerce-Pflichten sind, die in einem
§5-TMG-Impressum gar nicht stehen.
Output:
- doc_type passt MC bleibt active (kein DB-Update)
- doc_type passt NICHT check_type wird auf 'misclassified' gesetzt;
rag_document_checker filtert die dann aus
Sidecar SQLite. KEINE Postgres-Aenderung (Schema frozen).
"""
from __future__ import annotations
import argparse
import json
import os
import sqlite3
import sys
import time
from datetime import datetime, timezone
import httpx
import psycopg2
from psycopg2.extras import RealDictCursor
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
MODEL = "claude-sonnet-4-6"
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
BATCH_SIZE = 25
SLEEP_BETWEEN_BATCHES = 0.5
DOC_TYPE_DESCRIPTIONS = {
"agb": "Allgemeine Geschaeftsbedingungen — Vertragsbedingungen "
"zwischen Anbieter und Kunde",
"avv": "Auftragsverarbeitungsvertrag Art. 28 DSGVO — Vertrag zwischen "
"Verantwortlichem und Auftragsverarbeiter",
"cookie": "Cookie-Richtlinie — Information ueber Zwecke, Anbieter, "
"Speicherdauer und Widerruf von Cookies/Tracking-Pixeln",
"dse": "Datenschutzerklaerung Art. 13 DSGVO — Informationen ueber "
"Verantwortlichen, Zwecke, Rechtsgrundlagen, Empfaenger, "
"Speicherdauer, Betroffenenrechte, Aufsichtsbehoerde",
"dsfa": "Datenschutz-Folgenabschaetzung Art. 35 DSGVO — Risikobewertung "
"von Verarbeitungen mit hohem Risiko",
"impressum": "Impressum §5 TMG / §18 MStV — Anbieterkennzeichnung mit "
"Name, Anschrift, Vertretungsberechtigten, Handelsregister, "
"USt-IdNr., berufsrechtliche Angaben, Aufsicht",
"loeschkonzept": "Loeschkonzept DIN 66398 — Definition Aufbewahrungs- "
"und Loeschfristen pro Datenkategorie + Prozess",
"widerruf": "Widerrufsbelehrung §312 BGB — Verbraucher-Widerrufsrecht "
"bei Fernabsatz, Frist, Folgen, Muster",
}
SYSTEM_PROMPT = """Du pruefst Master-Controls (MCs) der Compliance-Pruefung.
Fuer jeden MC bekommst du:
- den ihm zugeordneten doc_type (z.B. 'impressum', 'cookie', 'dse')
- den Titel und die check_question
Deine Aufgabe: entscheide ob der MC fachlich wirklich zu diesem doc_type passt.
Beispiele:
- MC "Wird die USt-IdNr im Impressum genannt?" + doc_type=impressum PASST
- MC "Monatlich aktive Endnutzer fuer Online-Vermittlungsdienste messen" + doc_type=impressum PASST NICHT
(DSA-Pflicht, gehoert nicht in ein klassisches §5-TMG-Impressum)
- MC "Mithoeren von Funknachrichten verhindern" + doc_type=cookie PASST NICHT
(TKG-Spezialthema, nicht Cookie-Richtlinie)
Antworte als JSON-Array, eine Zeile pro MC:
[{"control_id": "<wie input>", "doc_type": "<wie input>", "fits": true|false,
"rationale": "ein kurzer satz"}, ...]
Kein Markdown."""
def fetch_pairs_to_audit(conn) -> list[dict]:
"""All text-MCs that haven't been audited yet (no 'fits' column)."""
with sqlite3.connect(SIDECAR_DB) as side:
cols = [r[1] for r in side.execute("PRAGMA table_info(mc_classification)")]
if "fits_doc_type" not in cols:
side.execute("ALTER TABLE mc_classification ADD COLUMN fits_doc_type INTEGER")
side.commit()
already = set()
for cid, dt in side.execute(
"SELECT control_id, doc_type FROM mc_classification "
"WHERE fits_doc_type IS NOT NULL"
):
already.add((cid, dt or ""))
with conn.cursor(cursor_factory=RealDictCursor) as c:
c.execute("""SELECT dc.control_id, dc.doc_type, dc.title, dc.check_question
FROM compliance.doc_check_controls dc
WHERE dc.control_id IN (
SELECT control_id FROM compliance.doc_check_controls
)""")
all_rows = list(c.fetchall())
# Audit only those classified as 'text' in sidecar — process/review
# never run through doc_check anyway
with sqlite3.connect(SIDECAR_DB) as side:
text_pairs = set()
for cid, dt in side.execute(
"SELECT control_id, doc_type FROM mc_classification "
"WHERE check_type = 'text'"
):
text_pairs.add((cid, dt or ""))
target = [r for r in all_rows
if (r["control_id"], r["doc_type"] or "") in text_pairs
and (r["control_id"], r["doc_type"] or "") not in already]
return target
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
payload = {
"model": MODEL,
"max_tokens": 4000,
"system": SYSTEM_PROMPT,
"messages": [{
"role": "user",
"content": (
"Doc-Typen-Beschreibungen:\n"
+ "\n".join(f"- {k}: {v}" for k, v in DOC_TYPE_DESCRIPTIONS.items())
+ "\n\nPruefe folgende MCs:\n\n"
+ json.dumps([
{"control_id": m["control_id"], "doc_type": m["doc_type"],
"title": m["title"], "check_question": (m["check_question"] or "")[:300]}
for m in batch
], ensure_ascii=False, indent=2)
),
}],
}
headers = {
"x-api-key": api_key,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
r.raise_for_status()
txt = r.json()["content"][0]["text"].strip()
if txt.startswith("```"):
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
if txt.startswith("json"):
txt = txt[4:].strip()
return json.loads(txt)
def store_audit(rows: list[dict]) -> None:
ts = datetime.now(timezone.utc).isoformat()
with sqlite3.connect(SIDECAR_DB) as c:
c.executemany(
"UPDATE mc_classification SET fits_doc_type = ?, "
"rationale = COALESCE(?, rationale), classified_at = ? "
"WHERE control_id = ? AND doc_type = ?",
[
(
1 if r.get("fits") else 0,
(r.get("rationale") or "")[:500] or None,
ts,
r.get("control_id"),
r.get("doc_type") or "",
)
for r in rows
],
)
c.commit()
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--sample", action="store_true")
args = ap.parse_args()
api_key = os.environ["ANTHROPIC_API_KEY"]
conn = psycopg2.connect(os.environ["DATABASE_URL"])
pairs = fetch_pairs_to_audit(conn)
if args.sample:
for m in pairs[:5]:
print(json.dumps(m, ensure_ascii=False, indent=2))
print(f"\nTotal pairs to audit: {len(pairs)}")
return
print(f"Audit {len(pairs)} text-MCs in Batches von {BATCH_SIZE}", flush=True)
if not pairs:
print("Alles auditiert.")
return
done = 0
failed_batches = 0
t0 = time.time()
for i in range(0, len(pairs), BATCH_SIZE):
batch = pairs[i:i + BATCH_SIZE]
try:
out = call_claude(api_key, batch)
store_audit(out)
done += len(out)
elapsed = time.time() - t0
rate = done / max(elapsed, 0.01)
eta = (len(pairs) - done) / max(rate, 0.01)
print(f" [{done:>4}/{len(pairs)}] {rate:.1f} MC/s ETA {eta/60:.1f}min", flush=True)
except Exception as e:
failed_batches += 1
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
if failed_batches >= 5:
print("Zu viele Fehler — abbrechen.", file=sys.stderr)
break
time.sleep(SLEEP_BETWEEN_BATCHES)
print(f"\nFertig: {done}/{len(pairs)} auditiert ({failed_batches} Fehlbatches).")
with sqlite3.connect(SIDECAR_DB) as c:
c.row_factory = sqlite3.Row
rows = c.execute(
"SELECT doc_type, "
" SUM(CASE WHEN fits_doc_type = 1 THEN 1 ELSE 0 END) AS fits, "
" SUM(CASE WHEN fits_doc_type = 0 THEN 1 ELSE 0 END) AS misfits, "
" COUNT(*) AS total "
"FROM mc_classification "
"WHERE check_type = 'text' AND fits_doc_type IS NOT NULL "
"GROUP BY doc_type ORDER BY doc_type"
).fetchall()
print("\n=== Audit-Verteilung doc_type x fits ===")
for r in rows:
print(f" {r['doc_type']:<14} fits={r['fits']:<4} misfits={r['misfits']:<4} total={r['total']}")
if __name__ == "__main__":
main()
@@ -0,0 +1,216 @@
"""
A3-v2: nachschärfen welche MCs in Wahrheit auf das CMP-Banner-UI / einen
Prozess zielen, nicht auf den Doc-TEXT.
BMW-Run zeigte ~10 'Sprache klar/einfach' + 'Button-Text formuliert' MCs,
die der erste Audit als doc_type-passend markierte. Sie sind ABER nicht
gegen den Cookie-Policy- oder DSE-Text pruefbar die fragen nach der
Verstaendlichkeit der Einwilligungs-UI.
Schaltet 'scope_requires' bei FRT/biometrischen MCs und re-klassifiziert
diese UI-Items auf check_type='process' (filtert sie aus doc_check raus).
Plus: Setzt scope_requires fuer offensichtlich konditionale MCs:
- 'biometric_processing' bei FRT/Gesichtserkennung
- 'ai_decision_making' bei automatisierten Einzelentscheidungen
- 'child_targeting' bei Kinder-Einwilligungs-MCs
"""
from __future__ import annotations
import argparse
import json
import os
import sqlite3
import sys
import time
import httpx
import psycopg2
from psycopg2.extras import RealDictCursor
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
MODEL = "claude-sonnet-4-6"
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
BATCH_SIZE = 20
SYSTEM_PROMPT = """Du pruefst Master-Controls (MCs) der Compliance-Pruefung
zweiter Ordnung: jeder MC wurde bereits als 'text' klassifiziert und einem
doc_type zugeordnet. Du entscheidest:
A) Pruefst der MC den TEXT des Dokuments (z.B. "Wird im Impressum die
USt-IdNr genannt?") → fits_doc_type=true, ui_only=false
B) Pruefst der MC die UI/UX eines Banners oder Buttons (z.B.
"Ist der Ablehnungs-Button gleich gross wie Akzeptieren?", "Ist die
Einwilligungsanfrage in einfacher Sprache formuliert?") → ui_only=true
(Diese MCs koennen NIE im Dokumenttext stehen, weil sie sich auf eine
externe UI beziehen.)
Zusaetzlich kennzeichne SCOPE-Bedingungen, wenn der MC nur fuer bestimmte
Sites relevant ist:
- 'biometric_processing' : nur bei Sites die biometrische Daten
(Gesichtserkennung/FRT, Fingerabdruecke) verarbeiten
- 'ai_decision_making' : nur bei automatisierten Einzelentscheidungen
(Art. 22 DSGVO)
- 'child_targeting' : nur bei Sites die sich an Kinder richten
- 'ecommerce' : nur bei Webshops
- 'b2c' : nur fuer Verbraucher-Geschaeft (UWG/BGB-Pflichten)
Wenn der MC fuer ALLE Sites gilt, setze scope_requires=null.
Antworte als JSON-Array keine Erklaerung davor/danach, kein Markdown.
Format:
[{"control_id": "<wie input>", "doc_type": "<wie input>",
"ui_only": true|false, "scope_requires": "biometric_processing|ai_decision_making|child_targeting|ecommerce|b2c|null",
"rationale": "ein kurzer satz"}, ...]"""
def fetch_pairs_to_audit(conn) -> list[dict]:
"""All text-MCs that haven't been v2-audited yet (no 'ui_only' column)."""
with sqlite3.connect(SIDECAR_DB) as side:
cols = [r[1] for r in side.execute("PRAGMA table_info(mc_classification)")]
added = False
if "ui_only" not in cols:
side.execute("ALTER TABLE mc_classification ADD COLUMN ui_only INTEGER")
added = True
if "scope_requires" not in cols:
side.execute("ALTER TABLE mc_classification ADD COLUMN scope_requires TEXT")
added = True
if added:
side.commit()
already = set()
for cid, dt in side.execute(
"SELECT control_id, doc_type FROM mc_classification "
"WHERE ui_only IS NOT NULL"
):
already.add((cid, dt or ""))
with conn.cursor(cursor_factory=RealDictCursor) as c:
c.execute("""SELECT dc.control_id, dc.doc_type, dc.title, dc.check_question
FROM compliance.doc_check_controls dc""")
all_rows = list(c.fetchall())
# Audit only those already classified as text+fits in sidecar
with sqlite3.connect(SIDECAR_DB) as side:
eligible = set()
for cid, dt in side.execute(
"SELECT control_id, doc_type FROM mc_classification "
"WHERE check_type = 'text' AND (fits_doc_type IS NULL OR fits_doc_type = 1)"
):
eligible.add((cid, dt or ""))
target = [r for r in all_rows
if (r["control_id"], r["doc_type"] or "") in eligible
and (r["control_id"], r["doc_type"] or "") not in already]
return target
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
payload = {
"model": MODEL,
"max_tokens": 4000,
"system": SYSTEM_PROMPT,
"messages": [{
"role": "user",
"content": "Pruefe folgende MCs:\n\n" + json.dumps([
{"control_id": m["control_id"], "doc_type": m["doc_type"],
"title": m["title"], "check_question": (m["check_question"] or "")[:300]}
for m in batch
], ensure_ascii=False, indent=2),
}],
}
headers = {
"x-api-key": api_key,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
r.raise_for_status()
txt = r.json()["content"][0]["text"].strip()
if txt.startswith("```"):
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
if txt.startswith("json"):
txt = txt[4:].strip()
return json.loads(txt)
def store(rows: list[dict]) -> None:
with sqlite3.connect(SIDECAR_DB) as c:
c.executemany(
"UPDATE mc_classification SET ui_only = ?, scope_requires = ? "
"WHERE control_id = ? AND doc_type = ?",
[
(
1 if r.get("ui_only") else 0,
(r.get("scope_requires") or "").strip() or None
if (r.get("scope_requires") or "").lower() not in ("", "null")
else None,
r.get("control_id"),
r.get("doc_type") or "",
)
for r in rows
],
)
# MCs flagged ui_only become check_type='process' so they're not in doc_check
c.executemany(
"UPDATE mc_classification SET check_type='process' "
"WHERE ui_only=1 AND control_id=? AND doc_type=?",
[(r.get("control_id"), r.get("doc_type") or "") for r in rows
if r.get("ui_only")],
)
c.commit()
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--sample", action="store_true")
args = ap.parse_args()
api_key = os.environ["ANTHROPIC_API_KEY"]
conn = psycopg2.connect(os.environ["DATABASE_URL"])
pairs = fetch_pairs_to_audit(conn)
if args.sample:
for m in pairs[:5]:
print(json.dumps(m, ensure_ascii=False, indent=2))
print(f"\nTotal: {len(pairs)}")
return
print(f"Audit-v2: {len(pairs)} MCs in Batches von {BATCH_SIZE}", flush=True)
if not pairs:
print("Alles geprueft.")
return
done = 0
fail = 0
t0 = time.time()
for i in range(0, len(pairs), BATCH_SIZE):
batch = pairs[i:i + BATCH_SIZE]
try:
out = call_claude(api_key, batch)
store(out)
done += len(out)
elapsed = time.time() - t0
rate = done / max(elapsed, 0.01)
eta = (len(pairs) - done) / max(rate, 0.01)
print(f" [{done:>4}/{len(pairs)}] {rate:.1f} MC/s ETA {eta/60:.1f}min", flush=True)
except Exception as e:
fail += 1
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", file=sys.stderr, flush=True)
if fail >= 5: break
time.sleep(0.5)
print(f"\nFertig: {done}/{len(pairs)} ({fail} Fehlbatches).")
with sqlite3.connect(SIDECAR_DB) as c:
ui = c.execute("SELECT COUNT(*) FROM mc_classification WHERE ui_only=1").fetchone()[0]
scope = c.execute(
"SELECT scope_requires, COUNT(*) FROM mc_classification "
"WHERE scope_requires IS NOT NULL GROUP BY scope_requires"
).fetchall()
print(f"\nui_only=1: {ui} MCs (jetzt 'process')")
print("scope_requires Verteilung:")
for s, n in scope:
print(f" {s}: {n}")
if __name__ == "__main__":
main()
@@ -0,0 +1,222 @@
"""
Classify doc_check_controls (1874 MCs) into check_type:
- text : MC asks about the DOCUMENT'S WORDING ("enthaelt die DSE..", "wird im Impressum genannt..")
- process : MC asks about a TECHNICAL OR ORGANISATIONAL CONTROL ("ist sichergestellt..", "ist implementiert..")
- review : MC asks an ABSTRACT META-QUESTION that needs human judgment ("sind alle Verarbeitungen vollstaendig?")
Output goes to sidecar SQLite at /data/mc_classification.db (no Postgres schema change
per CLAUDE.md guardrails). Schema:
CREATE TABLE mc_classification (
control_id TEXT PRIMARY KEY,
doc_type TEXT,
title TEXT,
check_type TEXT, -- text|process|review
confidence REAL, -- 0..1
rationale TEXT,
classified_at TEXT
);
Run from inside bp-compliance-backend container:
docker exec bp-compliance-backend python /app/scripts/classify_mc_check_type.py [--limit N] [--sample]
"""
from __future__ import annotations
import argparse
import json
import os
import sqlite3
import sys
import time
from datetime import datetime, timezone
from pathlib import Path
import httpx
import psycopg2
from psycopg2.extras import RealDictCursor
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
MODEL = "claude-sonnet-4-6"
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
BATCH_SIZE = 25
SLEEP_BETWEEN_BATCHES = 0.5 # sec — keep gentle for the parallel Haiku batch
SYSTEM_PROMPT = """Du klassifizierst Master-Controls (MCs) der DSGVO/TMG/TDDDG-Compliance-Pruefung in genau drei Klassen:
TEXT Der MC prueft, ob ein bestimmter Inhalt im TEXT eines Dokuments (DSE, Impressum, Cookie-Richtlinie etc.) steht.
Beispiele: "Enthaelt die Datenschutzerklaerung die Aufsichtsbehoerde?" / "Wird im Impressum die USt-IdNr genannt?"
Diese MCs koennen gegen den Dokument-Text gematched werden.
PROCESS Der MC prueft eine technische oder organisatorische MASSNAHME, die ausserhalb des Dokumenttexts liegt.
Beispiele: "Ist sichergestellt, dass Cookies erst nach Einwilligung gespeichert werden?" /
"Ist ein Loeschkonzept implementiert?" / "Wird das Consent-Tool monatlich getestet?"
Diese MCs lassen sich NICHT aus dem Dokumenttext beantworten sie brauchen Evidence/TOM-Nachweis.
REVIEW Der MC stellt eine abstrakte Vollstaendigkeits-/Konsistenzfrage, die menschliche Beurteilung braucht.
Beispiele: "Sind ALLE Verarbeitungszwecke vollstaendig erfasst?" / "Stimmen Cookie-Richtlinie und Banner ueberein?"
Diese MCs sind Checklisten-Items fuer den DSB, kein automatischer Check moeglich.
Antworte ausschliesslich als JSON-Array keine Erklaerung davor/danach, kein Markdown-Codeblock. Format:
[{"control_id": "<wie input>", "check_type": "text|process|review", "confidence": 0.9, "rationale": "ein kurzer satz"}, ...]"""
def fetch_mcs(conn, limit: int | None = None, only_unclassified: bool = True) -> list[dict]:
sql = """SELECT control_id, doc_type, title, check_question
FROM compliance.doc_check_controls"""
if only_unclassified:
sql += " WHERE control_id NOT IN (SELECT control_id FROM _classified_already)"
sql += " ORDER BY doc_type, title"
if limit:
sql += f" LIMIT {limit}"
with conn.cursor(cursor_factory=RealDictCursor) as c:
try:
c.execute("CREATE TEMP TABLE _classified_already (control_id TEXT)")
with sqlite3.connect(SIDECAR_DB) as side:
rows = side.execute("SELECT control_id FROM mc_classification").fetchall()
if rows:
c.executemany("INSERT INTO _classified_already VALUES (%s)", [(r[0],) for r in rows])
except Exception:
pass
c.execute(sql)
return list(c.fetchall())
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
payload = {
"model": MODEL,
"max_tokens": 4000,
"system": SYSTEM_PROMPT,
"messages": [{
"role": "user",
"content": "Klassifiziere die folgenden MCs:\n\n" + json.dumps(
[{"control_id": m["control_id"],
"doc_type": m["doc_type"],
"title": m["title"],
"check_question": (m["check_question"] or "")[:400]}
for m in batch],
ensure_ascii=False, indent=2),
}],
}
headers = {
"x-api-key": api_key,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
r.raise_for_status()
txt = r.json()["content"][0]["text"].strip()
# Strip code fences if Sonnet adds them
if txt.startswith("```"):
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
if txt.startswith("json"):
txt = txt[4:].strip()
return json.loads(txt)
def ensure_sidecar() -> None:
Path(SIDECAR_DB).parent.mkdir(parents=True, exist_ok=True)
with sqlite3.connect(SIDECAR_DB) as c:
c.executescript("""
CREATE TABLE IF NOT EXISTS mc_classification (
control_id TEXT PRIMARY KEY,
doc_type TEXT,
title TEXT,
check_type TEXT,
confidence REAL,
rationale TEXT,
classified_at TEXT
);
CREATE INDEX IF NOT EXISTS idx_doctype ON mc_classification(doc_type);
CREATE INDEX IF NOT EXISTS idx_type ON mc_classification(check_type);
""")
def store_results(rows: list[dict], lookup: dict[str, dict]) -> None:
ts = datetime.now(timezone.utc).isoformat()
with sqlite3.connect(SIDECAR_DB) as c:
c.executemany(
"INSERT OR REPLACE INTO mc_classification "
"(control_id, doc_type, title, check_type, confidence, rationale, classified_at) "
"VALUES (?, ?, ?, ?, ?, ?, ?)",
[
(
r.get("control_id"),
lookup.get(r.get("control_id"), {}).get("doc_type", ""),
lookup.get(r.get("control_id"), {}).get("title", ""),
(r.get("check_type") or "").lower(),
float(r.get("confidence") or 0),
(r.get("rationale") or "")[:500],
ts,
)
for r in rows
],
)
c.commit()
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--limit", type=int, default=None, help="limit number of MCs (for testing)")
ap.add_argument("--sample", action="store_true", help="just print first 5 MCs and exit")
ap.add_argument("--all", action="store_true", help="re-classify all (otherwise skips already classified)")
args = ap.parse_args()
ensure_sidecar()
api_key = os.environ["ANTHROPIC_API_KEY"]
conn = psycopg2.connect(os.environ["DATABASE_URL"])
mcs = fetch_mcs(conn, limit=args.limit, only_unclassified=not args.all)
if args.sample:
for m in mcs[:5]:
print(json.dumps(m, ensure_ascii=False, indent=2))
return
print(f"Klassifiziere {len(mcs)} MCs in Batches von {BATCH_SIZE}", flush=True)
if not mcs:
print("Nichts zu tun.")
return
lookup = {m["control_id"]: m for m in mcs}
total = len(mcs)
done = 0
failed_batches = 0
t0 = time.time()
for i in range(0, total, BATCH_SIZE):
batch = mcs[i:i + BATCH_SIZE]
try:
out = call_claude(api_key, batch)
store_results(out, lookup)
done += len(out)
elapsed = time.time() - t0
rate = done / max(elapsed, 0.01)
eta = (total - done) / max(rate, 0.01)
print(f" [{done:>5}/{total}] {rate:.1f} MC/s ETA {eta/60:.1f}min",
flush=True)
except Exception as e:
failed_batches += 1
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
if failed_batches >= 5:
print(" Zu viele Fehler — abbrechen.", file=sys.stderr)
break
time.sleep(SLEEP_BETWEEN_BATCHES)
print(f"\nFertig: {done}/{total} klassifiziert ({failed_batches} Fehlbatches).")
# Summary
with sqlite3.connect(SIDECAR_DB) as c:
c.row_factory = sqlite3.Row
rows = c.execute(
"SELECT doc_type, check_type, COUNT(*) n FROM mc_classification "
"GROUP BY doc_type, check_type ORDER BY doc_type, check_type"
).fetchall()
print("\n=== Verteilung nach doc_type x check_type ===")
prev = None
for r in rows:
if r["doc_type"] != prev:
print(); print(f"[{r['doc_type']}]")
prev = r["doc_type"]
print(f" {r['check_type']:<8} {r['n']}")
if __name__ == "__main__":
main()
@@ -0,0 +1,241 @@
"""
v2: Re-classify only the (control_id, doc_type) PAIRS that aren't covered yet.
V1 used PK=control_id, so cross-doc-type variants (same control assigned
to e.g. AGB AND Widerruf with different check_questions) overwrote each
other. v2 migrates to PK=(control_id, doc_type) and classifies only the
~262 missing pairs.
Run from container:
docker exec bp-compliance-backend python /app/scripts/classify_mc_check_type_v2.py [--sample]
"""
from __future__ import annotations
import argparse
import json
import os
import sqlite3
import sys
import time
from datetime import datetime, timezone
from pathlib import Path
import httpx
import psycopg2
from psycopg2.extras import RealDictCursor
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
MODEL = "claude-sonnet-4-6"
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
BATCH_SIZE = 25
SLEEP_BETWEEN_BATCHES = 0.5
SYSTEM_PROMPT = """Du klassifizierst Master-Controls (MCs) der DSGVO/TMG/TDDDG-Compliance-Pruefung in genau drei Klassen:
TEXT Der MC prueft, ob ein bestimmter Inhalt im TEXT eines Dokuments steht.
Beispiele: "Enthaelt die Datenschutzerklaerung die Aufsichtsbehoerde?"
Diese MCs koennen gegen den Dokument-Text gematched werden.
PROCESS Der MC prueft eine technische oder organisatorische MASSNAHME ausserhalb des Dokumenttexts.
Beispiele: "Ist sichergestellt, dass Cookies erst nach Einwilligung gespeichert werden?"
REVIEW Der MC stellt eine abstrakte Vollstaendigkeitsfrage, die menschliche Beurteilung braucht.
Beispiele: "Sind ALLE Verarbeitungszwecke vollstaendig erfasst?"
Wichtig: derselbe Control kann fuer DIFFERENT doc_types unterschiedlich klassifiziert sein
mit doc_type-spezifisch angepasster check_question kann ein "text"-Check fuer ein Dok zum
"process"-Check fuer ein anderes werden.
Antworte ausschliesslich als JSON-Array kein Markdown. Format:
[{"control_id": "<input>", "doc_type": "<input>", "check_type": "text|process|review",
"confidence": 0.9, "rationale": "ein kurzer satz"}, ...]"""
def migrate_schema() -> None:
"""Migrate sidecar from PK=control_id to PK=(control_id, doc_type)."""
Path(SIDECAR_DB).parent.mkdir(parents=True, exist_ok=True)
with sqlite3.connect(SIDECAR_DB) as c:
# Check if v2 schema already in place (composite PK)
cols = c.execute("PRAGMA table_info(mc_classification)").fetchall()
if not cols:
# First run — create fresh
c.executescript("""
CREATE TABLE mc_classification (
control_id TEXT,
doc_type TEXT,
title TEXT,
check_type TEXT,
confidence REAL,
rationale TEXT,
classified_at TEXT,
PRIMARY KEY (control_id, doc_type)
);
CREATE INDEX idx_doctype ON mc_classification(doc_type);
CREATE INDEX idx_type ON mc_classification(check_type);
""")
return
# Check whether the existing table already has composite PK
pk_cols = [r[1] for r in cols if r[5] > 0]
if set(pk_cols) == {"control_id", "doc_type"}:
print("Schema already v2 (composite PK). Skipping migration.")
return
print("Migrating sidecar schema to PK(control_id, doc_type)...")
c.executescript("""
CREATE TABLE mc_classification_v2 (
control_id TEXT,
doc_type TEXT,
title TEXT,
check_type TEXT,
confidence REAL,
rationale TEXT,
classified_at TEXT,
PRIMARY KEY (control_id, doc_type)
);
INSERT INTO mc_classification_v2
(control_id, doc_type, title, check_type, confidence, rationale, classified_at)
SELECT control_id, COALESCE(doc_type, ''), title, check_type, confidence, rationale, classified_at
FROM mc_classification;
DROP TABLE mc_classification;
ALTER TABLE mc_classification_v2 RENAME TO mc_classification;
CREATE INDEX idx_doctype ON mc_classification(doc_type);
CREATE INDEX idx_type ON mc_classification(check_type);
""")
n = c.execute("SELECT COUNT(*) FROM mc_classification").fetchone()[0]
print(f"Migrated {n} existing rows.")
def fetch_unclassified_pairs(conn) -> list[dict]:
"""All (control_id, doc_type) pairs in PG that aren't yet in sidecar."""
side_pairs: set[tuple[str, str]] = set()
with sqlite3.connect(SIDECAR_DB) as side:
for cid, dt in side.execute("SELECT control_id, doc_type FROM mc_classification"):
side_pairs.add((cid, dt or ""))
with conn.cursor(cursor_factory=RealDictCursor) as c:
c.execute("""SELECT control_id, doc_type, title, check_question
FROM compliance.doc_check_controls""")
all_rows = list(c.fetchall())
missing = [r for r in all_rows if (r["control_id"], r["doc_type"] or "") not in side_pairs]
return missing
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
payload = {
"model": MODEL,
"max_tokens": 4000,
"system": SYSTEM_PROMPT,
"messages": [{
"role": "user",
"content": "Klassifiziere die folgenden MCs:\n\n" + json.dumps(
[{"control_id": m["control_id"],
"doc_type": m["doc_type"],
"title": m["title"],
"check_question": (m["check_question"] or "")[:400]}
for m in batch],
ensure_ascii=False, indent=2),
}],
}
headers = {
"x-api-key": api_key,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
r.raise_for_status()
txt = r.json()["content"][0]["text"].strip()
if txt.startswith("```"):
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
if txt.startswith("json"):
txt = txt[4:].strip()
return json.loads(txt)
def store_results(rows: list[dict], lookup: dict[tuple[str, str], dict]) -> None:
ts = datetime.now(timezone.utc).isoformat()
with sqlite3.connect(SIDECAR_DB) as c:
c.executemany(
"INSERT OR REPLACE INTO mc_classification "
"(control_id, doc_type, title, check_type, confidence, rationale, classified_at) "
"VALUES (?, ?, ?, ?, ?, ?, ?)",
[
(
r.get("control_id"),
r.get("doc_type") or "",
lookup.get((r.get("control_id"), r.get("doc_type") or ""), {}).get("title", ""),
(r.get("check_type") or "").lower(),
float(r.get("confidence") or 0),
(r.get("rationale") or "")[:500],
ts,
)
for r in rows
],
)
c.commit()
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--sample", action="store_true")
args = ap.parse_args()
migrate_schema()
api_key = os.environ["ANTHROPIC_API_KEY"]
conn = psycopg2.connect(os.environ["DATABASE_URL"])
missing = fetch_unclassified_pairs(conn)
if args.sample:
for m in missing[:5]:
print(json.dumps(m, ensure_ascii=False, indent=2))
print(f"\nTotal missing pairs: {len(missing)}")
return
print(f"Klassifiziere {len(missing)} fehlende (control_id, doc_type)-Paare in Batches von {BATCH_SIZE}", flush=True)
if not missing:
print("Alles klassifiziert. Nichts zu tun.")
return
lookup = {(m["control_id"], m["doc_type"] or ""): m for m in missing}
total = len(missing)
done = 0
failed_batches = 0
t0 = time.time()
for i in range(0, total, BATCH_SIZE):
batch = missing[i:i + BATCH_SIZE]
try:
out = call_claude(api_key, batch)
store_results(out, lookup)
done += len(out)
elapsed = time.time() - t0
rate = done / max(elapsed, 0.01)
eta = (total - done) / max(rate, 0.01)
print(f" [{done:>4}/{total}] {rate:.1f} MC/s ETA {eta/60:.1f}min", flush=True)
except Exception as e:
failed_batches += 1
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
if failed_batches >= 5:
print(" Zu viele Fehler — abbrechen.", file=sys.stderr)
break
time.sleep(SLEEP_BETWEEN_BATCHES)
print(f"\nFertig: {done}/{total} klassifiziert ({failed_batches} Fehlbatches).")
with sqlite3.connect(SIDECAR_DB) as c:
c.row_factory = sqlite3.Row
rows = c.execute(
"SELECT doc_type, check_type, COUNT(*) n FROM mc_classification "
"GROUP BY doc_type, check_type ORDER BY doc_type, check_type"
).fetchall()
print("\n=== Final-Verteilung doc_type x check_type ===")
prev = None
for r in rows:
if r["doc_type"] != prev:
print(); print(f"[{r['doc_type']}]")
prev = r["doc_type"]
print(f" {r['check_type']:<8} {r['n']}")
if __name__ == "__main__":
main()