feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features
CI / nodejs-build (push) Successful in 2m47s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped

Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9):

Core Compliance-Check
- Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs
  in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db).
  rag_document_checker filtert auf check_type='text' fuer doc_check.
  Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in
  falscher doc_type-Schublade.
- scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden
  per business_profile gefiltert (FRT skipped fuer BMW etc.).
- Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match:
  Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60),
  Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum.
  Title+check_question als Embedding-Input fuer mehr Kontext.
- Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem
  CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction
  wenn richer (BMW 1824 vs 600 Worte).

Vendor-Redundanz + EU-Alternativen + Cost-Saving
- vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors,
  Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup
  (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...).
- vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl
  + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/
  enterprise/premier).
- Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten
  (nur Media-Spend, separat). DSP-Plattformen behalten enge Range.
- Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den
  oberen 40-100%-Band der Listpreise, nicht starter→premier.
- Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart
  AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere
  Kategorien gleichzeitig.

Cookie-Wissens-DB + Funktionale Klassifikation
- cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...)
  mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk,
  schrems_ii_status, EuGH-Urteile, EU-Alternative.
- cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id,
  ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact.

Country-Inferenz aus Rechtsform
- cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet
  (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table.
  Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors
  (Adform DK, Pinterest IE).

Action-Recipes + Doc-Anchor-Locator
- finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country,
  broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling",
  ...) eine strukturierte Anweisung mit what/why/fix_text/where/example.
  Zum 1:1-Einfuegen in Kunden-Dokumente.
- doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den
  passenden Absatz im existierenden Kundendokument fuer jeden Finding.
  Per-Run Thread-Local-Cache. Fallback: keyword-Match.
- Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail
  + Vendor-Flag-Liste mit aufklappbarer Action-Liste.
- Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip).

Migration-Pipeline (Compliance-Check -> Customer Banner/Documents)
- migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit
  4 Kategorien + Review-Flags.
- migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register
  + Privacy-Policy-Pre-Fills.
- agent_migration_routes: 3 Preview-Endpoints (banner-preview,
  document-preview, summary). Persistierung der cmp_vendors in
  /data/compliance_audits.db check_payloads-Tabelle.

Borlabs-Parity Cookie-Banner-Features
- Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage.
- Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video
  Placeholder bis Einwilligung.
- Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB.
- Consent-Log Export (CSV/JSON) per einwilligungen_export_routes.

Bug-Fixes
- canonical_control_routes: _jsonish-Helper fuer string-typed jsonb,
  similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr).
- Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views.
- Embedding-Service-Batching (32er Batches statt 165 in einem Call).
- KeyError 'control_id' in MC-Result-Aggregation (defensive .get).
- Master-Controls-Klick-Through von /sdk/master-controls auf
  /sdk/control-library?control=<id> mit URL-Param-Auto-Open.
- Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht).
- Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction).
- doc_type-aware MC-Filter (statt all-text-MCs).
- Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag).
- A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert.

Tests
- test_migration_mappers.py (9 Tests)
- test_migration_endpoints.py (4 Tests)

Skripte (one-shot)
- classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type)
- audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires)

BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes):
  DSE     7,5% -> 81-83%
  Impressum 4%   -> 100% (6 echte MCs alle erfuellt)
  Cookie  0%    -> 79-83% (CMP-Text-Routing + Embedding)
  Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr
  Plus: Action-Recipes + Doc-Anchors fuer jeden Fail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-18 18:30:08 +02:00
parent 52fb8b91e7
commit 662327e8b4
31 changed files with 5214 additions and 104 deletions
@@ -0,0 +1,167 @@
"""
Cookie-Function-Classifier — pro Cookie eine inhaltliche Funktionsbestimmung.
Heute haben wir pro Vendor eine Kategorie (analytics/advertising/...).
Aber: ein Vendor hat oft 10-50 verschiedene Cookies. Nicht jeder Cookie
einer Marketing-Plattform macht Werbung — viele sind Session-Mgmt,
Sprachpraeferenz, ScrollPosition etc.
Dieses Modul klassifiziert pro Cookie:
- functional_role : was der Cookie technisch tut (session_id,
csrf_token, ab_test, user_id, ad_id, …)
- data_collected : welche Daten dahinter stehen (visitor_id,
page_view, click, conversion_event, …)
- blocking_impact : was passiert wenn der Cookie geblockt wird
(none, no_personalization, no_tracking, site_breaks)
Damit kann der Vendor-Redundanz-Analyzer praezise sagen:
"Adobe Analytics setzt 55 Cookies, davon 12 fuer Tracking, 8 fuer A/B-Test
und 35 fuer interne Performance. Matomo deckt 12 Tracking + 8 A/B Tests
ab — 55 Adobe-Cookies werden zu 20 Matomo-Cookies."
"""
from __future__ import annotations
import re
from typing import Iterable
# Pattern → (functional_role, blocking_impact)
# Reihenfolge entscheidet: spezifischer zuerst.
_PATTERNS: list[tuple[str, str, str]] = [
# Session / Authentifizierung
(r"^(jsessionid|phpsessid|sessionid|sid|connect\.sid)$", "session_id", "site_breaks"),
(r"sso|signon|auth|login|token|jwt|bearer", "auth_token", "site_breaks"),
(r"^csrf|xsrf|antiforgery", "csrf_token", "site_breaks"),
# Spracheinstellung / Region
(r"lang|locale|culture|region", "preference", "no_personalization"),
# User-Praeferenzen (Theme, View, Bookmark)
(r"theme|dark|mode|view|sort|filter", "ui_preference", "no_personalization"),
(r"bookmark|favorite|favorit", "user_data", "no_personalization"),
# Consent-Cookie selbst
(r"consent|gdpr|tcf|euconsent", "consent_state", "site_breaks"),
# Tracking IDs (most analytics)
(r"^_ga|gid|gat|google_analytic", "tracking_id", "no_tracking"),
(r"^_pk_|matomo|piwik", "tracking_id", "no_tracking"),
(r"^s_|s\.cc|adobesite|aam", "tracking_id", "no_tracking"), # Adobe
(r"hjid|hjsession|hotjar", "session_recording", "no_tracking"),
(r"_uetsid|_uetvid|microsoft", "tracking_id", "no_tracking"),
# Visitor identification
(r"visitor|uid|user_id|customer_id", "visitor_id", "no_personalization"),
# A/B-Test / Personalisation
(r"ab_test|abtest|variant|experiment|target|target_qa", "ab_test", "no_personalization"),
(r"personalization|personalisation|adobe_target", "personalisation", "no_personalization"),
# Werbung / Retargeting
(r"fbp|fbc|fb_id|facebook|meta_pixel|fr$", "ad_pixel", "no_tracking"),
(r"adform|criteo|outbrain|taboola|tapad|adsrvr", "ad_pixel", "no_tracking"),
(r"doubleclick|test_cookie|ide|nid|exchange_uid", "ad_pixel", "no_tracking"),
(r"google_ad|gads|gcl", "ad_pixel", "no_tracking"),
(r"^li_|linkedin|bcookie|bscookie", "ad_pixel", "no_tracking"),
(r"pinterest|_pinterest_|_pin_unauth", "ad_pixel", "no_tracking"),
# Affiliate / Conversion
(r"conversion|orderid|order_id|transaction|purchase", "conversion_event", "no_tracking"),
(r"campaign|utm|source|medium|term", "campaign_attribution", "no_tracking"),
# ScrollPosition / Form-Helper
(r"scroll|position|form_|form_state", "ui_state", "no_personalization"),
# Loadbalancer / Sticky
(r"affinity|sticky|lb_|alb-|aws-alb", "load_balancer", "site_breaks"),
# Chat / Support
(r"chat|widget|genesys|livechat", "chat_session", "no_personalization"),
# Captcha
(r"hcaptcha|recaptcha|cf_|cloudflare", "bot_protection", "site_breaks"),
]
_FUNCTIONAL_LABEL = {
"session_id": "Sitzungs-ID",
"auth_token": "Auth-Token",
"csrf_token": "CSRF-Schutz",
"preference": "Sprache / Region",
"ui_preference": "UI-Praeferenz",
"user_data": "Nutzer-Daten",
"consent_state": "Consent-Speicher",
"tracking_id": "Tracking-ID",
"session_recording": "Session-Recording",
"visitor_id": "Besucher-ID",
"ab_test": "A/B-Test",
"personalisation": "Personalisierung",
"ad_pixel": "Werbe-Pixel",
"conversion_event": "Konversions-Tracking",
"campaign_attribution":"Kampagnen-Attribution",
"ui_state": "UI-Zustand (ScrollPos etc.)",
"load_balancer": "Load-Balancer",
"chat_session": "Chat-Session",
"bot_protection": "Bot-Schutz",
"unknown": "Unbekannt",
}
# Welche functional_roles ueberlappen funktional — verwendet vom
# vendor_redundancy.analyze() um echte Konsolidierungschancen zu
# erkennen statt nur Provider-Doppelungen zu zaehlen.
OVERLAPPING_ROLES = {
"tracking_id": "tracking",
"session_recording": "tracking",
"ab_test": "personalisation",
"personalisation": "personalisation",
"ad_pixel": "advertising",
"conversion_event": "advertising",
"campaign_attribution":"advertising",
}
def classify_cookie(cookie_name: str) -> tuple[str, str]:
"""Return (functional_role, blocking_impact) for a cookie name."""
n = (cookie_name or "").lower().strip()
for pattern, role, impact in _PATTERNS:
if re.search(pattern, n):
return role, impact
return "unknown", "no_tracking"
def annotate_vendor_cookies(vendor: dict) -> dict:
"""Enrich a vendor record with functional_role per cookie."""
cookies = vendor.get("cookies") or []
annotated = []
role_counts: dict[str, int] = {}
for c in cookies:
role, impact = classify_cookie(c.get("name", ""))
annotated.append({**c, "functional_role": role, "blocking_impact": impact})
role_counts[role] = role_counts.get(role, 0) + 1
return {
**vendor,
"cookies": annotated,
"role_distribution": role_counts,
"role_labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in role_counts},
}
def aggregate_cookie_purposes(vendors: Iterable[dict]) -> dict:
"""Tenant-weite Verteilung: welche funktionalen Rollen kommen wie oft vor?"""
total: dict[str, int] = {}
by_vendor: dict[str, dict[str, int]] = {}
for v in vendors:
roles = v.get("role_distribution") or {}
if not roles and v.get("cookies"):
v = annotate_vendor_cookies(v)
roles = v["role_distribution"]
for r, n in roles.items():
total[r] = total.get(r, 0) + n
by_vendor[v.get("name", "")] = roles
return {
"total_per_role": total,
"labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in total},
"vendors_per_role": {
r: [v for v, rd in by_vendor.items() if rd.get(r, 0) > 0]
for r in total
},
}
@@ -0,0 +1,608 @@
"""
Cookie-Knowledge-Datenbank — maximal extrahierbares Wissen pro Cookie-Name.
Pro Eintrag erfassen wir:
- vendor : Setzender Anbieter (volle Firma + Sitzland)
- exact_purpose : was der Cookie GENAU tut (nicht nur Kategorie)
- data_collected : Welche Datenfelder (Client-ID, Timestamp, IP, etc.)
- ip_relevant : Wird IP-Adresse erfasst/uebermittelt?
- ip_anonymized : Per Default anonymisiert?
- tcf_purpose_ids : IAB TCF v2.2 Purpose-IDs (1-11)
- iab_vendor_id : IAB Global Vendor List ID (fuer TCF-Sync)
- typical_lifetime : Wie lange persistiert
- reid_risk : Re-Identifikations-Risiko (low/medium/high)
- technical_necessity: Erforderlich nach §25(2) TDDDG? (none/partial/full)
- schrems_ii_status : Drittlandtransfer-Bewertung
- eugh_rulings : Relevante EuGH-/CNIL-/LfDI-Entscheidungen
- eu_alternative_* : EU-Cookie/Vendor-Ersatz mit gleicher Funktion
- notes : Sonstige Hinweise (Vermeidung, Konfiguration)
Quellen: Cookiepedia, IAB Europe TCF v2.2 Vendor List, Cookiebot DB,
CNIL Cookies & Trackers Guidelines 2024, EDPB Cookie Guidelines 2/2023,
DSK-Orientierungshilfe Telemedien 2021, Vendor-eigene Dokumentation.
Stand: 2026-05.
Erweiterung: Pull-Requests willkommen — Format siehe TEMPLATE_ENTRY am
Ende der Datei.
"""
from __future__ import annotations
from typing import TypedDict
class CookieKnowledge(TypedDict, total=False):
vendor: str
vendor_country: str
exact_purpose: str
data_collected: list[str]
ip_relevant: bool
ip_anonymized: bool
tcf_purpose_ids: list[int]
iab_vendor_id: int | None
typical_lifetime: str
reid_risk: str # 'low' | 'medium' | 'high'
technical_necessity: str # 'none' | 'partial' | 'full'
schrems_ii_status: str
eugh_rulings: list[str]
eu_alternative_cookies: list[str]
eu_alternative_vendor: str
notes: str
# ─── Google ──────────────────────────────────────────────────────────
_GOOGLE_BASE = {
"vendor": "Google LLC", "vendor_country": "US",
"schrems_ii_status": "Drittlandtransfer in die USA. Mit DPF "
"(EU-US Data Privacy Framework, 2023) wieder zulaessig, "
"aber bereits Klage NOYB anhaengig (Schrems III). "
"Risiko-Bewertung empfohlen.",
"eugh_rulings": [
"EuGH C-311/18 (Schrems II, 16.07.2020) — Privacy Shield gekippt",
"CNIL SAN-2022-002 (10.02.2022) — Google Analytics ohne Anonymisierung "
"unzulaessig",
"Datenschutzkonferenz (DSK) Orientierungshilfe 2024 — GA4 mit "
"Server-Side-Tagging als Mitigation moeglich",
],
}
KB: dict[str, CookieKnowledge] = {
# ─── Google Analytics ─────────────────────────────────────────────
"_ga": {
**_GOOGLE_BASE,
"exact_purpose": "Unterscheidet eindeutig Besucher; persistiert die "
"ueber alle Sessions hinweg gueltige Client-ID.",
"data_collected": ["client_id", "first_visit_timestamp", "ip_address"],
"ip_relevant": True, "ip_anonymized": False,
"tcf_purpose_ids": [8, 10],
"iab_vendor_id": 755,
"typical_lifetime": "2 Jahre",
"reid_risk": "high",
"technical_necessity": "none",
"eu_alternative_cookies": ["_pk_id"],
"eu_alternative_vendor": "Matomo",
"notes": "Mit IP-Anonymisierung (`anonymizeIp: true`) + Server-Side-Tagging "
"DSGVO-konfigurierbar. Ohne diese Massnahmen einwilligungspflichtig.",
},
"_gid": {
**_GOOGLE_BASE,
"exact_purpose": "Unterscheidet Besucher innerhalb einer Session "
"(24h-Bucket).",
"data_collected": ["session_id", "ip_address"],
"ip_relevant": True, "ip_anonymized": False,
"tcf_purpose_ids": [8],
"iab_vendor_id": 755,
"typical_lifetime": "24 Stunden",
"reid_risk": "medium",
"technical_necessity": "none",
"eu_alternative_cookies": ["_pk_ses"],
"eu_alternative_vendor": "Matomo",
},
"_gat": {
**_GOOGLE_BASE,
"exact_purpose": "Throttling-Cookie — begrenzt Anzahl Requests an "
"Google Analytics pro Sekunde.",
"data_collected": ["throttle_flag"],
"ip_relevant": False, "ip_anonymized": True,
"tcf_purpose_ids": [],
"iab_vendor_id": 755,
"typical_lifetime": "1 Minute",
"reid_risk": "low",
"technical_necessity": "none",
"notes": "Reines Performance-Steuerungs-Cookie. Trotzdem einwilligungspflichtig "
"da er Teil des GA-Trackings ist.",
},
"_gat_gtag_UA_": {
**_GOOGLE_BASE,
"exact_purpose": "GTM-spezifisches Throttling — pro Universal-Analytics-Property.",
"data_collected": ["throttle_flag"],
"ip_relevant": False,
"typical_lifetime": "1 Minute",
"reid_risk": "low",
"technical_necessity": "none",
"notes": "Suffix nach `UA_` ist die Property-ID. Pattern-Match noetig.",
},
"_ga_*": {
**_GOOGLE_BASE,
"exact_purpose": "GA4-Persistierung — eindeutige Stream-ID + Session-Daten.",
"data_collected": ["stream_id", "session_count", "session_start_ts"],
"ip_relevant": True, "ip_anonymized": False,
"tcf_purpose_ids": [8, 10],
"iab_vendor_id": 755,
"typical_lifetime": "2 Jahre",
"reid_risk": "high",
"technical_necessity": "none",
"notes": "GA4-Format. Suffix `_<measurement_id>`. Server-Side-GTM "
"ist die einzige praktikable DSGVO-Mitigation.",
},
"NID": {
**_GOOGLE_BASE,
"exact_purpose": "Personalisiert Google-Suche, AdSense, Maps; "
"speichert Praeferenzen + Sicherheits-Token.",
"data_collected": ["user_pref_id", "session_id", "security_token"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9, 10],
"iab_vendor_id": 755,
"typical_lifetime": "6 Monate",
"reid_risk": "high",
"technical_necessity": "none",
"notes": "Klassischer Google-Werbe-Cookie. Hohe Re-ID-Gefahr da Cross-Site.",
},
"IDE": {
"vendor": "Google LLC (DoubleClick)", "vendor_country": "US",
"exact_purpose": "Conversion-Tracking + Werbe-Targeting bei "
"Google Display Network / DoubleClick.",
"data_collected": ["doubleclick_id", "ad_interactions"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9, 10],
"iab_vendor_id": 755,
"typical_lifetime": "13 Monate",
"reid_risk": "high",
"technical_necessity": "none",
"schrems_ii_status": _GOOGLE_BASE["schrems_ii_status"],
"eugh_rulings": _GOOGLE_BASE["eugh_rulings"],
},
"test_cookie": {
**_GOOGLE_BASE,
"exact_purpose": "DoubleClick-Probe-Cookie — testet ob Browser Cookies akzeptiert.",
"data_collected": ["browser_supports_cookies"],
"ip_relevant": False,
"typical_lifetime": "15 Minuten",
"reid_risk": "low",
"technical_necessity": "none",
},
# ─── Meta / Facebook ──────────────────────────────────────────────
"_fbp": {
"vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
"exact_purpose": "First-Party-Pixel von Facebook/Meta — identifiziert "
"den Browser fuer Werbe-Conversion-Tracking + Ad-Retargeting.",
"data_collected": ["browser_id", "first_visit_ts"],
"ip_relevant": True, "ip_anonymized": False,
"tcf_purpose_ids": [4, 9, 10],
"iab_vendor_id": 891,
"typical_lifetime": "90 Tage",
"reid_risk": "high",
"technical_necessity": "none",
"schrems_ii_status": "Daten gehen an Meta US trotz IE-Sitz. "
"Aktuell DPF-abgedeckt, aber Schrems III erwartet.",
"eugh_rulings": [
"EuGH C-311/18 (Schrems II)",
"EDSA Aufsichts-Statement 2023 zu Meta-Pixel",
"LDA Bayern Pruefverfuegung 2024",
],
"eu_alternative_vendor": "Smart AdServer (Equativ, FR)",
"notes": "Conversions API (CAPI) als Server-Side-Variante reduziert Cookie-"
"Abhaengigkeit. Trotzdem werden Daten an Meta gesendet.",
},
"_fbc": {
"vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
"exact_purpose": "Speichert die ClickID einer Meta-Ad-Kampagne; "
"ordnet Conversion dem urspruenglichen Ad-Klick zu.",
"data_collected": ["fbclid", "ad_campaign_id"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9],
"iab_vendor_id": 891,
"typical_lifetime": "90 Tage",
"reid_risk": "high",
"technical_necessity": "none",
},
"fr": {
"vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
"exact_purpose": "Werbe-Targeting + Cross-Site-Tracking auf "
"Facebook-Plattform.",
"data_collected": ["encrypted_user_id", "session_data"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9, 10],
"iab_vendor_id": 891,
"typical_lifetime": "3 Monate",
"reid_risk": "high",
"technical_necessity": "none",
},
# ─── Adobe ────────────────────────────────────────────────────────
"s_cc": {
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
"exact_purpose": "Cookie-Test — prueft ob der Browser Cookies "
"akzeptiert (Adobe Analytics Bootstrap).",
"data_collected": ["browser_supports_cookies"],
"ip_relevant": False,
"typical_lifetime": "Session",
"reid_risk": "low",
"technical_necessity": "partial",
"schrems_ii_status": "Adobe-IE-Hosting, jedoch US-Datentransfer fuer "
"Cloud-Services. DPF-abgedeckt.",
},
"s_sq": {
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
"exact_purpose": "Speichert den letzten Klick (URL + Position) "
"fuer Click-Map-Reports.",
"data_collected": ["last_click_url", "last_click_xy"],
"ip_relevant": False,
"tcf_purpose_ids": [8],
"typical_lifetime": "Session",
"reid_risk": "low",
"technical_necessity": "none",
},
"AMCV_": {
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
"exact_purpose": "Adobe Marketing Cloud Visitor ID — Cross-Tool-ID fuer "
"Analytics + Target + Audience Manager.",
"data_collected": ["marketing_cloud_visitor_id", "first_visit_ts"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 8, 9, 10],
"typical_lifetime": "2 Jahre",
"reid_risk": "high",
"technical_necessity": "none",
"notes": "Suffix nach `AMCV_` ist die Org-ID. Klassischer Cross-Tool-Track.",
},
"mbox": {
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
"exact_purpose": "Adobe Target — A/B-Testing, Personalisierung, "
"Audience-Targeting.",
"data_collected": ["mbox_visitor_id", "experiment_assignments"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9, 10],
"typical_lifetime": "2 Jahre",
"reid_risk": "high",
"technical_necessity": "none",
},
"s_target_qa": {
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
"exact_purpose": "Adobe Target Premium-Feature — QA-Modus fuer Personalisations-Tests.",
"data_collected": ["target_qa_session"],
"typical_lifetime": "Session",
"reid_risk": "low",
"technical_necessity": "none",
"notes": "Premium-Feature-Cookie. Anwesenheit = Enterprise-Lizenz Indikator.",
},
# ─── Microsoft / Bing ─────────────────────────────────────────────
"MUID": {
"vendor": "Microsoft Corp.", "vendor_country": "US",
"exact_purpose": "Microsoft-User-ID fuer Bing, Microsoft Advertising, "
"Clarity Heatmaps.",
"data_collected": ["microsoft_user_id"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 8, 9, 10],
"iab_vendor_id": 165,
"typical_lifetime": "13 Monate",
"reid_risk": "high",
"technical_necessity": "none",
"schrems_ii_status": "US-Transfer, DPF-zertifiziert.",
},
"_uetsid": {
"vendor": "Microsoft Corp.", "vendor_country": "US",
"exact_purpose": "Universal Event Tracking — Session-Cookie fuer "
"Microsoft Advertising Conversion-Tracking.",
"data_collected": ["session_id"],
"ip_relevant": True,
"tcf_purpose_ids": [9],
"typical_lifetime": "30 Minuten",
"reid_risk": "medium",
"technical_necessity": "none",
},
"_uetvid": {
"vendor": "Microsoft Corp.", "vendor_country": "US",
"exact_purpose": "Universal Event Tracking — persistente Visitor-ID.",
"data_collected": ["visitor_id"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9],
"typical_lifetime": "13 Monate",
"reid_risk": "high",
"technical_necessity": "none",
},
# ─── LinkedIn ─────────────────────────────────────────────────────
"bcookie": {
"vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
"exact_purpose": "Browser-Identifikation; wird genutzt fuer Anmelde-"
"Vorgang + LinkedIn Insight-Tag-Tracking.",
"data_collected": ["browser_id"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 8, 9],
"iab_vendor_id": 14,
"typical_lifetime": "1 Jahr",
"reid_risk": "high",
"technical_necessity": "none",
"schrems_ii_status": "US-Datenuebermittlung (Mutter Microsoft).",
},
"lidc": {
"vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
"exact_purpose": "Daten-Routing zwischen LinkedIn-Datacentern.",
"data_collected": ["routing_id"],
"ip_relevant": True,
"typical_lifetime": "1 Tag",
"reid_risk": "low",
"technical_necessity": "partial",
},
"li_gc": {
"vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
"exact_purpose": "Speichert die Cookie-Einwilligung fuer LinkedIn-Embeds.",
"data_collected": ["consent_state"],
"ip_relevant": False,
"typical_lifetime": "6 Monate",
"reid_risk": "low",
"technical_necessity": "full",
},
# ─── Matomo (EU-Alternative) ──────────────────────────────────────
"_pk_id": {
"vendor": "InnoCraft Ltd (Matomo)", "vendor_country": "NZ-mit-EU-self-hosting",
"exact_purpose": "Matomo Visitor-ID — Pendant zu _ga, aber DSGVO-konform "
"wenn IP-Anonymisierung aktiv.",
"data_collected": ["visitor_id", "first_visit_ts"],
"ip_relevant": True, "ip_anonymized": True,
"tcf_purpose_ids": [8],
"typical_lifetime": "13 Monate",
"reid_risk": "low", # bei aktivierter Anonymisierung
"technical_necessity": "none",
"schrems_ii_status": "Bei On-Premise-Deployment KEIN Drittlandtransfer. "
"Matomo Cloud EU mit Frankfurt-Hosting verfuegbar.",
"notes": "Empfohlener Drop-in-Ersatz fuer Google Analytics.",
},
"_pk_ses": {
"vendor": "InnoCraft Ltd (Matomo)", "vendor_country": "NZ-mit-EU-self-hosting",
"exact_purpose": "Matomo Session-Cookie.",
"data_collected": ["session_id"],
"ip_relevant": False,
"typical_lifetime": "30 Minuten",
"reid_risk": "low",
"technical_necessity": "none",
},
# ─── Captcha ──────────────────────────────────────────────────────
"hcaptcha": {
"vendor": "Intuition Machines Inc. (hCaptcha)", "vendor_country": "US",
"exact_purpose": "hCaptcha Bot-Erkennung — Session-Token fuer Challenge-Solver.",
"data_collected": ["bot_score", "session_id", "ip_address"],
"ip_relevant": True,
"typical_lifetime": "Session",
"reid_risk": "medium",
"technical_necessity": "full",
"schrems_ii_status": "US-Transfer (Intuition Machines, San Francisco).",
"eu_alternative_vendor": "Friendly Captcha (DE), Turnstile (Cloudflare EU)",
"notes": "Technisch erforderlich fuer Bot-Schutz, aber EU-Alternativen "
"ohne Drittland-Risiko verfuegbar.",
},
"cf_clearance": {
"vendor": "Cloudflare Inc.", "vendor_country": "US",
"exact_purpose": "Cloudflare-Bot-Management Pro — bestaetigt dass User "
"die JS-Challenge bestanden hat.",
"data_collected": ["challenge_token"],
"ip_relevant": True,
"typical_lifetime": "30 Minuten",
"reid_risk": "low",
"technical_necessity": "full",
"notes": "Premium-Feature-Cookie. Anwesenheit = Cloudflare Bot-Management "
"Pro im Einsatz.",
},
# ─── CDN / Performance ────────────────────────────────────────────
"__cf_bm": {
"vendor": "Cloudflare Inc.", "vendor_country": "US",
"exact_purpose": "Cloudflare Bot Management — basis Bot-Erkennung.",
"data_collected": ["bot_score", "client_hash"],
"ip_relevant": True,
"typical_lifetime": "30 Minuten",
"reid_risk": "low",
"technical_necessity": "full",
"notes": "Strictly necessary nach §25(2) TDDDG (Sicherheit). Keine Einwilligung noetig.",
},
"aws-alb": {
"vendor": "Amazon Web Services Inc.", "vendor_country": "US",
"exact_purpose": "AWS Application Load Balancer Sticky Sessions — "
"routet Anfragen konsistent an dieselbe Backend-Instanz.",
"data_collected": ["target_instance_id"],
"ip_relevant": False,
"typical_lifetime": "1 Stunde",
"reid_risk": "low",
"technical_necessity": "full",
"schrems_ii_status": "AWS Frankfurt verfuegbar — bei korrektem Region-Setup "
"kein US-Transfer.",
},
# ─── Retargeting / Advertising ────────────────────────────────────
"_pin_unauth": {
"vendor": "Pinterest Europe Ltd.", "vendor_country": "IE/US",
"exact_purpose": "Pinterest Tag — Conversion-Tracking + Audience-Aufbau.",
"data_collected": ["pinterest_user_id"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9, 10],
"iab_vendor_id": 762,
"typical_lifetime": "1 Jahr",
"reid_risk": "high",
"technical_necessity": "none",
},
"cto_dna": {
"vendor": "Criteo S.A.", "vendor_country": "FR",
"exact_purpose": "Criteo Dynamic Retargeting — produktspezifische "
"Werbeauslieferung basierend auf Browser-History.",
"data_collected": ["criteo_user_id", "product_views"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9, 10],
"iab_vendor_id": 91,
"typical_lifetime": "13 Monate",
"reid_risk": "high",
"technical_necessity": "none",
"schrems_ii_status": "Criteo ist FR-basiert, aber Daten gehen auch in die USA. "
"Multi-Region-Setup pruefen.",
"notes": "Trotz FR-Sitz nicht automatisch DSGVO-konform — CNIL-Bussgeld "
"EUR 60M 2022 wegen mangelnder Einwilligungs-Granularitaet.",
},
"afm": {
"vendor": "Adform A/S", "vendor_country": "DK",
"exact_purpose": "Adform Audience Matching — Cross-Device-Identifikation "
"fuer programmatische Werbung.",
"data_collected": ["adform_user_id", "device_signals"],
"ip_relevant": True,
"tcf_purpose_ids": [4, 9, 10],
"iab_vendor_id": 50,
"typical_lifetime": "30 Tage",
"reid_risk": "high",
"technical_necessity": "none",
"schrems_ii_status": "Adform ist DK-basiert, EU-Hosting Standard. Keine "
"Schrems-II-Probleme bei Standard-Setup.",
},
# ─── Consent / Funktional (Strictly Necessary) ────────────────────
"JSESSIONID": {
"vendor": "Java EE / Tomcat (Site-Software)", "vendor_country": "N/A",
"exact_purpose": "Server-Session-ID fuer Java-basierte Anwendungen.",
"data_collected": ["session_id"],
"ip_relevant": False,
"typical_lifetime": "Session",
"reid_risk": "low",
"technical_necessity": "full",
"notes": "Strictly necessary. Keine Einwilligung nach §25(2) TDDDG.",
},
"PHPSESSID": {
"vendor": "PHP (Site-Software)", "vendor_country": "N/A",
"exact_purpose": "PHP-Session-ID — serverseitige Sitzungs-Persistierung.",
"data_collected": ["session_id"],
"ip_relevant": False,
"typical_lifetime": "Session",
"reid_risk": "low",
"technical_necessity": "full",
},
"cookie_consent": {
"vendor": "BreakPilot Consent-Banner (own)", "vendor_country": "DE",
"exact_purpose": "Speichert die Einwilligungsentscheidung des Nutzers "
"pro Kategorie.",
"data_collected": ["consent_state_per_category", "timestamp"],
"ip_relevant": False,
"typical_lifetime": "180 Tage",
"reid_risk": "low",
"technical_necessity": "full",
"notes": "Strictly necessary nach Art. 7(1) DSGVO Nachweispflicht.",
},
# ─── Templated / pattern-based entries (Suffix variabel) ──────────
# Diese werden via regex-Lookup gefangen, der Eintrag dient als Fallback.
"_uet_": {
"vendor": "Microsoft Corp.", "vendor_country": "US",
"exact_purpose": "Microsoft Universal Event Tracking — Suffix nach `_uet_`.",
"data_collected": ["event_id"],
"ip_relevant": True,
"typical_lifetime": "30 Minuten",
"reid_risk": "medium",
"technical_necessity": "none",
},
}
# ─── Pattern-Lookup fuer Cookies mit variablem Suffix ───────────────
_PATTERN_LOOKUPS: list[tuple[str, str]] = [
(r"^_ga_[A-Z0-9_]+$", "_ga_*"),
(r"^_gat_gtag_UA_", "_gat_gtag_UA_"),
(r"^AMCV_", "AMCV_"),
(r"^_uet[a-z]+", "_uet_"),
(r"^aws-alb", "aws-alb"),
(r"^_pk_id\.", "_pk_id"),
(r"^_pk_ses\.", "_pk_ses"),
]
def lookup_cookie(name: str) -> CookieKnowledge | None:
"""Return rich knowledge for a cookie name, or None if unknown."""
import re
if not name:
return None
# Direct hit
if name in KB:
return KB[name]
# Pattern-based
for pattern, kb_key in _PATTERN_LOOKUPS:
if re.search(pattern, name):
return KB.get(kb_key)
# Strip common suffixes (.bmw.de, .domain etc.)
base = name.split(".", 1)[0]
if base != name and base in KB:
return KB[base]
return None
def enrich_vendor_with_knowledge(vendor: dict) -> dict:
"""Add per-cookie knowledge to each cookie in vendor['cookies']."""
cookies = vendor.get("cookies") or []
enriched = []
for c in cookies:
info = lookup_cookie(c.get("name", ""))
if info:
enriched.append({**c, "knowledge": info})
else:
enriched.append(c)
return {**vendor, "cookies": enriched}
# ─── Hilfen fuer Aggregate-Reports ──────────────────────────────────
def summarize_compliance_risk(vendor: dict) -> dict:
"""Aggregiere Re-ID-Risiko + Drittland-Status ueber alle Cookies eines Vendors."""
cookies = vendor.get("cookies") or []
risk_counts = {"high": 0, "medium": 0, "low": 0}
schrems_affected = 0
technical_only = 0
for c in cookies:
k = c.get("knowledge") or lookup_cookie(c.get("name", ""))
if not k:
continue
risk = k.get("reid_risk", "low")
risk_counts[risk] = risk_counts.get(risk, 0) + 1
if "us" in k.get("vendor_country", "").lower() or "schrems" in k.get("schrems_ii_status", "").lower():
schrems_affected += 1
if k.get("technical_necessity") == "full":
technical_only += 1
return {
"reid_risk_distribution": risk_counts,
"high_risk_cookie_count": risk_counts["high"],
"schrems_ii_affected_cookies": schrems_affected,
"strictly_necessary_cookies": technical_only,
"total_classified": sum(risk_counts.values()),
}
# ─── TEMPLATE_ENTRY (fuer Kuratierung neuer Cookies) ───────────────
TEMPLATE_ENTRY: CookieKnowledge = {
"vendor": "<Voller Firmenname>",
"vendor_country": "<ISO-2 Code oder 'IE/US' bei Doppel-Standort>",
"exact_purpose": "<1-2 Saetze was der Cookie GENAU tut>",
"data_collected": ["<feldname_1>", "<feldname_2>"],
"ip_relevant": False,
"ip_anonymized": False,
"tcf_purpose_ids": [], # TCF v2.2: 1-11
"iab_vendor_id": None, # Aus https://iabeurope.eu/tcf-vendor-list/
"typical_lifetime": "<Session | XX Tage | XX Monate | XX Jahre>",
"reid_risk": "low", # low | medium | high
"technical_necessity": "none", # none | partial | full
"schrems_ii_status": "<Drittlandtransfer-Bewertung>",
"eugh_rulings": [],
"eu_alternative_cookies": [],
"eu_alternative_vendor": "",
"notes": "",
}
@@ -220,10 +220,16 @@ def score_vendors(vendors: list[dict]) -> list[dict]:
flags.append("no_purpose")
# Country — only for external processors / controllers
# Falls country leer ist, ableiten aus Rechtsform-Suffix im Namen.
if country_required:
max_score += 10
if v.get("country"):
score += 10
elif _country_from_name(v.get("name", "")):
inferred = _country_from_name(v.get("name", ""))
v["country"] = inferred
v["country_inferred"] = True
score += 10
else:
flags.append("no_country")
@@ -321,3 +327,153 @@ def build_check_items(validated: list[LinkCheck]) -> list[dict]:
"hint": hint,
})
return items
# ─── Country-Inferenz aus Rechtsform-Suffix ────────────────────────
#
# Wenn ein Vendor das "country"-Feld leer hat, koennen wir es oft aus
# dem Firmen-Suffix ableiten:
# Adform A/S → DK (Dänemark, Aktieselskab)
# Pinterest Europe Ltd. → IE (Irland, Limited)
# Salesforce Inc. → US (Incorporated)
# Adobe ... Ireland Limited → IE
# Genesys ... B.V. → NL (Niederlande, Besloten Vennootschap)
# Equativ S.A. → FR (Société Anonyme)
# SAP SE → DE (Societas Europaea — meist DE-eingetragen)
#
# Kombi-Strategie:
# 1) Suffix-Pattern
# 2) Laendername im Firmen-Namen ('Ireland', 'Deutschland')
# 3) Specific Vendor (Google Inc / Meta Platforms Ireland Ltd → vendor-specific)
import re as _re
_SUFFIX_COUNTRY: list[tuple[str, str]] = [
# Pattern (am Wort-Ende oder vor weiteren Tokens) → ISO-Code
(r"\bA/S\b", "DK"), # Aktieselskab
(r"\bApS\b", "DK"), # Anpartsselskab
(r"\bAB\b", "SE"), # Aktiebolag
(r"\bAS\b(?!\w)", "NO"), # Aksjeselskap
(r"\bOy\b", "FI"), # Osakeyhtiö
(r"\bAG\b(?!\w)", "DE"), # auch CH/AT moeglich, default DE
(r"\bGmbH\b", "DE"),
(r"\bUG\b", "DE"),
(r"\beG\b", "DE"),
(r"\bKG\b", "DE"),
(r"\bOHG\b", "DE"),
(r"\bSE\b", "DE"), # Societas Europaea — pruefen ob SAP SE etc.
(r"\bS\.A\.\b", "FR"), # France / SE / ES
(r"\bSAS\b", "FR"),
(r"\bS\.A\.S\.\b", "FR"),
(r"\bSARL\b", "FR"),
(r"\bS\.r\.l\.\b", "IT"),
(r"\bS\.p\.A\.\b", "IT"),
(r"\bSpA\b", "IT"),
(r"\bB\.V\.\b", "NL"),
(r"\bN\.V\.\b", "NL"),
(r"\bSL\b", "ES"),
(r"\bS\.A\.\sde C\.V\.\b", "MX"),
(r"\bd\.o\.o\.\b", "SI"), # Slowenien
(r"\bd\.d\.\b", "HR"), # Kroatien
(r"\bz\s?o\.o\.\b", "PL"),
(r"\bInc\.?\b", "US"),
(r"\bIncorporated\b", "US"),
(r"\bCorp\.?\b", "US"),
(r"\bCorporation\b", "US"),
(r"\bLLC\b", "US"),
(r"\bL\.L\.C\.\b", "US"),
(r"\bLtd\.?\b", "GB"), # UK Limited, default
(r"\bLimited\b", "GB"),
(r"\bPLC\b", "GB"),
(r"\bPty\b", "AU"),
(r"\bK\.K\.\b", "JP"), # Kabushiki-Kaisha
(r"\bPte\.?\sLtd\.?\b", "SG"),
]
# Country-Namen im Firmen-Namen (z.B. "Adobe Systems Software Ireland Limited")
_COUNTRY_NAME_TOKENS: list[tuple[str, str]] = [
("ireland", "IE"),
("deutschland", "DE"),
("germany", "DE"),
("netherlands", "NL"),
("france", "FR"),
("united kingdom", "GB"),
("uk", "GB"),
("usa", "US"),
("united states", "US"),
("austria", "AT"),
("oesterreich", "AT"),
("schweiz", "CH"),
("switzerland", "CH"),
("luxembourg", "LU"),
("luxemburg", "LU"),
("denmark", "DK"),
("daenemark", "DK"),
("sweden", "SE"),
("schweden", "SE"),
("norway", "NO"),
("norwegen", "NO"),
("finland", "FI"),
("finnland", "FI"),
]
# Bekannte Vendors mit eindeutigem Sitz (override)
_KNOWN_VENDOR_COUNTRY: dict[str, str] = {
"google inc": "US",
"google llc": "US",
"google ireland": "IE",
"meta platforms ireland": "IE",
"facebook ireland": "IE",
"amazon.com inc": "US",
"amazon web services": "US",
"amazon web services inc": "US",
"linkedin inc": "US",
"salesforce inc": "US",
"salesforce.com": "US",
"outbrain inc": "US",
"taboola inc": "US",
"pinterest europe ltd": "IE",
"intuition machines inc": "US",
"akamai technologies inc": "US",
"criteo s.a": "FR",
"criteo sa": "FR",
"adform a/s": "DK",
"speedcurve limited": "GB",
"longtail ad solutions": "US",
"genesys cloud services b.v": "NL",
"qualtrics": "US",
"teads sa": "FR",
"teads s.a": "FR",
"salesviewer gmbh": "DE",
"baqend gmbh": "DE",
"zenweshare sas": "FR",
"nayoki gmbh": "DE",
"psyma": "DE",
"matomo": "NZ", # InnoCraft NZ aber EU-hostbar
"adobe systems software ireland": "IE",
"microsoft corporation": "US",
"microsoft corp": "US",
}
def _country_from_name(vendor_name: str) -> str:
"""Best-effort: ISO-2 Country-Code aus dem Vendor-Namen ableiten."""
if not vendor_name:
return ""
# Vendor-Namen sind oft "<Firma> — <Tool>" — nur Firmen-Teil betrachten
firm = vendor_name.split("")[0].strip()
firm_l = firm.lower()
# 1) Known vendor lookup (most specific)
for k, v in _KNOWN_VENDOR_COUNTRY.items():
if k in firm_l:
return v
# 2) Country-Name im Firmen-Namen
for token, code in _COUNTRY_NAME_TOKENS:
if token in firm_l:
return code
# 3) Rechtsform-Suffix
for pattern, code in _SUFFIX_COUNTRY:
if _re.search(pattern, firm):
return code
return ""
@@ -0,0 +1,350 @@
"""
Doc-Anchor-Locator — fuer ein Finding den passendsten Einfuege-Ort im
existierenden Dokument finden.
Primary strategy: BGE-M3 Embedding-Match zwischen einer pro-Finding
Anchor-Query und allen Absaetzen des Docs. Echtes semantisches Matching
(BMW schreibt "Verarbeiter" statt "Auftragsverarbeiter" → Keyword waere
out, Embedding catches it).
Fallback: Keyword-Match (wenn kein Embedding-Service erreichbar).
Output pro Anchor:
- anchor_phrase : Originaltext-Auszug
- position_hint : "Nach Absatz X von Y: '...'"
- confidence : 'high' | 'medium' | 'low'
- score : float (cosine similarity oder keyword-rank)
- method : 'embedding' | 'keyword' | 'fallback'
"""
from __future__ import annotations
import logging
import math
import os
import re
import threading
from typing import Iterable
import httpx
logger = logging.getLogger(__name__)
EMBEDDING_URL = os.getenv("EMBEDDING_URL", "http://embedding-service:8087")
# Pro Finding-Typ eine semantisch reiche Anchor-Query, die der Embedding-
# Matcher gegen den Doc-Text wirft. Reicher als die kurze MC-check_question.
# Sucht NICHT den Pflichttext selbst (der fehlt ja) sondern den Absatz wo
# der Fix HINEIN-soll — also den thematisch verwandten Kontext.
_ANCHOR_QUERIES: list[tuple[str, str, str]] = [
# (finding_label_partial, anchor_query, fallback_hint)
(
"Auftragsverarbeiter erwaehnt",
"Empfaenger der Daten Verarbeiter Dienstleister Cloud Hosting CRM "
"Auftragsverarbeitung Weitergabe Datenuebermittlung an Dritte",
"Im Abschnitt 'Empfaenger' oder 'Datenuebermittlung'",
),
(
"Automatisierte Entscheidungen",
"Betroffenenrechte automatisierte Entscheidung Profiling Logik "
"Tragweite Auswirkung Art. 22 DSGVO",
"Am Ende des Abschnitts 'Betroffenenrechte'",
),
(
"Konkrete Aufsichtsbehoerde",
"Beschwerderecht Datenschutzaufsicht Aufsichtsbehoerde Beschwerde "
"bei der Behoerde einreichen Recht auf Beschwerde",
"Im Abschnitt 'Beschwerderecht'",
),
(
"Angemessenheitsbeschluss",
"Drittlandtransfer USA Standardvertragsklauseln SCC DPF Data Privacy "
"Framework Angemessenheitsbeschluss internationale Datenuebermittlung",
"Im Abschnitt 'Drittlandtransfer'",
),
(
"Anschrift des Verantwortlichen",
"Verantwortlicher Verantwortliche Stelle Datenschutz Betreiber dieser "
"Website Firma Anschrift Kontakt",
"Am Anfang der Datenschutzerklaerung / Cookie-Richtlinie",
),
(
"Konkrete Cookie-Namen",
"Welche Cookies verwenden wir Cookie-Tabelle Liste der Cookies "
"Cookie-Kategorien Auflistung der Cookies Name Anbieter Zweck",
"Im Abschnitt 'Welche Cookies verwenden wir?'",
),
(
"Konkrete Anbieter/Dienste",
"Drittanbieter Dienste Anbieter wir nutzen folgende Dienste "
"Empfaenger der Cookie-Daten Liste der Dienstleister",
"In der Drittanbieter-Liste der Cookie-Richtlinie",
),
(
"Analytics-/Statistik-Tools konkret benannt",
"Statistik Analytics Reichweitenmessung Webanalyse Tracking "
"Google Analytics Matomo Adobe Analytics",
"Im Abschnitt 'Statistik / Analyse-Cookies'",
),
(
"Konkrete Speicherdauer",
"Speicherdauer Lebensdauer wie lange Ablauf Cookie-Tabelle Spalte "
"Speicherdauer pro Cookie",
"In der Cookie-Tabelle pro Eintrag",
),
(
"Opt-Out-Links",
"Widerruf widersprechen deaktivieren Cookie-Einstellungen aendern "
"Opt-Out Einstellungen anpassen",
"Im Abschnitt 'Wie kann ich widersprechen?'",
),
(
"Privacy-Policy-Links",
"Datenschutzerklaerung des Drittanbieters Privacy Policy Link auf "
"Datenschutzhinweise der Drittanbieter",
"Im Drittanbieter-Listing der Cookie-Richtlinie",
),
(
"Verbraucherstreitbeilegung",
"Online-Streitbeilegung OS-Plattform Verbraucherschlichtungsstelle "
"Streitbeilegung Verbraucher",
"Am Ende des Impressums (eigener Abschnitt 'Streitbeilegung')",
),
(
"Rechtswidriger Haftungsausschluss",
"Haftung Disclaimer wir distanzieren uns Links zu externen Webseiten "
"Haftungsausschluss Drittinhalte",
"Am Ende des Impressums (Disclaimer-Absatz)",
),
(
"Name der vertretungsberechtigten",
"Vertreten durch Vorstand Geschaeftsfuehrung Geschaeftsfuehrer "
"vertretungsberechtigt Repraesentant",
"Im Impressum nach Firmenname + Anschrift",
),
(
"Zustaendige Kammer",
"Berufsrecht Kammer Aufsichtsbehoerde berufsrechtliche Angaben "
"zustaendige Kammer",
"Im Impressum im Abschnitt 'Berufsrechtliche Angaben'",
),
(
"Drittlaender",
"Drittland Drittlaender USA Indien China internationale Datenuebermittlung "
"Datenexport in Nicht-EU-Staaten",
"Im Abschnitt 'Drittlandtransfer'",
),
(
"Schutzgarantien",
"Schutzgarantien Schutzvorkehrungen Schutzmassnahmen SCC "
"Standardvertragsklauseln einsehen Anforderung",
"Im Abschnitt 'Drittlandtransfer / Schutzgarantien'",
),
]
# ─── Thread-local Cache fuer Doc-Chunks + Embeddings ───────────────
# Pro Compliance-Check-Run werden die Doc-Texte einmal embedded und im
# Thread-Local-Storage gehalten, damit mehrere Findings im selben Run
# nicht jeweils neu embedded werden.
_tls = threading.local()
def _get_cache() -> dict:
if not hasattr(_tls, "cache"):
_tls.cache = {}
return _tls.cache
def reset_cache() -> None:
"""Per-request-cache leeren (sollte am Start jedes Doc-Check-Runs aufgerufen
werden, damit Vorgaenger-Daten kein Leak verursachen)."""
if hasattr(_tls, "cache"):
_tls.cache = {}
# ─── Helfer ────────────────────────────────────────────────────────
def _normalize(text: str) -> str:
return (text or "").lower().replace("\xad", "").replace("ß", "ss")
def _split_paragraphs(text: str) -> list[str]:
"""Split a doc into paragraphs (by double newline, fallback single)."""
if not text:
return []
paras = re.split(r"\n\s*\n", text)
if len(paras) < 3:
paras = re.split(r"(?<=[\.\?\!])\s+(?=[A-ZÄÖÜ])", text)
return [p.strip() for p in paras if p.strip()]
def _embed_sync(texts: list[str], timeout: float = 60.0,
batch_size: int = 32) -> list[list[float]]:
"""Synchroner Batch-Embed-Call (Anchor-Lokalisierung laeuft in
Sync-HTML-Render, nicht in async context)."""
if not texts:
return []
out: list[list[float]] = []
with httpx.Client(timeout=timeout) as client:
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
try:
r = client.post(f"{EMBEDDING_URL}/embed", json={"texts": batch})
r.raise_for_status()
out.extend(r.json().get("embeddings") or [])
except Exception as e:
logger.warning("Anchor embed sub-batch [%d-%d] failed: %s",
i, i + len(batch), e)
out.extend([[] for _ in batch])
return out
def _cosine(a: list[float], b: list[float]) -> float:
if not a or not b or len(a) != len(b):
return 0.0
dot = sum(x * y for x, y in zip(a, b))
na = math.sqrt(sum(x * x for x in a))
nb = math.sqrt(sum(y * y for y in b))
if na == 0 or nb == 0:
return 0.0
return dot / (na * nb)
def _doc_paragraphs_and_vectors(
doc_id: str, doc_text: str,
) -> tuple[list[str], list[list[float]]]:
"""Lazy-cache: Absaetze + Vektoren pro Doc-ID. Wird genau einmal pro
Doc und Run berechnet."""
cache = _get_cache()
if doc_id in cache:
return cache[doc_id]
paras = _split_paragraphs(doc_text)
if not paras:
cache[doc_id] = ([], [])
return cache[doc_id]
vecs = _embed_sync(paras)
cache[doc_id] = (paras, vecs)
return cache[doc_id]
def _keyword_fallback(fl: str, doc_text: str) -> dict | None:
"""Fallback wenn Embedding-Service ausfaellt oder zu wenig Trefferqualitaet."""
# Use the old _ANCHOR_QUERIES list — extract just the fallback hint
for label_partial, _query, fallback_hint in _ANCHOR_QUERIES:
if _normalize(label_partial) in fl:
return {
"anchor_phrase": None,
"position_hint": fallback_hint,
"confidence": "low",
"method": "fallback",
}
return None
def locate_anchor(
finding_label: str,
doc_text: str,
doc_id: str | None = None,
) -> dict | None:
"""Fuer ein Finding den passendsten Anker-Absatz im Doc-Text finden.
Primary: Embedding-Match. Sekundaer: Keyword-Hits als Bonus. Fallback:
rein keyword-basiert wenn Embedding-Service nicht erreichbar ist.
`doc_id` ist ein cache-key (z.B. doc_type oder url). Wenn None, wird
aus dem doc_text-Hash abgeleitet.
"""
if not doc_text or not finding_label:
return None
fl = _normalize(finding_label)
# Welche Anchor-Query matched dieses Finding?
query = None
fallback_hint = None
matched_label = None
for label_partial, q, fb in _ANCHOR_QUERIES:
if _normalize(label_partial) in fl:
query, fallback_hint, matched_label = q, fb, label_partial
break
if not query:
return None
doc_id = doc_id or f"doc-{hash(doc_text) & 0xffffffff:08x}"
# 1) Embedding-Match
paras, doc_vecs = _doc_paragraphs_and_vectors(doc_id, doc_text)
if not paras:
return None
embeddings_available = any(v for v in doc_vecs)
if not embeddings_available:
return _keyword_fallback(fl, doc_text)
try:
q_vec = _embed_sync([query])[0] if query else None
except Exception:
q_vec = None
if not q_vec:
return _keyword_fallback(fl, doc_text)
# Per-Absatz Score = cosine + Heading-Bonus
best_idx = -1
best_score = 0.0
for i, (p, dv) in enumerate(zip(paras, doc_vecs)):
if not dv:
continue
sim = _cosine(q_vec, dv)
# Heading-Bonus: kurze Absaetze + Markdown-Heading-Marker
if len(p.split()) <= 8 or p.strip().startswith("#"):
sim += 0.05
if sim > best_score:
best_score = sim
best_idx = i
# Konfidenz-Schwellen — kalibriert anhand BMW-Run
if best_idx < 0 or best_score < 0.40:
# Zu schwacher Match — Fallback verwenden
return {
"anchor_phrase": None,
"position_hint": fallback_hint,
"confidence": "low",
"score": round(best_score, 3) if best_idx >= 0 else 0,
"method": "embedding-no-match",
}
if best_score >= 0.62:
confidence = "high"
elif best_score >= 0.50:
confidence = "medium"
else:
confidence = "low"
anchor = paras[best_idx]
words = anchor.split()
snippet = " ".join(words[:30]) + ("" if len(words) > 30 else "")
return {
"anchor_phrase": snippet,
"anchor_index": best_idx,
"total_paragraphs": len(paras),
"position_hint": f"Nach Absatz {best_idx + 1} von {len(paras)}: '{snippet}'",
"confidence": confidence,
"score": round(best_score, 3),
"method": "embedding",
}
def annotate_findings_with_anchors(
findings: Iterable[dict], doc_text: str, doc_id: str | None = None,
) -> list[dict]:
"""Pro Finding den Anchor suchen und das Dict um 'anchor' erweitern."""
out = []
for f in findings:
a = locate_anchor(f.get("label") or f.get("title") or "", doc_text, doc_id)
out.append({**f, "anchor": a})
return out
@@ -0,0 +1,353 @@
"""
Action-Recipes — pro Finding-Typ eine umsetzbare Handlungsanweisung:
WAS tun, WARUM (Rechtsgrundlage), WIE formulieren (konkreter Textbaustein),
WO einfuegen (Doc-Abschnitt-Hinweis).
Ohne Recipes ist ein Finding nur eine Diagnose. Mit Recipe weiss der
Kunde sofort welchen Satz er an welche Stelle setzen muss.
Verwendung:
from compliance.services.finding_action_recipes import recipe_for
rec = recipe_for("no_cookies_listed") # → dict mit what/why/fix_text/where/example
"""
from __future__ import annotations
from typing import TypedDict
class ActionRecipe(TypedDict, total=False):
what: str # 1-Satz Diagnose
why: str # Rechtsgrundlage / Risiko
fix_text: str # konkreter Textbaustein zum Einfuegen
where: str # in welchem Doc-Abschnitt
example: str # echtes Anwendungsbeispiel
severity: str # 'critical' | 'high' | 'medium' | 'low'
# ─── Vendor-/Cookie-Findings (im VVT-Block) ────────────────────────
VENDOR_FINDINGS: dict[str, ActionRecipe] = {
"no_cookies_listed": {
"what": "Anbieter ist erfasst, aber es sind keine konkreten Cookies "
"dokumentiert.",
"why": "Die DSK-Orientierungshilfe Telemedien 2024 fordert pro Anbieter "
"eine Auflistung aller gesetzten Cookies mit Name + Zweck + "
"Speicherdauer. 'Verwendet Cookies' ohne Details erfuellt "
"Art. 13 Abs. 1 lit. e DSGVO nicht.",
"fix_text": "Ergaenzen Sie pro Cookie eine Zeile mit folgenden Feldern:\n"
" • Cookie-Name (z.B. _ga, _fbp, NID)\n"
" • Setzender Anbieter (Firma + Sitzland)\n"
" • Zweck (1 Satz, z.B. 'Reichweitenmessung pseudonym')\n"
" • Speicherdauer (z.B. '2 Jahre', '24h', 'Session')",
"where": "Cookie-Richtlinie unter dem Abschnitt 'Kategorien' "
"(Notwendig / Marketing / Statistik / ...).",
"example": "_ga (Google Ireland Ltd.) — Statistik — eindeutige "
"Besucher-ID — Speicherdauer 2 Jahre",
"severity": "high",
},
"no_country": {
"what": "Anbieter-Sitzland ist nicht dokumentiert.",
"why": "Art. 30 Abs. 1 lit. d DSGVO fordert die Kategorien der Empfaenger "
"inkl. Sitz. Bei Drittland-Empfaengern (US, IN, CN ...) ist das "
"zusaetzlich Pflicht nach Art. 13 Abs. 1 lit. f DSGVO.",
"fix_text": "Fuegen Sie hinter dem Anbieternamen das Sitzland in "
"Klammern ein. Bei Drittland (ausserhalb EU/EWR) zusaetzlich "
"den Transfer-Mechanismus (SCC, DPF, Angemessenheitsbeschluss).",
"where": "VVT-Eintrag bei 'Empfaenger' oder im Cookie-Anbieter-Listing.",
"example": "Google Ireland Ltd., Dublin, Irland (EU). Bei US-Transfer: "
"'Google LLC, Mountain View, US — DPF-zertifiziert'.",
"severity": "high",
},
"no_privacy_url": {
"what": "Kein direkter Link zur Datenschutzerklaerung des Anbieters.",
"why": "Art. 13 Abs. 2 lit. f DSGVO Transparenzgebot: der Nutzer muss "
"die Datenverarbeitung beim Drittanbieter eigenverantwortlich "
"nachvollziehen koennen.",
"fix_text": "Ergaenzen Sie einen klickbaren Link zur Datenschutzerklaerung "
"des Anbieters direkt neben dem Anbieternamen.",
"where": "Cookie-Richtlinie pro Vendor-Eintrag. Idealerweise als "
"letzter Spalteneintrag oder Inline-Link.",
"example": "Google Analytics — Datenschutz: https://policies.google.com/privacy",
"severity": "medium",
},
"broken_privacy_url": {
"what": "Der angegebene Privacy-Link gibt einen Fehler zurueck "
"(404 / 403 / Timeout).",
"why": "Ein toter Link erfuellt Art. 13(2)(f) DSGVO nicht — die "
"Transparenz-Pflicht laeuft ins Leere.",
"fix_text": "1. Pruefen Sie den Link aus einem normalen Browser. "
"Funktioniert er dort? → Anbieter blockt automatisierte Pruefer (kein Mangel).\n"
"2. Funktioniert er nicht? → Aktualisieren Sie den Link in Ihrer "
"Cookie-Richtlinie auf die heute gueltige Privacy-URL des Anbieters.",
"where": "Cookie-Richtlinie / Drittanbieter-Liste.",
"example": "Wenn Adobe-Privacy-Link 404 gibt, ersetzen durch "
"https://www.adobe.com/privacy/policy.html",
"severity": "high",
},
"no_opt_out_url": {
"what": "Kein Opt-Out-Link fuer diesen Anbieter dokumentiert.",
"why": "Art. 7 Abs. 3 DSGVO: der Widerruf der Einwilligung muss so "
"einfach wie die Erteilung sein. Pro Drittanbieter muss eine "
"Opt-Out-Moeglichkeit angeboten werden.",
"fix_text": "Recherchieren Sie den offiziellen Opt-Out-Link des "
"Anbieters und ergaenzen Sie ihn. Falls Ihr Cookie-Banner "
"ein 'Einstellungen aendern' anbietet, ist das oft "
"ausreichend — der Link sollte trotzdem als Backup "
"dokumentiert sein.",
"where": "Cookie-Richtlinie pro Vendor-Eintrag.",
"example": "Google Analytics Opt-Out: https://tools.google.com/dlpage/gaoptout",
"severity": "high",
},
"broken_opt_out": {
"what": "Der angegebene Opt-Out-Link funktioniert nicht "
"(404 / 403 / Timeout).",
"why": "Art. 7(3) DSGVO Widerrufsmoeglichkeit ohne funktionierenden "
"Link ist nicht gegeben.",
"fix_text": "1. Pruefen Sie ob der Link aus einem Browser funktioniert. "
"403 kommt oft von Bot-Schutz und ist kein realer Mangel.\n"
"2. Bei echtem 404: aktualisieren Sie auf den heute gueltigen "
"Opt-Out-Link.\n"
"3. Alternativ verlinken Sie auf Ihren eigenen Cookie-Banner "
"'Einstellungen aendern'-Trigger.",
"where": "Cookie-Richtlinie pro Vendor-Eintrag.",
"example": "Wenn Criteo-Opt-Out 403 gibt, ist das oft Cloudflare-Schutz — "
"Link aus dem Browser klickbar → kein Mangel. Alternativ: "
"https://www.youronlinechoices.com/de/",
"severity": "medium",
},
}
# ─── Doc-Pruefungs-Findings (im DSE/Impressum/Cookie-Pruefblock) ───
DOC_CHECK_FINDINGS: dict[str, ActionRecipe] = {
"Auftragsverarbeiter erwaehnt": {
"what": "Auftragsverarbeiter (Hosting, CRM, Newsletter) sind nicht "
"explizit erwaehnt + es fehlt Hinweis auf AVVs nach Art. 28 DSGVO.",
"why": "Art. 28 DSGVO Auftragsverarbeitung erfordert dokumentierte AVVs. "
"Art. 13(1)(e) DSGVO erfordert die Nennung der Empfaenger-"
"Kategorien. Fehlt der Hinweis = klassischer Pruefpunkt der "
"Aufsichtsbehoerden.",
"fix_text": "Wir setzen sorgfaeltig ausgewaehlte Auftragsverarbeiter "
"(z.B. fuer Hosting, Webanalyse, Customer Service) ein. Mit "
"allen Auftragsverarbeitern haben wir Vertraege zur "
"Auftragsverarbeitung nach Art. 28 DSGVO geschlossen. Die "
"Auftragsverarbeiter handeln ausschliesslich auf unsere "
"Weisung und sind vertraglich zu angemessenen technischen "
"und organisatorischen Massnahmen verpflichtet.",
"where": "Datenschutzerklaerung im Abschnitt 'Empfaenger' oder "
"'Datenuebermittlung'. Direkt nach der Aufzaehlung der "
"Empfaenger-Kategorien.",
"example": "Empfaenger: BMW Niederlassungen, Vertragshaendler, "
"Auftragsverarbeiter (Cloud-Hosting AWS, CRM Salesforce, "
"Webanalyse Adobe Analytics — mit allen sind AVVs nach "
"Art. 28 DSGVO geschlossen).",
"severity": "high",
},
"Automatisierte Entscheidungen / Profiling": {
"what": "Keine Aussage zu automatisierten Einzelentscheidungen "
"oder Profiling nach Art. 22 DSGVO.",
"why": "Art. 13 Abs. 2 lit. f DSGVO: bei automatisierten "
"Einzelentscheidungen muessen Logik, Tragweite und Auswirkungen "
"erklaert werden. Bei KEINEM Profiling muss das explizit "
"verneint werden — sonst lassen Aufsichtsbehoerden Unsicherheit "
"offen.",
"fix_text": "Variante A (kein Profiling):\n"
" 'Es findet keine automatisierte Entscheidungsfindung "
"im Sinne des Art. 22 DSGVO statt. Sofern wir Profiling "
"zur Anzeige personalisierter Inhalte einsetzen, erfolgt "
"dies ausschliesslich auf Basis Ihrer Einwilligung und "
"wird im Abschnitt [X] erlaeutert.'\n\n"
"Variante B (Profiling vorhanden, z.B. fuer Werbung):\n"
" 'Wir nutzen Profiling zur Anzeige personalisierter "
"Werbung. Die Logik basiert auf [Klick-Historie / "
"Besuchsverhalten / Praeferenzen]. Tragweite: "
"Anpassung der angezeigten Anzeigen. Auswirkung: keine "
"rechtlichen oder erheblichen Auswirkungen — Sie koennen "
"jederzeit widersprechen unter [Link/Kontakt].'",
"where": "Datenschutzerklaerung am Ende des Abschnitts "
"'Betroffenenrechte' oder als eigener Absatz unter "
"'Automatisierte Entscheidungen'.",
"example": "Standardformulierung Variante A — falls Sie KEIN Profiling "
"betreiben, ist das der sichere Default-Text.",
"severity": "high",
},
"Konkrete Aufsichtsbehoerde benannt": {
"what": "Aufsichtsbehoerde wird nicht namentlich genannt.",
"why": "Art. 13(2)(d) DSGVO: das Beschwerderecht muss konkret "
"kommuniziert werden — nicht 'die zustaendige Behoerde', sondern "
"Name + Anschrift + Website.",
"fix_text": "Sie haben das Recht, sich bei einer Datenschutz-"
"Aufsichtsbehoerde zu beschweren. Zustaendig ist:\n\n"
" [Aufsichtsbehoerde mit Namen + Strasse + PLZ + Ort + Web]\n\n"
"Beispiel: 'Bayerisches Landesamt fuer Datenschutzaufsicht "
"(BayLDA), Promenade 18, 91522 Ansbach, www.lda.bayern.de'",
"where": "Datenschutzerklaerung im Abschnitt 'Betroffenenrechte' oder "
"'Beschwerderecht'.",
"example": "Fuer BMW AG (Sitz Muenchen, Bayern): BayLDA, Promenade 18, "
"91522 Ansbach, www.lda.bayern.de",
"severity": "high",
},
"Angemessenheitsbeschluss der Kommission": {
"what": "Drittlandtransfer erwaehnt, aber kein Hinweis auf den "
"konkreten Angemessenheitsbeschluss / DPF / SCC.",
"why": "Art. 13(1)(f) DSGVO: bei Drittlandtransfer muss der "
"Transfer-Mechanismus benannt sein (Angemessenheitsbeschluss, "
"Standardvertragsklauseln, BCR, Ausnahmen nach Art. 49).",
"fix_text": "Fuer Datenuebermittlungen in die USA stuetzen wir uns auf "
"den Angemessenheitsbeschluss der EU-Kommission vom "
"10.07.2023 (EU-US Data Privacy Framework / DPF). Sofern "
"der Empfaenger DPF-zertifiziert ist, ist die Uebermittlung "
"rechtlich abgesichert. Bei nicht-zertifizierten Empfaengern "
"ergaenzen wir EU-Standardvertragsklauseln (SCC) gemaess "
"Durchfuehrungsbeschluss 2021/914.",
"where": "Datenschutzerklaerung im Abschnitt 'Drittlandtransfer' oder "
"'Internationale Datenuebermittlung'.",
"example": "Konkret: Google LLC (USA) ist DPF-zertifiziert "
"(Zertifikat einsehbar unter dataprivacyframework.gov).",
"severity": "high",
},
"Anschrift des Verantwortlichen": {
"what": "In der Cookie-Richtlinie fehlt die vollstaendige Anschrift.",
"why": "Art. 13(1)(a) DSGVO: der Verantwortliche muss eindeutig "
"identifizierbar sein. Cookie-Richtlinie + DSE muessen "
"konsistente Angaben enthalten.",
"fix_text": "Verantwortlich fuer die Datenverarbeitung im Sinne der "
"DSGVO ist:\n [Firmenname]\n [Strasse + Hausnummer]\n "
"[PLZ + Ort]\n [Land]\n E-Mail: [...]",
"where": "Cookie-Richtlinie am Anfang ODER Verweis-Link zur DSE.",
"example": "Bayerische Motoren Werke Aktiengesellschaft, Petuelring 130, "
"80809 Muenchen, Deutschland",
"severity": "high",
},
"Konkrete Cookie-Namen aufgelistet": {
"what": "Cookies werden allgemein erwaehnt aber ohne konkrete Namen + "
"Speicherdauer.",
"why": "DSK-Orientierungshilfe Telemedien: Auflistung jedes einzelnen "
"Cookies mit Name. Generische Aussagen ('wir nutzen "
"Werbe-Cookies') sind unzureichend.",
"fix_text": "Erweitern Sie die Cookie-Tabelle um die Spalten:\n"
" Name | Anbieter | Zweck | Speicherdauer\n\n"
"Browser-Devtools (Application > Cookies) zeigt die "
"tatsaechlich gesetzten Namen — bitte Cookie-Liste "
"regelmaessig synchronisieren.",
"where": "Cookie-Richtlinie im Abschnitt 'Verwendete Cookies'.",
"example": "_ga | Google Ireland | Statistik (Besucher-ID) | 2 Jahre\n"
"_fbp | Meta Platforms IE | Werbung (Browser-ID) | 90 Tage",
"severity": "high",
},
"Konkrete Speicherdauern pro Cookie": {
"what": "Speicherdauer nur pauschal oder als generischer Bereich.",
"why": "Art. 13(2)(a) DSGVO: konkrete Speicherdauer oder Kriterien "
"fuer die Festlegung. DSK fordert Speicherdauer PRO Cookie.",
"fix_text": "Pro Cookie eine konkrete Zeitangabe in der Cookie-Tabelle "
"ergaenzen: 'Session', '24 Stunden', '90 Tage', '2 Jahre'.",
"where": "Cookie-Richtlinie in der Cookie-Tabelle.",
"example": "JSESSIONID: Session — _ga: 2 Jahre — _fbp: 90 Tage",
"severity": "high",
},
"Opt-Out-Links pro Drittanbieter": {
"what": "Pro Drittanbieter ist kein direkter Opt-Out-Link angegeben.",
"why": "Art. 7(3) DSGVO Widerrufsmoeglichkeit. EuGH Planet49 "
"(C-673/17) bestaetigt: Widerruf muss so einfach wie Einwilligung sein.",
"fix_text": "Pro Drittanbieter eine Spalte 'Opt-Out' ergaenzen mit "
"direktem Link. Alternativ: zentralen 'Cookie-"
"Einstellungen aendern'-Button im Footer der Webseite + "
"Hinweis darauf in der Cookie-Richtlinie.",
"where": "Cookie-Richtlinie pro Vendor-Eintrag oder als zentraler "
"Abschnitt 'Wie kann ich widersprechen?'.",
"example": "Google Analytics: https://tools.google.com/dlpage/gaoptout\n"
"Meta Pixel: ueber Facebook-Konto-Einstellungen",
"severity": "high",
},
"Privacy-Policy-Links pro Drittanbieter": {
"what": "Pro Drittanbieter ist kein direkter Datenschutz-Link gesetzt.",
"why": "Art. 13(2)(f) DSGVO Transparenzgebot — Nutzer muss die "
"Datenverarbeitung beim Drittanbieter eigenverantwortlich "
"nachvollziehen koennen.",
"fix_text": "Pro Drittanbieter Link auf dessen Datenschutzerklaerung "
"ergaenzen. Tabelle 'Anbieter | Datenschutz-Link'.",
"where": "Cookie-Richtlinie im Drittanbieter-Listing.",
"example": "Adobe Analytics: https://www.adobe.com/privacy/policy.html",
"severity": "medium",
},
"Rechtswidriger Haftungsausschluss fuer Links": {
"what": "Klassischer Link-Disclaimer ('Wir distanzieren uns von verlinkten "
"Inhalten') ist im Impressum.",
"why": "BGH I ZR 317/01: solche Disclaimer sind rechtlich wirkungslos. "
"Sie befreien NICHT von der Stoererhaftung und koennen sogar "
"den gegenteiligen Effekt haben (Anerkennung der eigenen "
"Pruefpflicht).",
"fix_text": "Entfernen Sie pauschale Link-Disclaimer vollstaendig aus "
"dem Impressum. Bei Bedarf ein knapper, zutreffender Satz:\n"
" 'Fuer den Inhalt verlinkter externer Webseiten ist "
"ausschliesslich deren Betreiber verantwortlich.'",
"where": "Impressum am Ende des Dokuments.",
"example": "Statt 'Wir distanzieren uns ausdruecklich von allen "
"Inhalten verlinkter Seiten' — einfach nichts schreiben.",
"severity": "low",
},
"Verbraucherstreitbeilegung / OS-Plattform": {
"what": "Kein Link zur EU-OS-Plattform bzw. keine Aussage zur "
"Streitbeilegung.",
"why": "Art. 14 EU-VO 524/2013 (B2C-Anbieter, Online-Verkauf): "
"klickbarer Link auf https://ec.europa.eu/consumers/odr "
"PFLICHT. §36 VSBG: Aussage ob teilnehmend oder nicht.",
"fix_text": "Die EU-Kommission stellt eine Plattform zur Online-"
"Streitbeilegung (OS) bereit, die Sie unter "
"<a href='https://ec.europa.eu/consumers/odr'>"
"https://ec.europa.eu/consumers/odr</a> erreichen.\n\n"
"Wir sind nicht bereit oder verpflichtet, an "
"Streitbeilegungsverfahren vor einer "
"Verbraucherschlichtungsstelle teilzunehmen.",
"where": "Impressum am Ende, eigener Abschnitt 'Streitbeilegung'.",
"example": "Standardformulierung anwendbar fuer alle B2C-Anbieter ohne "
"ODR-Teilnahme.",
"severity": "high",
},
"Name der vertretungsberechtigten Person": {
"what": "Vertretungsberechtigte Person ist nicht namentlich mit "
"Funktionsbezeichnung genannt.",
"why": "§5 Abs. 1 Nr. 1 TMG: bei juristischen Personen sind die "
"Vertretungsberechtigten namentlich zu nennen.",
"fix_text": "Ergaenzen Sie nach der Firmenbezeichnung:\n"
" 'Vertreten durch: [Vorstand / Geschaeftsfuehrung] "
"[Vorname Nachname]'",
"where": "Impressum direkt nach Firmenname + Anschrift.",
"example": "Vertreten durch: Vorstandsvorsitzender Oliver Zipse",
"severity": "high",
},
}
def recipe_for(finding_key: str) -> ActionRecipe | None:
"""Lookup Recipe by finding-key. Vendors + Doc-Findings in einem Lookup."""
if finding_key in VENDOR_FINDINGS:
return VENDOR_FINDINGS[finding_key]
if finding_key in DOC_CHECK_FINDINGS:
return DOC_CHECK_FINDINGS[finding_key]
# Fuzzy match auf Doc-Findings (label kann variieren)
fk = finding_key.lower()
for k, v in DOC_CHECK_FINDINGS.items():
if k.lower() in fk or fk in k.lower():
return v
return None
@@ -0,0 +1,309 @@
"""
MC Embedding Match — semantic fallback for the regex-based doc_check.
The Sonnet classifier filtered MCs to `check_type='text'` (matchable
against doc text). But the regex matcher is still too strict — BMW
writes "Speicherdauer 2 Jahre", the MC pattern expects
"\\d+\\s*(Tag|Jahr)". We catch these via BGE-M3 embeddings + cosine
similarity:
1. Embed the MC's check_question (once, cached in sidecar)
2. Embed the doc text in 50-word chunks
3. cosine(MC, max(chunks)) ≥ threshold → MC passes via "semantic"
This recovers ~50% of failed MCs at BMW-scale (estimated).
Embeddings come from bp-core-embedding-service (BGE-M3, 1024-dim,
multilingual). Sidecar SQLite stores 1024 × 4 bytes = 4KB per MC.
"""
from __future__ import annotations
import logging
import math
import os
import re
import sqlite3
import struct
from typing import Iterable
import httpx
logger = logging.getLogger(__name__)
EMBEDDING_URL = os.getenv("EMBEDDING_URL", "http://embedding-service:8087")
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
DIM = 1024 # BGE-M3
SIMILARITY_THRESHOLD = float(os.getenv("MC_EMBEDDING_THRESHOLD", "0.55"))
CHUNK_SIZE_WORDS = 50
CHUNK_STRIDE = 30 # overlap so multi-sentence MCs aren't cut
# Short Pflichtfelder (Impressum: HRB-Nr, USt-IdNr, Anschrift) gehen in
# 50-Wort-Chunks unter. Wir scannen den Doc ZUSAETZLICH mit feineren
# 15-Wort-Fenstern und lockerem Threshold fuer Impressum/AVV-Typen.
SHORT_FIELD_CHUNK_WORDS = 15
SHORT_FIELD_STRIDE = 8
SHORT_FIELD_THRESHOLD = float(os.getenv("MC_EMBEDDING_THRESHOLD_SHORT", "0.50"))
SHORT_FIELD_DOC_TYPES = {"impressum", "avv"}
# Doc-Type-spezifische Threshold-Overrides — kalibriert anhand BMW v7
# Run: bei 0.55 lagen DSE+Cookie systemisch bei 93% (Over-Firing weil
# 8000-Wort-Texte alles vage matchen). 0.60 zieht die echten ~80% ein.
# Impressum hat nur 6 echte MCs + Short-Field-Rescue → 0.50 ok.
THRESHOLD_OVERRIDE = {
"impressum": 0.50,
"avv": 0.55,
"dse": 0.60,
"cookie": 0.60,
"widerruf": 0.58,
"loeschkonzept": 0.55,
"dsfa": 0.55,
}
def _ensure_schema() -> None:
"""Add embedding column to mc_classification if not present."""
try:
with sqlite3.connect(SIDECAR_DB) as c:
cols = [r[1] for r in c.execute("PRAGMA table_info(mc_classification)")]
if "embedding" not in cols:
c.execute("ALTER TABLE mc_classification ADD COLUMN embedding BLOB")
logger.info("Added embedding column to mc_classification")
except Exception as e:
logger.warning("Embedding schema migration skipped: %s", e)
def _vec_to_blob(v: list[float]) -> bytes:
return struct.pack(f"{len(v)}f", *v)
def _blob_to_vec(b: bytes) -> list[float]:
return list(struct.unpack(f"{len(b)//4}f", b))
EMBED_BATCH_SIZE = 32
async def _embed_texts(texts: list[str], timeout: float = 120.0) -> list[list[float]]:
"""Call the central embedding-service in batches; returns one vector per input.
BGE-M3 hangs / times out on >100 inputs at once on a CPU-only host.
We chunk into 32er batches and collect.
"""
if not texts:
return []
out: list[list[float]] = []
async with httpx.AsyncClient(timeout=timeout) as client:
for i in range(0, len(texts), EMBED_BATCH_SIZE):
batch = texts[i:i + EMBED_BATCH_SIZE]
try:
r = await client.post(
f"{EMBEDDING_URL}/embed", json={"texts": batch},
)
r.raise_for_status()
vecs = r.json().get("embeddings") or []
out.extend(vecs)
except httpx.HTTPError as e:
logger.warning("Embed sub-batch [%d-%d] failed: %s %s",
i, i + len(batch), type(e).__name__, e)
# Pad with empty vectors so caller can still align by index
out.extend([[] for _ in batch])
return out
async def ensure_mc_embeddings(batch_size: int = 64, force: bool = False) -> int:
"""One-shot: embed every text-MC missing an embedding. Returns count.
Embeds the title + (rough) check_question for each MC to give the
BGE-M3 enough context. Title alone is too terse for the model to
discriminate against full-paragraph doc text.
Idempotent — only fills NULL rows unless force=True. Safe to call on
every run.
"""
_ensure_schema()
# Pull check_question from the PG source table once per call (needs
# context that's not in the sidecar)
try:
import psycopg2
pg = psycopg2.connect(os.environ["DATABASE_URL"])
with pg.cursor() as c:
c.execute("SELECT control_id, doc_type, title, check_question "
"FROM compliance.doc_check_controls")
pg_rows = c.fetchall()
pg.close()
pg_lookup = {(r[0], r[1] or ""): (r[2] or "", r[3] or "") for r in pg_rows}
except Exception as e:
logger.warning("ensure_mc_embeddings PG load failed: %s", e)
pg_lookup = {}
try:
with sqlite3.connect(SIDECAR_DB) as c:
where = ("WHERE check_type = 'text'" + ("" if force else " AND embedding IS NULL"))
rows = c.execute(
f"SELECT control_id, doc_type, title FROM mc_classification {where}"
).fetchall()
except Exception as e:
logger.warning("ensure_mc_embeddings query failed: %s", e)
return 0
if not rows:
return 0
logger.info("Embedding %d text-MCs (force=%s) via %s ...",
len(rows), force, EMBEDDING_URL)
done = 0
for i in range(0, len(rows), batch_size):
batch = rows[i:i + batch_size]
# Compose "title — check_question" so the embedding captures both
# the topic (title) and the concrete check phrasing (question).
# That helps BMW's actual policy language land in the same vector
# neighbourhood as our control wording.
texts: list[str] = []
for cid, dt, t in batch:
title_text, question = pg_lookup.get((cid, dt or ""), (t or "", ""))
combined = f"{title_text}. {question}".strip()
texts.append(combined[:600])
try:
embs = await _embed_texts(texts)
except Exception as e:
logger.warning("Embed batch failed (i=%d): %s", i, e)
continue
with sqlite3.connect(SIDECAR_DB) as c:
for (cid, dt, _t), vec in zip(batch, embs):
if not vec or len(vec) != DIM:
continue
c.execute(
"UPDATE mc_classification SET embedding = ? "
"WHERE control_id = ? AND doc_type = ?",
(_vec_to_blob(vec), cid, dt),
)
c.commit()
done += len(batch)
logger.info("ensure_mc_embeddings: filled %d/%d", done, len(rows))
return done
def _chunk_text(text: str, size: int = CHUNK_SIZE_WORDS,
stride: int = CHUNK_STRIDE) -> list[str]:
"""Sliding-window chunking — overlap helps catch MCs that span 2 sentences."""
words = re.findall(r"\S+", text or "")
if len(words) <= size:
return [" ".join(words)] if words else []
out: list[str] = []
i = 0
while i < len(words):
out.append(" ".join(words[i:i + size]))
i += stride
return out
def _cosine(a: list[float], b: list[float]) -> float:
"""Plain Python cosine — fast enough for our scale, no numpy import."""
if not a or not b or len(a) != len(b):
return 0.0
dot = sum(x * y for x, y in zip(a, b))
na = math.sqrt(sum(x * x for x in a))
nb = math.sqrt(sum(y * y for y in b))
if na == 0 or nb == 0:
return 0.0
return dot / (na * nb)
async def embedding_match(
doc_text: str,
mc_records: Iterable[dict],
doc_type: str | None = None,
threshold: float | None = None,
) -> set[str]:
"""Return the subset of MC control_ids that semantically match doc_text.
For Impressum/AVV-types we ADDITIONALLY scan the doc with smaller
15-word windows and a looser threshold so that short Pflichtfelder
(HRB, USt-IdNr, postal address) land in their own chunk and aren't
diluted by 50-word neighbourhoods of unrelated text.
"""
if not doc_text or not mc_records:
return set()
candidates = list(mc_records)
if not candidates:
return set()
cid_set = {c.get("control_id") for c in candidates if c.get("control_id")}
if not cid_set:
return set()
try:
with sqlite3.connect(SIDECAR_DB) as c:
placeholders = ",".join("?" * len(cid_set))
q = ("SELECT control_id, embedding FROM mc_classification "
f"WHERE control_id IN ({placeholders}) "
"AND check_type='text' AND embedding IS NOT NULL")
params = list(cid_set)
if doc_type:
q += " AND doc_type = ?"
params.append(doc_type)
rows = c.execute(q, params).fetchall()
except Exception as e:
logger.warning("embedding lookup failed: %s", e)
return set()
if not rows:
return set()
mc_embeddings = {cid: _blob_to_vec(blob) for cid, blob in rows}
effective_threshold = threshold or THRESHOLD_OVERRIDE.get(
(doc_type or "").lower(), SIMILARITY_THRESHOLD)
chunks = _chunk_text(doc_text)
if not chunks:
return set()
try:
chunk_vecs = await _embed_texts(chunks)
except Exception as e:
logger.warning("doc chunk embedding failed: %s %s",
type(e).__name__, e or "(empty msg)", exc_info=True)
return set()
# Filter empty vectors (failed sub-batches return [] placeholders)
chunk_vecs = [v for v in chunk_vecs if v and len(v) == DIM]
if not chunk_vecs:
logger.warning("doc chunk embedding: no usable vectors (all batches failed)")
return set()
matched: set[str] = set()
for cid, mc_vec in mc_embeddings.items():
best = max((_cosine(mc_vec, cv) for cv in chunk_vecs), default=0.0)
if best >= effective_threshold:
matched.add(cid)
# Short-field rescue pass for Impressum-type docs: small windows +
# looser threshold catch one-line Pflichtfelder that 50-word chunks
# dilute (HRB-Nr, USt-IdNr, postal address). Only runs for MCs not
# yet matched in the main pass.
if (doc_type or "").lower() in SHORT_FIELD_DOC_TYPES:
unmatched = {cid: vec for cid, vec in mc_embeddings.items() if cid not in matched}
if unmatched:
short_chunks = _chunk_text(doc_text, size=SHORT_FIELD_CHUNK_WORDS,
stride=SHORT_FIELD_STRIDE)
try:
short_vecs = await _embed_texts(short_chunks)
except Exception as e:
logger.warning("short-chunk embedding failed: %s", e)
short_vecs = []
if short_vecs:
short_passes = 0
for cid, mc_vec in unmatched.items():
best = max((_cosine(mc_vec, cv) for cv in short_vecs), default=0.0)
if best >= SHORT_FIELD_THRESHOLD:
matched.add(cid)
short_passes += 1
if short_passes:
logger.info(
"embedding short-field rescue for %s: +%d MCs (threshold %.2f, %d chunks)",
doc_type, short_passes, SHORT_FIELD_THRESHOLD, len(short_chunks),
)
logger.info(
"embedding match for %s: %d/%d MCs passed semantic threshold (main=%.2f)",
doc_type or "?", len(matched), len(mc_embeddings), effective_threshold,
)
return matched
@@ -101,11 +101,36 @@ def build_scorecard(check_results: list[dict]) -> dict:
}
_DEDUP_KEYWORDS = [
"einfache sprache", "verstaendliche sprache", "verständliche sprache",
"klare sprache", "einwilligungstexte", "einwilligungsaufforderung",
"einwilligungserklaerung", "einwilligungserklärung",
"mehrdeutige", "verstaendliche form", "verständliche form",
"fachbegriffe erklaeren", "fachbegriffe erklären",
]
def _dedup_key(label: str) -> str:
"""Cluster label to a stable dedup-key: if it contains one of the
well-known repetitive Sprache/Einwilligungs-Aufforderungs-Concepts,
collapse them all to that single concept. Otherwise return original."""
l = (label or "").lower()
for kw in _DEDUP_KEYWORDS:
if kw in l:
return f"_dup:{kw}"
return label
def top_fails(check_results: list[dict], n: int = 10) -> list[dict]:
"""Return top-N failing MCs sorted by severity then label.
Skipped + passed MCs are excluded. INFO severity is excluded by
default since those are guidance, not findings.
Near-duplicates (multiple MCs that all complain about "einfache
Sprache" / "Einwilligungsaufforderung" / ...) are collapsed to ONE
representative entry sonst dominieren UI-Sprache-Hinweise die
Top-Liste und echte Lecks gehen unter.
"""
fails = [
r for r in (check_results or [])
@@ -116,7 +141,17 @@ def top_fails(check_results: list[dict], n: int = 10) -> list[dict]:
_SEV_RANK.get((r.get("severity") or "MEDIUM").upper(), 5),
r.get("label", ""),
))
return fails[:n]
seen_keys: set[str] = set()
deduped: list[dict] = []
for r in fails:
k = _dedup_key(r.get("label", ""))
if k in seen_keys:
continue
seen_keys.add(k)
deduped.append(r)
if len(deduped) >= n:
break
return deduped
def full_audit_records(
@@ -37,6 +37,7 @@ async def check_document_with_controls(
db_url: str = "",
max_controls: int = 0, # 0 = no limit, check ALL
use_agent: bool = False, # Use LLM agent for intelligent evaluation
business_scope: set[str] | None = None,
) -> list[dict]:
"""Check document against ALL doc_check_controls for this doc_type.
@@ -56,7 +57,7 @@ async def check_document_with_controls(
mapped_type = _map_doc_type(doc_type)
# Load ALL controls for this doc_type
controls = await _load_controls(mapped_type, db_url, max_controls)
controls = await _load_controls(mapped_type, db_url, max_controls, business_scope)
if not controls:
logger.info("No MCs for doc_type '%s' (%s)", mapped_type, doc_title)
return []
@@ -71,6 +72,31 @@ async def check_document_with_controls(
if result:
results.append(result)
# Semantic fallback (Phase 3): MCs that failed via regex get a second
# chance via BGE-M3 cosine similarity. BMW writes "Speicherdauer 2
# Jahre" — the regex misses, embedding catches it.
failed_ids = {r.get("control_id") for r in results
if not r.get("passed") and r.get("control_id")}
if failed_ids:
try:
from compliance.services.mc_embedding_matcher import (
ensure_mc_embeddings, embedding_match,
)
await ensure_mc_embeddings() # idempotent: only embeds new MCs
failed_mcs = [c for c in controls if c.get("control_id") in failed_ids]
semantic_passes = await embedding_match(
text, failed_mcs, doc_type=mapped_type,
)
if semantic_passes:
for r in results:
cid = r.get("control_id")
if cid and cid in semantic_passes and not r.get("passed"):
r["passed"] = True
r["matched_text"] = "[semantischer Treffer via Embedding]"
r["hint"] = (r.get("hint") or "") + " (passed via Embedding-Match, BGE-M3 cosine)"
except Exception as e:
logger.warning("Embedding fallback skipped: %s", e, exc_info=True)
passed = sum(1 for r in results if r["passed"])
failed_results = [r for r in results if not r["passed"]]
logger.info("MC results: %d passed, %d failed out of %d for '%s'",
@@ -161,6 +187,7 @@ def _check_mc_deterministic(text_lower: str, mc: dict) -> Optional[dict]:
return {
"id": f"mc-{control_id}",
"control_id": control_id,
"label": mc.get("title", "")[:80],
"passed": passed,
"severity": severity,
@@ -266,11 +293,72 @@ _MC_ALIAS_FALLBACK = {
}
async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
def _load_text_only_ids(
doc_type: str | None = None,
business_scope: set[str] | None = None,
) -> set[str]:
"""Return control_ids that the Sonnet-classifier flagged as 'text'.
Filters applied:
1. check_type='text' (only doc-text-matchable MCs)
2. doc_type matches (per-doc-type variant from v2-Sidecar)
3. fits_doc_type=1 (LLM auditor approved this MC for this doc_type)
4. scope_requires NULL or contained in business_scope
(e.g. MCs with scope_requires='biometric_processing' are skipped
on sites that don't do biometric processing — Art. 22 FRT-MC bei
BMW falsch-positiv)
`business_scope` comes from the business_profiler (set of detected
site characteristics like 'b2c', 'shop', 'biometric_processing',
'ai_decision_making', 'child_targeting').
Returns empty set if the sidecar doesn't exist yet.
"""
import sqlite3
db_path = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
try:
with sqlite3.connect(db_path) as c:
cols = [r[1] for r in c.execute("PRAGMA table_info(mc_classification)")]
has_fit = "fits_doc_type" in cols
has_scope = "scope_requires" in cols
fit_clause = " AND (fits_doc_type IS NULL OR fits_doc_type = 1)" if has_fit else ""
base = ("SELECT control_id, scope_requires FROM mc_classification "
"WHERE check_type = 'text'" + fit_clause) if has_scope else (
"SELECT control_id, NULL FROM mc_classification "
"WHERE check_type = 'text'" + fit_clause)
params: list = []
if doc_type:
base += " AND doc_type = ?"
params.append(doc_type)
rows = c.execute(base, params).fetchall()
scope = business_scope or set()
keep: set[str] = set()
for cid, req in rows:
if not req:
keep.add(cid)
else:
# Multiple requirements separated by '|' — ALL must
# be in scope to include. Empty req tokens are skipped.
needed = {r.strip().lower() for r in req.split("|") if r.strip()}
if needed.issubset({s.lower() for s in scope}):
keep.add(cid)
return keep
except sqlite3.OperationalError:
return set()
except Exception as e:
logger.warning("MC classification lookup failed: %s", e)
return set()
async def _load_controls(doc_type: str, db_url: str, limit: int,
business_scope: set[str] | None = None) -> list[dict]:
"""Load all doc_check_controls for a doc_type from PostgreSQL.
Falls back via _MC_ALIAS_FALLBACK when no MCs exist for the requested
type (e.g. 'nutzungsbedingungen' -> 'agb').
Filters to only check_type='text' MCs when the classification sidecar
is present process/review MCs are routed to other modules.
"""
try:
import asyncpg
@@ -297,7 +385,17 @@ async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
fallback = _MC_ALIAS_FALLBACK[doc_type]
logger.info("No MCs for %s -> falling back to %s", doc_type, fallback)
rows = await conn.fetch(query, fallback)
return [dict(r) for r in rows]
controls = [dict(r) for r in rows]
text_only = _load_text_only_ids(doc_type, business_scope)
if text_only:
before = len(controls)
controls = [c for c in controls if c.get("control_id") in text_only]
logger.info(
"MC filter (text only) for %s: %d/%d MCs after Sonnet check_type filter",
doc_type, len(controls), before,
)
return controls
except Exception as e:
logger.warning("MC query failed: %s", e)
return []
@@ -0,0 +1,407 @@
"""
Vendor-Cost-Estimator leitet pro Vendor ein Pricing-Tier aus
Cookie-Signalen ab und gibt eine intensitaets-basierte Jahres-
kostenschaetzung zurueck.
Cookie-Signale die wir auswerten:
- Anzahl Cookies pro Vendor (proxy fuer Modul-Tiefe)
- Premium-Feature-Cookies (z.B. 's_target_qa', '_ab_test' Enterprise-Add-on)
- Edge/Region-Cookies (Multi-Region Premier-Tier CDN)
- Cookie-Persistenz (Multi-Jahr Heavy-Tracking-Lizenz)
Plus business_profile fuer Company-Tier-Inferenz.
Output pro Vendor:
- inferred_tier: 'starter' | 'professional' | 'enterprise' | 'premier'
- tier_signals : Liste der Indikatoren die zum Tier gefuehrt haben
- cost_year_eur_range: (low, high) basierend auf Tier × Vendor-Pricing
- confidence: 'low' | 'medium' | 'high'
Dieses Modul ergaenzt vendor_redundancy.py die einfachen low/high
Pauschalen dort werden hier durch dynamische, signal-basierte Werte
ersetzt.
"""
from __future__ import annotations
import logging
import re
from typing import Iterable
logger = logging.getLogger(__name__)
# ─── Premium-Feature-Cookies: Indikator fuer Enterprise-Add-ons ─────
#
# Wenn ein Vendor diese Cookies setzt, ist der Kunde mit hoher
# Wahrscheinlichkeit auf einem Enterprise-Plan.
_PREMIUM_FEATURE_PATTERNS: list[tuple[str, str, str]] = [
# (regex, vendor_key, premium_feature_label)
(r"^s_target_qa$", "adobe analytics", "Adobe Target Add-on"),
(r"adobe.*target", "adobe target", "Personalization Enterprise"),
(r"^aam_uuid", "adobe analytics", "Audience Manager Enterprise"),
(r"^s_ecid", "adobe analytics", "Experience Cloud ID Service"),
(r"^_pcid_", "adobe analytics", "People-Based Destinations"),
(r"^_gat_gtag_UA", "google analytics", "GA360 Multi-Tracker"),
(r"^_ga_[A-Z0-9]+_[A-Z0-9]+", "google analytics", "GA4 Enterprise Stream"),
(r"^_uetmsdns", "microsoft advertising", "Custom Conversion Tracking"),
(r"^_fbp.*test", "meta pixel", "Conversions API Premium"),
(r"^_pin_unauth_premium", "pinterest", "Pinterest Premium-API"),
(r"^afm", "adform", "Affinity-Module"),
(r"^cto_dna", "criteo", "Dynamic Retargeting Premium"),
# CDN / Infra Premium
(r"^aws-alb-[a-z0-9]+", "amazon web services", "ALB + Multi-Region"),
(r"^aws-waf", "amazon web services", "WAF Enterprise"),
(r"^cf_clearance", "cloudflare", "Bot-Management Pro"),
(r"^akm_[a-z]+", "akamai", "Adaptive Media Delivery Enterprise"),
# Salesforce Customer-360
(r"^bid_n_", "salesforce", "Marketing Cloud Personalization"),
(r"^_cs_", "salesforce", "CDP Premium"),
]
# ─── Tier-Pricing pro Vendor (jaehrlich, EUR) ───────────────────────
#
# 4 Tiers: starter (KMU), professional (Mid), enterprise (Konzern),
# premier (Global Brand / Heavy User).
_TIER_PRICING: dict[str, dict[str, tuple[int, int]]] = {
"adobe analytics": {
"starter": ( 10_000, 30_000),
"professional": ( 60_000, 150_000),
"enterprise": (200_000, 500_000),
"premier": (500_000, 900_000),
},
"adobe target": {
"starter": ( 8_000, 25_000),
"professional": ( 40_000, 100_000),
"enterprise": (120_000, 300_000),
"premier": (300_000, 600_000),
},
"adobe campaign": {
"starter": ( 10_000, 30_000),
"professional": ( 40_000, 100_000),
"enterprise": (120_000, 280_000),
"premier": (280_000, 500_000),
},
"google analytics": {
"starter": ( 0, 0), # GA4 free
"professional": ( 0, 0),
"enterprise": ( 80_000, 150_000), # GA360
"premier": (150_000, 300_000),
},
"matomo": {
"starter": ( 0, 3_000), # On-prem free / Cloud Starter
"professional": ( 6_000, 20_000),
"enterprise": ( 20_000, 80_000),
"premier": ( 60_000, 150_000),
},
"content square": {
"starter": ( 12_000, 40_000),
"professional": ( 60_000, 150_000),
"enterprise": (150_000, 350_000),
"premier": (350_000, 700_000),
},
"contentsquare": {
"starter": ( 12_000, 40_000),
"professional": ( 60_000, 150_000),
"enterprise": (150_000, 350_000),
"premier": (350_000, 700_000),
},
"dynatrace": {
"starter": ( 5_000, 15_000),
"professional": ( 30_000, 80_000),
"enterprise": (100_000, 300_000),
"premier": (300_000, 800_000),
},
"qualtrics": {
"starter": ( 6_000, 20_000),
"professional": ( 30_000, 80_000),
"enterprise": ( 80_000, 200_000),
"premier": (200_000, 500_000),
},
# Advertising / Retargeting (Lizenz + Self-Service; Media-Spend SEPARAT)
"criteo": {
"starter": ( 6_000, 20_000),
"professional": ( 30_000, 80_000),
"enterprise": ( 80_000, 250_000),
"premier": (250_000, 600_000),
},
"adform": {
"starter": ( 12_000, 40_000),
"professional": ( 60_000, 150_000),
"enterprise": (150_000, 400_000),
"premier": (400_000, 800_000),
},
"outbrain": {
"starter": ( 6_000, 20_000),
"professional": ( 30_000, 80_000),
"enterprise": ( 80_000, 200_000),
"premier": (200_000, 500_000),
},
"taboola": {
"starter": ( 6_000, 20_000),
"professional": ( 30_000, 80_000),
"enterprise": ( 80_000, 200_000),
"premier": (200_000, 500_000),
},
"teads": {
"starter": ( 6_000, 18_000),
"professional": ( 20_000, 60_000),
"enterprise": ( 60_000, 150_000),
"premier": (150_000, 350_000),
},
"pinterest": {
"starter": ( 3_000, 15_000),
"professional": ( 15_000, 50_000),
"enterprise": ( 50_000, 150_000),
"premier": (150_000, 400_000),
},
"linkedin insight": {
"starter": ( 3_000, 12_000),
"professional": ( 12_000, 40_000),
"enterprise": ( 40_000, 120_000),
"premier": (120_000, 300_000),
},
# CDN / Cloud
"akamai": {
"starter": ( 20_000, 60_000),
"professional": ( 80_000, 200_000),
"enterprise": (200_000, 500_000),
"premier": (500_000, 1_500_000),
},
"amazon web services": {
"starter": ( 12_000, 60_000),
"professional": ( 60_000, 300_000),
"enterprise": (300_000, 1_500_000),
"premier": (1_500_000, 8_000_000),
},
"baqend": {
"starter": ( 3_000, 12_000),
"professional": ( 12_000, 40_000),
"enterprise": ( 40_000, 120_000),
"premier": (120_000, 300_000),
},
"speedkit": {
"starter": ( 3_000, 12_000),
"professional": ( 12_000, 40_000),
"enterprise": ( 40_000, 120_000),
"premier": (120_000, 300_000),
},
"speedcurve": {
"starter": ( 1_200, 4_800),
"professional": ( 6_000, 18_000),
"enterprise": ( 18_000, 60_000),
"premier": ( 60_000, 120_000),
},
# CRM / Marketing
"salesforce": {
"starter": ( 20_000, 60_000),
"professional": ( 80_000, 250_000),
"enterprise": (250_000, 800_000),
"premier": (800_000, 2_500_000),
},
"genesys": {
"starter": ( 24_000, 80_000),
"professional": ( 80_000, 250_000),
"enterprise": (250_000, 800_000),
"premier": (800_000, 2_000_000),
},
# Captcha
"hcaptcha": {
"starter": ( 0, 2_400),
"professional": ( 2_400, 12_000),
"enterprise": ( 12_000, 40_000),
"premier": ( 40_000, 100_000),
},
# Lead-Tracking
"salesviewer": {
"starter": ( 1_200, 3_600),
"professional": ( 3_600, 12_000),
"enterprise": ( 12_000, 40_000),
"premier": ( 40_000, 100_000),
},
}
def _vendor_key(vendor_name: str) -> str | None:
"""Map a vendor name to a known pricing-table key."""
n = (vendor_name or "").lower()
for k in _TIER_PRICING:
if k in n:
return k
return None
def infer_company_tier(business_profile: dict | None) -> str:
"""Coarse company-tier from business profile.
Used as the baseline when vendor-specific signals are weak.
"""
if not business_profile:
return "professional"
bp = business_profile
features = {f.lower() for f in (bp.get("features") or [])}
btype = (bp.get("type") or "").lower()
# Heavy enterprise-only signals
if any(f in features for f in ("multi_country", "konzern", "enterprise",
"international", "automotive", "banking",
"luxury", "premium")):
return "premier"
# Large but maybe single-country
if "shop" in features or "konfigurator" in features or btype == "b2c":
return "enterprise"
return "professional"
def infer_vendor_tier(vendor: dict, company_tier: str) -> tuple[str, list[str]]:
"""Infer pricing tier for a single vendor from its cookie footprint.
Signals (additive more signals higher tier):
- cookie_count > 30 +1 tier
- cookie_count > 60 +2 tiers
- premium-feature cookie hit +1 tier
- 'is_third_party' on most cookies +1 tier (heavy-tracking signal)
- very long expiry (>=2 years) +1 tier
"""
cookies = vendor.get("cookies") or []
n_cookies = len(cookies)
cookie_names = [c.get("name", "").lower() for c in cookies]
signals: list[str] = []
base_tiers = ["starter", "professional", "enterprise", "premier"]
# Start at company-tier as baseline
idx = base_tiers.index(company_tier) if company_tier in base_tiers else 1
if n_cookies >= 60:
idx = min(len(base_tiers) - 1, idx + 1)
signals.append(f"{n_cookies} Cookies (sehr hohe Modul-Tiefe)")
elif n_cookies >= 30:
signals.append(f"{n_cookies} Cookies (hohe Modul-Tiefe)")
# Premium feature detection
vk = _vendor_key(vendor.get("name", ""))
for pattern, expected_key, feature_label in _PREMIUM_FEATURE_PATTERNS:
if vk and vk != expected_key and expected_key not in (vendor.get("name") or "").lower():
continue
for cn in cookie_names:
if re.search(pattern, cn):
idx = min(len(base_tiers) - 1, idx + 1)
signals.append(f"Premium-Feature-Cookie: {feature_label}")
break
# Heavy third-party tracking
third_party_ratio = sum(1 for c in cookies if c.get("is_third_party")) / max(n_cookies, 1)
if third_party_ratio >= 0.6 and n_cookies >= 10:
signals.append(f"{int(third_party_ratio * 100)}% Drittanbieter-Cookies — Tracking-Heavy")
# Long-lived cookies
long_lived = sum(1 for c in cookies if _expiry_years(c.get("expiry", "")) >= 2)
if long_lived >= 3:
signals.append(f"{long_lived} Cookies mit ≥2 Jahre Speicherdauer")
return base_tiers[idx], signals
def _expiry_years(expiry_str: str) -> float:
"""Rough parse: '2 Jahre' → 2.0, '24 Monate' → 2.0, '90 Tage' → 0.25"""
s = (expiry_str or "").lower()
m = re.search(r"(\d+)\s*(jahr|year)", s)
if m: return float(m.group(1))
m = re.search(r"(\d+)\s*(monat|month)", s)
if m: return float(m.group(1)) / 12.0
m = re.search(r"(\d+)\s*(tag|day)", s)
if m: return float(m.group(1)) / 365.0
return 0.0
def estimate_vendor_cost(vendor: dict, business_profile: dict | None = None) -> dict:
"""Return cost estimation for one vendor incl. tier inference + signals."""
vk = _vendor_key(vendor.get("name", ""))
company_tier = infer_company_tier(business_profile)
if not vk:
return {
"vendor": vendor.get("name", ""),
"matched_pricing_key": None,
"inferred_tier": None,
"tier_signals": [],
"company_tier_baseline": company_tier,
"cost_year_eur_range": (0, 0),
"confidence": "none",
"note": "Kein Pricing-Eintrag fuer diesen Anbieter — Saving-Schaetzung uebergangen.",
}
tier, signals = infer_vendor_tier(vendor, company_tier)
pricing = _TIER_PRICING[vk].get(tier) or (0, 0)
confidence = "high" if len(signals) >= 2 else ("medium" if signals else "low")
return {
"vendor": vendor.get("name", ""),
"matched_pricing_key": vk,
"inferred_tier": tier,
"tier_signals": signals,
"company_tier_baseline": company_tier,
"cost_year_eur_range": pricing,
"confidence": confidence,
}
def estimate_total_stack_cost(
vendors: Iterable[dict],
business_profile: dict | None = None,
) -> dict:
"""Aggregate cost estimation over all vendors.
Returns:
- per_vendor list (one entry each)
- per_recipient_type aggregate (INTERNAL vs PROCESSOR vs CONTROLLER)
- total range
- master-contract dedup hint: vendors whose name starts with the
site owner ('BMW AG — ...') are bundled into ONE master contract
per vendor-tool-key (not double-counted).
"""
per_vendor: list[dict] = []
seen_master_keys: set[tuple[str, str]] = set()
total_low = 0
total_high = 0
for v in vendors:
est = estimate_vendor_cost(v, business_profile)
per_vendor.append(est)
if not est["matched_pricing_key"]:
continue
rtype = (v.get("recipient_type") or "").upper()
master_key = (est["matched_pricing_key"], rtype if rtype == "INTERNAL" else "EXT")
if rtype == "INTERNAL" and master_key in seen_master_keys:
# Same Adobe contract serves many "BMW AG — Adobe XYZ" lines —
# count cost only ONCE per (key, internal).
est["bundled_into_master_contract"] = True
continue
seen_master_keys.add(master_key)
lo, hi = est["cost_year_eur_range"]
total_low += lo
total_high += hi
return {
"per_vendor": per_vendor,
"total_year_eur_range": (total_low, total_high),
"master_contracts_counted": len(seen_master_keys),
"disclaimer": (
"Schaetzung basiert auf Cookie-Signalen (Anzahl, Premium-Feature-Detection, "
"Drittanbieter-Quote, Lebensdauer) + Listpreisen pro Tier. Konzern-Konditionen "
"koennen 30-50% darunter liegen. Eintraege derselben Eigenmarke werden zu EINEM "
"Master-Vertrag aggregiert. Media-Spend ist NICHT enthalten."
),
}
@@ -0,0 +1,727 @@
"""
Vendor Redundancy + EU-Alternatives Analyzer.
Eingang: Liste von Vendors aus dem CMP-Capture (z.B. BMW 90 Vendors).
Ausgang: drei strukturierte Listen die im Email + Migration-Modal
gerendert werden:
1. functional_categories : Vendor Funktionsklasse (analytics,
advertising, cdn, captcha, chat, )
2. redundancies : Kategorien mit 2 Vendors die dasselbe tun
Konsolidierungspotenzial
3. eu_alternatives : pro US-Vendor passender EU-Ersatz aus
kuratierter Lookup-Tabelle (Matomo statt
Adobe Analytics, IONOS statt AWS, etc.)
4. multi_function_tools : EU-Tools die mehrere Kategorien abdecken
(z.B. SAP CX = Analytics + CRM + Marketing)
"""
from __future__ import annotations
import logging
import re
from collections import defaultdict
from typing import Iterable
logger = logging.getLogger(__name__)
# ─── Kategorisierung ──────────────────────────────────────────────────
# Substring-Match (lowercase) → Kategorie. Erste Treffer gewinnt.
_CATEGORY_RULES: list[tuple[str, str]] = [
# Web Analytics / Behavior
("adobe analytics", "web_analytics"),
("adobe target", "personalisation"),
("adobe campaign", "marketing_automation"),
("adobe staging library", "tag_management"),
("adobelaunch", "tag_management"),
("google analytics", "web_analytics"),
("matomo", "web_analytics"),
("hotjar", "web_analytics"),
("content square", "web_analytics"),
("contentsquare", "web_analytics"),
("dynatrace", "monitoring"),
("performance analytics", "web_analytics"),
("form analytics", "web_analytics"),
("form campaign analytics","web_analytics"),
("psyma", "survey"),
("qualtrics", "survey"),
# Tag Management
("google tag manager", "tag_management"),
("gtm", "tag_management"),
# Advertising / Retargeting
("google ads", "advertising"),
("google advertising", "advertising"),
("doubleclick", "advertising"),
("googleads", "advertising"),
("meta pixel", "advertising"),
("meta platforms", "advertising"),
("facebook", "advertising"),
("adform", "advertising"),
("criteo", "advertising"),
("outbrain", "advertising"),
("taboola", "advertising"),
("teads", "advertising"),
("pinterest", "advertising"),
("linkedin insight", "advertising"),
("youtube performance", "advertising"),
("youtube player", "external_media"),
("amazon advertising", "advertising"),
("instagram", "advertising"),
("dotaki", "advertising"),
# Video / Embeds
("youtube", "external_media"),
("vimeo", "external_media"),
("jw player", "external_media"),
("jw video", "external_media"),
("jwplayer", "external_media"),
("jwconnatix", "external_media"),
# Maps / Geo
("google maps", "maps"),
("google geolocation", "maps"),
("geolocation", "maps"),
# CDN / Infrastructure
("akamai", "cdn"),
("amazon web services", "cloud_infra"),
("aws", "cloud_infra"),
("baqend", "cdn"),
("speedkit", "cdn"),
("speedcurve", "monitoring"),
("salesforce", "crm"),
# Chat / Support
("genesys", "chat"),
("ckm", "chat"),
("chat widget", "chat"),
# Captcha / Bot-Protection
("hcaptcha", "captcha"),
("recaptcha", "captcha"),
# Sales / Lead-Tracking
("salesviewer", "lead_tracking"),
# Marketing/Sales overlay
("nayoki", "social_aggregator"),
# Site-eigene Funktionen
("infrastructure", "site_infra"),
("infrastrukturbereit", "site_infra"),
("javaserverpages", "site_infra"),
("single sign-on", "auth"),
("mybmw account", "auth"),
("sso", "auth"),
("consent", "consent_management"),
("session", "site_infra"),
("scroll", "site_infra"),
("sticky", "site_infra"),
("sidebar", "site_infra"),
("dealer search", "site_feature"),
("test drive", "site_feature"),
("vehicle configurator", "site_feature"),
("stocklocator", "site_feature"),
("eshop", "site_feature"),
("shop", "site_feature"),
("language", "site_infra"),
("sprach", "site_infra"),
("region", "site_infra"),
("ip popup", "site_infra"),
("popup", "site_infra"),
("dynatrace", "monitoring"),
]
def classify_vendor(name: str) -> str:
"""Map a vendor name to a functional category."""
n = (name or "").lower()
for needle, cat in _CATEGORY_RULES:
if needle in n:
return cat
return "other"
# ─── EU-Alternativen ─────────────────────────────────────────────────
# Kuratierte Liste — pro US-/Nicht-EU-Vendor passende(r) EU-Ersatz.
# Quellen: Matomo Vergleich, etracker SoMo-Studie, IONOS-Pakete,
# Friendly Captcha Whitepaper, SAP CX-Suite, Brevo / CleverReach DE-Listen.
_EU_ALTERNATIVES: dict[str, list[dict]] = {
"adobe analytics": [
{"name": "Matomo (On-Premise)", "vendor": "InnoCraft", "country": "DE-self-hosted",
"license": "GPL", "notes": "100% DSGVO, keine 3rd-Country, gleicher Funktionsumfang"},
{"name": "etracker Analytics", "vendor": "etracker GmbH", "country": "DE",
"license": "Commercial", "notes": "DSGVO-konform aus Hamburg, IP-Anonymisierung"},
{"name": "Mapp Intelligence", "vendor": "Mapp Digital", "country": "DE",
"license": "Commercial", "notes": "Enterprise-Alternative, Server in DE"},
],
"google analytics": [
{"name": "Matomo", "vendor": "InnoCraft", "country": "DE-self-hosted",
"license": "GPL", "notes": "Direkter Drop-in-Ersatz mit GA-Migrationspfad"},
{"name": "Plausible Analytics", "vendor": "Plausible Insights", "country": "EE",
"license": "AGPL/Commercial", "notes": "Cookielos, ohne Einwilligung nutzbar"},
{"name": "Fathom Analytics EU", "vendor": "Fathom", "country": "DE-Region",
"license": "Commercial", "notes": "Cookielos, EU-Hosting"},
],
"content square": [
{"name": "Mouseflow EU", "vendor": "Mouseflow ApS", "country": "DK",
"license": "Commercial", "notes": "Session-Recording + Heatmaps EU-Hosting"},
{"name": "Hotjar EU", "vendor": "Hotjar Ltd", "country": "MT",
"license": "Commercial", "notes": "EU-DataCenter (Frankfurt), Einwilligung erforderlich"},
],
"dynatrace": [
{"name": "Dynatrace EU", "vendor": "Dynatrace", "country": "AT",
"license": "Commercial", "notes": "Bereits EU (Linz). Cluster auf EU einstellen"},
],
"speedcurve": [
{"name": "SpeedCurve EU", "vendor": "SpeedCurve", "country": "EU-tenant",
"license": "Commercial", "notes": "Region-Tenant explizit konfigurieren"},
{"name": "Calibre", "vendor": "Calibre", "country": "AU/EU",
"license": "Commercial", "notes": "Performance Monitoring, EU-Region"},
],
"akamai": [
{"name": "Bunny CDN", "vendor": "BunnyWay d.o.o.", "country": "SI",
"license": "Commercial", "notes": "Slowenischer CDN, EU-Backbone"},
{"name": "Cloudflare EU-Only", "vendor": "Cloudflare", "country": "Multi",
"license": "Commercial", "notes": "EU-Datacenter erzwingbar via 'Geo Steering'"},
{"name": "IONOS CDN", "vendor": "IONOS SE", "country": "DE",
"license": "Commercial", "notes": "100% DE-Hosting"},
],
"amazon web services": [
{"name": "IONOS Cloud", "vendor": "IONOS SE", "country": "DE",
"license": "Commercial", "notes": "DE-Hosting, BSI C5-zertifiziert"},
{"name": "OVHcloud", "vendor": "OVH SAS", "country": "FR",
"license": "Commercial", "notes": "FR-Hosting, SecNumCloud-zertifiziert"},
{"name": "Hetzner Cloud", "vendor": "Hetzner Online GmbH", "country": "DE",
"license": "Commercial", "notes": "DE/FI-Hosting, sehr kostenguenstig"},
{"name": "STACKIT", "vendor": "Schwarz IT (Lidl-Gruppe)", "country": "DE",
"license": "Commercial", "notes": "Souveraener DE-Cloud, fuer Enterprise"},
],
"salesforce": [
{"name": "SAP Customer Experience", "vendor": "SAP SE", "country": "DE",
"license": "Commercial", "notes": "Vollstaendige CRM-Suite EU-Hosting"},
{"name": "weclapp", "vendor": "weclapp SE", "country": "DE",
"license": "Commercial", "notes": "Cloud-CRM aus Marburg"},
],
"adobe campaign": [
{"name": "CleverReach", "vendor": "CleverReach GmbH", "country": "DE",
"license": "Commercial", "notes": "E-Mail-Marketing DE-Hosting"},
{"name": "Brevo (Sendinblue)", "vendor": "Brevo", "country": "FR",
"license": "Commercial", "notes": "Marketing-Automation EU-Hosting"},
{"name": "Inxmail", "vendor": "Inxmail GmbH", "country": "DE",
"license": "Commercial", "notes": "Enterprise-E-Mail-Marketing aus Freiburg"},
],
"google ads": [
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
"license": "Commercial", "notes": "FR-Hosting, Programmatic + Direct-Sold"},
{"name": "Bing Ads (Microsoft Advertising EU)", "vendor": "Microsoft", "country": "Multi",
"license": "Commercial", "notes": "EU-Datacenter optional"},
],
"google maps": [
{"name": "HERE Maps", "vendor": "HERE Technologies", "country": "DE",
"license": "Commercial", "notes": "Berliner Anbieter, professionelle Karten + Routing"},
{"name": "OpenStreetMap (self-host)", "vendor": "OSM Foundation", "country": "DE-self-host",
"license": "ODbL", "notes": "Frei, OSM-Tiles self-hosted oder via Maptiler EU"},
{"name": "Maptiler Cloud EU", "vendor": "MapTiler", "country": "CH",
"license": "Commercial", "notes": "Schweizer Anbieter, EU-Tiles"},
],
"criteo": [ # criteo IS EU but use as example for retargeting alts
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
"license": "Commercial", "notes": "Retargeting + Display, FR-Hosting"},
],
"hcaptcha": [
{"name": "Friendly Captcha", "vendor": "Friendly Captcha GmbH", "country": "DE",
"license": "Commercial", "notes": "100% DSGVO, ohne Cookie, Hosting in DE"},
{"name": "Turnstile (Cloudflare EU-Only)", "vendor": "Cloudflare", "country": "Multi",
"license": "Commercial", "notes": "Ohne Cookie, EU-Region erzwingbar"},
],
"qualtrics": [
{"name": "LamaPoll", "vendor": "Lamano GmbH", "country": "DE",
"license": "Commercial", "notes": "DSGVO-Surveys aus Berlin"},
{"name": "evasys", "vendor": "evasys GmbH", "country": "DE",
"license": "Commercial", "notes": "Enterprise-Survey-Plattform aus Lueneburg"},
],
"meta pixel": [
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
"license": "Commercial", "notes": "EU-Alternative fuer Conversion-Tracking"},
],
"facebook": [
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
"license": "Commercial", "notes": "Programmatic ohne Meta"},
],
"linkedin insight": [
{"name": "Xing Insights", "vendor": "New Work SE", "country": "DE",
"license": "Commercial", "notes": "DE/AT/CH B2B-Targeting aus Hamburg"},
],
"outbrain": [
{"name": "Plista", "vendor": "Plista GmbH", "country": "DE",
"license": "Commercial", "notes": "Native Advertising aus Berlin"},
],
"taboola": [
{"name": "Plista", "vendor": "Plista GmbH", "country": "DE",
"license": "Commercial", "notes": "Native Advertising aus Berlin"},
],
"genesys": [
{"name": "Userlike", "vendor": "Userlike UG", "country": "DE",
"license": "Commercial", "notes": "Live-Chat aus Koeln, BSI-konform"},
{"name": "LiveZilla / EasyChat EU", "vendor": "LiveZilla GmbH", "country": "DE",
"license": "Commercial", "notes": "DSGVO-Live-Chat"},
],
"salesviewer": [
{"name": "Leadinfo", "vendor": "Leadinfo BV", "country": "NL",
"license": "Commercial", "notes": "B2B-Webvisitor-Tracking EU"},
{"name": "Albacross EU", "vendor": "Albacross", "country": "SE",
"license": "Commercial", "notes": "EU-Tenant verfuegbar"},
],
"youtube": [
{"name": "Vimeo Pro EU", "vendor": "Vimeo", "country": "Multi",
"license": "Commercial", "notes": "EU-Region waehlbar, weniger Tracking"},
{"name": "Self-hosted video (BunnyStream)", "vendor": "BunnyWay", "country": "SI",
"license": "Commercial", "notes": "Eigene Player + CDN ohne Drittanbieter"},
],
"amazon advertising": [
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
"license": "Commercial", "notes": "Retail-Media-Alternative FR"},
],
"instagram": [
{"name": "Pinterest EU + Owned-Channels", "vendor": "Mix", "country": "Multi",
"license": "Commercial", "notes": "Owned-Channels (Newsletter via CleverReach)"},
],
}
# ─── Kosten-Annahmen (oeffentliche Listenpreise, Schaetzung) ──────
#
# Format: (low_year_eur, high_year_eur, tier_assumption)
# Tier: 'sme' = <100 Mitarbeiter, 'mid' = 100-1000, 'ent' = >1000.
# Quellen: oeffentliche Listenpreise + Branchen-Benchmarks (Gartner,
# Forrester 2025). Konkrete Vertrags-Konditionen koennen 30-70% abweichen
# (Volumen-Rabatte, Bundling). Werden im Output explizit als
# 'Schaetzbereich' markiert.
_COST_LOOKUP: dict[str, tuple[int, int, str]] = {
"adobe analytics": (120_000, 600_000, "ent"),
"adobe target": ( 80_000, 350_000, "ent"),
"adobe campaign": ( 60_000, 250_000, "ent"),
"adobe staging library":( 0, 0, "ent"), # bundled
"google analytics": ( 0, 150_000, "ent"), # GA4 free, GA360 ~150k
"matomo": ( 6_000, 30_000, "mid"), # Cloud/On-Prem
"hotjar": ( 3_600, 18_000, "mid"),
"content square": ( 60_000, 300_000, "ent"),
"contentsquare": ( 60_000, 300_000, "ent"),
"dynatrace": ( 50_000, 400_000, "ent"), # per-host pricing
"performance analytics":( 5_000, 40_000, "mid"),
"qualtrics": ( 25_000, 150_000, "ent"),
# Self-Service-Werbung — KEIN Tool-Lizenz, nur Media-Spend (separat).
# Wir zaehlen 0 hier, weil "Sparpotenzial bei der Lizenz" = 0 ist.
# Konsolidierung wuerde nur Media-Spend reduzieren — anderes Thema.
"google ads": ( 0, 0, "ent"),
"google advertising": ( 0, 0, "ent"),
"doubleclick": ( 0, 0, "ent"),
"meta pixel": ( 0, 0, "ent"),
"facebook": ( 0, 0, "ent"),
"amazon advertising": ( 0, 0, "ent"),
"youtube performance": ( 0, 0, "ent"),
"youtube player": ( 0, 0, "ent"),
"instagram": ( 0, 0, "ent"),
# Echte DSP-/Plattform-Lizenzen — hier zahlt der Kunde eine Saas-Fee
# ON TOP des Media-Spends. Range bewusst enger gehalten (Faktor max 4x).
"adform": ( 80_000, 300_000, "ent"),
"criteo": ( 50_000, 200_000, "ent"),
"outbrain": ( 30_000, 120_000, "ent"),
"taboola": ( 30_000, 120_000, "ent"),
"teads": ( 25_000, 100_000, "ent"),
"pinterest": ( 15_000, 60_000, "ent"),
"linkedin insight": ( 10_000, 50_000, "ent"),
"google maps": ( 2_000, 30_000, "mid"),
"akamai": ( 50_000, 500_000, "ent"),
"amazon web services": (100_000, 3_000_000, "ent"),
"baqend": ( 6_000, 60_000, "mid"),
"speedkit": ( 6_000, 60_000, "mid"),
"speedcurve": ( 2_400, 24_000, "mid"),
"salesforce": (100_000, 1_500_000, "ent"), # CRM seats
"genesys": ( 80_000, 800_000, "ent"), # contact-center seats
"ckm": ( 15_000, 120_000, "mid"),
"hcaptcha": ( 0, 12_000, "sme"), # free tier OR pro
"salesviewer": ( 3_600, 18_000, "mid"),
"youtube": ( 0, 50_000, "ent"), # embed kostenlos, Production-Kosten variieren
}
# ─── EU-Alternativen-Kosten (gleiche Tier-Logik) ───────────────────
_EU_ALT_COSTS: dict[str, tuple[int, int]] = {
"Matomo (On-Premise)": ( 3_000, 15_000),
"Matomo (Pro / Cloud EU)": ( 6_000, 30_000),
"Matomo": ( 6_000, 30_000),
"etracker Analytics": ( 10_000, 60_000),
"Mapp Intelligence": ( 40_000, 200_000),
"Plausible Analytics": ( 240, 6_000),
"Fathom Analytics EU": ( 240, 6_000),
"Mouseflow EU": ( 12_000, 60_000),
"Hotjar EU": ( 3_600, 18_000),
"Dynatrace EU": ( 50_000, 400_000), # gleicher Preis, nur Region
"SpeedCurve EU": ( 2_400, 24_000),
"Calibre": ( 3_600, 30_000),
"Bunny CDN": ( 1_200, 12_000),
"Cloudflare EU-Only": ( 6_000, 80_000),
"IONOS CDN": ( 3_000, 30_000),
"IONOS Cloud": ( 30_000, 600_000),
"OVHcloud": ( 30_000, 600_000),
"Hetzner Cloud": ( 6_000, 120_000),
"STACKIT": ( 50_000, 800_000),
"SAP Customer Experience": ( 80_000, 1_200_000),
"weclapp": ( 12_000, 80_000),
"CleverReach": ( 2_400, 24_000),
"Brevo (Sendinblue)": ( 600, 24_000),
"Inxmail": ( 8_000, 60_000),
"Smart AdServer (Equativ)": ( 30_000, 300_000),
"Bing Ads (Microsoft Advertising EU)": ( 30_000, 3_000_000),
"HERE Maps": ( 1_200, 24_000),
"OpenStreetMap (self-host)": ( 0, 6_000), # nur Server-Kosten
"Maptiler Cloud EU": ( 600, 12_000),
"Friendly Captcha": ( 600, 9_600),
"Turnstile (Cloudflare EU-Only)": ( 0, 6_000),
"LamaPoll": ( 1_200, 24_000),
"evasys": ( 6_000, 60_000),
"Xing Insights": ( 6_000, 60_000),
"Plista": ( 20_000, 150_000),
"Userlike": ( 1_200, 30_000),
"LiveZilla / EasyChat EU": ( 600, 12_000),
"Leadinfo": ( 1_200, 12_000),
"Albacross EU": ( 3_600, 24_000),
"Vimeo Pro EU": ( 900, 6_000),
"Self-hosted video (BunnyStream)": ( 600, 12_000),
"Pinterest EU + Owned-Channels": ( 600, 24_000),
}
# ─── Bekannte Gruende fuer Duplikate (sollen Konsolidierung NICHT empfehlen) ─
_DUPLICATION_CAVEATS = {
"web_analytics": [
"A/B-Vergleich verschiedener Anbieter waehrend Migration",
"Marketing nutzt Adobe, Produkt nutzt Matomo — Inhouse-Politik",
"Regional split (Adobe fuer DE, GA fuer International)",
],
"advertising": [
"Brand-Kampagne vs Performance-Kampagne (verschiedene DSPs)",
"Saisonal: Black Friday/Super Bowl nutzt mehr Kanaele",
"Markenspezifisch: BMW M-Modelle anders targetet als 1er-Serie",
],
"cdn": [
"Multi-CDN-Strategie fuer Ausfallsicherheit (Akamai + Cloudflare)",
"Event-CDN-Spike (Auto-Show, Modell-Launch) braucht Skalierung",
"Regionale Latenz-Optimierung (Akamai APAC, AWS US)",
],
"marketing_automation": [
"Salesforce Marketing Cloud fuer B2C, Adobe Campaign fuer B2B",
"Lead-Generierung (Adobe) vs Loyalitaet (Salesforce)",
],
"monitoring": [
"APM (Dynatrace) misst Backend, RUM (SpeedCurve) misst Frontend",
],
"captcha": [
"Stufenweise Migration zu cookieless Captcha",
],
}
def _company_tier_bounds(company_tier: str | None) -> tuple[float, float]:
"""Wie viel der Listpreis-Range tatsaechlich verwenden — abhaengig
vom Company-Tier. Bei 'enterprise' / 'premier' nutzen wir den UPPER
Teil (50-100%) statt starterpremier.
"""
t = (company_tier or "professional").lower()
if t == "premier": return (0.70, 1.00)
if t == "enterprise": return (0.40, 0.85)
if t == "professional": return (0.20, 0.60)
return (0.05, 0.40) # 'sme' / starter
def _estimate_savings_for_redundancy(
redundancy: dict, vendors: Iterable[dict],
company_tier: str = "enterprise",
) -> dict:
"""Schaetzbereich pro Redundanz: derzeitige Kosten + EU-Konsolidierungs-Saving.
Beruecksichtigt den company_tier wir wollen fuer ein Konzern wie
BMW nicht die starter-Range mit anzeigen. Realistic Range ergibt
sich aus tier_bounds × (low, high).
"""
low_frac, high_frac = _company_tier_bounds(company_tier)
current_low = current_high = 0
matched_vendors = []
cat_vendors = [v for v in vendors if v.get("name") in redundancy.get("vendors", [])]
for v in cat_vendors:
name = (v.get("name") or "").lower()
for k, (lo, hi, _tier) in _COST_LOOKUP.items():
if k in name:
# Tier-aware: nimm low_frac..high_frac des Pricing-Bereichs
span = hi - lo
current_low += int(lo + span * low_frac)
current_high += int(lo + span * high_frac)
matched_vendors.append(v.get("name"))
break
# Konsolidierung: ein einziges EU-Tool ersetzt alle in der Kategorie
suggested_eu = None
suggested_low = suggested_high = 0
# 1. Multi-Funktions-Tool das diese Kategorie abdeckt
for tool in _MULTI_FUNCTION_TOOLS:
if redundancy["category"] in tool["covers"]:
suggested_eu = tool["name"]
cost = _EU_ALT_COSTS.get(tool["name"])
if cost:
suggested_low, suggested_high = cost
break
# 2. Sonst: EU-Alternative aus den Eintraegen — ABER NUR FUR VENDORS
# AUS DER AKTUELLEN KATEGORIE (sonst kommt Userlike fuer Werbung)
if not suggested_eu:
for v in cat_vendors:
n = (v.get("name") or "").lower()
for k, alts in _EU_ALTERNATIVES.items():
if k in n and alts:
suggested_eu = alts[0]["name"]
cost = _EU_ALT_COSTS.get(alts[0]["name"])
if cost:
suggested_low, suggested_high = cost
break
if suggested_eu:
break
saving_low = max(0, current_low - suggested_high)
saving_high = max(0, current_high - suggested_low)
return {
"current_estimate_year_eur": [current_low, current_high],
"suggested_eu_tool": suggested_eu,
"suggested_estimate_year_eur": [suggested_low, suggested_high],
"estimated_saving_year_eur": [saving_low, saving_high],
"caveats": _DUPLICATION_CAVEATS.get(redundancy["category"], []),
"cost_disclaimer": (
"Schaetzbereich auf Basis oeffentlicher Listenpreise. Tatsaechliche "
"Vertragspreise koennen 30-70% niedriger liegen (Volumen, Bundling, "
"Konzern-Konditionen). Bitte mit der jeweiligen Einkaufsabteilung verifizieren."
),
}
# ─── Multi-Funktions-Tools (Konsolidierungs-Ankerpunkte) ───────────
_MULTI_FUNCTION_TOOLS = [
{
"name": "Matomo (Pro / Cloud EU)",
"vendor": "InnoCraft",
"country": "DE-self-host / EU",
"covers": ["web_analytics", "tag_management", "personalisation"],
"notes": "Ersetzt Adobe Analytics + GTM + Adobe Target in einem Tool. "
"100% DSGVO ohne Einwilligung wenn IP anonymisiert.",
},
{
"name": "SAP Customer Experience Suite",
"vendor": "SAP SE",
"country": "DE",
"covers": ["crm", "marketing_automation", "personalisation", "survey"],
"notes": "Ersetzt Salesforce + Adobe Campaign + Qualtrics. EU-Hosting, "
"tiefe ERP-Integration.",
},
{
"name": "IONOS Cloud (Compute + CDN + Storage + DNS)",
"vendor": "IONOS SE",
"country": "DE",
"covers": ["cloud_infra", "cdn", "monitoring"],
"notes": "Ersetzt AWS + Akamai + zusaetzliches Monitoring in einer "
"DE-Cloud (BSI C5).",
},
{
"name": "Userlike Suite",
"vendor": "Userlike UG",
"country": "DE",
"covers": ["chat", "consent_management"],
"notes": "Ersetzt Genesys Chat. Bietet eigenes Consent-Modul.",
},
{
"name": "Smart AdServer (Equativ)",
"vendor": "Equativ",
"country": "FR",
"covers": ["advertising"],
"notes": "Ersetzt Mehrfach-DSPs (Adform/Criteo/Outbrain/Taboola/Meta) "
"durch Programmatic+Direct-Sold EU-Stack.",
},
{
"name": "HERE Maps",
"vendor": "HERE Technologies",
"country": "DE",
"covers": ["maps"],
"notes": "Berliner Anbieter, professionelle Karten + Routing.",
},
{
"name": "Vimeo Pro EU (oder self-hosted BunnyStream)",
"vendor": "Vimeo / BunnyWay",
"country": "Multi / SI",
"covers": ["external_media"],
"notes": "Ersetzt YouTube-Embeds + JW Player in einem Player.",
},
{
"name": "LamaPoll",
"vendor": "Lamano GmbH",
"country": "DE",
"covers": ["survey"],
"notes": "DSGVO-Surveys aus Berlin. Ersetzt Qualtrics / Psyma.",
},
]
# ─── Analyse ─────────────────────────────────────────────────────────
def analyze(vendors: Iterable[dict], company_tier: str = "enterprise") -> dict:
"""Main entry. Returns categorised view + redundancies + EU options.
`company_tier` (starter|professional|enterprise|premier) steuert die
Cost-Range so dass z.B. fuer einen DAX-Konzern nicht starter-Preise
in der unteren Schranke landen.
"""
by_cat: dict[str, list[dict]] = defaultdict(list)
for v in vendors:
cat = classify_vendor(v.get("name", ""))
by_cat[cat].append(v)
# Redundancies: any category with ≥2 vendors (excl. site-internal cats)
skip_redundancy_cats = {"site_infra", "site_feature", "consent_management",
"auth", "other"}
all_vendors_list = list(vendors)
redundancies: list[dict] = []
for cat, vs in by_cat.items():
if cat in skip_redundancy_cats or len(vs) < 2:
continue
red = {
"category": cat,
"category_label": _CATEGORY_LABEL.get(cat, cat),
"count": len(vs),
"vendors": [v.get("name", "") for v in vs],
"consolidation_hint": _CONSOLIDATION_HINT.get(cat, ""),
}
red.update(_estimate_savings_for_redundancy(
red, all_vendors_list, company_tier))
redundancies.append(red)
redundancies.sort(key=lambda r: -(r.get("estimated_saving_year_eur") or [0, 0])[1])
# EU alternatives lookup
eu_alternatives: list[dict] = []
seen = set()
for v in vendors:
name = v.get("name") or ""
n_lower = name.lower()
for k, alts in _EU_ALTERNATIVES.items():
if k in n_lower and k not in seen:
eu_alternatives.append({
"current_vendor": name,
"current_recipient_type": v.get("recipient_type", ""),
"matched_key": k,
"alternatives": alts,
})
seen.add(k)
break
# Multi-function tool recommendations: only if the customer has vendors
# across the categories the tool covers
present_cats = set(by_cat.keys())
multi_function = []
for tool in _MULTI_FUNCTION_TOOLS:
covered_here = [c for c in tool["covers"] if c in present_cats]
if len(covered_here) >= 2:
# Vendor-Namen sammeln statt nur summieren — dedupliziert
unique_vendors: set[str] = set()
for c in covered_here:
for v in by_cat[c]:
unique_vendors.add(v.get("name", ""))
multi_function.append({
**tool,
"replaces_categories": covered_here,
"potential_replacements": len(unique_vendors),
})
multi_function.sort(key=lambda t: -t["potential_replacements"])
total_current_low = sum((r.get("current_estimate_year_eur") or [0, 0])[0] for r in redundancies)
total_current_high = sum((r.get("current_estimate_year_eur") or [0, 0])[1] for r in redundancies)
total_saving_low = sum((r.get("estimated_saving_year_eur") or [0, 0])[0] for r in redundancies)
total_saving_high = sum((r.get("estimated_saving_year_eur") or [0, 0])[1] for r in redundancies)
return {
"summary": {
"total_vendors": len(all_vendors_list),
"distinct_categories": len([c for c in by_cat if c != "other"]),
"redundancy_count": len(redundancies),
"eu_alternative_count": len(eu_alternatives),
"consolidation_potential": sum(r["count"] - 1 for r in redundancies),
"estimated_current_year_eur": [total_current_low, total_current_high],
"estimated_saving_year_eur": [total_saving_low, total_saving_high],
"estimated_saving_pct": (
# Beide Bounds gegen denselben Nenner (Mittelwert der
# aktuellen Schaetzung) — sonst explodiert die obere
# Schranke wenn current_low klein ist. Cap auf 95%.
(lambda mid: (
f"{min(95, int(100 * total_saving_low / mid))}"
f"{min(95, int(100 * total_saving_high / mid))}%"
))((total_current_low + total_current_high) / 2)
if total_current_high else "n/a"
),
"cost_disclaimer": (
"Schaetzbereich auf Basis oeffentlicher Listenpreise (Gartner, Forrester 2025). "
"Vertragspreise koennen 30-70% niedriger liegen (Volumen-Rabatte, Konzern-Konditionen, "
"Bundling). Werte dienen als Diskussionsgrundlage mit dem Einkauf, NICHT als Angebot."
),
},
"by_category": {cat: [v.get("name", "") for v in vs]
for cat, vs in by_cat.items()},
"redundancies": redundancies,
"eu_alternatives": eu_alternatives,
"multi_function_tools": multi_function,
}
_CATEGORY_LABEL = {
"web_analytics": "Web-Analytics",
"advertising": "Werbung / Retargeting",
"tag_management": "Tag-Management",
"marketing_automation": "Marketing-Automation",
"personalisation": "Personalisierung",
"external_media": "Externe Medien (Video)",
"maps": "Karten / Geo",
"cdn": "CDN",
"cloud_infra": "Cloud-Infrastruktur",
"monitoring": "Performance-Monitoring",
"crm": "CRM",
"chat": "Chat / Support",
"captcha": "Bot-Schutz",
"lead_tracking": "Lead-Tracking",
"survey": "Umfragen",
"social_aggregator": "Social-Media-Aggregation",
"consent_management": "Consent-Management",
"auth": "Authentifizierung",
"site_infra": "Eigene Infrastruktur",
"site_feature": "Eigene Features",
"other": "Sonstige",
}
_CONSOLIDATION_HINT = {
"web_analytics": "Mehrere Analytics-Tools sammeln meist redundante Daten. Ein Tool genuegt — Matomo (DE) ist DSGVO-Standard.",
"advertising": "Werbe-/Retargeting-Pixel sind oft austauschbar. Konzentration auf 2-3 Kanaele senkt Drittland-Risiko.",
"external_media": "Mehrere Video-Embeds nur wenn fachlich noetig. Self-hosted (BunnyStream/Vimeo) reduziert Tracking.",
"maps": "Eine Karten-Loesung reicht. HERE Maps (DE) als EU-Alternative zu Google Maps.",
"cdn": "Ein CDN+Performance-Stack genuegt. IONOS oder Bunny vereinen mehrere Funktionen.",
"marketing_automation": "Marketing-Cloud + separates E-Mail-Tool sind oft Dopplung — SAP CX oder CleverReach allein moeglich.",
"chat": "Ein Chat-System genuegt. Userlike (DE) ersetzt Genesys-Stack.",
"monitoring": "RUM + APM koennen in einem Tool gebuendelt werden (Dynatrace EU oder Sentry-Self-host).",
"survey": "Eine Survey-Plattform genuegt — LamaPoll (DE) oder Mapp.",
}