feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features
CI / nodejs-build (push) Successful in 2m47s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / nodejs-build (push) Successful in 2m47s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9): Core Compliance-Check - Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db). rag_document_checker filtert auf check_type='text' fuer doc_check. Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in falscher doc_type-Schublade. - scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden per business_profile gefiltert (FRT skipped fuer BMW etc.). - Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match: Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60), Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum. Title+check_question als Embedding-Input fuer mehr Kontext. - Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction wenn richer (BMW 1824 vs 600 Worte). Vendor-Redundanz + EU-Alternativen + Cost-Saving - vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors, Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...). - vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/ enterprise/premier). - Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten (nur Media-Spend, separat). DSP-Plattformen behalten enge Range. - Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den oberen 40-100%-Band der Listpreise, nicht starter→premier. - Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere Kategorien gleichzeitig. Cookie-Wissens-DB + Funktionale Klassifikation - cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...) mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk, schrems_ii_status, EuGH-Urteile, EU-Alternative. - cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id, ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact. Country-Inferenz aus Rechtsform - cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table. Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors (Adform DK, Pinterest IE). Action-Recipes + Doc-Anchor-Locator - finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country, broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling", ...) eine strukturierte Anweisung mit what/why/fix_text/where/example. Zum 1:1-Einfuegen in Kunden-Dokumente. - doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den passenden Absatz im existierenden Kundendokument fuer jeden Finding. Per-Run Thread-Local-Cache. Fallback: keyword-Match. - Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail + Vendor-Flag-Liste mit aufklappbarer Action-Liste. - Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip). Migration-Pipeline (Compliance-Check -> Customer Banner/Documents) - migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit 4 Kategorien + Review-Flags. - migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register + Privacy-Policy-Pre-Fills. - agent_migration_routes: 3 Preview-Endpoints (banner-preview, document-preview, summary). Persistierung der cmp_vendors in /data/compliance_audits.db check_payloads-Tabelle. Borlabs-Parity Cookie-Banner-Features - Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage. - Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video Placeholder bis Einwilligung. - Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB. - Consent-Log Export (CSV/JSON) per einwilligungen_export_routes. Bug-Fixes - canonical_control_routes: _jsonish-Helper fuer string-typed jsonb, similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr). - Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views. - Embedding-Service-Batching (32er Batches statt 165 in einem Call). - KeyError 'control_id' in MC-Result-Aggregation (defensive .get). - Master-Controls-Klick-Through von /sdk/master-controls auf /sdk/control-library?control=<id> mit URL-Param-Auto-Open. - Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht). - Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction). - doc_type-aware MC-Filter (statt all-text-MCs). - Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag). - A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert. Tests - test_migration_mappers.py (9 Tests) - test_migration_endpoints.py (4 Tests) Skripte (one-shot) - classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type) - audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires) BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes): DSE 7,5% -> 81-83% Impressum 4% -> 100% (6 echte MCs alle erfuellt) Cookie 0% -> 79-83% (CMP-Text-Routing + Embedding) Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr Plus: Action-Recipes + Doc-Anchors fuer jeden Fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,167 @@
|
||||
"""
|
||||
Cookie-Function-Classifier — pro Cookie eine inhaltliche Funktionsbestimmung.
|
||||
|
||||
Heute haben wir pro Vendor eine Kategorie (analytics/advertising/...).
|
||||
Aber: ein Vendor hat oft 10-50 verschiedene Cookies. Nicht jeder Cookie
|
||||
einer Marketing-Plattform macht Werbung — viele sind Session-Mgmt,
|
||||
Sprachpraeferenz, ScrollPosition etc.
|
||||
|
||||
Dieses Modul klassifiziert pro Cookie:
|
||||
- functional_role : was der Cookie technisch tut (session_id,
|
||||
csrf_token, ab_test, user_id, ad_id, …)
|
||||
- data_collected : welche Daten dahinter stehen (visitor_id,
|
||||
page_view, click, conversion_event, …)
|
||||
- blocking_impact : was passiert wenn der Cookie geblockt wird
|
||||
(none, no_personalization, no_tracking, site_breaks)
|
||||
|
||||
Damit kann der Vendor-Redundanz-Analyzer praezise sagen:
|
||||
"Adobe Analytics setzt 55 Cookies, davon 12 fuer Tracking, 8 fuer A/B-Test
|
||||
und 35 fuer interne Performance. Matomo deckt 12 Tracking + 8 A/B Tests
|
||||
ab — 55 Adobe-Cookies werden zu 20 Matomo-Cookies."
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from typing import Iterable
|
||||
|
||||
# Pattern → (functional_role, blocking_impact)
|
||||
# Reihenfolge entscheidet: spezifischer zuerst.
|
||||
_PATTERNS: list[tuple[str, str, str]] = [
|
||||
# Session / Authentifizierung
|
||||
(r"^(jsessionid|phpsessid|sessionid|sid|connect\.sid)$", "session_id", "site_breaks"),
|
||||
(r"sso|signon|auth|login|token|jwt|bearer", "auth_token", "site_breaks"),
|
||||
(r"^csrf|xsrf|antiforgery", "csrf_token", "site_breaks"),
|
||||
|
||||
# Spracheinstellung / Region
|
||||
(r"lang|locale|culture|region", "preference", "no_personalization"),
|
||||
|
||||
# User-Praeferenzen (Theme, View, Bookmark)
|
||||
(r"theme|dark|mode|view|sort|filter", "ui_preference", "no_personalization"),
|
||||
(r"bookmark|favorite|favorit", "user_data", "no_personalization"),
|
||||
|
||||
# Consent-Cookie selbst
|
||||
(r"consent|gdpr|tcf|euconsent", "consent_state", "site_breaks"),
|
||||
|
||||
# Tracking IDs (most analytics)
|
||||
(r"^_ga|gid|gat|google_analytic", "tracking_id", "no_tracking"),
|
||||
(r"^_pk_|matomo|piwik", "tracking_id", "no_tracking"),
|
||||
(r"^s_|s\.cc|adobesite|aam", "tracking_id", "no_tracking"), # Adobe
|
||||
(r"hjid|hjsession|hotjar", "session_recording", "no_tracking"),
|
||||
(r"_uetsid|_uetvid|microsoft", "tracking_id", "no_tracking"),
|
||||
|
||||
# Visitor identification
|
||||
(r"visitor|uid|user_id|customer_id", "visitor_id", "no_personalization"),
|
||||
|
||||
# A/B-Test / Personalisation
|
||||
(r"ab_test|abtest|variant|experiment|target|target_qa", "ab_test", "no_personalization"),
|
||||
(r"personalization|personalisation|adobe_target", "personalisation", "no_personalization"),
|
||||
|
||||
# Werbung / Retargeting
|
||||
(r"fbp|fbc|fb_id|facebook|meta_pixel|fr$", "ad_pixel", "no_tracking"),
|
||||
(r"adform|criteo|outbrain|taboola|tapad|adsrvr", "ad_pixel", "no_tracking"),
|
||||
(r"doubleclick|test_cookie|ide|nid|exchange_uid", "ad_pixel", "no_tracking"),
|
||||
(r"google_ad|gads|gcl", "ad_pixel", "no_tracking"),
|
||||
(r"^li_|linkedin|bcookie|bscookie", "ad_pixel", "no_tracking"),
|
||||
(r"pinterest|_pinterest_|_pin_unauth", "ad_pixel", "no_tracking"),
|
||||
|
||||
# Affiliate / Conversion
|
||||
(r"conversion|orderid|order_id|transaction|purchase", "conversion_event", "no_tracking"),
|
||||
(r"campaign|utm|source|medium|term", "campaign_attribution", "no_tracking"),
|
||||
|
||||
# ScrollPosition / Form-Helper
|
||||
(r"scroll|position|form_|form_state", "ui_state", "no_personalization"),
|
||||
|
||||
# Loadbalancer / Sticky
|
||||
(r"affinity|sticky|lb_|alb-|aws-alb", "load_balancer", "site_breaks"),
|
||||
|
||||
# Chat / Support
|
||||
(r"chat|widget|genesys|livechat", "chat_session", "no_personalization"),
|
||||
|
||||
# Captcha
|
||||
(r"hcaptcha|recaptcha|cf_|cloudflare", "bot_protection", "site_breaks"),
|
||||
]
|
||||
|
||||
_FUNCTIONAL_LABEL = {
|
||||
"session_id": "Sitzungs-ID",
|
||||
"auth_token": "Auth-Token",
|
||||
"csrf_token": "CSRF-Schutz",
|
||||
"preference": "Sprache / Region",
|
||||
"ui_preference": "UI-Praeferenz",
|
||||
"user_data": "Nutzer-Daten",
|
||||
"consent_state": "Consent-Speicher",
|
||||
"tracking_id": "Tracking-ID",
|
||||
"session_recording": "Session-Recording",
|
||||
"visitor_id": "Besucher-ID",
|
||||
"ab_test": "A/B-Test",
|
||||
"personalisation": "Personalisierung",
|
||||
"ad_pixel": "Werbe-Pixel",
|
||||
"conversion_event": "Konversions-Tracking",
|
||||
"campaign_attribution":"Kampagnen-Attribution",
|
||||
"ui_state": "UI-Zustand (ScrollPos etc.)",
|
||||
"load_balancer": "Load-Balancer",
|
||||
"chat_session": "Chat-Session",
|
||||
"bot_protection": "Bot-Schutz",
|
||||
"unknown": "Unbekannt",
|
||||
}
|
||||
|
||||
# Welche functional_roles ueberlappen funktional — verwendet vom
|
||||
# vendor_redundancy.analyze() um echte Konsolidierungschancen zu
|
||||
# erkennen statt nur Provider-Doppelungen zu zaehlen.
|
||||
OVERLAPPING_ROLES = {
|
||||
"tracking_id": "tracking",
|
||||
"session_recording": "tracking",
|
||||
"ab_test": "personalisation",
|
||||
"personalisation": "personalisation",
|
||||
"ad_pixel": "advertising",
|
||||
"conversion_event": "advertising",
|
||||
"campaign_attribution":"advertising",
|
||||
}
|
||||
|
||||
|
||||
def classify_cookie(cookie_name: str) -> tuple[str, str]:
|
||||
"""Return (functional_role, blocking_impact) for a cookie name."""
|
||||
n = (cookie_name or "").lower().strip()
|
||||
for pattern, role, impact in _PATTERNS:
|
||||
if re.search(pattern, n):
|
||||
return role, impact
|
||||
return "unknown", "no_tracking"
|
||||
|
||||
|
||||
def annotate_vendor_cookies(vendor: dict) -> dict:
|
||||
"""Enrich a vendor record with functional_role per cookie."""
|
||||
cookies = vendor.get("cookies") or []
|
||||
annotated = []
|
||||
role_counts: dict[str, int] = {}
|
||||
for c in cookies:
|
||||
role, impact = classify_cookie(c.get("name", ""))
|
||||
annotated.append({**c, "functional_role": role, "blocking_impact": impact})
|
||||
role_counts[role] = role_counts.get(role, 0) + 1
|
||||
return {
|
||||
**vendor,
|
||||
"cookies": annotated,
|
||||
"role_distribution": role_counts,
|
||||
"role_labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in role_counts},
|
||||
}
|
||||
|
||||
|
||||
def aggregate_cookie_purposes(vendors: Iterable[dict]) -> dict:
|
||||
"""Tenant-weite Verteilung: welche funktionalen Rollen kommen wie oft vor?"""
|
||||
total: dict[str, int] = {}
|
||||
by_vendor: dict[str, dict[str, int]] = {}
|
||||
for v in vendors:
|
||||
roles = v.get("role_distribution") or {}
|
||||
if not roles and v.get("cookies"):
|
||||
v = annotate_vendor_cookies(v)
|
||||
roles = v["role_distribution"]
|
||||
for r, n in roles.items():
|
||||
total[r] = total.get(r, 0) + n
|
||||
by_vendor[v.get("name", "")] = roles
|
||||
return {
|
||||
"total_per_role": total,
|
||||
"labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in total},
|
||||
"vendors_per_role": {
|
||||
r: [v for v, rd in by_vendor.items() if rd.get(r, 0) > 0]
|
||||
for r in total
|
||||
},
|
||||
}
|
||||
@@ -0,0 +1,608 @@
|
||||
"""
|
||||
Cookie-Knowledge-Datenbank — maximal extrahierbares Wissen pro Cookie-Name.
|
||||
|
||||
Pro Eintrag erfassen wir:
|
||||
- vendor : Setzender Anbieter (volle Firma + Sitzland)
|
||||
- exact_purpose : was der Cookie GENAU tut (nicht nur Kategorie)
|
||||
- data_collected : Welche Datenfelder (Client-ID, Timestamp, IP, etc.)
|
||||
- ip_relevant : Wird IP-Adresse erfasst/uebermittelt?
|
||||
- ip_anonymized : Per Default anonymisiert?
|
||||
- tcf_purpose_ids : IAB TCF v2.2 Purpose-IDs (1-11)
|
||||
- iab_vendor_id : IAB Global Vendor List ID (fuer TCF-Sync)
|
||||
- typical_lifetime : Wie lange persistiert
|
||||
- reid_risk : Re-Identifikations-Risiko (low/medium/high)
|
||||
- technical_necessity: Erforderlich nach §25(2) TDDDG? (none/partial/full)
|
||||
- schrems_ii_status : Drittlandtransfer-Bewertung
|
||||
- eugh_rulings : Relevante EuGH-/CNIL-/LfDI-Entscheidungen
|
||||
- eu_alternative_* : EU-Cookie/Vendor-Ersatz mit gleicher Funktion
|
||||
- notes : Sonstige Hinweise (Vermeidung, Konfiguration)
|
||||
|
||||
Quellen: Cookiepedia, IAB Europe TCF v2.2 Vendor List, Cookiebot DB,
|
||||
CNIL Cookies & Trackers Guidelines 2024, EDPB Cookie Guidelines 2/2023,
|
||||
DSK-Orientierungshilfe Telemedien 2021, Vendor-eigene Dokumentation.
|
||||
|
||||
Stand: 2026-05.
|
||||
|
||||
Erweiterung: Pull-Requests willkommen — Format siehe TEMPLATE_ENTRY am
|
||||
Ende der Datei.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import TypedDict
|
||||
|
||||
|
||||
class CookieKnowledge(TypedDict, total=False):
|
||||
vendor: str
|
||||
vendor_country: str
|
||||
exact_purpose: str
|
||||
data_collected: list[str]
|
||||
ip_relevant: bool
|
||||
ip_anonymized: bool
|
||||
tcf_purpose_ids: list[int]
|
||||
iab_vendor_id: int | None
|
||||
typical_lifetime: str
|
||||
reid_risk: str # 'low' | 'medium' | 'high'
|
||||
technical_necessity: str # 'none' | 'partial' | 'full'
|
||||
schrems_ii_status: str
|
||||
eugh_rulings: list[str]
|
||||
eu_alternative_cookies: list[str]
|
||||
eu_alternative_vendor: str
|
||||
notes: str
|
||||
|
||||
|
||||
# ─── Google ──────────────────────────────────────────────────────────
|
||||
|
||||
_GOOGLE_BASE = {
|
||||
"vendor": "Google LLC", "vendor_country": "US",
|
||||
"schrems_ii_status": "Drittlandtransfer in die USA. Mit DPF "
|
||||
"(EU-US Data Privacy Framework, 2023) wieder zulaessig, "
|
||||
"aber bereits Klage NOYB anhaengig (Schrems III). "
|
||||
"Risiko-Bewertung empfohlen.",
|
||||
"eugh_rulings": [
|
||||
"EuGH C-311/18 (Schrems II, 16.07.2020) — Privacy Shield gekippt",
|
||||
"CNIL SAN-2022-002 (10.02.2022) — Google Analytics ohne Anonymisierung "
|
||||
"unzulaessig",
|
||||
"Datenschutzkonferenz (DSK) Orientierungshilfe 2024 — GA4 mit "
|
||||
"Server-Side-Tagging als Mitigation moeglich",
|
||||
],
|
||||
}
|
||||
|
||||
KB: dict[str, CookieKnowledge] = {
|
||||
|
||||
# ─── Google Analytics ─────────────────────────────────────────────
|
||||
"_ga": {
|
||||
**_GOOGLE_BASE,
|
||||
"exact_purpose": "Unterscheidet eindeutig Besucher; persistiert die "
|
||||
"ueber alle Sessions hinweg gueltige Client-ID.",
|
||||
"data_collected": ["client_id", "first_visit_timestamp", "ip_address"],
|
||||
"ip_relevant": True, "ip_anonymized": False,
|
||||
"tcf_purpose_ids": [8, 10],
|
||||
"iab_vendor_id": 755,
|
||||
"typical_lifetime": "2 Jahre",
|
||||
"reid_risk": "high",
|
||||
"technical_necessity": "none",
|
||||
"eu_alternative_cookies": ["_pk_id"],
|
||||
"eu_alternative_vendor": "Matomo",
|
||||
"notes": "Mit IP-Anonymisierung (`anonymizeIp: true`) + Server-Side-Tagging "
|
||||
"DSGVO-konfigurierbar. Ohne diese Massnahmen einwilligungspflichtig.",
|
||||
},
|
||||
"_gid": {
|
||||
**_GOOGLE_BASE,
|
||||
"exact_purpose": "Unterscheidet Besucher innerhalb einer Session "
|
||||
"(24h-Bucket).",
|
||||
"data_collected": ["session_id", "ip_address"],
|
||||
"ip_relevant": True, "ip_anonymized": False,
|
||||
"tcf_purpose_ids": [8],
|
||||
"iab_vendor_id": 755,
|
||||
"typical_lifetime": "24 Stunden",
|
||||
"reid_risk": "medium",
|
||||
"technical_necessity": "none",
|
||||
"eu_alternative_cookies": ["_pk_ses"],
|
||||
"eu_alternative_vendor": "Matomo",
|
||||
},
|
||||
"_gat": {
|
||||
**_GOOGLE_BASE,
|
||||
"exact_purpose": "Throttling-Cookie — begrenzt Anzahl Requests an "
|
||||
"Google Analytics pro Sekunde.",
|
||||
"data_collected": ["throttle_flag"],
|
||||
"ip_relevant": False, "ip_anonymized": True,
|
||||
"tcf_purpose_ids": [],
|
||||
"iab_vendor_id": 755,
|
||||
"typical_lifetime": "1 Minute",
|
||||
"reid_risk": "low",
|
||||
"technical_necessity": "none",
|
||||
"notes": "Reines Performance-Steuerungs-Cookie. Trotzdem einwilligungspflichtig "
|
||||
"da er Teil des GA-Trackings ist.",
|
||||
},
|
||||
"_gat_gtag_UA_": {
|
||||
**_GOOGLE_BASE,
|
||||
"exact_purpose": "GTM-spezifisches Throttling — pro Universal-Analytics-Property.",
|
||||
"data_collected": ["throttle_flag"],
|
||||
"ip_relevant": False,
|
||||
"typical_lifetime": "1 Minute",
|
||||
"reid_risk": "low",
|
||||
"technical_necessity": "none",
|
||||
"notes": "Suffix nach `UA_` ist die Property-ID. Pattern-Match noetig.",
|
||||
},
|
||||
"_ga_*": {
|
||||
**_GOOGLE_BASE,
|
||||
"exact_purpose": "GA4-Persistierung — eindeutige Stream-ID + Session-Daten.",
|
||||
"data_collected": ["stream_id", "session_count", "session_start_ts"],
|
||||
"ip_relevant": True, "ip_anonymized": False,
|
||||
"tcf_purpose_ids": [8, 10],
|
||||
"iab_vendor_id": 755,
|
||||
"typical_lifetime": "2 Jahre",
|
||||
"reid_risk": "high",
|
||||
"technical_necessity": "none",
|
||||
"notes": "GA4-Format. Suffix `_<measurement_id>`. Server-Side-GTM "
|
||||
"ist die einzige praktikable DSGVO-Mitigation.",
|
||||
},
|
||||
"NID": {
|
||||
**_GOOGLE_BASE,
|
||||
"exact_purpose": "Personalisiert Google-Suche, AdSense, Maps; "
|
||||
"speichert Praeferenzen + Sicherheits-Token.",
|
||||
"data_collected": ["user_pref_id", "session_id", "security_token"],
|
||||
"ip_relevant": True,
|
||||
"tcf_purpose_ids": [4, 9, 10],
|
||||
"iab_vendor_id": 755,
|
||||
"typical_lifetime": "6 Monate",
|
||||
"reid_risk": "high",
|
||||
"technical_necessity": "none",
|
||||
"notes": "Klassischer Google-Werbe-Cookie. Hohe Re-ID-Gefahr da Cross-Site.",
|
||||
},
|
||||
"IDE": {
|
||||
"vendor": "Google LLC (DoubleClick)", "vendor_country": "US",
|
||||
"exact_purpose": "Conversion-Tracking + Werbe-Targeting bei "
|
||||
"Google Display Network / DoubleClick.",
|
||||
"data_collected": ["doubleclick_id", "ad_interactions"],
|
||||
"ip_relevant": True,
|
||||
"tcf_purpose_ids": [4, 9, 10],
|
||||
"iab_vendor_id": 755,
|
||||
"typical_lifetime": "13 Monate",
|
||||
"reid_risk": "high",
|
||||
"technical_necessity": "none",
|
||||
"schrems_ii_status": _GOOGLE_BASE["schrems_ii_status"],
|
||||
"eugh_rulings": _GOOGLE_BASE["eugh_rulings"],
|
||||
},
|
||||
"test_cookie": {
|
||||
**_GOOGLE_BASE,
|
||||
"exact_purpose": "DoubleClick-Probe-Cookie — testet ob Browser Cookies akzeptiert.",
|
||||
"data_collected": ["browser_supports_cookies"],
|
||||
"ip_relevant": False,
|
||||
"typical_lifetime": "15 Minuten",
|
||||
"reid_risk": "low",
|
||||
"technical_necessity": "none",
|
||||
},
|
||||
|
||||
# ─── Meta / Facebook ──────────────────────────────────────────────
|
||||
"_fbp": {
|
||||
"vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
|
||||
"exact_purpose": "First-Party-Pixel von Facebook/Meta — identifiziert "
|
||||
"den Browser fuer Werbe-Conversion-Tracking + Ad-Retargeting.",
|
||||
"data_collected": ["browser_id", "first_visit_ts"],
|
||||
"ip_relevant": True, "ip_anonymized": False,
|
||||
"tcf_purpose_ids": [4, 9, 10],
|
||||
"iab_vendor_id": 891,
|
||||
"typical_lifetime": "90 Tage",
|
||||
"reid_risk": "high",
|
||||
"technical_necessity": "none",
|
||||
"schrems_ii_status": "Daten gehen an Meta US trotz IE-Sitz. "
|
||||
"Aktuell DPF-abgedeckt, aber Schrems III erwartet.",
|
||||
"eugh_rulings": [
|
||||
"EuGH C-311/18 (Schrems II)",
|
||||
"EDSA Aufsichts-Statement 2023 zu Meta-Pixel",
|
||||
"LDA Bayern Pruefverfuegung 2024",
|
||||
],
|
||||
"eu_alternative_vendor": "Smart AdServer (Equativ, FR)",
|
||||
"notes": "Conversions API (CAPI) als Server-Side-Variante reduziert Cookie-"
|
||||
"Abhaengigkeit. Trotzdem werden Daten an Meta gesendet.",
|
||||
},
|
||||
"_fbc": {
|
||||
"vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
|
||||
"exact_purpose": "Speichert die ClickID einer Meta-Ad-Kampagne; "
|
||||
"ordnet Conversion dem urspruenglichen Ad-Klick zu.",
|
||||
"data_collected": ["fbclid", "ad_campaign_id"],
|
||||
"ip_relevant": True,
|
||||
"tcf_purpose_ids": [4, 9],
|
||||
"iab_vendor_id": 891,
|
||||
"typical_lifetime": "90 Tage",
|
||||
"reid_risk": "high",
|
||||
"technical_necessity": "none",
|
||||
},
|
||||
"fr": {
|
||||
"vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
|
||||
"exact_purpose": "Werbe-Targeting + Cross-Site-Tracking auf "
|
||||
"Facebook-Plattform.",
|
||||
"data_collected": ["encrypted_user_id", "session_data"],
|
||||
"ip_relevant": True,
|
||||
"tcf_purpose_ids": [4, 9, 10],
|
||||
"iab_vendor_id": 891,
|
||||
"typical_lifetime": "3 Monate",
|
||||
"reid_risk": "high",
|
||||
"technical_necessity": "none",
|
||||
},
|
||||
|
||||
# ─── Adobe ────────────────────────────────────────────────────────
|
||||
"s_cc": {
|
||||
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
|
||||
"exact_purpose": "Cookie-Test — prueft ob der Browser Cookies "
|
||||
"akzeptiert (Adobe Analytics Bootstrap).",
|
||||
"data_collected": ["browser_supports_cookies"],
|
||||
"ip_relevant": False,
|
||||
"typical_lifetime": "Session",
|
||||
"reid_risk": "low",
|
||||
"technical_necessity": "partial",
|
||||
"schrems_ii_status": "Adobe-IE-Hosting, jedoch US-Datentransfer fuer "
|
||||
"Cloud-Services. DPF-abgedeckt.",
|
||||
},
|
||||
"s_sq": {
|
||||
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
|
||||
"exact_purpose": "Speichert den letzten Klick (URL + Position) "
|
||||
"fuer Click-Map-Reports.",
|
||||
"data_collected": ["last_click_url", "last_click_xy"],
|
||||
"ip_relevant": False,
|
||||
"tcf_purpose_ids": [8],
|
||||
"typical_lifetime": "Session",
|
||||
"reid_risk": "low",
|
||||
"technical_necessity": "none",
|
||||
},
|
||||
"AMCV_": {
|
||||
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
|
||||
"exact_purpose": "Adobe Marketing Cloud Visitor ID — Cross-Tool-ID fuer "
|
||||
"Analytics + Target + Audience Manager.",
|
||||
"data_collected": ["marketing_cloud_visitor_id", "first_visit_ts"],
|
||||
"ip_relevant": True,
|
||||
"tcf_purpose_ids": [4, 8, 9, 10],
|
||||
"typical_lifetime": "2 Jahre",
|
||||
"reid_risk": "high",
|
||||
"technical_necessity": "none",
|
||||
"notes": "Suffix nach `AMCV_` ist die Org-ID. Klassischer Cross-Tool-Track.",
|
||||
},
|
||||
"mbox": {
|
||||
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
|
||||
"exact_purpose": "Adobe Target — A/B-Testing, Personalisierung, "
|
||||
"Audience-Targeting.",
|
||||
"data_collected": ["mbox_visitor_id", "experiment_assignments"],
|
||||
"ip_relevant": True,
|
||||
"tcf_purpose_ids": [4, 9, 10],
|
||||
"typical_lifetime": "2 Jahre",
|
||||
"reid_risk": "high",
|
||||
"technical_necessity": "none",
|
||||
},
|
||||
"s_target_qa": {
|
||||
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
|
||||
"exact_purpose": "Adobe Target Premium-Feature — QA-Modus fuer Personalisations-Tests.",
|
||||
"data_collected": ["target_qa_session"],
|
||||
"typical_lifetime": "Session",
|
||||
"reid_risk": "low",
|
||||
"technical_necessity": "none",
|
||||
"notes": "Premium-Feature-Cookie. Anwesenheit = Enterprise-Lizenz Indikator.",
|
||||
},
|
||||
|
||||
# ─── Microsoft / Bing ─────────────────────────────────────────────
|
||||
"MUID": {
|
||||
"vendor": "Microsoft Corp.", "vendor_country": "US",
|
||||
"exact_purpose": "Microsoft-User-ID fuer Bing, Microsoft Advertising, "
|
||||
"Clarity Heatmaps.",
|
||||
"data_collected": ["microsoft_user_id"],
|
||||
"ip_relevant": True,
|
||||
"tcf_purpose_ids": [4, 8, 9, 10],
|
||||
"iab_vendor_id": 165,
|
||||
"typical_lifetime": "13 Monate",
|
||||
"reid_risk": "high",
|
||||
"technical_necessity": "none",
|
||||
"schrems_ii_status": "US-Transfer, DPF-zertifiziert.",
|
||||
},
|
||||
"_uetsid": {
|
||||
"vendor": "Microsoft Corp.", "vendor_country": "US",
|
||||
"exact_purpose": "Universal Event Tracking — Session-Cookie fuer "
|
||||
"Microsoft Advertising Conversion-Tracking.",
|
||||
"data_collected": ["session_id"],
|
||||
"ip_relevant": True,
|
||||
"tcf_purpose_ids": [9],
|
||||
"typical_lifetime": "30 Minuten",
|
||||
"reid_risk": "medium",
|
||||
"technical_necessity": "none",
|
||||
},
|
||||
"_uetvid": {
|
||||
"vendor": "Microsoft Corp.", "vendor_country": "US",
|
||||
"exact_purpose": "Universal Event Tracking — persistente Visitor-ID.",
|
||||
"data_collected": ["visitor_id"],
|
||||
"ip_relevant": True,
|
||||
"tcf_purpose_ids": [4, 9],
|
||||
"typical_lifetime": "13 Monate",
|
||||
"reid_risk": "high",
|
||||
"technical_necessity": "none",
|
||||
},
|
||||
|
||||
# ─── LinkedIn ─────────────────────────────────────────────────────
|
||||
"bcookie": {
|
||||
"vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
|
||||
"exact_purpose": "Browser-Identifikation; wird genutzt fuer Anmelde-"
|
||||
"Vorgang + LinkedIn Insight-Tag-Tracking.",
|
||||
"data_collected": ["browser_id"],
|
||||
"ip_relevant": True,
|
||||
"tcf_purpose_ids": [4, 8, 9],
|
||||
"iab_vendor_id": 14,
|
||||
"typical_lifetime": "1 Jahr",
|
||||
"reid_risk": "high",
|
||||
"technical_necessity": "none",
|
||||
"schrems_ii_status": "US-Datenuebermittlung (Mutter Microsoft).",
|
||||
},
|
||||
"lidc": {
|
||||
"vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
|
||||
"exact_purpose": "Daten-Routing zwischen LinkedIn-Datacentern.",
|
||||
"data_collected": ["routing_id"],
|
||||
"ip_relevant": True,
|
||||
"typical_lifetime": "1 Tag",
|
||||
"reid_risk": "low",
|
||||
"technical_necessity": "partial",
|
||||
},
|
||||
"li_gc": {
|
||||
"vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
|
||||
"exact_purpose": "Speichert die Cookie-Einwilligung fuer LinkedIn-Embeds.",
|
||||
"data_collected": ["consent_state"],
|
||||
"ip_relevant": False,
|
||||
"typical_lifetime": "6 Monate",
|
||||
"reid_risk": "low",
|
||||
"technical_necessity": "full",
|
||||
},
|
||||
|
||||
# ─── Matomo (EU-Alternative) ──────────────────────────────────────
|
||||
"_pk_id": {
|
||||
"vendor": "InnoCraft Ltd (Matomo)", "vendor_country": "NZ-mit-EU-self-hosting",
|
||||
"exact_purpose": "Matomo Visitor-ID — Pendant zu _ga, aber DSGVO-konform "
|
||||
"wenn IP-Anonymisierung aktiv.",
|
||||
"data_collected": ["visitor_id", "first_visit_ts"],
|
||||
"ip_relevant": True, "ip_anonymized": True,
|
||||
"tcf_purpose_ids": [8],
|
||||
"typical_lifetime": "13 Monate",
|
||||
"reid_risk": "low", # bei aktivierter Anonymisierung
|
||||
"technical_necessity": "none",
|
||||
"schrems_ii_status": "Bei On-Premise-Deployment KEIN Drittlandtransfer. "
|
||||
"Matomo Cloud EU mit Frankfurt-Hosting verfuegbar.",
|
||||
"notes": "Empfohlener Drop-in-Ersatz fuer Google Analytics.",
|
||||
},
|
||||
"_pk_ses": {
|
||||
"vendor": "InnoCraft Ltd (Matomo)", "vendor_country": "NZ-mit-EU-self-hosting",
|
||||
"exact_purpose": "Matomo Session-Cookie.",
|
||||
"data_collected": ["session_id"],
|
||||
"ip_relevant": False,
|
||||
"typical_lifetime": "30 Minuten",
|
||||
"reid_risk": "low",
|
||||
"technical_necessity": "none",
|
||||
},
|
||||
|
||||
# ─── Captcha ──────────────────────────────────────────────────────
|
||||
"hcaptcha": {
|
||||
"vendor": "Intuition Machines Inc. (hCaptcha)", "vendor_country": "US",
|
||||
"exact_purpose": "hCaptcha Bot-Erkennung — Session-Token fuer Challenge-Solver.",
|
||||
"data_collected": ["bot_score", "session_id", "ip_address"],
|
||||
"ip_relevant": True,
|
||||
"typical_lifetime": "Session",
|
||||
"reid_risk": "medium",
|
||||
"technical_necessity": "full",
|
||||
"schrems_ii_status": "US-Transfer (Intuition Machines, San Francisco).",
|
||||
"eu_alternative_vendor": "Friendly Captcha (DE), Turnstile (Cloudflare EU)",
|
||||
"notes": "Technisch erforderlich fuer Bot-Schutz, aber EU-Alternativen "
|
||||
"ohne Drittland-Risiko verfuegbar.",
|
||||
},
|
||||
"cf_clearance": {
|
||||
"vendor": "Cloudflare Inc.", "vendor_country": "US",
|
||||
"exact_purpose": "Cloudflare-Bot-Management Pro — bestaetigt dass User "
|
||||
"die JS-Challenge bestanden hat.",
|
||||
"data_collected": ["challenge_token"],
|
||||
"ip_relevant": True,
|
||||
"typical_lifetime": "30 Minuten",
|
||||
"reid_risk": "low",
|
||||
"technical_necessity": "full",
|
||||
"notes": "Premium-Feature-Cookie. Anwesenheit = Cloudflare Bot-Management "
|
||||
"Pro im Einsatz.",
|
||||
},
|
||||
|
||||
# ─── CDN / Performance ────────────────────────────────────────────
|
||||
"__cf_bm": {
|
||||
"vendor": "Cloudflare Inc.", "vendor_country": "US",
|
||||
"exact_purpose": "Cloudflare Bot Management — basis Bot-Erkennung.",
|
||||
"data_collected": ["bot_score", "client_hash"],
|
||||
"ip_relevant": True,
|
||||
"typical_lifetime": "30 Minuten",
|
||||
"reid_risk": "low",
|
||||
"technical_necessity": "full",
|
||||
"notes": "Strictly necessary nach §25(2) TDDDG (Sicherheit). Keine Einwilligung noetig.",
|
||||
},
|
||||
"aws-alb": {
|
||||
"vendor": "Amazon Web Services Inc.", "vendor_country": "US",
|
||||
"exact_purpose": "AWS Application Load Balancer Sticky Sessions — "
|
||||
"routet Anfragen konsistent an dieselbe Backend-Instanz.",
|
||||
"data_collected": ["target_instance_id"],
|
||||
"ip_relevant": False,
|
||||
"typical_lifetime": "1 Stunde",
|
||||
"reid_risk": "low",
|
||||
"technical_necessity": "full",
|
||||
"schrems_ii_status": "AWS Frankfurt verfuegbar — bei korrektem Region-Setup "
|
||||
"kein US-Transfer.",
|
||||
},
|
||||
|
||||
# ─── Retargeting / Advertising ────────────────────────────────────
|
||||
"_pin_unauth": {
|
||||
"vendor": "Pinterest Europe Ltd.", "vendor_country": "IE/US",
|
||||
"exact_purpose": "Pinterest Tag — Conversion-Tracking + Audience-Aufbau.",
|
||||
"data_collected": ["pinterest_user_id"],
|
||||
"ip_relevant": True,
|
||||
"tcf_purpose_ids": [4, 9, 10],
|
||||
"iab_vendor_id": 762,
|
||||
"typical_lifetime": "1 Jahr",
|
||||
"reid_risk": "high",
|
||||
"technical_necessity": "none",
|
||||
},
|
||||
"cto_dna": {
|
||||
"vendor": "Criteo S.A.", "vendor_country": "FR",
|
||||
"exact_purpose": "Criteo Dynamic Retargeting — produktspezifische "
|
||||
"Werbeauslieferung basierend auf Browser-History.",
|
||||
"data_collected": ["criteo_user_id", "product_views"],
|
||||
"ip_relevant": True,
|
||||
"tcf_purpose_ids": [4, 9, 10],
|
||||
"iab_vendor_id": 91,
|
||||
"typical_lifetime": "13 Monate",
|
||||
"reid_risk": "high",
|
||||
"technical_necessity": "none",
|
||||
"schrems_ii_status": "Criteo ist FR-basiert, aber Daten gehen auch in die USA. "
|
||||
"Multi-Region-Setup pruefen.",
|
||||
"notes": "Trotz FR-Sitz nicht automatisch DSGVO-konform — CNIL-Bussgeld "
|
||||
"EUR 60M 2022 wegen mangelnder Einwilligungs-Granularitaet.",
|
||||
},
|
||||
"afm": {
|
||||
"vendor": "Adform A/S", "vendor_country": "DK",
|
||||
"exact_purpose": "Adform Audience Matching — Cross-Device-Identifikation "
|
||||
"fuer programmatische Werbung.",
|
||||
"data_collected": ["adform_user_id", "device_signals"],
|
||||
"ip_relevant": True,
|
||||
"tcf_purpose_ids": [4, 9, 10],
|
||||
"iab_vendor_id": 50,
|
||||
"typical_lifetime": "30 Tage",
|
||||
"reid_risk": "high",
|
||||
"technical_necessity": "none",
|
||||
"schrems_ii_status": "Adform ist DK-basiert, EU-Hosting Standard. Keine "
|
||||
"Schrems-II-Probleme bei Standard-Setup.",
|
||||
},
|
||||
|
||||
# ─── Consent / Funktional (Strictly Necessary) ────────────────────
|
||||
"JSESSIONID": {
|
||||
"vendor": "Java EE / Tomcat (Site-Software)", "vendor_country": "N/A",
|
||||
"exact_purpose": "Server-Session-ID fuer Java-basierte Anwendungen.",
|
||||
"data_collected": ["session_id"],
|
||||
"ip_relevant": False,
|
||||
"typical_lifetime": "Session",
|
||||
"reid_risk": "low",
|
||||
"technical_necessity": "full",
|
||||
"notes": "Strictly necessary. Keine Einwilligung nach §25(2) TDDDG.",
|
||||
},
|
||||
"PHPSESSID": {
|
||||
"vendor": "PHP (Site-Software)", "vendor_country": "N/A",
|
||||
"exact_purpose": "PHP-Session-ID — serverseitige Sitzungs-Persistierung.",
|
||||
"data_collected": ["session_id"],
|
||||
"ip_relevant": False,
|
||||
"typical_lifetime": "Session",
|
||||
"reid_risk": "low",
|
||||
"technical_necessity": "full",
|
||||
},
|
||||
"cookie_consent": {
|
||||
"vendor": "BreakPilot Consent-Banner (own)", "vendor_country": "DE",
|
||||
"exact_purpose": "Speichert die Einwilligungsentscheidung des Nutzers "
|
||||
"pro Kategorie.",
|
||||
"data_collected": ["consent_state_per_category", "timestamp"],
|
||||
"ip_relevant": False,
|
||||
"typical_lifetime": "180 Tage",
|
||||
"reid_risk": "low",
|
||||
"technical_necessity": "full",
|
||||
"notes": "Strictly necessary nach Art. 7(1) DSGVO Nachweispflicht.",
|
||||
},
|
||||
|
||||
# ─── Templated / pattern-based entries (Suffix variabel) ──────────
|
||||
# Diese werden via regex-Lookup gefangen, der Eintrag dient als Fallback.
|
||||
"_uet_": {
|
||||
"vendor": "Microsoft Corp.", "vendor_country": "US",
|
||||
"exact_purpose": "Microsoft Universal Event Tracking — Suffix nach `_uet_`.",
|
||||
"data_collected": ["event_id"],
|
||||
"ip_relevant": True,
|
||||
"typical_lifetime": "30 Minuten",
|
||||
"reid_risk": "medium",
|
||||
"technical_necessity": "none",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
# ─── Pattern-Lookup fuer Cookies mit variablem Suffix ───────────────
|
||||
|
||||
_PATTERN_LOOKUPS: list[tuple[str, str]] = [
|
||||
(r"^_ga_[A-Z0-9_]+$", "_ga_*"),
|
||||
(r"^_gat_gtag_UA_", "_gat_gtag_UA_"),
|
||||
(r"^AMCV_", "AMCV_"),
|
||||
(r"^_uet[a-z]+", "_uet_"),
|
||||
(r"^aws-alb", "aws-alb"),
|
||||
(r"^_pk_id\.", "_pk_id"),
|
||||
(r"^_pk_ses\.", "_pk_ses"),
|
||||
]
|
||||
|
||||
|
||||
def lookup_cookie(name: str) -> CookieKnowledge | None:
|
||||
"""Return rich knowledge for a cookie name, or None if unknown."""
|
||||
import re
|
||||
if not name:
|
||||
return None
|
||||
# Direct hit
|
||||
if name in KB:
|
||||
return KB[name]
|
||||
# Pattern-based
|
||||
for pattern, kb_key in _PATTERN_LOOKUPS:
|
||||
if re.search(pattern, name):
|
||||
return KB.get(kb_key)
|
||||
# Strip common suffixes (.bmw.de, .domain etc.)
|
||||
base = name.split(".", 1)[0]
|
||||
if base != name and base in KB:
|
||||
return KB[base]
|
||||
return None
|
||||
|
||||
|
||||
def enrich_vendor_with_knowledge(vendor: dict) -> dict:
|
||||
"""Add per-cookie knowledge to each cookie in vendor['cookies']."""
|
||||
cookies = vendor.get("cookies") or []
|
||||
enriched = []
|
||||
for c in cookies:
|
||||
info = lookup_cookie(c.get("name", ""))
|
||||
if info:
|
||||
enriched.append({**c, "knowledge": info})
|
||||
else:
|
||||
enriched.append(c)
|
||||
return {**vendor, "cookies": enriched}
|
||||
|
||||
|
||||
# ─── Hilfen fuer Aggregate-Reports ──────────────────────────────────
|
||||
|
||||
def summarize_compliance_risk(vendor: dict) -> dict:
|
||||
"""Aggregiere Re-ID-Risiko + Drittland-Status ueber alle Cookies eines Vendors."""
|
||||
cookies = vendor.get("cookies") or []
|
||||
risk_counts = {"high": 0, "medium": 0, "low": 0}
|
||||
schrems_affected = 0
|
||||
technical_only = 0
|
||||
for c in cookies:
|
||||
k = c.get("knowledge") or lookup_cookie(c.get("name", ""))
|
||||
if not k:
|
||||
continue
|
||||
risk = k.get("reid_risk", "low")
|
||||
risk_counts[risk] = risk_counts.get(risk, 0) + 1
|
||||
if "us" in k.get("vendor_country", "").lower() or "schrems" in k.get("schrems_ii_status", "").lower():
|
||||
schrems_affected += 1
|
||||
if k.get("technical_necessity") == "full":
|
||||
technical_only += 1
|
||||
return {
|
||||
"reid_risk_distribution": risk_counts,
|
||||
"high_risk_cookie_count": risk_counts["high"],
|
||||
"schrems_ii_affected_cookies": schrems_affected,
|
||||
"strictly_necessary_cookies": technical_only,
|
||||
"total_classified": sum(risk_counts.values()),
|
||||
}
|
||||
|
||||
|
||||
# ─── TEMPLATE_ENTRY (fuer Kuratierung neuer Cookies) ───────────────
|
||||
|
||||
TEMPLATE_ENTRY: CookieKnowledge = {
|
||||
"vendor": "<Voller Firmenname>",
|
||||
"vendor_country": "<ISO-2 Code oder 'IE/US' bei Doppel-Standort>",
|
||||
"exact_purpose": "<1-2 Saetze was der Cookie GENAU tut>",
|
||||
"data_collected": ["<feldname_1>", "<feldname_2>"],
|
||||
"ip_relevant": False,
|
||||
"ip_anonymized": False,
|
||||
"tcf_purpose_ids": [], # TCF v2.2: 1-11
|
||||
"iab_vendor_id": None, # Aus https://iabeurope.eu/tcf-vendor-list/
|
||||
"typical_lifetime": "<Session | XX Tage | XX Monate | XX Jahre>",
|
||||
"reid_risk": "low", # low | medium | high
|
||||
"technical_necessity": "none", # none | partial | full
|
||||
"schrems_ii_status": "<Drittlandtransfer-Bewertung>",
|
||||
"eugh_rulings": [],
|
||||
"eu_alternative_cookies": [],
|
||||
"eu_alternative_vendor": "",
|
||||
"notes": "",
|
||||
}
|
||||
@@ -220,10 +220,16 @@ def score_vendors(vendors: list[dict]) -> list[dict]:
|
||||
flags.append("no_purpose")
|
||||
|
||||
# Country — only for external processors / controllers
|
||||
# Falls country leer ist, ableiten aus Rechtsform-Suffix im Namen.
|
||||
if country_required:
|
||||
max_score += 10
|
||||
if v.get("country"):
|
||||
score += 10
|
||||
elif _country_from_name(v.get("name", "")):
|
||||
inferred = _country_from_name(v.get("name", ""))
|
||||
v["country"] = inferred
|
||||
v["country_inferred"] = True
|
||||
score += 10
|
||||
else:
|
||||
flags.append("no_country")
|
||||
|
||||
@@ -321,3 +327,153 @@ def build_check_items(validated: list[LinkCheck]) -> list[dict]:
|
||||
"hint": hint,
|
||||
})
|
||||
return items
|
||||
|
||||
|
||||
# ─── Country-Inferenz aus Rechtsform-Suffix ────────────────────────
|
||||
#
|
||||
# Wenn ein Vendor das "country"-Feld leer hat, koennen wir es oft aus
|
||||
# dem Firmen-Suffix ableiten:
|
||||
# Adform A/S → DK (Dänemark, Aktieselskab)
|
||||
# Pinterest Europe Ltd. → IE (Irland, Limited)
|
||||
# Salesforce Inc. → US (Incorporated)
|
||||
# Adobe ... Ireland Limited → IE
|
||||
# Genesys ... B.V. → NL (Niederlande, Besloten Vennootschap)
|
||||
# Equativ S.A. → FR (Société Anonyme)
|
||||
# SAP SE → DE (Societas Europaea — meist DE-eingetragen)
|
||||
#
|
||||
# Kombi-Strategie:
|
||||
# 1) Suffix-Pattern
|
||||
# 2) Laendername im Firmen-Namen ('Ireland', 'Deutschland')
|
||||
# 3) Specific Vendor (Google Inc / Meta Platforms Ireland Ltd → vendor-specific)
|
||||
|
||||
import re as _re
|
||||
|
||||
_SUFFIX_COUNTRY: list[tuple[str, str]] = [
|
||||
# Pattern (am Wort-Ende oder vor weiteren Tokens) → ISO-Code
|
||||
(r"\bA/S\b", "DK"), # Aktieselskab
|
||||
(r"\bApS\b", "DK"), # Anpartsselskab
|
||||
(r"\bAB\b", "SE"), # Aktiebolag
|
||||
(r"\bAS\b(?!\w)", "NO"), # Aksjeselskap
|
||||
(r"\bOy\b", "FI"), # Osakeyhtiö
|
||||
(r"\bAG\b(?!\w)", "DE"), # auch CH/AT moeglich, default DE
|
||||
(r"\bGmbH\b", "DE"),
|
||||
(r"\bUG\b", "DE"),
|
||||
(r"\beG\b", "DE"),
|
||||
(r"\bKG\b", "DE"),
|
||||
(r"\bOHG\b", "DE"),
|
||||
(r"\bSE\b", "DE"), # Societas Europaea — pruefen ob SAP SE etc.
|
||||
(r"\bS\.A\.\b", "FR"), # France / SE / ES
|
||||
(r"\bSAS\b", "FR"),
|
||||
(r"\bS\.A\.S\.\b", "FR"),
|
||||
(r"\bSARL\b", "FR"),
|
||||
(r"\bS\.r\.l\.\b", "IT"),
|
||||
(r"\bS\.p\.A\.\b", "IT"),
|
||||
(r"\bSpA\b", "IT"),
|
||||
(r"\bB\.V\.\b", "NL"),
|
||||
(r"\bN\.V\.\b", "NL"),
|
||||
(r"\bSL\b", "ES"),
|
||||
(r"\bS\.A\.\sde C\.V\.\b", "MX"),
|
||||
(r"\bd\.o\.o\.\b", "SI"), # Slowenien
|
||||
(r"\bd\.d\.\b", "HR"), # Kroatien
|
||||
(r"\bz\s?o\.o\.\b", "PL"),
|
||||
(r"\bInc\.?\b", "US"),
|
||||
(r"\bIncorporated\b", "US"),
|
||||
(r"\bCorp\.?\b", "US"),
|
||||
(r"\bCorporation\b", "US"),
|
||||
(r"\bLLC\b", "US"),
|
||||
(r"\bL\.L\.C\.\b", "US"),
|
||||
(r"\bLtd\.?\b", "GB"), # UK Limited, default
|
||||
(r"\bLimited\b", "GB"),
|
||||
(r"\bPLC\b", "GB"),
|
||||
(r"\bPty\b", "AU"),
|
||||
(r"\bK\.K\.\b", "JP"), # Kabushiki-Kaisha
|
||||
(r"\bPte\.?\sLtd\.?\b", "SG"),
|
||||
]
|
||||
|
||||
# Country-Namen im Firmen-Namen (z.B. "Adobe Systems Software Ireland Limited")
|
||||
_COUNTRY_NAME_TOKENS: list[tuple[str, str]] = [
|
||||
("ireland", "IE"),
|
||||
("deutschland", "DE"),
|
||||
("germany", "DE"),
|
||||
("netherlands", "NL"),
|
||||
("france", "FR"),
|
||||
("united kingdom", "GB"),
|
||||
("uk", "GB"),
|
||||
("usa", "US"),
|
||||
("united states", "US"),
|
||||
("austria", "AT"),
|
||||
("oesterreich", "AT"),
|
||||
("schweiz", "CH"),
|
||||
("switzerland", "CH"),
|
||||
("luxembourg", "LU"),
|
||||
("luxemburg", "LU"),
|
||||
("denmark", "DK"),
|
||||
("daenemark", "DK"),
|
||||
("sweden", "SE"),
|
||||
("schweden", "SE"),
|
||||
("norway", "NO"),
|
||||
("norwegen", "NO"),
|
||||
("finland", "FI"),
|
||||
("finnland", "FI"),
|
||||
]
|
||||
|
||||
# Bekannte Vendors mit eindeutigem Sitz (override)
|
||||
_KNOWN_VENDOR_COUNTRY: dict[str, str] = {
|
||||
"google inc": "US",
|
||||
"google llc": "US",
|
||||
"google ireland": "IE",
|
||||
"meta platforms ireland": "IE",
|
||||
"facebook ireland": "IE",
|
||||
"amazon.com inc": "US",
|
||||
"amazon web services": "US",
|
||||
"amazon web services inc": "US",
|
||||
"linkedin inc": "US",
|
||||
"salesforce inc": "US",
|
||||
"salesforce.com": "US",
|
||||
"outbrain inc": "US",
|
||||
"taboola inc": "US",
|
||||
"pinterest europe ltd": "IE",
|
||||
"intuition machines inc": "US",
|
||||
"akamai technologies inc": "US",
|
||||
"criteo s.a": "FR",
|
||||
"criteo sa": "FR",
|
||||
"adform a/s": "DK",
|
||||
"speedcurve limited": "GB",
|
||||
"longtail ad solutions": "US",
|
||||
"genesys cloud services b.v": "NL",
|
||||
"qualtrics": "US",
|
||||
"teads sa": "FR",
|
||||
"teads s.a": "FR",
|
||||
"salesviewer gmbh": "DE",
|
||||
"baqend gmbh": "DE",
|
||||
"zenweshare sas": "FR",
|
||||
"nayoki gmbh": "DE",
|
||||
"psyma": "DE",
|
||||
"matomo": "NZ", # InnoCraft NZ aber EU-hostbar
|
||||
"adobe systems software ireland": "IE",
|
||||
"microsoft corporation": "US",
|
||||
"microsoft corp": "US",
|
||||
}
|
||||
|
||||
|
||||
def _country_from_name(vendor_name: str) -> str:
|
||||
"""Best-effort: ISO-2 Country-Code aus dem Vendor-Namen ableiten."""
|
||||
if not vendor_name:
|
||||
return ""
|
||||
# Vendor-Namen sind oft "<Firma> — <Tool>" — nur Firmen-Teil betrachten
|
||||
firm = vendor_name.split(" — ")[0].strip()
|
||||
firm_l = firm.lower()
|
||||
|
||||
# 1) Known vendor lookup (most specific)
|
||||
for k, v in _KNOWN_VENDOR_COUNTRY.items():
|
||||
if k in firm_l:
|
||||
return v
|
||||
# 2) Country-Name im Firmen-Namen
|
||||
for token, code in _COUNTRY_NAME_TOKENS:
|
||||
if token in firm_l:
|
||||
return code
|
||||
# 3) Rechtsform-Suffix
|
||||
for pattern, code in _SUFFIX_COUNTRY:
|
||||
if _re.search(pattern, firm):
|
||||
return code
|
||||
return ""
|
||||
|
||||
@@ -0,0 +1,350 @@
|
||||
"""
|
||||
Doc-Anchor-Locator — fuer ein Finding den passendsten Einfuege-Ort im
|
||||
existierenden Dokument finden.
|
||||
|
||||
Primary strategy: BGE-M3 Embedding-Match zwischen einer pro-Finding
|
||||
Anchor-Query und allen Absaetzen des Docs. Echtes semantisches Matching
|
||||
(BMW schreibt "Verarbeiter" statt "Auftragsverarbeiter" → Keyword waere
|
||||
out, Embedding catches it).
|
||||
|
||||
Fallback: Keyword-Match (wenn kein Embedding-Service erreichbar).
|
||||
|
||||
Output pro Anchor:
|
||||
- anchor_phrase : Originaltext-Auszug
|
||||
- position_hint : "Nach Absatz X von Y: '...'"
|
||||
- confidence : 'high' | 'medium' | 'low'
|
||||
- score : float (cosine similarity oder keyword-rank)
|
||||
- method : 'embedding' | 'keyword' | 'fallback'
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import math
|
||||
import os
|
||||
import re
|
||||
import threading
|
||||
from typing import Iterable
|
||||
|
||||
import httpx
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
EMBEDDING_URL = os.getenv("EMBEDDING_URL", "http://embedding-service:8087")
|
||||
|
||||
# Pro Finding-Typ eine semantisch reiche Anchor-Query, die der Embedding-
|
||||
# Matcher gegen den Doc-Text wirft. Reicher als die kurze MC-check_question.
|
||||
# Sucht NICHT den Pflichttext selbst (der fehlt ja) sondern den Absatz wo
|
||||
# der Fix HINEIN-soll — also den thematisch verwandten Kontext.
|
||||
_ANCHOR_QUERIES: list[tuple[str, str, str]] = [
|
||||
# (finding_label_partial, anchor_query, fallback_hint)
|
||||
(
|
||||
"Auftragsverarbeiter erwaehnt",
|
||||
"Empfaenger der Daten Verarbeiter Dienstleister Cloud Hosting CRM "
|
||||
"Auftragsverarbeitung Weitergabe Datenuebermittlung an Dritte",
|
||||
"Im Abschnitt 'Empfaenger' oder 'Datenuebermittlung'",
|
||||
),
|
||||
(
|
||||
"Automatisierte Entscheidungen",
|
||||
"Betroffenenrechte automatisierte Entscheidung Profiling Logik "
|
||||
"Tragweite Auswirkung Art. 22 DSGVO",
|
||||
"Am Ende des Abschnitts 'Betroffenenrechte'",
|
||||
),
|
||||
(
|
||||
"Konkrete Aufsichtsbehoerde",
|
||||
"Beschwerderecht Datenschutzaufsicht Aufsichtsbehoerde Beschwerde "
|
||||
"bei der Behoerde einreichen Recht auf Beschwerde",
|
||||
"Im Abschnitt 'Beschwerderecht'",
|
||||
),
|
||||
(
|
||||
"Angemessenheitsbeschluss",
|
||||
"Drittlandtransfer USA Standardvertragsklauseln SCC DPF Data Privacy "
|
||||
"Framework Angemessenheitsbeschluss internationale Datenuebermittlung",
|
||||
"Im Abschnitt 'Drittlandtransfer'",
|
||||
),
|
||||
(
|
||||
"Anschrift des Verantwortlichen",
|
||||
"Verantwortlicher Verantwortliche Stelle Datenschutz Betreiber dieser "
|
||||
"Website Firma Anschrift Kontakt",
|
||||
"Am Anfang der Datenschutzerklaerung / Cookie-Richtlinie",
|
||||
),
|
||||
(
|
||||
"Konkrete Cookie-Namen",
|
||||
"Welche Cookies verwenden wir Cookie-Tabelle Liste der Cookies "
|
||||
"Cookie-Kategorien Auflistung der Cookies Name Anbieter Zweck",
|
||||
"Im Abschnitt 'Welche Cookies verwenden wir?'",
|
||||
),
|
||||
(
|
||||
"Konkrete Anbieter/Dienste",
|
||||
"Drittanbieter Dienste Anbieter wir nutzen folgende Dienste "
|
||||
"Empfaenger der Cookie-Daten Liste der Dienstleister",
|
||||
"In der Drittanbieter-Liste der Cookie-Richtlinie",
|
||||
),
|
||||
(
|
||||
"Analytics-/Statistik-Tools konkret benannt",
|
||||
"Statistik Analytics Reichweitenmessung Webanalyse Tracking "
|
||||
"Google Analytics Matomo Adobe Analytics",
|
||||
"Im Abschnitt 'Statistik / Analyse-Cookies'",
|
||||
),
|
||||
(
|
||||
"Konkrete Speicherdauer",
|
||||
"Speicherdauer Lebensdauer wie lange Ablauf Cookie-Tabelle Spalte "
|
||||
"Speicherdauer pro Cookie",
|
||||
"In der Cookie-Tabelle pro Eintrag",
|
||||
),
|
||||
(
|
||||
"Opt-Out-Links",
|
||||
"Widerruf widersprechen deaktivieren Cookie-Einstellungen aendern "
|
||||
"Opt-Out Einstellungen anpassen",
|
||||
"Im Abschnitt 'Wie kann ich widersprechen?'",
|
||||
),
|
||||
(
|
||||
"Privacy-Policy-Links",
|
||||
"Datenschutzerklaerung des Drittanbieters Privacy Policy Link auf "
|
||||
"Datenschutzhinweise der Drittanbieter",
|
||||
"Im Drittanbieter-Listing der Cookie-Richtlinie",
|
||||
),
|
||||
(
|
||||
"Verbraucherstreitbeilegung",
|
||||
"Online-Streitbeilegung OS-Plattform Verbraucherschlichtungsstelle "
|
||||
"Streitbeilegung Verbraucher",
|
||||
"Am Ende des Impressums (eigener Abschnitt 'Streitbeilegung')",
|
||||
),
|
||||
(
|
||||
"Rechtswidriger Haftungsausschluss",
|
||||
"Haftung Disclaimer wir distanzieren uns Links zu externen Webseiten "
|
||||
"Haftungsausschluss Drittinhalte",
|
||||
"Am Ende des Impressums (Disclaimer-Absatz)",
|
||||
),
|
||||
(
|
||||
"Name der vertretungsberechtigten",
|
||||
"Vertreten durch Vorstand Geschaeftsfuehrung Geschaeftsfuehrer "
|
||||
"vertretungsberechtigt Repraesentant",
|
||||
"Im Impressum nach Firmenname + Anschrift",
|
||||
),
|
||||
(
|
||||
"Zustaendige Kammer",
|
||||
"Berufsrecht Kammer Aufsichtsbehoerde berufsrechtliche Angaben "
|
||||
"zustaendige Kammer",
|
||||
"Im Impressum im Abschnitt 'Berufsrechtliche Angaben'",
|
||||
),
|
||||
(
|
||||
"Drittlaender",
|
||||
"Drittland Drittlaender USA Indien China internationale Datenuebermittlung "
|
||||
"Datenexport in Nicht-EU-Staaten",
|
||||
"Im Abschnitt 'Drittlandtransfer'",
|
||||
),
|
||||
(
|
||||
"Schutzgarantien",
|
||||
"Schutzgarantien Schutzvorkehrungen Schutzmassnahmen SCC "
|
||||
"Standardvertragsklauseln einsehen Anforderung",
|
||||
"Im Abschnitt 'Drittlandtransfer / Schutzgarantien'",
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
# ─── Thread-local Cache fuer Doc-Chunks + Embeddings ───────────────
|
||||
# Pro Compliance-Check-Run werden die Doc-Texte einmal embedded und im
|
||||
# Thread-Local-Storage gehalten, damit mehrere Findings im selben Run
|
||||
# nicht jeweils neu embedded werden.
|
||||
|
||||
_tls = threading.local()
|
||||
|
||||
|
||||
def _get_cache() -> dict:
|
||||
if not hasattr(_tls, "cache"):
|
||||
_tls.cache = {}
|
||||
return _tls.cache
|
||||
|
||||
|
||||
def reset_cache() -> None:
|
||||
"""Per-request-cache leeren (sollte am Start jedes Doc-Check-Runs aufgerufen
|
||||
werden, damit Vorgaenger-Daten kein Leak verursachen)."""
|
||||
if hasattr(_tls, "cache"):
|
||||
_tls.cache = {}
|
||||
|
||||
|
||||
# ─── Helfer ────────────────────────────────────────────────────────
|
||||
|
||||
def _normalize(text: str) -> str:
|
||||
return (text or "").lower().replace("\xad", "").replace("ß", "ss")
|
||||
|
||||
|
||||
def _split_paragraphs(text: str) -> list[str]:
|
||||
"""Split a doc into paragraphs (by double newline, fallback single)."""
|
||||
if not text:
|
||||
return []
|
||||
paras = re.split(r"\n\s*\n", text)
|
||||
if len(paras) < 3:
|
||||
paras = re.split(r"(?<=[\.\?\!])\s+(?=[A-ZÄÖÜ])", text)
|
||||
return [p.strip() for p in paras if p.strip()]
|
||||
|
||||
|
||||
def _embed_sync(texts: list[str], timeout: float = 60.0,
|
||||
batch_size: int = 32) -> list[list[float]]:
|
||||
"""Synchroner Batch-Embed-Call (Anchor-Lokalisierung laeuft in
|
||||
Sync-HTML-Render, nicht in async context)."""
|
||||
if not texts:
|
||||
return []
|
||||
out: list[list[float]] = []
|
||||
with httpx.Client(timeout=timeout) as client:
|
||||
for i in range(0, len(texts), batch_size):
|
||||
batch = texts[i:i + batch_size]
|
||||
try:
|
||||
r = client.post(f"{EMBEDDING_URL}/embed", json={"texts": batch})
|
||||
r.raise_for_status()
|
||||
out.extend(r.json().get("embeddings") or [])
|
||||
except Exception as e:
|
||||
logger.warning("Anchor embed sub-batch [%d-%d] failed: %s",
|
||||
i, i + len(batch), e)
|
||||
out.extend([[] for _ in batch])
|
||||
return out
|
||||
|
||||
|
||||
def _cosine(a: list[float], b: list[float]) -> float:
|
||||
if not a or not b or len(a) != len(b):
|
||||
return 0.0
|
||||
dot = sum(x * y for x, y in zip(a, b))
|
||||
na = math.sqrt(sum(x * x for x in a))
|
||||
nb = math.sqrt(sum(y * y for y in b))
|
||||
if na == 0 or nb == 0:
|
||||
return 0.0
|
||||
return dot / (na * nb)
|
||||
|
||||
|
||||
def _doc_paragraphs_and_vectors(
|
||||
doc_id: str, doc_text: str,
|
||||
) -> tuple[list[str], list[list[float]]]:
|
||||
"""Lazy-cache: Absaetze + Vektoren pro Doc-ID. Wird genau einmal pro
|
||||
Doc und Run berechnet."""
|
||||
cache = _get_cache()
|
||||
if doc_id in cache:
|
||||
return cache[doc_id]
|
||||
|
||||
paras = _split_paragraphs(doc_text)
|
||||
if not paras:
|
||||
cache[doc_id] = ([], [])
|
||||
return cache[doc_id]
|
||||
|
||||
vecs = _embed_sync(paras)
|
||||
cache[doc_id] = (paras, vecs)
|
||||
return cache[doc_id]
|
||||
|
||||
|
||||
def _keyword_fallback(fl: str, doc_text: str) -> dict | None:
|
||||
"""Fallback wenn Embedding-Service ausfaellt oder zu wenig Trefferqualitaet."""
|
||||
# Use the old _ANCHOR_QUERIES list — extract just the fallback hint
|
||||
for label_partial, _query, fallback_hint in _ANCHOR_QUERIES:
|
||||
if _normalize(label_partial) in fl:
|
||||
return {
|
||||
"anchor_phrase": None,
|
||||
"position_hint": fallback_hint,
|
||||
"confidence": "low",
|
||||
"method": "fallback",
|
||||
}
|
||||
return None
|
||||
|
||||
|
||||
def locate_anchor(
|
||||
finding_label: str,
|
||||
doc_text: str,
|
||||
doc_id: str | None = None,
|
||||
) -> dict | None:
|
||||
"""Fuer ein Finding den passendsten Anker-Absatz im Doc-Text finden.
|
||||
|
||||
Primary: Embedding-Match. Sekundaer: Keyword-Hits als Bonus. Fallback:
|
||||
rein keyword-basiert wenn Embedding-Service nicht erreichbar ist.
|
||||
|
||||
`doc_id` ist ein cache-key (z.B. doc_type oder url). Wenn None, wird
|
||||
aus dem doc_text-Hash abgeleitet.
|
||||
"""
|
||||
if not doc_text or not finding_label:
|
||||
return None
|
||||
|
||||
fl = _normalize(finding_label)
|
||||
|
||||
# Welche Anchor-Query matched dieses Finding?
|
||||
query = None
|
||||
fallback_hint = None
|
||||
matched_label = None
|
||||
for label_partial, q, fb in _ANCHOR_QUERIES:
|
||||
if _normalize(label_partial) in fl:
|
||||
query, fallback_hint, matched_label = q, fb, label_partial
|
||||
break
|
||||
if not query:
|
||||
return None
|
||||
|
||||
doc_id = doc_id or f"doc-{hash(doc_text) & 0xffffffff:08x}"
|
||||
|
||||
# 1) Embedding-Match
|
||||
paras, doc_vecs = _doc_paragraphs_and_vectors(doc_id, doc_text)
|
||||
if not paras:
|
||||
return None
|
||||
|
||||
embeddings_available = any(v for v in doc_vecs)
|
||||
if not embeddings_available:
|
||||
return _keyword_fallback(fl, doc_text)
|
||||
|
||||
try:
|
||||
q_vec = _embed_sync([query])[0] if query else None
|
||||
except Exception:
|
||||
q_vec = None
|
||||
|
||||
if not q_vec:
|
||||
return _keyword_fallback(fl, doc_text)
|
||||
|
||||
# Per-Absatz Score = cosine + Heading-Bonus
|
||||
best_idx = -1
|
||||
best_score = 0.0
|
||||
for i, (p, dv) in enumerate(zip(paras, doc_vecs)):
|
||||
if not dv:
|
||||
continue
|
||||
sim = _cosine(q_vec, dv)
|
||||
# Heading-Bonus: kurze Absaetze + Markdown-Heading-Marker
|
||||
if len(p.split()) <= 8 or p.strip().startswith("#"):
|
||||
sim += 0.05
|
||||
if sim > best_score:
|
||||
best_score = sim
|
||||
best_idx = i
|
||||
|
||||
# Konfidenz-Schwellen — kalibriert anhand BMW-Run
|
||||
if best_idx < 0 or best_score < 0.40:
|
||||
# Zu schwacher Match — Fallback verwenden
|
||||
return {
|
||||
"anchor_phrase": None,
|
||||
"position_hint": fallback_hint,
|
||||
"confidence": "low",
|
||||
"score": round(best_score, 3) if best_idx >= 0 else 0,
|
||||
"method": "embedding-no-match",
|
||||
}
|
||||
|
||||
if best_score >= 0.62:
|
||||
confidence = "high"
|
||||
elif best_score >= 0.50:
|
||||
confidence = "medium"
|
||||
else:
|
||||
confidence = "low"
|
||||
|
||||
anchor = paras[best_idx]
|
||||
words = anchor.split()
|
||||
snippet = " ".join(words[:30]) + ("…" if len(words) > 30 else "")
|
||||
return {
|
||||
"anchor_phrase": snippet,
|
||||
"anchor_index": best_idx,
|
||||
"total_paragraphs": len(paras),
|
||||
"position_hint": f"Nach Absatz {best_idx + 1} von {len(paras)}: '{snippet}'",
|
||||
"confidence": confidence,
|
||||
"score": round(best_score, 3),
|
||||
"method": "embedding",
|
||||
}
|
||||
|
||||
|
||||
def annotate_findings_with_anchors(
|
||||
findings: Iterable[dict], doc_text: str, doc_id: str | None = None,
|
||||
) -> list[dict]:
|
||||
"""Pro Finding den Anchor suchen und das Dict um 'anchor' erweitern."""
|
||||
out = []
|
||||
for f in findings:
|
||||
a = locate_anchor(f.get("label") or f.get("title") or "", doc_text, doc_id)
|
||||
out.append({**f, "anchor": a})
|
||||
return out
|
||||
@@ -0,0 +1,353 @@
|
||||
"""
|
||||
Action-Recipes — pro Finding-Typ eine umsetzbare Handlungsanweisung:
|
||||
WAS tun, WARUM (Rechtsgrundlage), WIE formulieren (konkreter Textbaustein),
|
||||
WO einfuegen (Doc-Abschnitt-Hinweis).
|
||||
|
||||
Ohne Recipes ist ein Finding nur eine Diagnose. Mit Recipe weiss der
|
||||
Kunde sofort welchen Satz er an welche Stelle setzen muss.
|
||||
|
||||
Verwendung:
|
||||
from compliance.services.finding_action_recipes import recipe_for
|
||||
rec = recipe_for("no_cookies_listed") # → dict mit what/why/fix_text/where/example
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import TypedDict
|
||||
|
||||
|
||||
class ActionRecipe(TypedDict, total=False):
|
||||
what: str # 1-Satz Diagnose
|
||||
why: str # Rechtsgrundlage / Risiko
|
||||
fix_text: str # konkreter Textbaustein zum Einfuegen
|
||||
where: str # in welchem Doc-Abschnitt
|
||||
example: str # echtes Anwendungsbeispiel
|
||||
severity: str # 'critical' | 'high' | 'medium' | 'low'
|
||||
|
||||
|
||||
# ─── Vendor-/Cookie-Findings (im VVT-Block) ────────────────────────
|
||||
|
||||
VENDOR_FINDINGS: dict[str, ActionRecipe] = {
|
||||
|
||||
"no_cookies_listed": {
|
||||
"what": "Anbieter ist erfasst, aber es sind keine konkreten Cookies "
|
||||
"dokumentiert.",
|
||||
"why": "Die DSK-Orientierungshilfe Telemedien 2024 fordert pro Anbieter "
|
||||
"eine Auflistung aller gesetzten Cookies mit Name + Zweck + "
|
||||
"Speicherdauer. 'Verwendet Cookies' ohne Details erfuellt "
|
||||
"Art. 13 Abs. 1 lit. e DSGVO nicht.",
|
||||
"fix_text": "Ergaenzen Sie pro Cookie eine Zeile mit folgenden Feldern:\n"
|
||||
" • Cookie-Name (z.B. _ga, _fbp, NID)\n"
|
||||
" • Setzender Anbieter (Firma + Sitzland)\n"
|
||||
" • Zweck (1 Satz, z.B. 'Reichweitenmessung pseudonym')\n"
|
||||
" • Speicherdauer (z.B. '2 Jahre', '24h', 'Session')",
|
||||
"where": "Cookie-Richtlinie unter dem Abschnitt 'Kategorien' "
|
||||
"(Notwendig / Marketing / Statistik / ...).",
|
||||
"example": "_ga (Google Ireland Ltd.) — Statistik — eindeutige "
|
||||
"Besucher-ID — Speicherdauer 2 Jahre",
|
||||
"severity": "high",
|
||||
},
|
||||
|
||||
"no_country": {
|
||||
"what": "Anbieter-Sitzland ist nicht dokumentiert.",
|
||||
"why": "Art. 30 Abs. 1 lit. d DSGVO fordert die Kategorien der Empfaenger "
|
||||
"inkl. Sitz. Bei Drittland-Empfaengern (US, IN, CN ...) ist das "
|
||||
"zusaetzlich Pflicht nach Art. 13 Abs. 1 lit. f DSGVO.",
|
||||
"fix_text": "Fuegen Sie hinter dem Anbieternamen das Sitzland in "
|
||||
"Klammern ein. Bei Drittland (ausserhalb EU/EWR) zusaetzlich "
|
||||
"den Transfer-Mechanismus (SCC, DPF, Angemessenheitsbeschluss).",
|
||||
"where": "VVT-Eintrag bei 'Empfaenger' oder im Cookie-Anbieter-Listing.",
|
||||
"example": "Google Ireland Ltd., Dublin, Irland (EU). Bei US-Transfer: "
|
||||
"'Google LLC, Mountain View, US — DPF-zertifiziert'.",
|
||||
"severity": "high",
|
||||
},
|
||||
|
||||
"no_privacy_url": {
|
||||
"what": "Kein direkter Link zur Datenschutzerklaerung des Anbieters.",
|
||||
"why": "Art. 13 Abs. 2 lit. f DSGVO Transparenzgebot: der Nutzer muss "
|
||||
"die Datenverarbeitung beim Drittanbieter eigenverantwortlich "
|
||||
"nachvollziehen koennen.",
|
||||
"fix_text": "Ergaenzen Sie einen klickbaren Link zur Datenschutzerklaerung "
|
||||
"des Anbieters direkt neben dem Anbieternamen.",
|
||||
"where": "Cookie-Richtlinie pro Vendor-Eintrag. Idealerweise als "
|
||||
"letzter Spalteneintrag oder Inline-Link.",
|
||||
"example": "Google Analytics — Datenschutz: https://policies.google.com/privacy",
|
||||
"severity": "medium",
|
||||
},
|
||||
|
||||
"broken_privacy_url": {
|
||||
"what": "Der angegebene Privacy-Link gibt einen Fehler zurueck "
|
||||
"(404 / 403 / Timeout).",
|
||||
"why": "Ein toter Link erfuellt Art. 13(2)(f) DSGVO nicht — die "
|
||||
"Transparenz-Pflicht laeuft ins Leere.",
|
||||
"fix_text": "1. Pruefen Sie den Link aus einem normalen Browser. "
|
||||
"Funktioniert er dort? → Anbieter blockt automatisierte Pruefer (kein Mangel).\n"
|
||||
"2. Funktioniert er nicht? → Aktualisieren Sie den Link in Ihrer "
|
||||
"Cookie-Richtlinie auf die heute gueltige Privacy-URL des Anbieters.",
|
||||
"where": "Cookie-Richtlinie / Drittanbieter-Liste.",
|
||||
"example": "Wenn Adobe-Privacy-Link 404 gibt, ersetzen durch "
|
||||
"https://www.adobe.com/privacy/policy.html",
|
||||
"severity": "high",
|
||||
},
|
||||
|
||||
"no_opt_out_url": {
|
||||
"what": "Kein Opt-Out-Link fuer diesen Anbieter dokumentiert.",
|
||||
"why": "Art. 7 Abs. 3 DSGVO: der Widerruf der Einwilligung muss so "
|
||||
"einfach wie die Erteilung sein. Pro Drittanbieter muss eine "
|
||||
"Opt-Out-Moeglichkeit angeboten werden.",
|
||||
"fix_text": "Recherchieren Sie den offiziellen Opt-Out-Link des "
|
||||
"Anbieters und ergaenzen Sie ihn. Falls Ihr Cookie-Banner "
|
||||
"ein 'Einstellungen aendern' anbietet, ist das oft "
|
||||
"ausreichend — der Link sollte trotzdem als Backup "
|
||||
"dokumentiert sein.",
|
||||
"where": "Cookie-Richtlinie pro Vendor-Eintrag.",
|
||||
"example": "Google Analytics Opt-Out: https://tools.google.com/dlpage/gaoptout",
|
||||
"severity": "high",
|
||||
},
|
||||
|
||||
"broken_opt_out": {
|
||||
"what": "Der angegebene Opt-Out-Link funktioniert nicht "
|
||||
"(404 / 403 / Timeout).",
|
||||
"why": "Art. 7(3) DSGVO Widerrufsmoeglichkeit ohne funktionierenden "
|
||||
"Link ist nicht gegeben.",
|
||||
"fix_text": "1. Pruefen Sie ob der Link aus einem Browser funktioniert. "
|
||||
"403 kommt oft von Bot-Schutz und ist kein realer Mangel.\n"
|
||||
"2. Bei echtem 404: aktualisieren Sie auf den heute gueltigen "
|
||||
"Opt-Out-Link.\n"
|
||||
"3. Alternativ verlinken Sie auf Ihren eigenen Cookie-Banner "
|
||||
"'Einstellungen aendern'-Trigger.",
|
||||
"where": "Cookie-Richtlinie pro Vendor-Eintrag.",
|
||||
"example": "Wenn Criteo-Opt-Out 403 gibt, ist das oft Cloudflare-Schutz — "
|
||||
"Link aus dem Browser klickbar → kein Mangel. Alternativ: "
|
||||
"https://www.youronlinechoices.com/de/",
|
||||
"severity": "medium",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
# ─── Doc-Pruefungs-Findings (im DSE/Impressum/Cookie-Pruefblock) ───
|
||||
|
||||
DOC_CHECK_FINDINGS: dict[str, ActionRecipe] = {
|
||||
|
||||
"Auftragsverarbeiter erwaehnt": {
|
||||
"what": "Auftragsverarbeiter (Hosting, CRM, Newsletter) sind nicht "
|
||||
"explizit erwaehnt + es fehlt Hinweis auf AVVs nach Art. 28 DSGVO.",
|
||||
"why": "Art. 28 DSGVO Auftragsverarbeitung erfordert dokumentierte AVVs. "
|
||||
"Art. 13(1)(e) DSGVO erfordert die Nennung der Empfaenger-"
|
||||
"Kategorien. Fehlt der Hinweis = klassischer Pruefpunkt der "
|
||||
"Aufsichtsbehoerden.",
|
||||
"fix_text": "Wir setzen sorgfaeltig ausgewaehlte Auftragsverarbeiter "
|
||||
"(z.B. fuer Hosting, Webanalyse, Customer Service) ein. Mit "
|
||||
"allen Auftragsverarbeitern haben wir Vertraege zur "
|
||||
"Auftragsverarbeitung nach Art. 28 DSGVO geschlossen. Die "
|
||||
"Auftragsverarbeiter handeln ausschliesslich auf unsere "
|
||||
"Weisung und sind vertraglich zu angemessenen technischen "
|
||||
"und organisatorischen Massnahmen verpflichtet.",
|
||||
"where": "Datenschutzerklaerung im Abschnitt 'Empfaenger' oder "
|
||||
"'Datenuebermittlung'. Direkt nach der Aufzaehlung der "
|
||||
"Empfaenger-Kategorien.",
|
||||
"example": "Empfaenger: BMW Niederlassungen, Vertragshaendler, "
|
||||
"Auftragsverarbeiter (Cloud-Hosting AWS, CRM Salesforce, "
|
||||
"Webanalyse Adobe Analytics — mit allen sind AVVs nach "
|
||||
"Art. 28 DSGVO geschlossen).",
|
||||
"severity": "high",
|
||||
},
|
||||
|
||||
"Automatisierte Entscheidungen / Profiling": {
|
||||
"what": "Keine Aussage zu automatisierten Einzelentscheidungen "
|
||||
"oder Profiling nach Art. 22 DSGVO.",
|
||||
"why": "Art. 13 Abs. 2 lit. f DSGVO: bei automatisierten "
|
||||
"Einzelentscheidungen muessen Logik, Tragweite und Auswirkungen "
|
||||
"erklaert werden. Bei KEINEM Profiling muss das explizit "
|
||||
"verneint werden — sonst lassen Aufsichtsbehoerden Unsicherheit "
|
||||
"offen.",
|
||||
"fix_text": "Variante A (kein Profiling):\n"
|
||||
" 'Es findet keine automatisierte Entscheidungsfindung "
|
||||
"im Sinne des Art. 22 DSGVO statt. Sofern wir Profiling "
|
||||
"zur Anzeige personalisierter Inhalte einsetzen, erfolgt "
|
||||
"dies ausschliesslich auf Basis Ihrer Einwilligung und "
|
||||
"wird im Abschnitt [X] erlaeutert.'\n\n"
|
||||
"Variante B (Profiling vorhanden, z.B. fuer Werbung):\n"
|
||||
" 'Wir nutzen Profiling zur Anzeige personalisierter "
|
||||
"Werbung. Die Logik basiert auf [Klick-Historie / "
|
||||
"Besuchsverhalten / Praeferenzen]. Tragweite: "
|
||||
"Anpassung der angezeigten Anzeigen. Auswirkung: keine "
|
||||
"rechtlichen oder erheblichen Auswirkungen — Sie koennen "
|
||||
"jederzeit widersprechen unter [Link/Kontakt].'",
|
||||
"where": "Datenschutzerklaerung am Ende des Abschnitts "
|
||||
"'Betroffenenrechte' oder als eigener Absatz unter "
|
||||
"'Automatisierte Entscheidungen'.",
|
||||
"example": "Standardformulierung Variante A — falls Sie KEIN Profiling "
|
||||
"betreiben, ist das der sichere Default-Text.",
|
||||
"severity": "high",
|
||||
},
|
||||
|
||||
"Konkrete Aufsichtsbehoerde benannt": {
|
||||
"what": "Aufsichtsbehoerde wird nicht namentlich genannt.",
|
||||
"why": "Art. 13(2)(d) DSGVO: das Beschwerderecht muss konkret "
|
||||
"kommuniziert werden — nicht 'die zustaendige Behoerde', sondern "
|
||||
"Name + Anschrift + Website.",
|
||||
"fix_text": "Sie haben das Recht, sich bei einer Datenschutz-"
|
||||
"Aufsichtsbehoerde zu beschweren. Zustaendig ist:\n\n"
|
||||
" [Aufsichtsbehoerde mit Namen + Strasse + PLZ + Ort + Web]\n\n"
|
||||
"Beispiel: 'Bayerisches Landesamt fuer Datenschutzaufsicht "
|
||||
"(BayLDA), Promenade 18, 91522 Ansbach, www.lda.bayern.de'",
|
||||
"where": "Datenschutzerklaerung im Abschnitt 'Betroffenenrechte' oder "
|
||||
"'Beschwerderecht'.",
|
||||
"example": "Fuer BMW AG (Sitz Muenchen, Bayern): BayLDA, Promenade 18, "
|
||||
"91522 Ansbach, www.lda.bayern.de",
|
||||
"severity": "high",
|
||||
},
|
||||
|
||||
"Angemessenheitsbeschluss der Kommission": {
|
||||
"what": "Drittlandtransfer erwaehnt, aber kein Hinweis auf den "
|
||||
"konkreten Angemessenheitsbeschluss / DPF / SCC.",
|
||||
"why": "Art. 13(1)(f) DSGVO: bei Drittlandtransfer muss der "
|
||||
"Transfer-Mechanismus benannt sein (Angemessenheitsbeschluss, "
|
||||
"Standardvertragsklauseln, BCR, Ausnahmen nach Art. 49).",
|
||||
"fix_text": "Fuer Datenuebermittlungen in die USA stuetzen wir uns auf "
|
||||
"den Angemessenheitsbeschluss der EU-Kommission vom "
|
||||
"10.07.2023 (EU-US Data Privacy Framework / DPF). Sofern "
|
||||
"der Empfaenger DPF-zertifiziert ist, ist die Uebermittlung "
|
||||
"rechtlich abgesichert. Bei nicht-zertifizierten Empfaengern "
|
||||
"ergaenzen wir EU-Standardvertragsklauseln (SCC) gemaess "
|
||||
"Durchfuehrungsbeschluss 2021/914.",
|
||||
"where": "Datenschutzerklaerung im Abschnitt 'Drittlandtransfer' oder "
|
||||
"'Internationale Datenuebermittlung'.",
|
||||
"example": "Konkret: Google LLC (USA) ist DPF-zertifiziert "
|
||||
"(Zertifikat einsehbar unter dataprivacyframework.gov).",
|
||||
"severity": "high",
|
||||
},
|
||||
|
||||
"Anschrift des Verantwortlichen": {
|
||||
"what": "In der Cookie-Richtlinie fehlt die vollstaendige Anschrift.",
|
||||
"why": "Art. 13(1)(a) DSGVO: der Verantwortliche muss eindeutig "
|
||||
"identifizierbar sein. Cookie-Richtlinie + DSE muessen "
|
||||
"konsistente Angaben enthalten.",
|
||||
"fix_text": "Verantwortlich fuer die Datenverarbeitung im Sinne der "
|
||||
"DSGVO ist:\n [Firmenname]\n [Strasse + Hausnummer]\n "
|
||||
"[PLZ + Ort]\n [Land]\n E-Mail: [...]",
|
||||
"where": "Cookie-Richtlinie am Anfang ODER Verweis-Link zur DSE.",
|
||||
"example": "Bayerische Motoren Werke Aktiengesellschaft, Petuelring 130, "
|
||||
"80809 Muenchen, Deutschland",
|
||||
"severity": "high",
|
||||
},
|
||||
|
||||
"Konkrete Cookie-Namen aufgelistet": {
|
||||
"what": "Cookies werden allgemein erwaehnt aber ohne konkrete Namen + "
|
||||
"Speicherdauer.",
|
||||
"why": "DSK-Orientierungshilfe Telemedien: Auflistung jedes einzelnen "
|
||||
"Cookies mit Name. Generische Aussagen ('wir nutzen "
|
||||
"Werbe-Cookies') sind unzureichend.",
|
||||
"fix_text": "Erweitern Sie die Cookie-Tabelle um die Spalten:\n"
|
||||
" Name | Anbieter | Zweck | Speicherdauer\n\n"
|
||||
"Browser-Devtools (Application > Cookies) zeigt die "
|
||||
"tatsaechlich gesetzten Namen — bitte Cookie-Liste "
|
||||
"regelmaessig synchronisieren.",
|
||||
"where": "Cookie-Richtlinie im Abschnitt 'Verwendete Cookies'.",
|
||||
"example": "_ga | Google Ireland | Statistik (Besucher-ID) | 2 Jahre\n"
|
||||
"_fbp | Meta Platforms IE | Werbung (Browser-ID) | 90 Tage",
|
||||
"severity": "high",
|
||||
},
|
||||
|
||||
"Konkrete Speicherdauern pro Cookie": {
|
||||
"what": "Speicherdauer nur pauschal oder als generischer Bereich.",
|
||||
"why": "Art. 13(2)(a) DSGVO: konkrete Speicherdauer oder Kriterien "
|
||||
"fuer die Festlegung. DSK fordert Speicherdauer PRO Cookie.",
|
||||
"fix_text": "Pro Cookie eine konkrete Zeitangabe in der Cookie-Tabelle "
|
||||
"ergaenzen: 'Session', '24 Stunden', '90 Tage', '2 Jahre'.",
|
||||
"where": "Cookie-Richtlinie in der Cookie-Tabelle.",
|
||||
"example": "JSESSIONID: Session — _ga: 2 Jahre — _fbp: 90 Tage",
|
||||
"severity": "high",
|
||||
},
|
||||
|
||||
"Opt-Out-Links pro Drittanbieter": {
|
||||
"what": "Pro Drittanbieter ist kein direkter Opt-Out-Link angegeben.",
|
||||
"why": "Art. 7(3) DSGVO Widerrufsmoeglichkeit. EuGH Planet49 "
|
||||
"(C-673/17) bestaetigt: Widerruf muss so einfach wie Einwilligung sein.",
|
||||
"fix_text": "Pro Drittanbieter eine Spalte 'Opt-Out' ergaenzen mit "
|
||||
"direktem Link. Alternativ: zentralen 'Cookie-"
|
||||
"Einstellungen aendern'-Button im Footer der Webseite + "
|
||||
"Hinweis darauf in der Cookie-Richtlinie.",
|
||||
"where": "Cookie-Richtlinie pro Vendor-Eintrag oder als zentraler "
|
||||
"Abschnitt 'Wie kann ich widersprechen?'.",
|
||||
"example": "Google Analytics: https://tools.google.com/dlpage/gaoptout\n"
|
||||
"Meta Pixel: ueber Facebook-Konto-Einstellungen",
|
||||
"severity": "high",
|
||||
},
|
||||
|
||||
"Privacy-Policy-Links pro Drittanbieter": {
|
||||
"what": "Pro Drittanbieter ist kein direkter Datenschutz-Link gesetzt.",
|
||||
"why": "Art. 13(2)(f) DSGVO Transparenzgebot — Nutzer muss die "
|
||||
"Datenverarbeitung beim Drittanbieter eigenverantwortlich "
|
||||
"nachvollziehen koennen.",
|
||||
"fix_text": "Pro Drittanbieter Link auf dessen Datenschutzerklaerung "
|
||||
"ergaenzen. Tabelle 'Anbieter | Datenschutz-Link'.",
|
||||
"where": "Cookie-Richtlinie im Drittanbieter-Listing.",
|
||||
"example": "Adobe Analytics: https://www.adobe.com/privacy/policy.html",
|
||||
"severity": "medium",
|
||||
},
|
||||
|
||||
"Rechtswidriger Haftungsausschluss fuer Links": {
|
||||
"what": "Klassischer Link-Disclaimer ('Wir distanzieren uns von verlinkten "
|
||||
"Inhalten') ist im Impressum.",
|
||||
"why": "BGH I ZR 317/01: solche Disclaimer sind rechtlich wirkungslos. "
|
||||
"Sie befreien NICHT von der Stoererhaftung und koennen sogar "
|
||||
"den gegenteiligen Effekt haben (Anerkennung der eigenen "
|
||||
"Pruefpflicht).",
|
||||
"fix_text": "Entfernen Sie pauschale Link-Disclaimer vollstaendig aus "
|
||||
"dem Impressum. Bei Bedarf ein knapper, zutreffender Satz:\n"
|
||||
" 'Fuer den Inhalt verlinkter externer Webseiten ist "
|
||||
"ausschliesslich deren Betreiber verantwortlich.'",
|
||||
"where": "Impressum am Ende des Dokuments.",
|
||||
"example": "Statt 'Wir distanzieren uns ausdruecklich von allen "
|
||||
"Inhalten verlinkter Seiten' — einfach nichts schreiben.",
|
||||
"severity": "low",
|
||||
},
|
||||
|
||||
"Verbraucherstreitbeilegung / OS-Plattform": {
|
||||
"what": "Kein Link zur EU-OS-Plattform bzw. keine Aussage zur "
|
||||
"Streitbeilegung.",
|
||||
"why": "Art. 14 EU-VO 524/2013 (B2C-Anbieter, Online-Verkauf): "
|
||||
"klickbarer Link auf https://ec.europa.eu/consumers/odr "
|
||||
"PFLICHT. §36 VSBG: Aussage ob teilnehmend oder nicht.",
|
||||
"fix_text": "Die EU-Kommission stellt eine Plattform zur Online-"
|
||||
"Streitbeilegung (OS) bereit, die Sie unter "
|
||||
"<a href='https://ec.europa.eu/consumers/odr'>"
|
||||
"https://ec.europa.eu/consumers/odr</a> erreichen.\n\n"
|
||||
"Wir sind nicht bereit oder verpflichtet, an "
|
||||
"Streitbeilegungsverfahren vor einer "
|
||||
"Verbraucherschlichtungsstelle teilzunehmen.",
|
||||
"where": "Impressum am Ende, eigener Abschnitt 'Streitbeilegung'.",
|
||||
"example": "Standardformulierung anwendbar fuer alle B2C-Anbieter ohne "
|
||||
"ODR-Teilnahme.",
|
||||
"severity": "high",
|
||||
},
|
||||
|
||||
"Name der vertretungsberechtigten Person": {
|
||||
"what": "Vertretungsberechtigte Person ist nicht namentlich mit "
|
||||
"Funktionsbezeichnung genannt.",
|
||||
"why": "§5 Abs. 1 Nr. 1 TMG: bei juristischen Personen sind die "
|
||||
"Vertretungsberechtigten namentlich zu nennen.",
|
||||
"fix_text": "Ergaenzen Sie nach der Firmenbezeichnung:\n"
|
||||
" 'Vertreten durch: [Vorstand / Geschaeftsfuehrung] "
|
||||
"[Vorname Nachname]'",
|
||||
"where": "Impressum direkt nach Firmenname + Anschrift.",
|
||||
"example": "Vertreten durch: Vorstandsvorsitzender Oliver Zipse",
|
||||
"severity": "high",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def recipe_for(finding_key: str) -> ActionRecipe | None:
|
||||
"""Lookup Recipe by finding-key. Vendors + Doc-Findings in einem Lookup."""
|
||||
if finding_key in VENDOR_FINDINGS:
|
||||
return VENDOR_FINDINGS[finding_key]
|
||||
if finding_key in DOC_CHECK_FINDINGS:
|
||||
return DOC_CHECK_FINDINGS[finding_key]
|
||||
# Fuzzy match auf Doc-Findings (label kann variieren)
|
||||
fk = finding_key.lower()
|
||||
for k, v in DOC_CHECK_FINDINGS.items():
|
||||
if k.lower() in fk or fk in k.lower():
|
||||
return v
|
||||
return None
|
||||
@@ -0,0 +1,309 @@
|
||||
"""
|
||||
MC Embedding Match — semantic fallback for the regex-based doc_check.
|
||||
|
||||
The Sonnet classifier filtered MCs to `check_type='text'` (matchable
|
||||
against doc text). But the regex matcher is still too strict — BMW
|
||||
writes "Speicherdauer 2 Jahre", the MC pattern expects
|
||||
"\\d+\\s*(Tag|Jahr)". We catch these via BGE-M3 embeddings + cosine
|
||||
similarity:
|
||||
|
||||
1. Embed the MC's check_question (once, cached in sidecar)
|
||||
2. Embed the doc text in 50-word chunks
|
||||
3. cosine(MC, max(chunks)) ≥ threshold → MC passes via "semantic"
|
||||
|
||||
This recovers ~50% of failed MCs at BMW-scale (estimated).
|
||||
|
||||
Embeddings come from bp-core-embedding-service (BGE-M3, 1024-dim,
|
||||
multilingual). Sidecar SQLite stores 1024 × 4 bytes = 4KB per MC.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import math
|
||||
import os
|
||||
import re
|
||||
import sqlite3
|
||||
import struct
|
||||
from typing import Iterable
|
||||
|
||||
import httpx
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
EMBEDDING_URL = os.getenv("EMBEDDING_URL", "http://embedding-service:8087")
|
||||
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
|
||||
DIM = 1024 # BGE-M3
|
||||
SIMILARITY_THRESHOLD = float(os.getenv("MC_EMBEDDING_THRESHOLD", "0.55"))
|
||||
CHUNK_SIZE_WORDS = 50
|
||||
CHUNK_STRIDE = 30 # overlap so multi-sentence MCs aren't cut
|
||||
|
||||
# Short Pflichtfelder (Impressum: HRB-Nr, USt-IdNr, Anschrift) gehen in
|
||||
# 50-Wort-Chunks unter. Wir scannen den Doc ZUSAETZLICH mit feineren
|
||||
# 15-Wort-Fenstern und lockerem Threshold fuer Impressum/AVV-Typen.
|
||||
SHORT_FIELD_CHUNK_WORDS = 15
|
||||
SHORT_FIELD_STRIDE = 8
|
||||
SHORT_FIELD_THRESHOLD = float(os.getenv("MC_EMBEDDING_THRESHOLD_SHORT", "0.50"))
|
||||
SHORT_FIELD_DOC_TYPES = {"impressum", "avv"}
|
||||
|
||||
# Doc-Type-spezifische Threshold-Overrides — kalibriert anhand BMW v7
|
||||
# Run: bei 0.55 lagen DSE+Cookie systemisch bei 93% (Over-Firing weil
|
||||
# 8000-Wort-Texte alles vage matchen). 0.60 zieht die echten ~80% ein.
|
||||
# Impressum hat nur 6 echte MCs + Short-Field-Rescue → 0.50 ok.
|
||||
THRESHOLD_OVERRIDE = {
|
||||
"impressum": 0.50,
|
||||
"avv": 0.55,
|
||||
"dse": 0.60,
|
||||
"cookie": 0.60,
|
||||
"widerruf": 0.58,
|
||||
"loeschkonzept": 0.55,
|
||||
"dsfa": 0.55,
|
||||
}
|
||||
|
||||
|
||||
def _ensure_schema() -> None:
|
||||
"""Add embedding column to mc_classification if not present."""
|
||||
try:
|
||||
with sqlite3.connect(SIDECAR_DB) as c:
|
||||
cols = [r[1] for r in c.execute("PRAGMA table_info(mc_classification)")]
|
||||
if "embedding" not in cols:
|
||||
c.execute("ALTER TABLE mc_classification ADD COLUMN embedding BLOB")
|
||||
logger.info("Added embedding column to mc_classification")
|
||||
except Exception as e:
|
||||
logger.warning("Embedding schema migration skipped: %s", e)
|
||||
|
||||
|
||||
def _vec_to_blob(v: list[float]) -> bytes:
|
||||
return struct.pack(f"{len(v)}f", *v)
|
||||
|
||||
|
||||
def _blob_to_vec(b: bytes) -> list[float]:
|
||||
return list(struct.unpack(f"{len(b)//4}f", b))
|
||||
|
||||
|
||||
EMBED_BATCH_SIZE = 32
|
||||
|
||||
|
||||
async def _embed_texts(texts: list[str], timeout: float = 120.0) -> list[list[float]]:
|
||||
"""Call the central embedding-service in batches; returns one vector per input.
|
||||
|
||||
BGE-M3 hangs / times out on >100 inputs at once on a CPU-only host.
|
||||
We chunk into 32er batches and collect.
|
||||
"""
|
||||
if not texts:
|
||||
return []
|
||||
out: list[list[float]] = []
|
||||
async with httpx.AsyncClient(timeout=timeout) as client:
|
||||
for i in range(0, len(texts), EMBED_BATCH_SIZE):
|
||||
batch = texts[i:i + EMBED_BATCH_SIZE]
|
||||
try:
|
||||
r = await client.post(
|
||||
f"{EMBEDDING_URL}/embed", json={"texts": batch},
|
||||
)
|
||||
r.raise_for_status()
|
||||
vecs = r.json().get("embeddings") or []
|
||||
out.extend(vecs)
|
||||
except httpx.HTTPError as e:
|
||||
logger.warning("Embed sub-batch [%d-%d] failed: %s %s",
|
||||
i, i + len(batch), type(e).__name__, e)
|
||||
# Pad with empty vectors so caller can still align by index
|
||||
out.extend([[] for _ in batch])
|
||||
return out
|
||||
|
||||
|
||||
async def ensure_mc_embeddings(batch_size: int = 64, force: bool = False) -> int:
|
||||
"""One-shot: embed every text-MC missing an embedding. Returns count.
|
||||
|
||||
Embeds the title + (rough) check_question for each MC to give the
|
||||
BGE-M3 enough context. Title alone is too terse for the model to
|
||||
discriminate against full-paragraph doc text.
|
||||
|
||||
Idempotent — only fills NULL rows unless force=True. Safe to call on
|
||||
every run.
|
||||
"""
|
||||
_ensure_schema()
|
||||
# Pull check_question from the PG source table once per call (needs
|
||||
# context that's not in the sidecar)
|
||||
try:
|
||||
import psycopg2
|
||||
pg = psycopg2.connect(os.environ["DATABASE_URL"])
|
||||
with pg.cursor() as c:
|
||||
c.execute("SELECT control_id, doc_type, title, check_question "
|
||||
"FROM compliance.doc_check_controls")
|
||||
pg_rows = c.fetchall()
|
||||
pg.close()
|
||||
pg_lookup = {(r[0], r[1] or ""): (r[2] or "", r[3] or "") for r in pg_rows}
|
||||
except Exception as e:
|
||||
logger.warning("ensure_mc_embeddings PG load failed: %s", e)
|
||||
pg_lookup = {}
|
||||
|
||||
try:
|
||||
with sqlite3.connect(SIDECAR_DB) as c:
|
||||
where = ("WHERE check_type = 'text'" + ("" if force else " AND embedding IS NULL"))
|
||||
rows = c.execute(
|
||||
f"SELECT control_id, doc_type, title FROM mc_classification {where}"
|
||||
).fetchall()
|
||||
except Exception as e:
|
||||
logger.warning("ensure_mc_embeddings query failed: %s", e)
|
||||
return 0
|
||||
|
||||
if not rows:
|
||||
return 0
|
||||
|
||||
logger.info("Embedding %d text-MCs (force=%s) via %s ...",
|
||||
len(rows), force, EMBEDDING_URL)
|
||||
done = 0
|
||||
for i in range(0, len(rows), batch_size):
|
||||
batch = rows[i:i + batch_size]
|
||||
# Compose "title — check_question" so the embedding captures both
|
||||
# the topic (title) and the concrete check phrasing (question).
|
||||
# That helps BMW's actual policy language land in the same vector
|
||||
# neighbourhood as our control wording.
|
||||
texts: list[str] = []
|
||||
for cid, dt, t in batch:
|
||||
title_text, question = pg_lookup.get((cid, dt or ""), (t or "", ""))
|
||||
combined = f"{title_text}. {question}".strip()
|
||||
texts.append(combined[:600])
|
||||
try:
|
||||
embs = await _embed_texts(texts)
|
||||
except Exception as e:
|
||||
logger.warning("Embed batch failed (i=%d): %s", i, e)
|
||||
continue
|
||||
with sqlite3.connect(SIDECAR_DB) as c:
|
||||
for (cid, dt, _t), vec in zip(batch, embs):
|
||||
if not vec or len(vec) != DIM:
|
||||
continue
|
||||
c.execute(
|
||||
"UPDATE mc_classification SET embedding = ? "
|
||||
"WHERE control_id = ? AND doc_type = ?",
|
||||
(_vec_to_blob(vec), cid, dt),
|
||||
)
|
||||
c.commit()
|
||||
done += len(batch)
|
||||
logger.info("ensure_mc_embeddings: filled %d/%d", done, len(rows))
|
||||
return done
|
||||
|
||||
|
||||
def _chunk_text(text: str, size: int = CHUNK_SIZE_WORDS,
|
||||
stride: int = CHUNK_STRIDE) -> list[str]:
|
||||
"""Sliding-window chunking — overlap helps catch MCs that span 2 sentences."""
|
||||
words = re.findall(r"\S+", text or "")
|
||||
if len(words) <= size:
|
||||
return [" ".join(words)] if words else []
|
||||
out: list[str] = []
|
||||
i = 0
|
||||
while i < len(words):
|
||||
out.append(" ".join(words[i:i + size]))
|
||||
i += stride
|
||||
return out
|
||||
|
||||
|
||||
def _cosine(a: list[float], b: list[float]) -> float:
|
||||
"""Plain Python cosine — fast enough for our scale, no numpy import."""
|
||||
if not a or not b or len(a) != len(b):
|
||||
return 0.0
|
||||
dot = sum(x * y for x, y in zip(a, b))
|
||||
na = math.sqrt(sum(x * x for x in a))
|
||||
nb = math.sqrt(sum(y * y for y in b))
|
||||
if na == 0 or nb == 0:
|
||||
return 0.0
|
||||
return dot / (na * nb)
|
||||
|
||||
|
||||
async def embedding_match(
|
||||
doc_text: str,
|
||||
mc_records: Iterable[dict],
|
||||
doc_type: str | None = None,
|
||||
threshold: float | None = None,
|
||||
) -> set[str]:
|
||||
"""Return the subset of MC control_ids that semantically match doc_text.
|
||||
|
||||
For Impressum/AVV-types we ADDITIONALLY scan the doc with smaller
|
||||
15-word windows and a looser threshold so that short Pflichtfelder
|
||||
(HRB, USt-IdNr, postal address) land in their own chunk and aren't
|
||||
diluted by 50-word neighbourhoods of unrelated text.
|
||||
"""
|
||||
if not doc_text or not mc_records:
|
||||
return set()
|
||||
candidates = list(mc_records)
|
||||
if not candidates:
|
||||
return set()
|
||||
|
||||
cid_set = {c.get("control_id") for c in candidates if c.get("control_id")}
|
||||
if not cid_set:
|
||||
return set()
|
||||
|
||||
try:
|
||||
with sqlite3.connect(SIDECAR_DB) as c:
|
||||
placeholders = ",".join("?" * len(cid_set))
|
||||
q = ("SELECT control_id, embedding FROM mc_classification "
|
||||
f"WHERE control_id IN ({placeholders}) "
|
||||
"AND check_type='text' AND embedding IS NOT NULL")
|
||||
params = list(cid_set)
|
||||
if doc_type:
|
||||
q += " AND doc_type = ?"
|
||||
params.append(doc_type)
|
||||
rows = c.execute(q, params).fetchall()
|
||||
except Exception as e:
|
||||
logger.warning("embedding lookup failed: %s", e)
|
||||
return set()
|
||||
if not rows:
|
||||
return set()
|
||||
mc_embeddings = {cid: _blob_to_vec(blob) for cid, blob in rows}
|
||||
|
||||
effective_threshold = threshold or THRESHOLD_OVERRIDE.get(
|
||||
(doc_type or "").lower(), SIMILARITY_THRESHOLD)
|
||||
|
||||
chunks = _chunk_text(doc_text)
|
||||
if not chunks:
|
||||
return set()
|
||||
try:
|
||||
chunk_vecs = await _embed_texts(chunks)
|
||||
except Exception as e:
|
||||
logger.warning("doc chunk embedding failed: %s %s",
|
||||
type(e).__name__, e or "(empty msg)", exc_info=True)
|
||||
return set()
|
||||
# Filter empty vectors (failed sub-batches return [] placeholders)
|
||||
chunk_vecs = [v for v in chunk_vecs if v and len(v) == DIM]
|
||||
if not chunk_vecs:
|
||||
logger.warning("doc chunk embedding: no usable vectors (all batches failed)")
|
||||
return set()
|
||||
|
||||
matched: set[str] = set()
|
||||
for cid, mc_vec in mc_embeddings.items():
|
||||
best = max((_cosine(mc_vec, cv) for cv in chunk_vecs), default=0.0)
|
||||
if best >= effective_threshold:
|
||||
matched.add(cid)
|
||||
|
||||
# Short-field rescue pass for Impressum-type docs: small windows +
|
||||
# looser threshold catch one-line Pflichtfelder that 50-word chunks
|
||||
# dilute (HRB-Nr, USt-IdNr, postal address). Only runs for MCs not
|
||||
# yet matched in the main pass.
|
||||
if (doc_type or "").lower() in SHORT_FIELD_DOC_TYPES:
|
||||
unmatched = {cid: vec for cid, vec in mc_embeddings.items() if cid not in matched}
|
||||
if unmatched:
|
||||
short_chunks = _chunk_text(doc_text, size=SHORT_FIELD_CHUNK_WORDS,
|
||||
stride=SHORT_FIELD_STRIDE)
|
||||
try:
|
||||
short_vecs = await _embed_texts(short_chunks)
|
||||
except Exception as e:
|
||||
logger.warning("short-chunk embedding failed: %s", e)
|
||||
short_vecs = []
|
||||
if short_vecs:
|
||||
short_passes = 0
|
||||
for cid, mc_vec in unmatched.items():
|
||||
best = max((_cosine(mc_vec, cv) for cv in short_vecs), default=0.0)
|
||||
if best >= SHORT_FIELD_THRESHOLD:
|
||||
matched.add(cid)
|
||||
short_passes += 1
|
||||
if short_passes:
|
||||
logger.info(
|
||||
"embedding short-field rescue for %s: +%d MCs (threshold %.2f, %d chunks)",
|
||||
doc_type, short_passes, SHORT_FIELD_THRESHOLD, len(short_chunks),
|
||||
)
|
||||
|
||||
logger.info(
|
||||
"embedding match for %s: %d/%d MCs passed semantic threshold (main=%.2f)",
|
||||
doc_type or "?", len(matched), len(mc_embeddings), effective_threshold,
|
||||
)
|
||||
return matched
|
||||
@@ -101,11 +101,36 @@ def build_scorecard(check_results: list[dict]) -> dict:
|
||||
}
|
||||
|
||||
|
||||
_DEDUP_KEYWORDS = [
|
||||
"einfache sprache", "verstaendliche sprache", "verständliche sprache",
|
||||
"klare sprache", "einwilligungstexte", "einwilligungsaufforderung",
|
||||
"einwilligungserklaerung", "einwilligungserklärung",
|
||||
"mehrdeutige", "verstaendliche form", "verständliche form",
|
||||
"fachbegriffe erklaeren", "fachbegriffe erklären",
|
||||
]
|
||||
|
||||
|
||||
def _dedup_key(label: str) -> str:
|
||||
"""Cluster label to a stable dedup-key: if it contains one of the
|
||||
well-known repetitive Sprache/Einwilligungs-Aufforderungs-Concepts,
|
||||
collapse them all to that single concept. Otherwise return original."""
|
||||
l = (label or "").lower()
|
||||
for kw in _DEDUP_KEYWORDS:
|
||||
if kw in l:
|
||||
return f"_dup:{kw}"
|
||||
return label
|
||||
|
||||
|
||||
def top_fails(check_results: list[dict], n: int = 10) -> list[dict]:
|
||||
"""Return top-N failing MCs sorted by severity then label.
|
||||
|
||||
Skipped + passed MCs are excluded. INFO severity is excluded by
|
||||
default since those are guidance, not findings.
|
||||
|
||||
Near-duplicates (multiple MCs that all complain about "einfache
|
||||
Sprache" / "Einwilligungsaufforderung" / ...) are collapsed to ONE
|
||||
representative entry — sonst dominieren UI-Sprache-Hinweise die
|
||||
Top-Liste und echte Lecks gehen unter.
|
||||
"""
|
||||
fails = [
|
||||
r for r in (check_results or [])
|
||||
@@ -116,7 +141,17 @@ def top_fails(check_results: list[dict], n: int = 10) -> list[dict]:
|
||||
_SEV_RANK.get((r.get("severity") or "MEDIUM").upper(), 5),
|
||||
r.get("label", ""),
|
||||
))
|
||||
return fails[:n]
|
||||
seen_keys: set[str] = set()
|
||||
deduped: list[dict] = []
|
||||
for r in fails:
|
||||
k = _dedup_key(r.get("label", ""))
|
||||
if k in seen_keys:
|
||||
continue
|
||||
seen_keys.add(k)
|
||||
deduped.append(r)
|
||||
if len(deduped) >= n:
|
||||
break
|
||||
return deduped
|
||||
|
||||
|
||||
def full_audit_records(
|
||||
|
||||
@@ -37,6 +37,7 @@ async def check_document_with_controls(
|
||||
db_url: str = "",
|
||||
max_controls: int = 0, # 0 = no limit, check ALL
|
||||
use_agent: bool = False, # Use LLM agent for intelligent evaluation
|
||||
business_scope: set[str] | None = None,
|
||||
) -> list[dict]:
|
||||
"""Check document against ALL doc_check_controls for this doc_type.
|
||||
|
||||
@@ -56,7 +57,7 @@ async def check_document_with_controls(
|
||||
mapped_type = _map_doc_type(doc_type)
|
||||
|
||||
# Load ALL controls for this doc_type
|
||||
controls = await _load_controls(mapped_type, db_url, max_controls)
|
||||
controls = await _load_controls(mapped_type, db_url, max_controls, business_scope)
|
||||
if not controls:
|
||||
logger.info("No MCs for doc_type '%s' (%s)", mapped_type, doc_title)
|
||||
return []
|
||||
@@ -71,6 +72,31 @@ async def check_document_with_controls(
|
||||
if result:
|
||||
results.append(result)
|
||||
|
||||
# Semantic fallback (Phase 3): MCs that failed via regex get a second
|
||||
# chance via BGE-M3 cosine similarity. BMW writes "Speicherdauer 2
|
||||
# Jahre" — the regex misses, embedding catches it.
|
||||
failed_ids = {r.get("control_id") for r in results
|
||||
if not r.get("passed") and r.get("control_id")}
|
||||
if failed_ids:
|
||||
try:
|
||||
from compliance.services.mc_embedding_matcher import (
|
||||
ensure_mc_embeddings, embedding_match,
|
||||
)
|
||||
await ensure_mc_embeddings() # idempotent: only embeds new MCs
|
||||
failed_mcs = [c for c in controls if c.get("control_id") in failed_ids]
|
||||
semantic_passes = await embedding_match(
|
||||
text, failed_mcs, doc_type=mapped_type,
|
||||
)
|
||||
if semantic_passes:
|
||||
for r in results:
|
||||
cid = r.get("control_id")
|
||||
if cid and cid in semantic_passes and not r.get("passed"):
|
||||
r["passed"] = True
|
||||
r["matched_text"] = "[semantischer Treffer via Embedding]"
|
||||
r["hint"] = (r.get("hint") or "") + " (passed via Embedding-Match, BGE-M3 cosine)"
|
||||
except Exception as e:
|
||||
logger.warning("Embedding fallback skipped: %s", e, exc_info=True)
|
||||
|
||||
passed = sum(1 for r in results if r["passed"])
|
||||
failed_results = [r for r in results if not r["passed"]]
|
||||
logger.info("MC results: %d passed, %d failed out of %d for '%s'",
|
||||
@@ -161,6 +187,7 @@ def _check_mc_deterministic(text_lower: str, mc: dict) -> Optional[dict]:
|
||||
|
||||
return {
|
||||
"id": f"mc-{control_id}",
|
||||
"control_id": control_id,
|
||||
"label": mc.get("title", "")[:80],
|
||||
"passed": passed,
|
||||
"severity": severity,
|
||||
@@ -266,11 +293,72 @@ _MC_ALIAS_FALLBACK = {
|
||||
}
|
||||
|
||||
|
||||
async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
|
||||
def _load_text_only_ids(
|
||||
doc_type: str | None = None,
|
||||
business_scope: set[str] | None = None,
|
||||
) -> set[str]:
|
||||
"""Return control_ids that the Sonnet-classifier flagged as 'text'.
|
||||
|
||||
Filters applied:
|
||||
1. check_type='text' (only doc-text-matchable MCs)
|
||||
2. doc_type matches (per-doc-type variant from v2-Sidecar)
|
||||
3. fits_doc_type=1 (LLM auditor approved this MC for this doc_type)
|
||||
4. scope_requires NULL or contained in business_scope
|
||||
(e.g. MCs with scope_requires='biometric_processing' are skipped
|
||||
on sites that don't do biometric processing — Art. 22 FRT-MC bei
|
||||
BMW falsch-positiv)
|
||||
|
||||
`business_scope` comes from the business_profiler (set of detected
|
||||
site characteristics like 'b2c', 'shop', 'biometric_processing',
|
||||
'ai_decision_making', 'child_targeting').
|
||||
|
||||
Returns empty set if the sidecar doesn't exist yet.
|
||||
"""
|
||||
import sqlite3
|
||||
db_path = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
|
||||
try:
|
||||
with sqlite3.connect(db_path) as c:
|
||||
cols = [r[1] for r in c.execute("PRAGMA table_info(mc_classification)")]
|
||||
has_fit = "fits_doc_type" in cols
|
||||
has_scope = "scope_requires" in cols
|
||||
fit_clause = " AND (fits_doc_type IS NULL OR fits_doc_type = 1)" if has_fit else ""
|
||||
base = ("SELECT control_id, scope_requires FROM mc_classification "
|
||||
"WHERE check_type = 'text'" + fit_clause) if has_scope else (
|
||||
"SELECT control_id, NULL FROM mc_classification "
|
||||
"WHERE check_type = 'text'" + fit_clause)
|
||||
params: list = []
|
||||
if doc_type:
|
||||
base += " AND doc_type = ?"
|
||||
params.append(doc_type)
|
||||
rows = c.execute(base, params).fetchall()
|
||||
scope = business_scope or set()
|
||||
keep: set[str] = set()
|
||||
for cid, req in rows:
|
||||
if not req:
|
||||
keep.add(cid)
|
||||
else:
|
||||
# Multiple requirements separated by '|' — ALL must
|
||||
# be in scope to include. Empty req tokens are skipped.
|
||||
needed = {r.strip().lower() for r in req.split("|") if r.strip()}
|
||||
if needed.issubset({s.lower() for s in scope}):
|
||||
keep.add(cid)
|
||||
return keep
|
||||
except sqlite3.OperationalError:
|
||||
return set()
|
||||
except Exception as e:
|
||||
logger.warning("MC classification lookup failed: %s", e)
|
||||
return set()
|
||||
|
||||
|
||||
async def _load_controls(doc_type: str, db_url: str, limit: int,
|
||||
business_scope: set[str] | None = None) -> list[dict]:
|
||||
"""Load all doc_check_controls for a doc_type from PostgreSQL.
|
||||
|
||||
Falls back via _MC_ALIAS_FALLBACK when no MCs exist for the requested
|
||||
type (e.g. 'nutzungsbedingungen' -> 'agb').
|
||||
|
||||
Filters to only check_type='text' MCs when the classification sidecar
|
||||
is present — process/review MCs are routed to other modules.
|
||||
"""
|
||||
try:
|
||||
import asyncpg
|
||||
@@ -297,7 +385,17 @@ async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
|
||||
fallback = _MC_ALIAS_FALLBACK[doc_type]
|
||||
logger.info("No MCs for %s -> falling back to %s", doc_type, fallback)
|
||||
rows = await conn.fetch(query, fallback)
|
||||
return [dict(r) for r in rows]
|
||||
|
||||
controls = [dict(r) for r in rows]
|
||||
text_only = _load_text_only_ids(doc_type, business_scope)
|
||||
if text_only:
|
||||
before = len(controls)
|
||||
controls = [c for c in controls if c.get("control_id") in text_only]
|
||||
logger.info(
|
||||
"MC filter (text only) for %s: %d/%d MCs after Sonnet check_type filter",
|
||||
doc_type, len(controls), before,
|
||||
)
|
||||
return controls
|
||||
except Exception as e:
|
||||
logger.warning("MC query failed: %s", e)
|
||||
return []
|
||||
|
||||
@@ -0,0 +1,407 @@
|
||||
"""
|
||||
Vendor-Cost-Estimator — leitet pro Vendor ein Pricing-Tier aus
|
||||
Cookie-Signalen ab und gibt eine intensitaets-basierte Jahres-
|
||||
kostenschaetzung zurueck.
|
||||
|
||||
Cookie-Signale die wir auswerten:
|
||||
- Anzahl Cookies pro Vendor (proxy fuer Modul-Tiefe)
|
||||
- Premium-Feature-Cookies (z.B. 's_target_qa', '_ab_test' → Enterprise-Add-on)
|
||||
- Edge/Region-Cookies (Multi-Region → Premier-Tier CDN)
|
||||
- Cookie-Persistenz (Multi-Jahr → Heavy-Tracking-Lizenz)
|
||||
|
||||
Plus business_profile fuer Company-Tier-Inferenz.
|
||||
|
||||
Output pro Vendor:
|
||||
- inferred_tier: 'starter' | 'professional' | 'enterprise' | 'premier'
|
||||
- tier_signals : Liste der Indikatoren die zum Tier gefuehrt haben
|
||||
- cost_year_eur_range: (low, high) basierend auf Tier × Vendor-Pricing
|
||||
- confidence: 'low' | 'medium' | 'high'
|
||||
|
||||
Dieses Modul ergaenzt vendor_redundancy.py — die einfachen low/high
|
||||
Pauschalen dort werden hier durch dynamische, signal-basierte Werte
|
||||
ersetzt.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import re
|
||||
from typing import Iterable
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# ─── Premium-Feature-Cookies: Indikator fuer Enterprise-Add-ons ─────
|
||||
#
|
||||
# Wenn ein Vendor diese Cookies setzt, ist der Kunde mit hoher
|
||||
# Wahrscheinlichkeit auf einem Enterprise-Plan.
|
||||
|
||||
_PREMIUM_FEATURE_PATTERNS: list[tuple[str, str, str]] = [
|
||||
# (regex, vendor_key, premium_feature_label)
|
||||
(r"^s_target_qa$", "adobe analytics", "Adobe Target Add-on"),
|
||||
(r"adobe.*target", "adobe target", "Personalization Enterprise"),
|
||||
(r"^aam_uuid", "adobe analytics", "Audience Manager Enterprise"),
|
||||
(r"^s_ecid", "adobe analytics", "Experience Cloud ID Service"),
|
||||
(r"^_pcid_", "adobe analytics", "People-Based Destinations"),
|
||||
|
||||
(r"^_gat_gtag_UA", "google analytics", "GA360 Multi-Tracker"),
|
||||
(r"^_ga_[A-Z0-9]+_[A-Z0-9]+", "google analytics", "GA4 Enterprise Stream"),
|
||||
|
||||
(r"^_uetmsdns", "microsoft advertising", "Custom Conversion Tracking"),
|
||||
(r"^_fbp.*test", "meta pixel", "Conversions API Premium"),
|
||||
(r"^_pin_unauth_premium", "pinterest", "Pinterest Premium-API"),
|
||||
|
||||
(r"^afm", "adform", "Affinity-Module"),
|
||||
(r"^cto_dna", "criteo", "Dynamic Retargeting Premium"),
|
||||
|
||||
# CDN / Infra Premium
|
||||
(r"^aws-alb-[a-z0-9]+", "amazon web services", "ALB + Multi-Region"),
|
||||
(r"^aws-waf", "amazon web services", "WAF Enterprise"),
|
||||
(r"^cf_clearance", "cloudflare", "Bot-Management Pro"),
|
||||
(r"^akm_[a-z]+", "akamai", "Adaptive Media Delivery Enterprise"),
|
||||
|
||||
# Salesforce Customer-360
|
||||
(r"^bid_n_", "salesforce", "Marketing Cloud Personalization"),
|
||||
(r"^_cs_", "salesforce", "CDP Premium"),
|
||||
]
|
||||
|
||||
|
||||
# ─── Tier-Pricing pro Vendor (jaehrlich, EUR) ───────────────────────
|
||||
#
|
||||
# 4 Tiers: starter (KMU), professional (Mid), enterprise (Konzern),
|
||||
# premier (Global Brand / Heavy User).
|
||||
|
||||
_TIER_PRICING: dict[str, dict[str, tuple[int, int]]] = {
|
||||
"adobe analytics": {
|
||||
"starter": ( 10_000, 30_000),
|
||||
"professional": ( 60_000, 150_000),
|
||||
"enterprise": (200_000, 500_000),
|
||||
"premier": (500_000, 900_000),
|
||||
},
|
||||
"adobe target": {
|
||||
"starter": ( 8_000, 25_000),
|
||||
"professional": ( 40_000, 100_000),
|
||||
"enterprise": (120_000, 300_000),
|
||||
"premier": (300_000, 600_000),
|
||||
},
|
||||
"adobe campaign": {
|
||||
"starter": ( 10_000, 30_000),
|
||||
"professional": ( 40_000, 100_000),
|
||||
"enterprise": (120_000, 280_000),
|
||||
"premier": (280_000, 500_000),
|
||||
},
|
||||
"google analytics": {
|
||||
"starter": ( 0, 0), # GA4 free
|
||||
"professional": ( 0, 0),
|
||||
"enterprise": ( 80_000, 150_000), # GA360
|
||||
"premier": (150_000, 300_000),
|
||||
},
|
||||
"matomo": {
|
||||
"starter": ( 0, 3_000), # On-prem free / Cloud Starter
|
||||
"professional": ( 6_000, 20_000),
|
||||
"enterprise": ( 20_000, 80_000),
|
||||
"premier": ( 60_000, 150_000),
|
||||
},
|
||||
"content square": {
|
||||
"starter": ( 12_000, 40_000),
|
||||
"professional": ( 60_000, 150_000),
|
||||
"enterprise": (150_000, 350_000),
|
||||
"premier": (350_000, 700_000),
|
||||
},
|
||||
"contentsquare": {
|
||||
"starter": ( 12_000, 40_000),
|
||||
"professional": ( 60_000, 150_000),
|
||||
"enterprise": (150_000, 350_000),
|
||||
"premier": (350_000, 700_000),
|
||||
},
|
||||
"dynatrace": {
|
||||
"starter": ( 5_000, 15_000),
|
||||
"professional": ( 30_000, 80_000),
|
||||
"enterprise": (100_000, 300_000),
|
||||
"premier": (300_000, 800_000),
|
||||
},
|
||||
"qualtrics": {
|
||||
"starter": ( 6_000, 20_000),
|
||||
"professional": ( 30_000, 80_000),
|
||||
"enterprise": ( 80_000, 200_000),
|
||||
"premier": (200_000, 500_000),
|
||||
},
|
||||
|
||||
# Advertising / Retargeting (Lizenz + Self-Service; Media-Spend SEPARAT)
|
||||
"criteo": {
|
||||
"starter": ( 6_000, 20_000),
|
||||
"professional": ( 30_000, 80_000),
|
||||
"enterprise": ( 80_000, 250_000),
|
||||
"premier": (250_000, 600_000),
|
||||
},
|
||||
"adform": {
|
||||
"starter": ( 12_000, 40_000),
|
||||
"professional": ( 60_000, 150_000),
|
||||
"enterprise": (150_000, 400_000),
|
||||
"premier": (400_000, 800_000),
|
||||
},
|
||||
"outbrain": {
|
||||
"starter": ( 6_000, 20_000),
|
||||
"professional": ( 30_000, 80_000),
|
||||
"enterprise": ( 80_000, 200_000),
|
||||
"premier": (200_000, 500_000),
|
||||
},
|
||||
"taboola": {
|
||||
"starter": ( 6_000, 20_000),
|
||||
"professional": ( 30_000, 80_000),
|
||||
"enterprise": ( 80_000, 200_000),
|
||||
"premier": (200_000, 500_000),
|
||||
},
|
||||
"teads": {
|
||||
"starter": ( 6_000, 18_000),
|
||||
"professional": ( 20_000, 60_000),
|
||||
"enterprise": ( 60_000, 150_000),
|
||||
"premier": (150_000, 350_000),
|
||||
},
|
||||
"pinterest": {
|
||||
"starter": ( 3_000, 15_000),
|
||||
"professional": ( 15_000, 50_000),
|
||||
"enterprise": ( 50_000, 150_000),
|
||||
"premier": (150_000, 400_000),
|
||||
},
|
||||
"linkedin insight": {
|
||||
"starter": ( 3_000, 12_000),
|
||||
"professional": ( 12_000, 40_000),
|
||||
"enterprise": ( 40_000, 120_000),
|
||||
"premier": (120_000, 300_000),
|
||||
},
|
||||
|
||||
# CDN / Cloud
|
||||
"akamai": {
|
||||
"starter": ( 20_000, 60_000),
|
||||
"professional": ( 80_000, 200_000),
|
||||
"enterprise": (200_000, 500_000),
|
||||
"premier": (500_000, 1_500_000),
|
||||
},
|
||||
"amazon web services": {
|
||||
"starter": ( 12_000, 60_000),
|
||||
"professional": ( 60_000, 300_000),
|
||||
"enterprise": (300_000, 1_500_000),
|
||||
"premier": (1_500_000, 8_000_000),
|
||||
},
|
||||
"baqend": {
|
||||
"starter": ( 3_000, 12_000),
|
||||
"professional": ( 12_000, 40_000),
|
||||
"enterprise": ( 40_000, 120_000),
|
||||
"premier": (120_000, 300_000),
|
||||
},
|
||||
"speedkit": {
|
||||
"starter": ( 3_000, 12_000),
|
||||
"professional": ( 12_000, 40_000),
|
||||
"enterprise": ( 40_000, 120_000),
|
||||
"premier": (120_000, 300_000),
|
||||
},
|
||||
"speedcurve": {
|
||||
"starter": ( 1_200, 4_800),
|
||||
"professional": ( 6_000, 18_000),
|
||||
"enterprise": ( 18_000, 60_000),
|
||||
"premier": ( 60_000, 120_000),
|
||||
},
|
||||
|
||||
# CRM / Marketing
|
||||
"salesforce": {
|
||||
"starter": ( 20_000, 60_000),
|
||||
"professional": ( 80_000, 250_000),
|
||||
"enterprise": (250_000, 800_000),
|
||||
"premier": (800_000, 2_500_000),
|
||||
},
|
||||
"genesys": {
|
||||
"starter": ( 24_000, 80_000),
|
||||
"professional": ( 80_000, 250_000),
|
||||
"enterprise": (250_000, 800_000),
|
||||
"premier": (800_000, 2_000_000),
|
||||
},
|
||||
|
||||
# Captcha
|
||||
"hcaptcha": {
|
||||
"starter": ( 0, 2_400),
|
||||
"professional": ( 2_400, 12_000),
|
||||
"enterprise": ( 12_000, 40_000),
|
||||
"premier": ( 40_000, 100_000),
|
||||
},
|
||||
|
||||
# Lead-Tracking
|
||||
"salesviewer": {
|
||||
"starter": ( 1_200, 3_600),
|
||||
"professional": ( 3_600, 12_000),
|
||||
"enterprise": ( 12_000, 40_000),
|
||||
"premier": ( 40_000, 100_000),
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def _vendor_key(vendor_name: str) -> str | None:
|
||||
"""Map a vendor name to a known pricing-table key."""
|
||||
n = (vendor_name or "").lower()
|
||||
for k in _TIER_PRICING:
|
||||
if k in n:
|
||||
return k
|
||||
return None
|
||||
|
||||
|
||||
def infer_company_tier(business_profile: dict | None) -> str:
|
||||
"""Coarse company-tier from business profile.
|
||||
|
||||
Used as the baseline when vendor-specific signals are weak.
|
||||
"""
|
||||
if not business_profile:
|
||||
return "professional"
|
||||
bp = business_profile
|
||||
features = {f.lower() for f in (bp.get("features") or [])}
|
||||
btype = (bp.get("type") or "").lower()
|
||||
# Heavy enterprise-only signals
|
||||
if any(f in features for f in ("multi_country", "konzern", "enterprise",
|
||||
"international", "automotive", "banking",
|
||||
"luxury", "premium")):
|
||||
return "premier"
|
||||
# Large but maybe single-country
|
||||
if "shop" in features or "konfigurator" in features or btype == "b2c":
|
||||
return "enterprise"
|
||||
return "professional"
|
||||
|
||||
|
||||
def infer_vendor_tier(vendor: dict, company_tier: str) -> tuple[str, list[str]]:
|
||||
"""Infer pricing tier for a single vendor from its cookie footprint.
|
||||
|
||||
Signals (additive — more signals → higher tier):
|
||||
- cookie_count > 30 → +1 tier
|
||||
- cookie_count > 60 → +2 tiers
|
||||
- premium-feature cookie hit → +1 tier
|
||||
- 'is_third_party' on most cookies → +1 tier (heavy-tracking signal)
|
||||
- very long expiry (>=2 years) → +1 tier
|
||||
"""
|
||||
cookies = vendor.get("cookies") or []
|
||||
n_cookies = len(cookies)
|
||||
cookie_names = [c.get("name", "").lower() for c in cookies]
|
||||
signals: list[str] = []
|
||||
|
||||
base_tiers = ["starter", "professional", "enterprise", "premier"]
|
||||
# Start at company-tier as baseline
|
||||
idx = base_tiers.index(company_tier) if company_tier in base_tiers else 1
|
||||
|
||||
if n_cookies >= 60:
|
||||
idx = min(len(base_tiers) - 1, idx + 1)
|
||||
signals.append(f"{n_cookies} Cookies (sehr hohe Modul-Tiefe)")
|
||||
elif n_cookies >= 30:
|
||||
signals.append(f"{n_cookies} Cookies (hohe Modul-Tiefe)")
|
||||
|
||||
# Premium feature detection
|
||||
vk = _vendor_key(vendor.get("name", ""))
|
||||
for pattern, expected_key, feature_label in _PREMIUM_FEATURE_PATTERNS:
|
||||
if vk and vk != expected_key and expected_key not in (vendor.get("name") or "").lower():
|
||||
continue
|
||||
for cn in cookie_names:
|
||||
if re.search(pattern, cn):
|
||||
idx = min(len(base_tiers) - 1, idx + 1)
|
||||
signals.append(f"Premium-Feature-Cookie: {feature_label}")
|
||||
break
|
||||
|
||||
# Heavy third-party tracking
|
||||
third_party_ratio = sum(1 for c in cookies if c.get("is_third_party")) / max(n_cookies, 1)
|
||||
if third_party_ratio >= 0.6 and n_cookies >= 10:
|
||||
signals.append(f"{int(third_party_ratio * 100)}% Drittanbieter-Cookies — Tracking-Heavy")
|
||||
|
||||
# Long-lived cookies
|
||||
long_lived = sum(1 for c in cookies if _expiry_years(c.get("expiry", "")) >= 2)
|
||||
if long_lived >= 3:
|
||||
signals.append(f"{long_lived} Cookies mit ≥2 Jahre Speicherdauer")
|
||||
|
||||
return base_tiers[idx], signals
|
||||
|
||||
|
||||
def _expiry_years(expiry_str: str) -> float:
|
||||
"""Rough parse: '2 Jahre' → 2.0, '24 Monate' → 2.0, '90 Tage' → 0.25"""
|
||||
s = (expiry_str or "").lower()
|
||||
m = re.search(r"(\d+)\s*(jahr|year)", s)
|
||||
if m: return float(m.group(1))
|
||||
m = re.search(r"(\d+)\s*(monat|month)", s)
|
||||
if m: return float(m.group(1)) / 12.0
|
||||
m = re.search(r"(\d+)\s*(tag|day)", s)
|
||||
if m: return float(m.group(1)) / 365.0
|
||||
return 0.0
|
||||
|
||||
|
||||
def estimate_vendor_cost(vendor: dict, business_profile: dict | None = None) -> dict:
|
||||
"""Return cost estimation for one vendor incl. tier inference + signals."""
|
||||
vk = _vendor_key(vendor.get("name", ""))
|
||||
company_tier = infer_company_tier(business_profile)
|
||||
|
||||
if not vk:
|
||||
return {
|
||||
"vendor": vendor.get("name", ""),
|
||||
"matched_pricing_key": None,
|
||||
"inferred_tier": None,
|
||||
"tier_signals": [],
|
||||
"company_tier_baseline": company_tier,
|
||||
"cost_year_eur_range": (0, 0),
|
||||
"confidence": "none",
|
||||
"note": "Kein Pricing-Eintrag fuer diesen Anbieter — Saving-Schaetzung uebergangen.",
|
||||
}
|
||||
|
||||
tier, signals = infer_vendor_tier(vendor, company_tier)
|
||||
pricing = _TIER_PRICING[vk].get(tier) or (0, 0)
|
||||
confidence = "high" if len(signals) >= 2 else ("medium" if signals else "low")
|
||||
|
||||
return {
|
||||
"vendor": vendor.get("name", ""),
|
||||
"matched_pricing_key": vk,
|
||||
"inferred_tier": tier,
|
||||
"tier_signals": signals,
|
||||
"company_tier_baseline": company_tier,
|
||||
"cost_year_eur_range": pricing,
|
||||
"confidence": confidence,
|
||||
}
|
||||
|
||||
|
||||
def estimate_total_stack_cost(
|
||||
vendors: Iterable[dict],
|
||||
business_profile: dict | None = None,
|
||||
) -> dict:
|
||||
"""Aggregate cost estimation over all vendors.
|
||||
|
||||
Returns:
|
||||
- per_vendor list (one entry each)
|
||||
- per_recipient_type aggregate (INTERNAL vs PROCESSOR vs CONTROLLER)
|
||||
- total range
|
||||
- master-contract dedup hint: vendors whose name starts with the
|
||||
site owner ('BMW AG — ...') are bundled into ONE master contract
|
||||
per vendor-tool-key (not double-counted).
|
||||
"""
|
||||
per_vendor: list[dict] = []
|
||||
seen_master_keys: set[tuple[str, str]] = set()
|
||||
total_low = 0
|
||||
total_high = 0
|
||||
|
||||
for v in vendors:
|
||||
est = estimate_vendor_cost(v, business_profile)
|
||||
per_vendor.append(est)
|
||||
if not est["matched_pricing_key"]:
|
||||
continue
|
||||
rtype = (v.get("recipient_type") or "").upper()
|
||||
master_key = (est["matched_pricing_key"], rtype if rtype == "INTERNAL" else "EXT")
|
||||
if rtype == "INTERNAL" and master_key in seen_master_keys:
|
||||
# Same Adobe contract serves many "BMW AG — Adobe XYZ" lines —
|
||||
# count cost only ONCE per (key, internal).
|
||||
est["bundled_into_master_contract"] = True
|
||||
continue
|
||||
seen_master_keys.add(master_key)
|
||||
lo, hi = est["cost_year_eur_range"]
|
||||
total_low += lo
|
||||
total_high += hi
|
||||
|
||||
return {
|
||||
"per_vendor": per_vendor,
|
||||
"total_year_eur_range": (total_low, total_high),
|
||||
"master_contracts_counted": len(seen_master_keys),
|
||||
"disclaimer": (
|
||||
"Schaetzung basiert auf Cookie-Signalen (Anzahl, Premium-Feature-Detection, "
|
||||
"Drittanbieter-Quote, Lebensdauer) + Listpreisen pro Tier. Konzern-Konditionen "
|
||||
"koennen 30-50% darunter liegen. Eintraege derselben Eigenmarke werden zu EINEM "
|
||||
"Master-Vertrag aggregiert. Media-Spend ist NICHT enthalten."
|
||||
),
|
||||
}
|
||||
@@ -0,0 +1,727 @@
|
||||
"""
|
||||
Vendor Redundancy + EU-Alternatives Analyzer.
|
||||
|
||||
Eingang: Liste von Vendors aus dem CMP-Capture (z.B. BMW 90 Vendors).
|
||||
Ausgang: drei strukturierte Listen die im Email + Migration-Modal
|
||||
gerendert werden:
|
||||
|
||||
1. functional_categories : Vendor → Funktionsklasse (analytics,
|
||||
advertising, cdn, captcha, chat, …)
|
||||
2. redundancies : Kategorien mit ≥2 Vendors die dasselbe tun
|
||||
→ Konsolidierungspotenzial
|
||||
3. eu_alternatives : pro US-Vendor passender EU-Ersatz aus
|
||||
kuratierter Lookup-Tabelle (Matomo statt
|
||||
Adobe Analytics, IONOS statt AWS, etc.)
|
||||
4. multi_function_tools : EU-Tools die mehrere Kategorien abdecken
|
||||
(z.B. SAP CX = Analytics + CRM + Marketing)
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import re
|
||||
from collections import defaultdict
|
||||
from typing import Iterable
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# ─── Kategorisierung ──────────────────────────────────────────────────
|
||||
|
||||
# Substring-Match (lowercase) → Kategorie. Erste Treffer gewinnt.
|
||||
_CATEGORY_RULES: list[tuple[str, str]] = [
|
||||
# Web Analytics / Behavior
|
||||
("adobe analytics", "web_analytics"),
|
||||
("adobe target", "personalisation"),
|
||||
("adobe campaign", "marketing_automation"),
|
||||
("adobe staging library", "tag_management"),
|
||||
("adobelaunch", "tag_management"),
|
||||
("google analytics", "web_analytics"),
|
||||
("matomo", "web_analytics"),
|
||||
("hotjar", "web_analytics"),
|
||||
("content square", "web_analytics"),
|
||||
("contentsquare", "web_analytics"),
|
||||
("dynatrace", "monitoring"),
|
||||
("performance analytics", "web_analytics"),
|
||||
("form analytics", "web_analytics"),
|
||||
("form campaign analytics","web_analytics"),
|
||||
("psyma", "survey"),
|
||||
("qualtrics", "survey"),
|
||||
|
||||
# Tag Management
|
||||
("google tag manager", "tag_management"),
|
||||
("gtm", "tag_management"),
|
||||
|
||||
# Advertising / Retargeting
|
||||
("google ads", "advertising"),
|
||||
("google advertising", "advertising"),
|
||||
("doubleclick", "advertising"),
|
||||
("googleads", "advertising"),
|
||||
("meta pixel", "advertising"),
|
||||
("meta platforms", "advertising"),
|
||||
("facebook", "advertising"),
|
||||
("adform", "advertising"),
|
||||
("criteo", "advertising"),
|
||||
("outbrain", "advertising"),
|
||||
("taboola", "advertising"),
|
||||
("teads", "advertising"),
|
||||
("pinterest", "advertising"),
|
||||
("linkedin insight", "advertising"),
|
||||
("youtube performance", "advertising"),
|
||||
("youtube player", "external_media"),
|
||||
("amazon advertising", "advertising"),
|
||||
("instagram", "advertising"),
|
||||
("dotaki", "advertising"),
|
||||
|
||||
# Video / Embeds
|
||||
("youtube", "external_media"),
|
||||
("vimeo", "external_media"),
|
||||
("jw player", "external_media"),
|
||||
("jw video", "external_media"),
|
||||
("jwplayer", "external_media"),
|
||||
("jwconnatix", "external_media"),
|
||||
|
||||
# Maps / Geo
|
||||
("google maps", "maps"),
|
||||
("google geolocation", "maps"),
|
||||
("geolocation", "maps"),
|
||||
|
||||
# CDN / Infrastructure
|
||||
("akamai", "cdn"),
|
||||
("amazon web services", "cloud_infra"),
|
||||
("aws", "cloud_infra"),
|
||||
("baqend", "cdn"),
|
||||
("speedkit", "cdn"),
|
||||
("speedcurve", "monitoring"),
|
||||
("salesforce", "crm"),
|
||||
|
||||
# Chat / Support
|
||||
("genesys", "chat"),
|
||||
("ckm", "chat"),
|
||||
("chat widget", "chat"),
|
||||
|
||||
# Captcha / Bot-Protection
|
||||
("hcaptcha", "captcha"),
|
||||
("recaptcha", "captcha"),
|
||||
|
||||
# Sales / Lead-Tracking
|
||||
("salesviewer", "lead_tracking"),
|
||||
|
||||
# Marketing/Sales overlay
|
||||
("nayoki", "social_aggregator"),
|
||||
|
||||
# Site-eigene Funktionen
|
||||
("infrastructure", "site_infra"),
|
||||
("infrastrukturbereit", "site_infra"),
|
||||
("javaserverpages", "site_infra"),
|
||||
("single sign-on", "auth"),
|
||||
("mybmw account", "auth"),
|
||||
("sso", "auth"),
|
||||
("consent", "consent_management"),
|
||||
("session", "site_infra"),
|
||||
("scroll", "site_infra"),
|
||||
("sticky", "site_infra"),
|
||||
("sidebar", "site_infra"),
|
||||
("dealer search", "site_feature"),
|
||||
("test drive", "site_feature"),
|
||||
("vehicle configurator", "site_feature"),
|
||||
("stocklocator", "site_feature"),
|
||||
("eshop", "site_feature"),
|
||||
("shop", "site_feature"),
|
||||
("language", "site_infra"),
|
||||
("sprach", "site_infra"),
|
||||
("region", "site_infra"),
|
||||
("ip popup", "site_infra"),
|
||||
("popup", "site_infra"),
|
||||
("dynatrace", "monitoring"),
|
||||
]
|
||||
|
||||
|
||||
def classify_vendor(name: str) -> str:
|
||||
"""Map a vendor name to a functional category."""
|
||||
n = (name or "").lower()
|
||||
for needle, cat in _CATEGORY_RULES:
|
||||
if needle in n:
|
||||
return cat
|
||||
return "other"
|
||||
|
||||
|
||||
# ─── EU-Alternativen ─────────────────────────────────────────────────
|
||||
|
||||
# Kuratierte Liste — pro US-/Nicht-EU-Vendor passende(r) EU-Ersatz.
|
||||
# Quellen: Matomo Vergleich, etracker SoMo-Studie, IONOS-Pakete,
|
||||
# Friendly Captcha Whitepaper, SAP CX-Suite, Brevo / CleverReach DE-Listen.
|
||||
_EU_ALTERNATIVES: dict[str, list[dict]] = {
|
||||
"adobe analytics": [
|
||||
{"name": "Matomo (On-Premise)", "vendor": "InnoCraft", "country": "DE-self-hosted",
|
||||
"license": "GPL", "notes": "100% DSGVO, keine 3rd-Country, gleicher Funktionsumfang"},
|
||||
{"name": "etracker Analytics", "vendor": "etracker GmbH", "country": "DE",
|
||||
"license": "Commercial", "notes": "DSGVO-konform aus Hamburg, IP-Anonymisierung"},
|
||||
{"name": "Mapp Intelligence", "vendor": "Mapp Digital", "country": "DE",
|
||||
"license": "Commercial", "notes": "Enterprise-Alternative, Server in DE"},
|
||||
],
|
||||
"google analytics": [
|
||||
{"name": "Matomo", "vendor": "InnoCraft", "country": "DE-self-hosted",
|
||||
"license": "GPL", "notes": "Direkter Drop-in-Ersatz mit GA-Migrationspfad"},
|
||||
{"name": "Plausible Analytics", "vendor": "Plausible Insights", "country": "EE",
|
||||
"license": "AGPL/Commercial", "notes": "Cookielos, ohne Einwilligung nutzbar"},
|
||||
{"name": "Fathom Analytics EU", "vendor": "Fathom", "country": "DE-Region",
|
||||
"license": "Commercial", "notes": "Cookielos, EU-Hosting"},
|
||||
],
|
||||
"content square": [
|
||||
{"name": "Mouseflow EU", "vendor": "Mouseflow ApS", "country": "DK",
|
||||
"license": "Commercial", "notes": "Session-Recording + Heatmaps EU-Hosting"},
|
||||
{"name": "Hotjar EU", "vendor": "Hotjar Ltd", "country": "MT",
|
||||
"license": "Commercial", "notes": "EU-DataCenter (Frankfurt), Einwilligung erforderlich"},
|
||||
],
|
||||
"dynatrace": [
|
||||
{"name": "Dynatrace EU", "vendor": "Dynatrace", "country": "AT",
|
||||
"license": "Commercial", "notes": "Bereits EU (Linz). Cluster auf EU einstellen"},
|
||||
],
|
||||
"speedcurve": [
|
||||
{"name": "SpeedCurve EU", "vendor": "SpeedCurve", "country": "EU-tenant",
|
||||
"license": "Commercial", "notes": "Region-Tenant explizit konfigurieren"},
|
||||
{"name": "Calibre", "vendor": "Calibre", "country": "AU/EU",
|
||||
"license": "Commercial", "notes": "Performance Monitoring, EU-Region"},
|
||||
],
|
||||
"akamai": [
|
||||
{"name": "Bunny CDN", "vendor": "BunnyWay d.o.o.", "country": "SI",
|
||||
"license": "Commercial", "notes": "Slowenischer CDN, EU-Backbone"},
|
||||
{"name": "Cloudflare EU-Only", "vendor": "Cloudflare", "country": "Multi",
|
||||
"license": "Commercial", "notes": "EU-Datacenter erzwingbar via 'Geo Steering'"},
|
||||
{"name": "IONOS CDN", "vendor": "IONOS SE", "country": "DE",
|
||||
"license": "Commercial", "notes": "100% DE-Hosting"},
|
||||
],
|
||||
"amazon web services": [
|
||||
{"name": "IONOS Cloud", "vendor": "IONOS SE", "country": "DE",
|
||||
"license": "Commercial", "notes": "DE-Hosting, BSI C5-zertifiziert"},
|
||||
{"name": "OVHcloud", "vendor": "OVH SAS", "country": "FR",
|
||||
"license": "Commercial", "notes": "FR-Hosting, SecNumCloud-zertifiziert"},
|
||||
{"name": "Hetzner Cloud", "vendor": "Hetzner Online GmbH", "country": "DE",
|
||||
"license": "Commercial", "notes": "DE/FI-Hosting, sehr kostenguenstig"},
|
||||
{"name": "STACKIT", "vendor": "Schwarz IT (Lidl-Gruppe)", "country": "DE",
|
||||
"license": "Commercial", "notes": "Souveraener DE-Cloud, fuer Enterprise"},
|
||||
],
|
||||
"salesforce": [
|
||||
{"name": "SAP Customer Experience", "vendor": "SAP SE", "country": "DE",
|
||||
"license": "Commercial", "notes": "Vollstaendige CRM-Suite EU-Hosting"},
|
||||
{"name": "weclapp", "vendor": "weclapp SE", "country": "DE",
|
||||
"license": "Commercial", "notes": "Cloud-CRM aus Marburg"},
|
||||
],
|
||||
"adobe campaign": [
|
||||
{"name": "CleverReach", "vendor": "CleverReach GmbH", "country": "DE",
|
||||
"license": "Commercial", "notes": "E-Mail-Marketing DE-Hosting"},
|
||||
{"name": "Brevo (Sendinblue)", "vendor": "Brevo", "country": "FR",
|
||||
"license": "Commercial", "notes": "Marketing-Automation EU-Hosting"},
|
||||
{"name": "Inxmail", "vendor": "Inxmail GmbH", "country": "DE",
|
||||
"license": "Commercial", "notes": "Enterprise-E-Mail-Marketing aus Freiburg"},
|
||||
],
|
||||
"google ads": [
|
||||
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
|
||||
"license": "Commercial", "notes": "FR-Hosting, Programmatic + Direct-Sold"},
|
||||
{"name": "Bing Ads (Microsoft Advertising EU)", "vendor": "Microsoft", "country": "Multi",
|
||||
"license": "Commercial", "notes": "EU-Datacenter optional"},
|
||||
],
|
||||
"google maps": [
|
||||
{"name": "HERE Maps", "vendor": "HERE Technologies", "country": "DE",
|
||||
"license": "Commercial", "notes": "Berliner Anbieter, professionelle Karten + Routing"},
|
||||
{"name": "OpenStreetMap (self-host)", "vendor": "OSM Foundation", "country": "DE-self-host",
|
||||
"license": "ODbL", "notes": "Frei, OSM-Tiles self-hosted oder via Maptiler EU"},
|
||||
{"name": "Maptiler Cloud EU", "vendor": "MapTiler", "country": "CH",
|
||||
"license": "Commercial", "notes": "Schweizer Anbieter, EU-Tiles"},
|
||||
],
|
||||
"criteo": [ # criteo IS EU but use as example for retargeting alts
|
||||
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
|
||||
"license": "Commercial", "notes": "Retargeting + Display, FR-Hosting"},
|
||||
],
|
||||
"hcaptcha": [
|
||||
{"name": "Friendly Captcha", "vendor": "Friendly Captcha GmbH", "country": "DE",
|
||||
"license": "Commercial", "notes": "100% DSGVO, ohne Cookie, Hosting in DE"},
|
||||
{"name": "Turnstile (Cloudflare EU-Only)", "vendor": "Cloudflare", "country": "Multi",
|
||||
"license": "Commercial", "notes": "Ohne Cookie, EU-Region erzwingbar"},
|
||||
],
|
||||
"qualtrics": [
|
||||
{"name": "LamaPoll", "vendor": "Lamano GmbH", "country": "DE",
|
||||
"license": "Commercial", "notes": "DSGVO-Surveys aus Berlin"},
|
||||
{"name": "evasys", "vendor": "evasys GmbH", "country": "DE",
|
||||
"license": "Commercial", "notes": "Enterprise-Survey-Plattform aus Lueneburg"},
|
||||
],
|
||||
"meta pixel": [
|
||||
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
|
||||
"license": "Commercial", "notes": "EU-Alternative fuer Conversion-Tracking"},
|
||||
],
|
||||
"facebook": [
|
||||
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
|
||||
"license": "Commercial", "notes": "Programmatic ohne Meta"},
|
||||
],
|
||||
"linkedin insight": [
|
||||
{"name": "Xing Insights", "vendor": "New Work SE", "country": "DE",
|
||||
"license": "Commercial", "notes": "DE/AT/CH B2B-Targeting aus Hamburg"},
|
||||
],
|
||||
"outbrain": [
|
||||
{"name": "Plista", "vendor": "Plista GmbH", "country": "DE",
|
||||
"license": "Commercial", "notes": "Native Advertising aus Berlin"},
|
||||
],
|
||||
"taboola": [
|
||||
{"name": "Plista", "vendor": "Plista GmbH", "country": "DE",
|
||||
"license": "Commercial", "notes": "Native Advertising aus Berlin"},
|
||||
],
|
||||
"genesys": [
|
||||
{"name": "Userlike", "vendor": "Userlike UG", "country": "DE",
|
||||
"license": "Commercial", "notes": "Live-Chat aus Koeln, BSI-konform"},
|
||||
{"name": "LiveZilla / EasyChat EU", "vendor": "LiveZilla GmbH", "country": "DE",
|
||||
"license": "Commercial", "notes": "DSGVO-Live-Chat"},
|
||||
],
|
||||
"salesviewer": [
|
||||
{"name": "Leadinfo", "vendor": "Leadinfo BV", "country": "NL",
|
||||
"license": "Commercial", "notes": "B2B-Webvisitor-Tracking EU"},
|
||||
{"name": "Albacross EU", "vendor": "Albacross", "country": "SE",
|
||||
"license": "Commercial", "notes": "EU-Tenant verfuegbar"},
|
||||
],
|
||||
"youtube": [
|
||||
{"name": "Vimeo Pro EU", "vendor": "Vimeo", "country": "Multi",
|
||||
"license": "Commercial", "notes": "EU-Region waehlbar, weniger Tracking"},
|
||||
{"name": "Self-hosted video (BunnyStream)", "vendor": "BunnyWay", "country": "SI",
|
||||
"license": "Commercial", "notes": "Eigene Player + CDN ohne Drittanbieter"},
|
||||
],
|
||||
"amazon advertising": [
|
||||
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
|
||||
"license": "Commercial", "notes": "Retail-Media-Alternative FR"},
|
||||
],
|
||||
"instagram": [
|
||||
{"name": "Pinterest EU + Owned-Channels", "vendor": "Mix", "country": "Multi",
|
||||
"license": "Commercial", "notes": "Owned-Channels (Newsletter via CleverReach)"},
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
# ─── Kosten-Annahmen (oeffentliche Listenpreise, Schaetzung) ──────
|
||||
#
|
||||
# Format: (low_year_eur, high_year_eur, tier_assumption)
|
||||
# Tier: 'sme' = <100 Mitarbeiter, 'mid' = 100-1000, 'ent' = >1000.
|
||||
# Quellen: oeffentliche Listenpreise + Branchen-Benchmarks (Gartner,
|
||||
# Forrester 2025). Konkrete Vertrags-Konditionen koennen 30-70% abweichen
|
||||
# (Volumen-Rabatte, Bundling). Werden im Output explizit als
|
||||
# 'Schaetzbereich' markiert.
|
||||
|
||||
_COST_LOOKUP: dict[str, tuple[int, int, str]] = {
|
||||
"adobe analytics": (120_000, 600_000, "ent"),
|
||||
"adobe target": ( 80_000, 350_000, "ent"),
|
||||
"adobe campaign": ( 60_000, 250_000, "ent"),
|
||||
"adobe staging library":( 0, 0, "ent"), # bundled
|
||||
"google analytics": ( 0, 150_000, "ent"), # GA4 free, GA360 ~150k
|
||||
"matomo": ( 6_000, 30_000, "mid"), # Cloud/On-Prem
|
||||
"hotjar": ( 3_600, 18_000, "mid"),
|
||||
"content square": ( 60_000, 300_000, "ent"),
|
||||
"contentsquare": ( 60_000, 300_000, "ent"),
|
||||
"dynatrace": ( 50_000, 400_000, "ent"), # per-host pricing
|
||||
"performance analytics":( 5_000, 40_000, "mid"),
|
||||
"qualtrics": ( 25_000, 150_000, "ent"),
|
||||
|
||||
# Self-Service-Werbung — KEIN Tool-Lizenz, nur Media-Spend (separat).
|
||||
# Wir zaehlen 0 hier, weil "Sparpotenzial bei der Lizenz" = 0 ist.
|
||||
# Konsolidierung wuerde nur Media-Spend reduzieren — anderes Thema.
|
||||
"google ads": ( 0, 0, "ent"),
|
||||
"google advertising": ( 0, 0, "ent"),
|
||||
"doubleclick": ( 0, 0, "ent"),
|
||||
"meta pixel": ( 0, 0, "ent"),
|
||||
"facebook": ( 0, 0, "ent"),
|
||||
"amazon advertising": ( 0, 0, "ent"),
|
||||
"youtube performance": ( 0, 0, "ent"),
|
||||
"youtube player": ( 0, 0, "ent"),
|
||||
"instagram": ( 0, 0, "ent"),
|
||||
# Echte DSP-/Plattform-Lizenzen — hier zahlt der Kunde eine Saas-Fee
|
||||
# ON TOP des Media-Spends. Range bewusst enger gehalten (Faktor max 4x).
|
||||
"adform": ( 80_000, 300_000, "ent"),
|
||||
"criteo": ( 50_000, 200_000, "ent"),
|
||||
"outbrain": ( 30_000, 120_000, "ent"),
|
||||
"taboola": ( 30_000, 120_000, "ent"),
|
||||
"teads": ( 25_000, 100_000, "ent"),
|
||||
"pinterest": ( 15_000, 60_000, "ent"),
|
||||
"linkedin insight": ( 10_000, 50_000, "ent"),
|
||||
|
||||
"google maps": ( 2_000, 30_000, "mid"),
|
||||
"akamai": ( 50_000, 500_000, "ent"),
|
||||
"amazon web services": (100_000, 3_000_000, "ent"),
|
||||
"baqend": ( 6_000, 60_000, "mid"),
|
||||
"speedkit": ( 6_000, 60_000, "mid"),
|
||||
"speedcurve": ( 2_400, 24_000, "mid"),
|
||||
|
||||
"salesforce": (100_000, 1_500_000, "ent"), # CRM seats
|
||||
"genesys": ( 80_000, 800_000, "ent"), # contact-center seats
|
||||
"ckm": ( 15_000, 120_000, "mid"),
|
||||
"hcaptcha": ( 0, 12_000, "sme"), # free tier OR pro
|
||||
|
||||
"salesviewer": ( 3_600, 18_000, "mid"),
|
||||
"youtube": ( 0, 50_000, "ent"), # embed kostenlos, Production-Kosten variieren
|
||||
}
|
||||
|
||||
|
||||
# ─── EU-Alternativen-Kosten (gleiche Tier-Logik) ───────────────────
|
||||
|
||||
_EU_ALT_COSTS: dict[str, tuple[int, int]] = {
|
||||
"Matomo (On-Premise)": ( 3_000, 15_000),
|
||||
"Matomo (Pro / Cloud EU)": ( 6_000, 30_000),
|
||||
"Matomo": ( 6_000, 30_000),
|
||||
"etracker Analytics": ( 10_000, 60_000),
|
||||
"Mapp Intelligence": ( 40_000, 200_000),
|
||||
"Plausible Analytics": ( 240, 6_000),
|
||||
"Fathom Analytics EU": ( 240, 6_000),
|
||||
"Mouseflow EU": ( 12_000, 60_000),
|
||||
"Hotjar EU": ( 3_600, 18_000),
|
||||
"Dynatrace EU": ( 50_000, 400_000), # gleicher Preis, nur Region
|
||||
"SpeedCurve EU": ( 2_400, 24_000),
|
||||
"Calibre": ( 3_600, 30_000),
|
||||
"Bunny CDN": ( 1_200, 12_000),
|
||||
"Cloudflare EU-Only": ( 6_000, 80_000),
|
||||
"IONOS CDN": ( 3_000, 30_000),
|
||||
"IONOS Cloud": ( 30_000, 600_000),
|
||||
"OVHcloud": ( 30_000, 600_000),
|
||||
"Hetzner Cloud": ( 6_000, 120_000),
|
||||
"STACKIT": ( 50_000, 800_000),
|
||||
"SAP Customer Experience": ( 80_000, 1_200_000),
|
||||
"weclapp": ( 12_000, 80_000),
|
||||
"CleverReach": ( 2_400, 24_000),
|
||||
"Brevo (Sendinblue)": ( 600, 24_000),
|
||||
"Inxmail": ( 8_000, 60_000),
|
||||
"Smart AdServer (Equativ)": ( 30_000, 300_000),
|
||||
"Bing Ads (Microsoft Advertising EU)": ( 30_000, 3_000_000),
|
||||
"HERE Maps": ( 1_200, 24_000),
|
||||
"OpenStreetMap (self-host)": ( 0, 6_000), # nur Server-Kosten
|
||||
"Maptiler Cloud EU": ( 600, 12_000),
|
||||
"Friendly Captcha": ( 600, 9_600),
|
||||
"Turnstile (Cloudflare EU-Only)": ( 0, 6_000),
|
||||
"LamaPoll": ( 1_200, 24_000),
|
||||
"evasys": ( 6_000, 60_000),
|
||||
"Xing Insights": ( 6_000, 60_000),
|
||||
"Plista": ( 20_000, 150_000),
|
||||
"Userlike": ( 1_200, 30_000),
|
||||
"LiveZilla / EasyChat EU": ( 600, 12_000),
|
||||
"Leadinfo": ( 1_200, 12_000),
|
||||
"Albacross EU": ( 3_600, 24_000),
|
||||
"Vimeo Pro EU": ( 900, 6_000),
|
||||
"Self-hosted video (BunnyStream)": ( 600, 12_000),
|
||||
"Pinterest EU + Owned-Channels": ( 600, 24_000),
|
||||
}
|
||||
|
||||
|
||||
# ─── Bekannte Gruende fuer Duplikate (sollen Konsolidierung NICHT empfehlen) ─
|
||||
|
||||
_DUPLICATION_CAVEATS = {
|
||||
"web_analytics": [
|
||||
"A/B-Vergleich verschiedener Anbieter waehrend Migration",
|
||||
"Marketing nutzt Adobe, Produkt nutzt Matomo — Inhouse-Politik",
|
||||
"Regional split (Adobe fuer DE, GA fuer International)",
|
||||
],
|
||||
"advertising": [
|
||||
"Brand-Kampagne vs Performance-Kampagne (verschiedene DSPs)",
|
||||
"Saisonal: Black Friday/Super Bowl nutzt mehr Kanaele",
|
||||
"Markenspezifisch: BMW M-Modelle anders targetet als 1er-Serie",
|
||||
],
|
||||
"cdn": [
|
||||
"Multi-CDN-Strategie fuer Ausfallsicherheit (Akamai + Cloudflare)",
|
||||
"Event-CDN-Spike (Auto-Show, Modell-Launch) braucht Skalierung",
|
||||
"Regionale Latenz-Optimierung (Akamai APAC, AWS US)",
|
||||
],
|
||||
"marketing_automation": [
|
||||
"Salesforce Marketing Cloud fuer B2C, Adobe Campaign fuer B2B",
|
||||
"Lead-Generierung (Adobe) vs Loyalitaet (Salesforce)",
|
||||
],
|
||||
"monitoring": [
|
||||
"APM (Dynatrace) misst Backend, RUM (SpeedCurve) misst Frontend",
|
||||
],
|
||||
"captcha": [
|
||||
"Stufenweise Migration zu cookieless Captcha",
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
def _company_tier_bounds(company_tier: str | None) -> tuple[float, float]:
|
||||
"""Wie viel der Listpreis-Range tatsaechlich verwenden — abhaengig
|
||||
vom Company-Tier. Bei 'enterprise' / 'premier' nutzen wir den UPPER
|
||||
Teil (50-100%) statt starter→premier.
|
||||
"""
|
||||
t = (company_tier or "professional").lower()
|
||||
if t == "premier": return (0.70, 1.00)
|
||||
if t == "enterprise": return (0.40, 0.85)
|
||||
if t == "professional": return (0.20, 0.60)
|
||||
return (0.05, 0.40) # 'sme' / starter
|
||||
|
||||
|
||||
def _estimate_savings_for_redundancy(
|
||||
redundancy: dict, vendors: Iterable[dict],
|
||||
company_tier: str = "enterprise",
|
||||
) -> dict:
|
||||
"""Schaetzbereich pro Redundanz: derzeitige Kosten + EU-Konsolidierungs-Saving.
|
||||
|
||||
Beruecksichtigt den company_tier — wir wollen fuer ein Konzern wie
|
||||
BMW nicht die starter-Range mit anzeigen. Realistic Range ergibt
|
||||
sich aus tier_bounds × (low, high).
|
||||
"""
|
||||
low_frac, high_frac = _company_tier_bounds(company_tier)
|
||||
current_low = current_high = 0
|
||||
matched_vendors = []
|
||||
cat_vendors = [v for v in vendors if v.get("name") in redundancy.get("vendors", [])]
|
||||
for v in cat_vendors:
|
||||
name = (v.get("name") or "").lower()
|
||||
for k, (lo, hi, _tier) in _COST_LOOKUP.items():
|
||||
if k in name:
|
||||
# Tier-aware: nimm low_frac..high_frac des Pricing-Bereichs
|
||||
span = hi - lo
|
||||
current_low += int(lo + span * low_frac)
|
||||
current_high += int(lo + span * high_frac)
|
||||
matched_vendors.append(v.get("name"))
|
||||
break
|
||||
|
||||
# Konsolidierung: ein einziges EU-Tool ersetzt alle in der Kategorie
|
||||
suggested_eu = None
|
||||
suggested_low = suggested_high = 0
|
||||
# 1. Multi-Funktions-Tool das diese Kategorie abdeckt
|
||||
for tool in _MULTI_FUNCTION_TOOLS:
|
||||
if redundancy["category"] in tool["covers"]:
|
||||
suggested_eu = tool["name"]
|
||||
cost = _EU_ALT_COSTS.get(tool["name"])
|
||||
if cost:
|
||||
suggested_low, suggested_high = cost
|
||||
break
|
||||
# 2. Sonst: EU-Alternative aus den Eintraegen — ABER NUR FUR VENDORS
|
||||
# AUS DER AKTUELLEN KATEGORIE (sonst kommt Userlike fuer Werbung)
|
||||
if not suggested_eu:
|
||||
for v in cat_vendors:
|
||||
n = (v.get("name") or "").lower()
|
||||
for k, alts in _EU_ALTERNATIVES.items():
|
||||
if k in n and alts:
|
||||
suggested_eu = alts[0]["name"]
|
||||
cost = _EU_ALT_COSTS.get(alts[0]["name"])
|
||||
if cost:
|
||||
suggested_low, suggested_high = cost
|
||||
break
|
||||
if suggested_eu:
|
||||
break
|
||||
|
||||
saving_low = max(0, current_low - suggested_high)
|
||||
saving_high = max(0, current_high - suggested_low)
|
||||
|
||||
return {
|
||||
"current_estimate_year_eur": [current_low, current_high],
|
||||
"suggested_eu_tool": suggested_eu,
|
||||
"suggested_estimate_year_eur": [suggested_low, suggested_high],
|
||||
"estimated_saving_year_eur": [saving_low, saving_high],
|
||||
"caveats": _DUPLICATION_CAVEATS.get(redundancy["category"], []),
|
||||
"cost_disclaimer": (
|
||||
"Schaetzbereich auf Basis oeffentlicher Listenpreise. Tatsaechliche "
|
||||
"Vertragspreise koennen 30-70% niedriger liegen (Volumen, Bundling, "
|
||||
"Konzern-Konditionen). Bitte mit der jeweiligen Einkaufsabteilung verifizieren."
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
# ─── Multi-Funktions-Tools (Konsolidierungs-Ankerpunkte) ───────────
|
||||
|
||||
_MULTI_FUNCTION_TOOLS = [
|
||||
{
|
||||
"name": "Matomo (Pro / Cloud EU)",
|
||||
"vendor": "InnoCraft",
|
||||
"country": "DE-self-host / EU",
|
||||
"covers": ["web_analytics", "tag_management", "personalisation"],
|
||||
"notes": "Ersetzt Adobe Analytics + GTM + Adobe Target in einem Tool. "
|
||||
"100% DSGVO ohne Einwilligung wenn IP anonymisiert.",
|
||||
},
|
||||
{
|
||||
"name": "SAP Customer Experience Suite",
|
||||
"vendor": "SAP SE",
|
||||
"country": "DE",
|
||||
"covers": ["crm", "marketing_automation", "personalisation", "survey"],
|
||||
"notes": "Ersetzt Salesforce + Adobe Campaign + Qualtrics. EU-Hosting, "
|
||||
"tiefe ERP-Integration.",
|
||||
},
|
||||
{
|
||||
"name": "IONOS Cloud (Compute + CDN + Storage + DNS)",
|
||||
"vendor": "IONOS SE",
|
||||
"country": "DE",
|
||||
"covers": ["cloud_infra", "cdn", "monitoring"],
|
||||
"notes": "Ersetzt AWS + Akamai + zusaetzliches Monitoring in einer "
|
||||
"DE-Cloud (BSI C5).",
|
||||
},
|
||||
{
|
||||
"name": "Userlike Suite",
|
||||
"vendor": "Userlike UG",
|
||||
"country": "DE",
|
||||
"covers": ["chat", "consent_management"],
|
||||
"notes": "Ersetzt Genesys Chat. Bietet eigenes Consent-Modul.",
|
||||
},
|
||||
{
|
||||
"name": "Smart AdServer (Equativ)",
|
||||
"vendor": "Equativ",
|
||||
"country": "FR",
|
||||
"covers": ["advertising"],
|
||||
"notes": "Ersetzt Mehrfach-DSPs (Adform/Criteo/Outbrain/Taboola/Meta) "
|
||||
"durch Programmatic+Direct-Sold EU-Stack.",
|
||||
},
|
||||
{
|
||||
"name": "HERE Maps",
|
||||
"vendor": "HERE Technologies",
|
||||
"country": "DE",
|
||||
"covers": ["maps"],
|
||||
"notes": "Berliner Anbieter, professionelle Karten + Routing.",
|
||||
},
|
||||
{
|
||||
"name": "Vimeo Pro EU (oder self-hosted BunnyStream)",
|
||||
"vendor": "Vimeo / BunnyWay",
|
||||
"country": "Multi / SI",
|
||||
"covers": ["external_media"],
|
||||
"notes": "Ersetzt YouTube-Embeds + JW Player in einem Player.",
|
||||
},
|
||||
{
|
||||
"name": "LamaPoll",
|
||||
"vendor": "Lamano GmbH",
|
||||
"country": "DE",
|
||||
"covers": ["survey"],
|
||||
"notes": "DSGVO-Surveys aus Berlin. Ersetzt Qualtrics / Psyma.",
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
# ─── Analyse ─────────────────────────────────────────────────────────
|
||||
|
||||
def analyze(vendors: Iterable[dict], company_tier: str = "enterprise") -> dict:
|
||||
"""Main entry. Returns categorised view + redundancies + EU options.
|
||||
|
||||
`company_tier` (starter|professional|enterprise|premier) steuert die
|
||||
Cost-Range so dass z.B. fuer einen DAX-Konzern nicht starter-Preise
|
||||
in der unteren Schranke landen.
|
||||
"""
|
||||
by_cat: dict[str, list[dict]] = defaultdict(list)
|
||||
for v in vendors:
|
||||
cat = classify_vendor(v.get("name", ""))
|
||||
by_cat[cat].append(v)
|
||||
|
||||
# Redundancies: any category with ≥2 vendors (excl. site-internal cats)
|
||||
skip_redundancy_cats = {"site_infra", "site_feature", "consent_management",
|
||||
"auth", "other"}
|
||||
all_vendors_list = list(vendors)
|
||||
redundancies: list[dict] = []
|
||||
for cat, vs in by_cat.items():
|
||||
if cat in skip_redundancy_cats or len(vs) < 2:
|
||||
continue
|
||||
red = {
|
||||
"category": cat,
|
||||
"category_label": _CATEGORY_LABEL.get(cat, cat),
|
||||
"count": len(vs),
|
||||
"vendors": [v.get("name", "") for v in vs],
|
||||
"consolidation_hint": _CONSOLIDATION_HINT.get(cat, ""),
|
||||
}
|
||||
red.update(_estimate_savings_for_redundancy(
|
||||
red, all_vendors_list, company_tier))
|
||||
redundancies.append(red)
|
||||
redundancies.sort(key=lambda r: -(r.get("estimated_saving_year_eur") or [0, 0])[1])
|
||||
|
||||
# EU alternatives lookup
|
||||
eu_alternatives: list[dict] = []
|
||||
seen = set()
|
||||
for v in vendors:
|
||||
name = v.get("name") or ""
|
||||
n_lower = name.lower()
|
||||
for k, alts in _EU_ALTERNATIVES.items():
|
||||
if k in n_lower and k not in seen:
|
||||
eu_alternatives.append({
|
||||
"current_vendor": name,
|
||||
"current_recipient_type": v.get("recipient_type", ""),
|
||||
"matched_key": k,
|
||||
"alternatives": alts,
|
||||
})
|
||||
seen.add(k)
|
||||
break
|
||||
|
||||
# Multi-function tool recommendations: only if the customer has vendors
|
||||
# across the categories the tool covers
|
||||
present_cats = set(by_cat.keys())
|
||||
multi_function = []
|
||||
for tool in _MULTI_FUNCTION_TOOLS:
|
||||
covered_here = [c for c in tool["covers"] if c in present_cats]
|
||||
if len(covered_here) >= 2:
|
||||
# Vendor-Namen sammeln statt nur summieren — dedupliziert
|
||||
unique_vendors: set[str] = set()
|
||||
for c in covered_here:
|
||||
for v in by_cat[c]:
|
||||
unique_vendors.add(v.get("name", ""))
|
||||
multi_function.append({
|
||||
**tool,
|
||||
"replaces_categories": covered_here,
|
||||
"potential_replacements": len(unique_vendors),
|
||||
})
|
||||
multi_function.sort(key=lambda t: -t["potential_replacements"])
|
||||
|
||||
total_current_low = sum((r.get("current_estimate_year_eur") or [0, 0])[0] for r in redundancies)
|
||||
total_current_high = sum((r.get("current_estimate_year_eur") or [0, 0])[1] for r in redundancies)
|
||||
total_saving_low = sum((r.get("estimated_saving_year_eur") or [0, 0])[0] for r in redundancies)
|
||||
total_saving_high = sum((r.get("estimated_saving_year_eur") or [0, 0])[1] for r in redundancies)
|
||||
|
||||
return {
|
||||
"summary": {
|
||||
"total_vendors": len(all_vendors_list),
|
||||
"distinct_categories": len([c for c in by_cat if c != "other"]),
|
||||
"redundancy_count": len(redundancies),
|
||||
"eu_alternative_count": len(eu_alternatives),
|
||||
"consolidation_potential": sum(r["count"] - 1 for r in redundancies),
|
||||
"estimated_current_year_eur": [total_current_low, total_current_high],
|
||||
"estimated_saving_year_eur": [total_saving_low, total_saving_high],
|
||||
"estimated_saving_pct": (
|
||||
# Beide Bounds gegen denselben Nenner (Mittelwert der
|
||||
# aktuellen Schaetzung) — sonst explodiert die obere
|
||||
# Schranke wenn current_low klein ist. Cap auf 95%.
|
||||
(lambda mid: (
|
||||
f"{min(95, int(100 * total_saving_low / mid))}–"
|
||||
f"{min(95, int(100 * total_saving_high / mid))}%"
|
||||
))((total_current_low + total_current_high) / 2)
|
||||
if total_current_high else "n/a"
|
||||
),
|
||||
"cost_disclaimer": (
|
||||
"Schaetzbereich auf Basis oeffentlicher Listenpreise (Gartner, Forrester 2025). "
|
||||
"Vertragspreise koennen 30-70% niedriger liegen (Volumen-Rabatte, Konzern-Konditionen, "
|
||||
"Bundling). Werte dienen als Diskussionsgrundlage mit dem Einkauf, NICHT als Angebot."
|
||||
),
|
||||
},
|
||||
"by_category": {cat: [v.get("name", "") for v in vs]
|
||||
for cat, vs in by_cat.items()},
|
||||
"redundancies": redundancies,
|
||||
"eu_alternatives": eu_alternatives,
|
||||
"multi_function_tools": multi_function,
|
||||
}
|
||||
|
||||
|
||||
_CATEGORY_LABEL = {
|
||||
"web_analytics": "Web-Analytics",
|
||||
"advertising": "Werbung / Retargeting",
|
||||
"tag_management": "Tag-Management",
|
||||
"marketing_automation": "Marketing-Automation",
|
||||
"personalisation": "Personalisierung",
|
||||
"external_media": "Externe Medien (Video)",
|
||||
"maps": "Karten / Geo",
|
||||
"cdn": "CDN",
|
||||
"cloud_infra": "Cloud-Infrastruktur",
|
||||
"monitoring": "Performance-Monitoring",
|
||||
"crm": "CRM",
|
||||
"chat": "Chat / Support",
|
||||
"captcha": "Bot-Schutz",
|
||||
"lead_tracking": "Lead-Tracking",
|
||||
"survey": "Umfragen",
|
||||
"social_aggregator": "Social-Media-Aggregation",
|
||||
"consent_management": "Consent-Management",
|
||||
"auth": "Authentifizierung",
|
||||
"site_infra": "Eigene Infrastruktur",
|
||||
"site_feature": "Eigene Features",
|
||||
"other": "Sonstige",
|
||||
}
|
||||
|
||||
_CONSOLIDATION_HINT = {
|
||||
"web_analytics": "Mehrere Analytics-Tools sammeln meist redundante Daten. Ein Tool genuegt — Matomo (DE) ist DSGVO-Standard.",
|
||||
"advertising": "Werbe-/Retargeting-Pixel sind oft austauschbar. Konzentration auf 2-3 Kanaele senkt Drittland-Risiko.",
|
||||
"external_media": "Mehrere Video-Embeds nur wenn fachlich noetig. Self-hosted (BunnyStream/Vimeo) reduziert Tracking.",
|
||||
"maps": "Eine Karten-Loesung reicht. HERE Maps (DE) als EU-Alternative zu Google Maps.",
|
||||
"cdn": "Ein CDN+Performance-Stack genuegt. IONOS oder Bunny vereinen mehrere Funktionen.",
|
||||
"marketing_automation": "Marketing-Cloud + separates E-Mail-Tool sind oft Dopplung — SAP CX oder CleverReach allein moeglich.",
|
||||
"chat": "Ein Chat-System genuegt. Userlike (DE) ersetzt Genesys-Stack.",
|
||||
"monitoring": "RUM + APM koennen in einem Tool gebuendelt werden (Dynatrace EU oder Sentry-Self-host).",
|
||||
"survey": "Eine Survey-Plattform genuegt — LamaPoll (DE) oder Mapp.",
|
||||
}
|
||||
Reference in New Issue
Block a user