feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features

Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9): Core Compliance-Check - Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db). rag_document_checker filtert auf check_type='text' fuer doc_check. Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in falscher doc_type-Schublade. - scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden per business_profile gefiltert (FRT skipped fuer BMW etc.). - Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match: Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60), Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum. Title+check_question als Embedding-Input fuer mehr Kontext. - Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction wenn richer (BMW 1824 vs 600 Worte). Vendor-Redundanz + EU-Alternativen + Cost-Saving - vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors, Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...). - vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/ enterprise/premier). - Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten (nur Media-Spend, separat). DSP-Plattformen behalten enge Range. - Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den oberen 40-100%-Band der Listpreise, nicht starter→premier. - Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere Kategorien gleichzeitig. Cookie-Wissens-DB + Funktionale Klassifikation - cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...) mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk, schrems_ii_status, EuGH-Urteile, EU-Alternative. - cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id, ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact. Country-Inferenz aus Rechtsform - cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table. Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors (Adform DK, Pinterest IE). Action-Recipes + Doc-Anchor-Locator - finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country, broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling", ...) eine strukturierte Anweisung mit what/why/fix_text/where/example. Zum 1:1-Einfuegen in Kunden-Dokumente. - doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den passenden Absatz im existierenden Kundendokument fuer jeden Finding. Per-Run Thread-Local-Cache. Fallback: keyword-Match. - Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail + Vendor-Flag-Liste mit aufklappbarer Action-Liste. - Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip). Migration-Pipeline (Compliance-Check -> Customer Banner/Documents) - migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit 4 Kategorien + Review-Flags. - migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register + Privacy-Policy-Pre-Fills. - agent_migration_routes: 3 Preview-Endpoints (banner-preview, document-preview, summary). Persistierung der cmp_vendors in /data/compliance_audits.db check_payloads-Tabelle. Borlabs-Parity Cookie-Banner-Features - Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage. - Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video Placeholder bis Einwilligung. - Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB. - Consent-Log Export (CSV/JSON) per einwilligungen_export_routes. Bug-Fixes - canonical_control_routes: _jsonish-Helper fuer string-typed jsonb, similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr). - Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views. - Embedding-Service-Batching (32er Batches statt 165 in einem Call). - KeyError 'control_id' in MC-Result-Aggregation (defensive .get). - Master-Controls-Klick-Through von /sdk/master-controls auf /sdk/control-library?control=<id> mit URL-Param-Auto-Open. - Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht). - Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction). - doc_type-aware MC-Filter (statt all-text-MCs). - Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag). - A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert. Tests - test_migration_mappers.py (9 Tests) - test_migration_endpoints.py (4 Tests) Skripte (one-shot) - classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type) - audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires) BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes): DSE 7,5% -> 81-83% Impressum 4% -> 100% (6 echte MCs alle erfuellt) Cookie 0% -> 79-83% (CMP-Text-Routing + Embedding) Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr Plus: Action-Recipes + Doc-Anchors fuer jeden Fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:30:08 +02:00
parent 52fb8b91e7
commit 662327e8b4
31 changed files with 5214 additions and 104 deletions
@@ -0,0 +1,167 @@
+"""
+Cookie-Function-Classifier — pro Cookie eine inhaltliche Funktionsbestimmung.
+
+Heute haben wir pro Vendor eine Kategorie (analytics/advertising/...).
+Aber: ein Vendor hat oft 10-50 verschiedene Cookies. Nicht jeder Cookie
+einer Marketing-Plattform macht Werbung — viele sind Session-Mgmt,
+Sprachpraeferenz, ScrollPosition etc.
+
+Dieses Modul klassifiziert pro Cookie:
+  - functional_role : was der Cookie technisch tut (session_id,
+    csrf_token, ab_test, user_id, ad_id, …)
+  - data_collected  : welche Daten dahinter stehen (visitor_id,
+    page_view, click, conversion_event, …)
+  - blocking_impact : was passiert wenn der Cookie geblockt wird
+    (none, no_personalization, no_tracking, site_breaks)
+
+Damit kann der Vendor-Redundanz-Analyzer praezise sagen:
+  "Adobe Analytics setzt 55 Cookies, davon 12 fuer Tracking, 8 fuer A/B-Test
+   und 35 fuer interne Performance. Matomo deckt 12 Tracking + 8 A/B Tests
+   ab — 55 Adobe-Cookies werden zu 20 Matomo-Cookies."
+"""
+
+from __future__ import annotations
+
+import re
+from typing import Iterable
+
+# Pattern → (functional_role, blocking_impact)
+# Reihenfolge entscheidet: spezifischer zuerst.
+_PATTERNS: list[tuple[str, str, str]] = [
+    # Session / Authentifizierung
+    (r"^(jsessionid|phpsessid|sessionid|sid|connect\.sid)$", "session_id", "site_breaks"),
+    (r"sso|signon|auth|login|token|jwt|bearer",              "auth_token", "site_breaks"),
+    (r"^csrf|xsrf|antiforgery",                              "csrf_token", "site_breaks"),
+
+    # Spracheinstellung / Region
+    (r"lang|locale|culture|region",                          "preference", "no_personalization"),
+
+    # User-Praeferenzen (Theme, View, Bookmark)
+    (r"theme|dark|mode|view|sort|filter",                    "ui_preference", "no_personalization"),
+    (r"bookmark|favorite|favorit",                           "user_data", "no_personalization"),
+
+    # Consent-Cookie selbst
+    (r"consent|gdpr|tcf|euconsent",                          "consent_state", "site_breaks"),
+
+    # Tracking IDs (most analytics)
+    (r"^_ga|gid|gat|google_analytic",                        "tracking_id", "no_tracking"),
+    (r"^_pk_|matomo|piwik",                                  "tracking_id", "no_tracking"),
+    (r"^s_|s\.cc|adobesite|aam",                             "tracking_id", "no_tracking"),  # Adobe
+    (r"hjid|hjsession|hotjar",                               "session_recording", "no_tracking"),
+    (r"_uetsid|_uetvid|microsoft",                           "tracking_id", "no_tracking"),
+
+    # Visitor identification
+    (r"visitor|uid|user_id|customer_id",                     "visitor_id", "no_personalization"),
+
+    # A/B-Test / Personalisation
+    (r"ab_test|abtest|variant|experiment|target|target_qa",  "ab_test", "no_personalization"),
+    (r"personalization|personalisation|adobe_target",        "personalisation", "no_personalization"),
+
+    # Werbung / Retargeting
+    (r"fbp|fbc|fb_id|facebook|meta_pixel|fr$",               "ad_pixel", "no_tracking"),
+    (r"adform|criteo|outbrain|taboola|tapad|adsrvr",         "ad_pixel", "no_tracking"),
+    (r"doubleclick|test_cookie|ide|nid|exchange_uid",        "ad_pixel", "no_tracking"),
+    (r"google_ad|gads|gcl",                                  "ad_pixel", "no_tracking"),
+    (r"^li_|linkedin|bcookie|bscookie",                      "ad_pixel", "no_tracking"),
+    (r"pinterest|_pinterest_|_pin_unauth",                   "ad_pixel", "no_tracking"),
+
+    # Affiliate / Conversion
+    (r"conversion|orderid|order_id|transaction|purchase",    "conversion_event", "no_tracking"),
+    (r"campaign|utm|source|medium|term",                     "campaign_attribution", "no_tracking"),
+
+    # ScrollPosition / Form-Helper
+    (r"scroll|position|form_|form_state",                    "ui_state", "no_personalization"),
+
+    # Loadbalancer / Sticky
+    (r"affinity|sticky|lb_|alb-|aws-alb",                    "load_balancer", "site_breaks"),
+
+    # Chat / Support
+    (r"chat|widget|genesys|livechat",                        "chat_session", "no_personalization"),
+
+    # Captcha
+    (r"hcaptcha|recaptcha|cf_|cloudflare",                   "bot_protection", "site_breaks"),
+]
+
+_FUNCTIONAL_LABEL = {
+    "session_id":          "Sitzungs-ID",
+    "auth_token":          "Auth-Token",
+    "csrf_token":          "CSRF-Schutz",
+    "preference":          "Sprache / Region",
+    "ui_preference":       "UI-Praeferenz",
+    "user_data":           "Nutzer-Daten",
+    "consent_state":       "Consent-Speicher",
+    "tracking_id":         "Tracking-ID",
+    "session_recording":   "Session-Recording",
+    "visitor_id":          "Besucher-ID",
+    "ab_test":             "A/B-Test",
+    "personalisation":     "Personalisierung",
+    "ad_pixel":            "Werbe-Pixel",
+    "conversion_event":    "Konversions-Tracking",
+    "campaign_attribution":"Kampagnen-Attribution",
+    "ui_state":            "UI-Zustand (ScrollPos etc.)",
+    "load_balancer":       "Load-Balancer",
+    "chat_session":        "Chat-Session",
+    "bot_protection":      "Bot-Schutz",
+    "unknown":             "Unbekannt",
+}
+
+# Welche functional_roles ueberlappen funktional — verwendet vom
+# vendor_redundancy.analyze() um echte Konsolidierungschancen zu
+# erkennen statt nur Provider-Doppelungen zu zaehlen.
+OVERLAPPING_ROLES = {
+    "tracking_id":         "tracking",
+    "session_recording":   "tracking",
+    "ab_test":             "personalisation",
+    "personalisation":     "personalisation",
+    "ad_pixel":            "advertising",
+    "conversion_event":    "advertising",
+    "campaign_attribution":"advertising",
+}
+
+
+def classify_cookie(cookie_name: str) -> tuple[str, str]:
+    """Return (functional_role, blocking_impact) for a cookie name."""
+    n = (cookie_name or "").lower().strip()
+    for pattern, role, impact in _PATTERNS:
+        if re.search(pattern, n):
+            return role, impact
+    return "unknown", "no_tracking"
+
+
+def annotate_vendor_cookies(vendor: dict) -> dict:
+    """Enrich a vendor record with functional_role per cookie."""
+    cookies = vendor.get("cookies") or []
+    annotated = []
+    role_counts: dict[str, int] = {}
+    for c in cookies:
+        role, impact = classify_cookie(c.get("name", ""))
+        annotated.append({**c, "functional_role": role, "blocking_impact": impact})
+        role_counts[role] = role_counts.get(role, 0) + 1
+    return {
+        **vendor,
+        "cookies": annotated,
+        "role_distribution": role_counts,
+        "role_labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in role_counts},
+    }
+
+
+def aggregate_cookie_purposes(vendors: Iterable[dict]) -> dict:
+    """Tenant-weite Verteilung: welche funktionalen Rollen kommen wie oft vor?"""
+    total: dict[str, int] = {}
+    by_vendor: dict[str, dict[str, int]] = {}
+    for v in vendors:
+        roles = v.get("role_distribution") or {}
+        if not roles and v.get("cookies"):
+            v = annotate_vendor_cookies(v)
+            roles = v["role_distribution"]
+        for r, n in roles.items():
+            total[r] = total.get(r, 0) + n
+        by_vendor[v.get("name", "")] = roles
+    return {
+        "total_per_role": total,
+        "labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in total},
+        "vendors_per_role": {
+            r: [v for v, rd in by_vendor.items() if rd.get(r, 0) > 0]
+            for r in total
+        },
+    }
@@ -0,0 +1,608 @@
+"""
+Cookie-Knowledge-Datenbank — maximal extrahierbares Wissen pro Cookie-Name.
+
+Pro Eintrag erfassen wir:
+  - vendor             : Setzender Anbieter (volle Firma + Sitzland)
+  - exact_purpose      : was der Cookie GENAU tut (nicht nur Kategorie)
+  - data_collected     : Welche Datenfelder (Client-ID, Timestamp, IP, etc.)
+  - ip_relevant        : Wird IP-Adresse erfasst/uebermittelt?
+  - ip_anonymized      : Per Default anonymisiert?
+  - tcf_purpose_ids    : IAB TCF v2.2 Purpose-IDs (1-11)
+  - iab_vendor_id      : IAB Global Vendor List ID (fuer TCF-Sync)
+  - typical_lifetime   : Wie lange persistiert
+  - reid_risk          : Re-Identifikations-Risiko (low/medium/high)
+  - technical_necessity: Erforderlich nach §25(2) TDDDG? (none/partial/full)
+  - schrems_ii_status  : Drittlandtransfer-Bewertung
+  - eugh_rulings       : Relevante EuGH-/CNIL-/LfDI-Entscheidungen
+  - eu_alternative_*   : EU-Cookie/Vendor-Ersatz mit gleicher Funktion
+  - notes              : Sonstige Hinweise (Vermeidung, Konfiguration)
+
+Quellen: Cookiepedia, IAB Europe TCF v2.2 Vendor List, Cookiebot DB,
+CNIL Cookies & Trackers Guidelines 2024, EDPB Cookie Guidelines 2/2023,
+DSK-Orientierungshilfe Telemedien 2021, Vendor-eigene Dokumentation.
+
+Stand: 2026-05.
+
+Erweiterung: Pull-Requests willkommen — Format siehe TEMPLATE_ENTRY am
+Ende der Datei.
+"""
+
+from __future__ import annotations
+
+from typing import TypedDict
+
+
+class CookieKnowledge(TypedDict, total=False):
+    vendor: str
+    vendor_country: str
+    exact_purpose: str
+    data_collected: list[str]
+    ip_relevant: bool
+    ip_anonymized: bool
+    tcf_purpose_ids: list[int]
+    iab_vendor_id: int | None
+    typical_lifetime: str
+    reid_risk: str  # 'low' | 'medium' | 'high'
+    technical_necessity: str  # 'none' | 'partial' | 'full'
+    schrems_ii_status: str
+    eugh_rulings: list[str]
+    eu_alternative_cookies: list[str]
+    eu_alternative_vendor: str
+    notes: str
+
+
+# ─── Google ──────────────────────────────────────────────────────────
+
+_GOOGLE_BASE = {
+    "vendor": "Google LLC", "vendor_country": "US",
+    "schrems_ii_status": "Drittlandtransfer in die USA. Mit DPF "
+                         "(EU-US Data Privacy Framework, 2023) wieder zulaessig, "
+                         "aber bereits Klage NOYB anhaengig (Schrems III). "
+                         "Risiko-Bewertung empfohlen.",
+    "eugh_rulings": [
+        "EuGH C-311/18 (Schrems II, 16.07.2020) — Privacy Shield gekippt",
+        "CNIL SAN-2022-002 (10.02.2022) — Google Analytics ohne Anonymisierung "
+        "unzulaessig",
+        "Datenschutzkonferenz (DSK) Orientierungshilfe 2024 — GA4 mit "
+        "Server-Side-Tagging als Mitigation moeglich",
+    ],
+}
+
+KB: dict[str, CookieKnowledge] = {
+
+    # ─── Google Analytics ─────────────────────────────────────────────
+    "_ga": {
+        **_GOOGLE_BASE,
+        "exact_purpose": "Unterscheidet eindeutig Besucher; persistiert die "
+                         "ueber alle Sessions hinweg gueltige Client-ID.",
+        "data_collected": ["client_id", "first_visit_timestamp", "ip_address"],
+        "ip_relevant": True, "ip_anonymized": False,
+        "tcf_purpose_ids": [8, 10],
+        "iab_vendor_id": 755,
+        "typical_lifetime": "2 Jahre",
+        "reid_risk": "high",
+        "technical_necessity": "none",
+        "eu_alternative_cookies": ["_pk_id"],
+        "eu_alternative_vendor": "Matomo",
+        "notes": "Mit IP-Anonymisierung (`anonymizeIp: true`) + Server-Side-Tagging "
+                 "DSGVO-konfigurierbar. Ohne diese Massnahmen einwilligungspflichtig.",
+    },
+    "_gid": {
+        **_GOOGLE_BASE,
+        "exact_purpose": "Unterscheidet Besucher innerhalb einer Session "
+                         "(24h-Bucket).",
+        "data_collected": ["session_id", "ip_address"],
+        "ip_relevant": True, "ip_anonymized": False,
+        "tcf_purpose_ids": [8],
+        "iab_vendor_id": 755,
+        "typical_lifetime": "24 Stunden",
+        "reid_risk": "medium",
+        "technical_necessity": "none",
+        "eu_alternative_cookies": ["_pk_ses"],
+        "eu_alternative_vendor": "Matomo",
+    },
+    "_gat": {
+        **_GOOGLE_BASE,
+        "exact_purpose": "Throttling-Cookie — begrenzt Anzahl Requests an "
+                         "Google Analytics pro Sekunde.",
+        "data_collected": ["throttle_flag"],
+        "ip_relevant": False, "ip_anonymized": True,
+        "tcf_purpose_ids": [],
+        "iab_vendor_id": 755,
+        "typical_lifetime": "1 Minute",
+        "reid_risk": "low",
+        "technical_necessity": "none",
+        "notes": "Reines Performance-Steuerungs-Cookie. Trotzdem einwilligungspflichtig "
+                 "da er Teil des GA-Trackings ist.",
+    },
+    "_gat_gtag_UA_": {
+        **_GOOGLE_BASE,
+        "exact_purpose": "GTM-spezifisches Throttling — pro Universal-Analytics-Property.",
+        "data_collected": ["throttle_flag"],
+        "ip_relevant": False,
+        "typical_lifetime": "1 Minute",
+        "reid_risk": "low",
+        "technical_necessity": "none",
+        "notes": "Suffix nach `UA_` ist die Property-ID. Pattern-Match noetig.",
+    },
+    "_ga_*": {
+        **_GOOGLE_BASE,
+        "exact_purpose": "GA4-Persistierung — eindeutige Stream-ID + Session-Daten.",
+        "data_collected": ["stream_id", "session_count", "session_start_ts"],
+        "ip_relevant": True, "ip_anonymized": False,
+        "tcf_purpose_ids": [8, 10],
+        "iab_vendor_id": 755,
+        "typical_lifetime": "2 Jahre",
+        "reid_risk": "high",
+        "technical_necessity": "none",
+        "notes": "GA4-Format. Suffix `_<measurement_id>`. Server-Side-GTM "
+                 "ist die einzige praktikable DSGVO-Mitigation.",
+    },
+    "NID": {
+        **_GOOGLE_BASE,
+        "exact_purpose": "Personalisiert Google-Suche, AdSense, Maps; "
+                         "speichert Praeferenzen + Sicherheits-Token.",
+        "data_collected": ["user_pref_id", "session_id", "security_token"],
+        "ip_relevant": True,
+        "tcf_purpose_ids": [4, 9, 10],
+        "iab_vendor_id": 755,
+        "typical_lifetime": "6 Monate",
+        "reid_risk": "high",
+        "technical_necessity": "none",
+        "notes": "Klassischer Google-Werbe-Cookie. Hohe Re-ID-Gefahr da Cross-Site.",
+    },
+    "IDE": {
+        "vendor": "Google LLC (DoubleClick)", "vendor_country": "US",
+        "exact_purpose": "Conversion-Tracking + Werbe-Targeting bei "
+                         "Google Display Network / DoubleClick.",
+        "data_collected": ["doubleclick_id", "ad_interactions"],
+        "ip_relevant": True,
+        "tcf_purpose_ids": [4, 9, 10],
+        "iab_vendor_id": 755,
+        "typical_lifetime": "13 Monate",
+        "reid_risk": "high",
+        "technical_necessity": "none",
+        "schrems_ii_status": _GOOGLE_BASE["schrems_ii_status"],
+        "eugh_rulings": _GOOGLE_BASE["eugh_rulings"],
+    },
+    "test_cookie": {
+        **_GOOGLE_BASE,
+        "exact_purpose": "DoubleClick-Probe-Cookie — testet ob Browser Cookies akzeptiert.",
+        "data_collected": ["browser_supports_cookies"],
+        "ip_relevant": False,
+        "typical_lifetime": "15 Minuten",
+        "reid_risk": "low",
+        "technical_necessity": "none",
+    },
+
+    # ─── Meta / Facebook ──────────────────────────────────────────────
+    "_fbp": {
+        "vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
+        "exact_purpose": "First-Party-Pixel von Facebook/Meta — identifiziert "
+                         "den Browser fuer Werbe-Conversion-Tracking + Ad-Retargeting.",
+        "data_collected": ["browser_id", "first_visit_ts"],
+        "ip_relevant": True, "ip_anonymized": False,
+        "tcf_purpose_ids": [4, 9, 10],
+        "iab_vendor_id": 891,
+        "typical_lifetime": "90 Tage",
+        "reid_risk": "high",
+        "technical_necessity": "none",
+        "schrems_ii_status": "Daten gehen an Meta US trotz IE-Sitz. "
+                             "Aktuell DPF-abgedeckt, aber Schrems III erwartet.",
+        "eugh_rulings": [
+            "EuGH C-311/18 (Schrems II)",
+            "EDSA Aufsichts-Statement 2023 zu Meta-Pixel",
+            "LDA Bayern Pruefverfuegung 2024",
+        ],
+        "eu_alternative_vendor": "Smart AdServer (Equativ, FR)",
+        "notes": "Conversions API (CAPI) als Server-Side-Variante reduziert Cookie-"
+                 "Abhaengigkeit. Trotzdem werden Daten an Meta gesendet.",
+    },
+    "_fbc": {
+        "vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
+        "exact_purpose": "Speichert die ClickID einer Meta-Ad-Kampagne; "
+                         "ordnet Conversion dem urspruenglichen Ad-Klick zu.",
+        "data_collected": ["fbclid", "ad_campaign_id"],
+        "ip_relevant": True,
+        "tcf_purpose_ids": [4, 9],
+        "iab_vendor_id": 891,
+        "typical_lifetime": "90 Tage",
+        "reid_risk": "high",
+        "technical_necessity": "none",
+    },
+    "fr": {
+        "vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
+        "exact_purpose": "Werbe-Targeting + Cross-Site-Tracking auf "
+                         "Facebook-Plattform.",
+        "data_collected": ["encrypted_user_id", "session_data"],
+        "ip_relevant": True,
+        "tcf_purpose_ids": [4, 9, 10],
+        "iab_vendor_id": 891,
+        "typical_lifetime": "3 Monate",
+        "reid_risk": "high",
+        "technical_necessity": "none",
+    },
+
+    # ─── Adobe ────────────────────────────────────────────────────────
+    "s_cc": {
+        "vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
+        "exact_purpose": "Cookie-Test — prueft ob der Browser Cookies "
+                         "akzeptiert (Adobe Analytics Bootstrap).",
+        "data_collected": ["browser_supports_cookies"],
+        "ip_relevant": False,
+        "typical_lifetime": "Session",
+        "reid_risk": "low",
+        "technical_necessity": "partial",
+        "schrems_ii_status": "Adobe-IE-Hosting, jedoch US-Datentransfer fuer "
+                             "Cloud-Services. DPF-abgedeckt.",
+    },
+    "s_sq": {
+        "vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
+        "exact_purpose": "Speichert den letzten Klick (URL + Position) "
+                         "fuer Click-Map-Reports.",
+        "data_collected": ["last_click_url", "last_click_xy"],
+        "ip_relevant": False,
+        "tcf_purpose_ids": [8],
+        "typical_lifetime": "Session",
+        "reid_risk": "low",
+        "technical_necessity": "none",
+    },
+    "AMCV_": {
+        "vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
+        "exact_purpose": "Adobe Marketing Cloud Visitor ID — Cross-Tool-ID fuer "
+                         "Analytics + Target + Audience Manager.",
+        "data_collected": ["marketing_cloud_visitor_id", "first_visit_ts"],
+        "ip_relevant": True,
+        "tcf_purpose_ids": [4, 8, 9, 10],
+        "typical_lifetime": "2 Jahre",
+        "reid_risk": "high",
+        "technical_necessity": "none",
+        "notes": "Suffix nach `AMCV_` ist die Org-ID. Klassischer Cross-Tool-Track.",
+    },
+    "mbox": {
+        "vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
+        "exact_purpose": "Adobe Target — A/B-Testing, Personalisierung, "
+                         "Audience-Targeting.",
+        "data_collected": ["mbox_visitor_id", "experiment_assignments"],
+        "ip_relevant": True,
+        "tcf_purpose_ids": [4, 9, 10],
+        "typical_lifetime": "2 Jahre",
+        "reid_risk": "high",
+        "technical_necessity": "none",
+    },
+    "s_target_qa": {
+        "vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
+        "exact_purpose": "Adobe Target Premium-Feature — QA-Modus fuer Personalisations-Tests.",
+        "data_collected": ["target_qa_session"],
+        "typical_lifetime": "Session",
+        "reid_risk": "low",
+        "technical_necessity": "none",
+        "notes": "Premium-Feature-Cookie. Anwesenheit = Enterprise-Lizenz Indikator.",
+    },
+
+    # ─── Microsoft / Bing ─────────────────────────────────────────────
+    "MUID": {
+        "vendor": "Microsoft Corp.", "vendor_country": "US",
+        "exact_purpose": "Microsoft-User-ID fuer Bing, Microsoft Advertising, "
+                         "Clarity Heatmaps.",
+        "data_collected": ["microsoft_user_id"],
+        "ip_relevant": True,
+        "tcf_purpose_ids": [4, 8, 9, 10],
+        "iab_vendor_id": 165,
+        "typical_lifetime": "13 Monate",
+        "reid_risk": "high",
+        "technical_necessity": "none",
+        "schrems_ii_status": "US-Transfer, DPF-zertifiziert.",
+    },
+    "_uetsid": {
+        "vendor": "Microsoft Corp.", "vendor_country": "US",
+        "exact_purpose": "Universal Event Tracking — Session-Cookie fuer "
+                         "Microsoft Advertising Conversion-Tracking.",
+        "data_collected": ["session_id"],
+        "ip_relevant": True,
+        "tcf_purpose_ids": [9],
+        "typical_lifetime": "30 Minuten",
+        "reid_risk": "medium",
+        "technical_necessity": "none",
+    },
+    "_uetvid": {
+        "vendor": "Microsoft Corp.", "vendor_country": "US",
+        "exact_purpose": "Universal Event Tracking — persistente Visitor-ID.",
+        "data_collected": ["visitor_id"],
+        "ip_relevant": True,
+        "tcf_purpose_ids": [4, 9],
+        "typical_lifetime": "13 Monate",
+        "reid_risk": "high",
+        "technical_necessity": "none",
+    },
+
+    # ─── LinkedIn ─────────────────────────────────────────────────────
+    "bcookie": {
+        "vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
+        "exact_purpose": "Browser-Identifikation; wird genutzt fuer Anmelde-"
+                         "Vorgang + LinkedIn Insight-Tag-Tracking.",
+        "data_collected": ["browser_id"],
+        "ip_relevant": True,
+        "tcf_purpose_ids": [4, 8, 9],
+        "iab_vendor_id": 14,
+        "typical_lifetime": "1 Jahr",
+        "reid_risk": "high",
+        "technical_necessity": "none",
+        "schrems_ii_status": "US-Datenuebermittlung (Mutter Microsoft).",
+    },
+    "lidc": {
+        "vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
+        "exact_purpose": "Daten-Routing zwischen LinkedIn-Datacentern.",
+        "data_collected": ["routing_id"],
+        "ip_relevant": True,
+        "typical_lifetime": "1 Tag",
+        "reid_risk": "low",
+        "technical_necessity": "partial",
+    },
+    "li_gc": {
+        "vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
+        "exact_purpose": "Speichert die Cookie-Einwilligung fuer LinkedIn-Embeds.",
+        "data_collected": ["consent_state"],
+        "ip_relevant": False,
+        "typical_lifetime": "6 Monate",
+        "reid_risk": "low",
+        "technical_necessity": "full",
+    },
+
+    # ─── Matomo (EU-Alternative) ──────────────────────────────────────
+    "_pk_id": {
+        "vendor": "InnoCraft Ltd (Matomo)", "vendor_country": "NZ-mit-EU-self-hosting",
+        "exact_purpose": "Matomo Visitor-ID — Pendant zu _ga, aber DSGVO-konform "
+                         "wenn IP-Anonymisierung aktiv.",
+        "data_collected": ["visitor_id", "first_visit_ts"],
+        "ip_relevant": True, "ip_anonymized": True,
+        "tcf_purpose_ids": [8],
+        "typical_lifetime": "13 Monate",
+        "reid_risk": "low",  # bei aktivierter Anonymisierung
+        "technical_necessity": "none",
+        "schrems_ii_status": "Bei On-Premise-Deployment KEIN Drittlandtransfer. "
+                             "Matomo Cloud EU mit Frankfurt-Hosting verfuegbar.",
+        "notes": "Empfohlener Drop-in-Ersatz fuer Google Analytics.",
+    },
+    "_pk_ses": {
+        "vendor": "InnoCraft Ltd (Matomo)", "vendor_country": "NZ-mit-EU-self-hosting",
+        "exact_purpose": "Matomo Session-Cookie.",
+        "data_collected": ["session_id"],
+        "ip_relevant": False,
+        "typical_lifetime": "30 Minuten",
+        "reid_risk": "low",
+        "technical_necessity": "none",
+    },
+
+    # ─── Captcha ──────────────────────────────────────────────────────
+    "hcaptcha": {
+        "vendor": "Intuition Machines Inc. (hCaptcha)", "vendor_country": "US",
+        "exact_purpose": "hCaptcha Bot-Erkennung — Session-Token fuer Challenge-Solver.",
+        "data_collected": ["bot_score", "session_id", "ip_address"],
+        "ip_relevant": True,
+        "typical_lifetime": "Session",
+        "reid_risk": "medium",
+        "technical_necessity": "full",
+        "schrems_ii_status": "US-Transfer (Intuition Machines, San Francisco).",
+        "eu_alternative_vendor": "Friendly Captcha (DE), Turnstile (Cloudflare EU)",
+        "notes": "Technisch erforderlich fuer Bot-Schutz, aber EU-Alternativen "
+                 "ohne Drittland-Risiko verfuegbar.",
+    },
+    "cf_clearance": {
+        "vendor": "Cloudflare Inc.", "vendor_country": "US",
+        "exact_purpose": "Cloudflare-Bot-Management Pro — bestaetigt dass User "
+                         "die JS-Challenge bestanden hat.",
+        "data_collected": ["challenge_token"],
+        "ip_relevant": True,
+        "typical_lifetime": "30 Minuten",
+        "reid_risk": "low",
+        "technical_necessity": "full",
+        "notes": "Premium-Feature-Cookie. Anwesenheit = Cloudflare Bot-Management "
+                 "Pro im Einsatz.",
+    },
+
+    # ─── CDN / Performance ────────────────────────────────────────────
+    "__cf_bm": {
+        "vendor": "Cloudflare Inc.", "vendor_country": "US",
+        "exact_purpose": "Cloudflare Bot Management — basis Bot-Erkennung.",
+        "data_collected": ["bot_score", "client_hash"],
+        "ip_relevant": True,
+        "typical_lifetime": "30 Minuten",
+        "reid_risk": "low",
+        "technical_necessity": "full",
+        "notes": "Strictly necessary nach §25(2) TDDDG (Sicherheit). Keine Einwilligung noetig.",
+    },
+    "aws-alb": {
+        "vendor": "Amazon Web Services Inc.", "vendor_country": "US",
+        "exact_purpose": "AWS Application Load Balancer Sticky Sessions — "
+                         "routet Anfragen konsistent an dieselbe Backend-Instanz.",
+        "data_collected": ["target_instance_id"],
+        "ip_relevant": False,
+        "typical_lifetime": "1 Stunde",
+        "reid_risk": "low",
+        "technical_necessity": "full",
+        "schrems_ii_status": "AWS Frankfurt verfuegbar — bei korrektem Region-Setup "
+                             "kein US-Transfer.",
+    },
+
+    # ─── Retargeting / Advertising ────────────────────────────────────
+    "_pin_unauth": {
+        "vendor": "Pinterest Europe Ltd.", "vendor_country": "IE/US",
+        "exact_purpose": "Pinterest Tag — Conversion-Tracking + Audience-Aufbau.",
+        "data_collected": ["pinterest_user_id"],
+        "ip_relevant": True,
+        "tcf_purpose_ids": [4, 9, 10],
+        "iab_vendor_id": 762,
+        "typical_lifetime": "1 Jahr",
+        "reid_risk": "high",
+        "technical_necessity": "none",
+    },
+    "cto_dna": {
+        "vendor": "Criteo S.A.", "vendor_country": "FR",
+        "exact_purpose": "Criteo Dynamic Retargeting — produktspezifische "
+                         "Werbeauslieferung basierend auf Browser-History.",
+        "data_collected": ["criteo_user_id", "product_views"],
+        "ip_relevant": True,
+        "tcf_purpose_ids": [4, 9, 10],
+        "iab_vendor_id": 91,
+        "typical_lifetime": "13 Monate",
+        "reid_risk": "high",
+        "technical_necessity": "none",
+        "schrems_ii_status": "Criteo ist FR-basiert, aber Daten gehen auch in die USA. "
+                             "Multi-Region-Setup pruefen.",
+        "notes": "Trotz FR-Sitz nicht automatisch DSGVO-konform — CNIL-Bussgeld "
+                 "EUR 60M 2022 wegen mangelnder Einwilligungs-Granularitaet.",
+    },
+    "afm": {
+        "vendor": "Adform A/S", "vendor_country": "DK",
+        "exact_purpose": "Adform Audience Matching — Cross-Device-Identifikation "
+                         "fuer programmatische Werbung.",
+        "data_collected": ["adform_user_id", "device_signals"],
+        "ip_relevant": True,
+        "tcf_purpose_ids": [4, 9, 10],
+        "iab_vendor_id": 50,
+        "typical_lifetime": "30 Tage",
+        "reid_risk": "high",
+        "technical_necessity": "none",
+        "schrems_ii_status": "Adform ist DK-basiert, EU-Hosting Standard. Keine "
+                             "Schrems-II-Probleme bei Standard-Setup.",
+    },
+
+    # ─── Consent / Funktional (Strictly Necessary) ────────────────────
+    "JSESSIONID": {
+        "vendor": "Java EE / Tomcat (Site-Software)", "vendor_country": "N/A",
+        "exact_purpose": "Server-Session-ID fuer Java-basierte Anwendungen.",
+        "data_collected": ["session_id"],
+        "ip_relevant": False,
+        "typical_lifetime": "Session",
+        "reid_risk": "low",
+        "technical_necessity": "full",
+        "notes": "Strictly necessary. Keine Einwilligung nach §25(2) TDDDG.",
+    },
+    "PHPSESSID": {
+        "vendor": "PHP (Site-Software)", "vendor_country": "N/A",
+        "exact_purpose": "PHP-Session-ID — serverseitige Sitzungs-Persistierung.",
+        "data_collected": ["session_id"],
+        "ip_relevant": False,
+        "typical_lifetime": "Session",
+        "reid_risk": "low",
+        "technical_necessity": "full",
+    },
+    "cookie_consent": {
+        "vendor": "BreakPilot Consent-Banner (own)", "vendor_country": "DE",
+        "exact_purpose": "Speichert die Einwilligungsentscheidung des Nutzers "
+                         "pro Kategorie.",
+        "data_collected": ["consent_state_per_category", "timestamp"],
+        "ip_relevant": False,
+        "typical_lifetime": "180 Tage",
+        "reid_risk": "low",
+        "technical_necessity": "full",
+        "notes": "Strictly necessary nach Art. 7(1) DSGVO Nachweispflicht.",
+    },
+
+    # ─── Templated / pattern-based entries (Suffix variabel) ──────────
+    # Diese werden via regex-Lookup gefangen, der Eintrag dient als Fallback.
+    "_uet_": {
+        "vendor": "Microsoft Corp.", "vendor_country": "US",
+        "exact_purpose": "Microsoft Universal Event Tracking — Suffix nach `_uet_`.",
+        "data_collected": ["event_id"],
+        "ip_relevant": True,
+        "typical_lifetime": "30 Minuten",
+        "reid_risk": "medium",
+        "technical_necessity": "none",
+    },
+}
+
+
+# ─── Pattern-Lookup fuer Cookies mit variablem Suffix ───────────────
+
+_PATTERN_LOOKUPS: list[tuple[str, str]] = [
+    (r"^_ga_[A-Z0-9_]+$",     "_ga_*"),
+    (r"^_gat_gtag_UA_",       "_gat_gtag_UA_"),
+    (r"^AMCV_",               "AMCV_"),
+    (r"^_uet[a-z]+",          "_uet_"),
+    (r"^aws-alb",             "aws-alb"),
+    (r"^_pk_id\.",            "_pk_id"),
+    (r"^_pk_ses\.",           "_pk_ses"),
+]
+
+
+def lookup_cookie(name: str) -> CookieKnowledge | None:
+    """Return rich knowledge for a cookie name, or None if unknown."""
+    import re
+    if not name:
+        return None
+    # Direct hit
+    if name in KB:
+        return KB[name]
+    # Pattern-based
+    for pattern, kb_key in _PATTERN_LOOKUPS:
+        if re.search(pattern, name):
+            return KB.get(kb_key)
+    # Strip common suffixes (.bmw.de, .domain etc.)
+    base = name.split(".", 1)[0]
+    if base != name and base in KB:
+        return KB[base]
+    return None
+
+
+def enrich_vendor_with_knowledge(vendor: dict) -> dict:
+    """Add per-cookie knowledge to each cookie in vendor['cookies']."""
+    cookies = vendor.get("cookies") or []
+    enriched = []
+    for c in cookies:
+        info = lookup_cookie(c.get("name", ""))
+        if info:
+            enriched.append({**c, "knowledge": info})
+        else:
+            enriched.append(c)
+    return {**vendor, "cookies": enriched}
+
+
+# ─── Hilfen fuer Aggregate-Reports ──────────────────────────────────
+
+def summarize_compliance_risk(vendor: dict) -> dict:
+    """Aggregiere Re-ID-Risiko + Drittland-Status ueber alle Cookies eines Vendors."""
+    cookies = vendor.get("cookies") or []
+    risk_counts = {"high": 0, "medium": 0, "low": 0}
+    schrems_affected = 0
+    technical_only = 0
+    for c in cookies:
+        k = c.get("knowledge") or lookup_cookie(c.get("name", ""))
+        if not k:
+            continue
+        risk = k.get("reid_risk", "low")
+        risk_counts[risk] = risk_counts.get(risk, 0) + 1
+        if "us" in k.get("vendor_country", "").lower() or "schrems" in k.get("schrems_ii_status", "").lower():
+            schrems_affected += 1
+        if k.get("technical_necessity") == "full":
+            technical_only += 1
+    return {
+        "reid_risk_distribution": risk_counts,
+        "high_risk_cookie_count": risk_counts["high"],
+        "schrems_ii_affected_cookies": schrems_affected,
+        "strictly_necessary_cookies": technical_only,
+        "total_classified": sum(risk_counts.values()),
+    }
+
+
+# ─── TEMPLATE_ENTRY (fuer Kuratierung neuer Cookies) ───────────────
+
+TEMPLATE_ENTRY: CookieKnowledge = {
+    "vendor": "<Voller Firmenname>",
+    "vendor_country": "<ISO-2 Code oder 'IE/US' bei Doppel-Standort>",
+    "exact_purpose": "<1-2 Saetze was der Cookie GENAU tut>",
+    "data_collected": ["<feldname_1>", "<feldname_2>"],
+    "ip_relevant": False,
+    "ip_anonymized": False,
+    "tcf_purpose_ids": [],   # TCF v2.2: 1-11
+    "iab_vendor_id": None,   # Aus https://iabeurope.eu/tcf-vendor-list/
+    "typical_lifetime": "<Session | XX Tage | XX Monate | XX Jahre>",
+    "reid_risk": "low",      # low | medium | high
+    "technical_necessity": "none",  # none | partial | full
+    "schrems_ii_status": "<Drittlandtransfer-Bewertung>",
+    "eugh_rulings": [],
+    "eu_alternative_cookies": [],
+    "eu_alternative_vendor": "",
+    "notes": "",
+}
@@ -220,10 +220,16 @@ def score_vendors(vendors: list[dict]) -> list[dict]:
            flags.append("no_purpose")

        # Country — only for external processors / controllers
+        # Falls country leer ist, ableiten aus Rechtsform-Suffix im Namen.
        if country_required:
            max_score += 10
            if v.get("country"):
                score += 10
+            elif _country_from_name(v.get("name", "")):
+                inferred = _country_from_name(v.get("name", ""))
+                v["country"] = inferred
+                v["country_inferred"] = True
+                score += 10
            else:
                flags.append("no_country")

@@ -321,3 +327,153 @@ def build_check_items(validated: list[LinkCheck]) -> list[dict]:
            "hint": hint,
        })
    return items
+
+
+# ─── Country-Inferenz aus Rechtsform-Suffix ────────────────────────
+#
+# Wenn ein Vendor das "country"-Feld leer hat, koennen wir es oft aus
+# dem Firmen-Suffix ableiten:
+#   Adform A/S          → DK (Dänemark, Aktieselskab)
+#   Pinterest Europe Ltd. → IE (Irland, Limited)
+#   Salesforce Inc.     → US (Incorporated)
+#   Adobe ... Ireland Limited → IE
+#   Genesys ... B.V.    → NL (Niederlande, Besloten Vennootschap)
+#   Equativ S.A.        → FR (Société Anonyme)
+#   SAP SE              → DE (Societas Europaea — meist DE-eingetragen)
+#
+# Kombi-Strategie:
+#   1) Suffix-Pattern
+#   2) Laendername im Firmen-Namen ('Ireland', 'Deutschland')
+#   3) Specific Vendor (Google Inc / Meta Platforms Ireland Ltd → vendor-specific)
+
+import re as _re
+
+_SUFFIX_COUNTRY: list[tuple[str, str]] = [
+    # Pattern (am Wort-Ende oder vor weiteren Tokens)  → ISO-Code
+    (r"\bA/S\b",                          "DK"),  # Aktieselskab
+    (r"\bApS\b",                          "DK"),  # Anpartsselskab
+    (r"\bAB\b",                           "SE"),  # Aktiebolag
+    (r"\bAS\b(?!\w)",                     "NO"),  # Aksjeselskap
+    (r"\bOy\b",                           "FI"),  # Osakeyhtiö
+    (r"\bAG\b(?!\w)",                     "DE"),  # auch CH/AT moeglich, default DE
+    (r"\bGmbH\b",                         "DE"),
+    (r"\bUG\b",                           "DE"),
+    (r"\beG\b",                           "DE"),
+    (r"\bKG\b",                           "DE"),
+    (r"\bOHG\b",                          "DE"),
+    (r"\bSE\b",                           "DE"),  # Societas Europaea — pruefen ob SAP SE etc.
+    (r"\bS\.A\.\b",                       "FR"),  # France / SE / ES
+    (r"\bSAS\b",                          "FR"),
+    (r"\bS\.A\.S\.\b",                    "FR"),
+    (r"\bSARL\b",                         "FR"),
+    (r"\bS\.r\.l\.\b",                    "IT"),
+    (r"\bS\.p\.A\.\b",                    "IT"),
+    (r"\bSpA\b",                          "IT"),
+    (r"\bB\.V\.\b",                       "NL"),
+    (r"\bN\.V\.\b",                       "NL"),
+    (r"\bSL\b",                           "ES"),
+    (r"\bS\.A\.\sde C\.V\.\b",           "MX"),
+    (r"\bd\.o\.o\.\b",                    "SI"),  # Slowenien
+    (r"\bd\.d\.\b",                       "HR"),  # Kroatien
+    (r"\bz\s?o\.o\.\b",                   "PL"),
+    (r"\bInc\.?\b",                       "US"),
+    (r"\bIncorporated\b",                 "US"),
+    (r"\bCorp\.?\b",                      "US"),
+    (r"\bCorporation\b",                  "US"),
+    (r"\bLLC\b",                          "US"),
+    (r"\bL\.L\.C\.\b",                    "US"),
+    (r"\bLtd\.?\b",                       "GB"),  # UK Limited, default
+    (r"\bLimited\b",                      "GB"),
+    (r"\bPLC\b",                          "GB"),
+    (r"\bPty\b",                          "AU"),
+    (r"\bK\.K\.\b",                       "JP"),  # Kabushiki-Kaisha
+    (r"\bPte\.?\sLtd\.?\b",               "SG"),
+]
+
+# Country-Namen im Firmen-Namen (z.B. "Adobe Systems Software Ireland Limited")
+_COUNTRY_NAME_TOKENS: list[tuple[str, str]] = [
+    ("ireland",          "IE"),
+    ("deutschland",      "DE"),
+    ("germany",          "DE"),
+    ("netherlands",      "NL"),
+    ("france",           "FR"),
+    ("united kingdom",   "GB"),
+    ("uk",               "GB"),
+    ("usa",              "US"),
+    ("united states",    "US"),
+    ("austria",          "AT"),
+    ("oesterreich",      "AT"),
+    ("schweiz",          "CH"),
+    ("switzerland",      "CH"),
+    ("luxembourg",       "LU"),
+    ("luxemburg",        "LU"),
+    ("denmark",          "DK"),
+    ("daenemark",        "DK"),
+    ("sweden",           "SE"),
+    ("schweden",         "SE"),
+    ("norway",           "NO"),
+    ("norwegen",         "NO"),
+    ("finland",          "FI"),
+    ("finnland",         "FI"),
+]
+
+# Bekannte Vendors mit eindeutigem Sitz (override)
+_KNOWN_VENDOR_COUNTRY: dict[str, str] = {
+    "google inc":                      "US",
+    "google llc":                      "US",
+    "google ireland":                  "IE",
+    "meta platforms ireland":          "IE",
+    "facebook ireland":                "IE",
+    "amazon.com inc":                  "US",
+    "amazon web services":             "US",
+    "amazon web services inc":         "US",
+    "linkedin inc":                    "US",
+    "salesforce inc":                  "US",
+    "salesforce.com":                  "US",
+    "outbrain inc":                    "US",
+    "taboola inc":                     "US",
+    "pinterest europe ltd":            "IE",
+    "intuition machines inc":          "US",
+    "akamai technologies inc":         "US",
+    "criteo s.a":                      "FR",
+    "criteo sa":                       "FR",
+    "adform a/s":                      "DK",
+    "speedcurve limited":              "GB",
+    "longtail ad solutions":           "US",
+    "genesys cloud services b.v":      "NL",
+    "qualtrics":                       "US",
+    "teads sa":                        "FR",
+    "teads s.a":                       "FR",
+    "salesviewer gmbh":                "DE",
+    "baqend gmbh":                     "DE",
+    "zenweshare sas":                  "FR",
+    "nayoki gmbh":                     "DE",
+    "psyma":                           "DE",
+    "matomo":                          "NZ",   # InnoCraft NZ aber EU-hostbar
+    "adobe systems software ireland":  "IE",
+    "microsoft corporation":           "US",
+    "microsoft corp":                  "US",
+}
+
+
+def _country_from_name(vendor_name: str) -> str:
+    """Best-effort: ISO-2 Country-Code aus dem Vendor-Namen ableiten."""
+    if not vendor_name:
+        return ""
+    # Vendor-Namen sind oft "<Firma> — <Tool>" — nur Firmen-Teil betrachten
+    firm = vendor_name.split(" — ")[0].strip()
+    firm_l = firm.lower()
+
+    # 1) Known vendor lookup (most specific)
+    for k, v in _KNOWN_VENDOR_COUNTRY.items():
+        if k in firm_l:
+            return v
+    # 2) Country-Name im Firmen-Namen
+    for token, code in _COUNTRY_NAME_TOKENS:
+        if token in firm_l:
+            return code
+    # 3) Rechtsform-Suffix
+    for pattern, code in _SUFFIX_COUNTRY:
+        if _re.search(pattern, firm):
+            return code
+    return ""
@@ -0,0 +1,350 @@
+"""
+Doc-Anchor-Locator — fuer ein Finding den passendsten Einfuege-Ort im
+existierenden Dokument finden.
+
+Primary strategy: BGE-M3 Embedding-Match zwischen einer pro-Finding
+Anchor-Query und allen Absaetzen des Docs. Echtes semantisches Matching
+(BMW schreibt "Verarbeiter" statt "Auftragsverarbeiter" → Keyword waere
+out, Embedding catches it).
+
+Fallback: Keyword-Match (wenn kein Embedding-Service erreichbar).
+
+Output pro Anchor:
+  - anchor_phrase     : Originaltext-Auszug
+  - position_hint     : "Nach Absatz X von Y: '...'"
+  - confidence        : 'high' | 'medium' | 'low'
+  - score             : float (cosine similarity oder keyword-rank)
+  - method            : 'embedding' | 'keyword' | 'fallback'
+"""
+
+from __future__ import annotations
+
+import logging
+import math
+import os
+import re
+import threading
+from typing import Iterable
+
+import httpx
+
+logger = logging.getLogger(__name__)
+
+EMBEDDING_URL = os.getenv("EMBEDDING_URL", "http://embedding-service:8087")
+
+# Pro Finding-Typ eine semantisch reiche Anchor-Query, die der Embedding-
+# Matcher gegen den Doc-Text wirft. Reicher als die kurze MC-check_question.
+# Sucht NICHT den Pflichttext selbst (der fehlt ja) sondern den Absatz wo
+# der Fix HINEIN-soll — also den thematisch verwandten Kontext.
+_ANCHOR_QUERIES: list[tuple[str, str, str]] = [
+    # (finding_label_partial, anchor_query, fallback_hint)
+    (
+        "Auftragsverarbeiter erwaehnt",
+        "Empfaenger der Daten Verarbeiter Dienstleister Cloud Hosting CRM "
+        "Auftragsverarbeitung Weitergabe Datenuebermittlung an Dritte",
+        "Im Abschnitt 'Empfaenger' oder 'Datenuebermittlung'",
+    ),
+    (
+        "Automatisierte Entscheidungen",
+        "Betroffenenrechte automatisierte Entscheidung Profiling Logik "
+        "Tragweite Auswirkung Art. 22 DSGVO",
+        "Am Ende des Abschnitts 'Betroffenenrechte'",
+    ),
+    (
+        "Konkrete Aufsichtsbehoerde",
+        "Beschwerderecht Datenschutzaufsicht Aufsichtsbehoerde Beschwerde "
+        "bei der Behoerde einreichen Recht auf Beschwerde",
+        "Im Abschnitt 'Beschwerderecht'",
+    ),
+    (
+        "Angemessenheitsbeschluss",
+        "Drittlandtransfer USA Standardvertragsklauseln SCC DPF Data Privacy "
+        "Framework Angemessenheitsbeschluss internationale Datenuebermittlung",
+        "Im Abschnitt 'Drittlandtransfer'",
+    ),
+    (
+        "Anschrift des Verantwortlichen",
+        "Verantwortlicher Verantwortliche Stelle Datenschutz Betreiber dieser "
+        "Website Firma Anschrift Kontakt",
+        "Am Anfang der Datenschutzerklaerung / Cookie-Richtlinie",
+    ),
+    (
+        "Konkrete Cookie-Namen",
+        "Welche Cookies verwenden wir Cookie-Tabelle Liste der Cookies "
+        "Cookie-Kategorien Auflistung der Cookies Name Anbieter Zweck",
+        "Im Abschnitt 'Welche Cookies verwenden wir?'",
+    ),
+    (
+        "Konkrete Anbieter/Dienste",
+        "Drittanbieter Dienste Anbieter wir nutzen folgende Dienste "
+        "Empfaenger der Cookie-Daten Liste der Dienstleister",
+        "In der Drittanbieter-Liste der Cookie-Richtlinie",
+    ),
+    (
+        "Analytics-/Statistik-Tools konkret benannt",
+        "Statistik Analytics Reichweitenmessung Webanalyse Tracking "
+        "Google Analytics Matomo Adobe Analytics",
+        "Im Abschnitt 'Statistik / Analyse-Cookies'",
+    ),
+    (
+        "Konkrete Speicherdauer",
+        "Speicherdauer Lebensdauer wie lange Ablauf Cookie-Tabelle Spalte "
+        "Speicherdauer pro Cookie",
+        "In der Cookie-Tabelle pro Eintrag",
+    ),
+    (
+        "Opt-Out-Links",
+        "Widerruf widersprechen deaktivieren Cookie-Einstellungen aendern "
+        "Opt-Out Einstellungen anpassen",
+        "Im Abschnitt 'Wie kann ich widersprechen?'",
+    ),
+    (
+        "Privacy-Policy-Links",
+        "Datenschutzerklaerung des Drittanbieters Privacy Policy Link auf "
+        "Datenschutzhinweise der Drittanbieter",
+        "Im Drittanbieter-Listing der Cookie-Richtlinie",
+    ),
+    (
+        "Verbraucherstreitbeilegung",
+        "Online-Streitbeilegung OS-Plattform Verbraucherschlichtungsstelle "
+        "Streitbeilegung Verbraucher",
+        "Am Ende des Impressums (eigener Abschnitt 'Streitbeilegung')",
+    ),
+    (
+        "Rechtswidriger Haftungsausschluss",
+        "Haftung Disclaimer wir distanzieren uns Links zu externen Webseiten "
+        "Haftungsausschluss Drittinhalte",
+        "Am Ende des Impressums (Disclaimer-Absatz)",
+    ),
+    (
+        "Name der vertretungsberechtigten",
+        "Vertreten durch Vorstand Geschaeftsfuehrung Geschaeftsfuehrer "
+        "vertretungsberechtigt Repraesentant",
+        "Im Impressum nach Firmenname + Anschrift",
+    ),
+    (
+        "Zustaendige Kammer",
+        "Berufsrecht Kammer Aufsichtsbehoerde berufsrechtliche Angaben "
+        "zustaendige Kammer",
+        "Im Impressum im Abschnitt 'Berufsrechtliche Angaben'",
+    ),
+    (
+        "Drittlaender",
+        "Drittland Drittlaender USA Indien China internationale Datenuebermittlung "
+        "Datenexport in Nicht-EU-Staaten",
+        "Im Abschnitt 'Drittlandtransfer'",
+    ),
+    (
+        "Schutzgarantien",
+        "Schutzgarantien Schutzvorkehrungen Schutzmassnahmen SCC "
+        "Standardvertragsklauseln einsehen Anforderung",
+        "Im Abschnitt 'Drittlandtransfer / Schutzgarantien'",
+    ),
+]
+
+
+# ─── Thread-local Cache fuer Doc-Chunks + Embeddings ───────────────
+# Pro Compliance-Check-Run werden die Doc-Texte einmal embedded und im
+# Thread-Local-Storage gehalten, damit mehrere Findings im selben Run
+# nicht jeweils neu embedded werden.
+
+_tls = threading.local()
+
+
+def _get_cache() -> dict:
+    if not hasattr(_tls, "cache"):
+        _tls.cache = {}
+    return _tls.cache
+
+
+def reset_cache() -> None:
+    """Per-request-cache leeren (sollte am Start jedes Doc-Check-Runs aufgerufen
+    werden, damit Vorgaenger-Daten kein Leak verursachen)."""
+    if hasattr(_tls, "cache"):
+        _tls.cache = {}
+
+
+# ─── Helfer ────────────────────────────────────────────────────────
+
+def _normalize(text: str) -> str:
+    return (text or "").lower().replace("\xad", "").replace("ß", "ss")
+
+
+def _split_paragraphs(text: str) -> list[str]:
+    """Split a doc into paragraphs (by double newline, fallback single)."""
+    if not text:
+        return []
+    paras = re.split(r"\n\s*\n", text)
+    if len(paras) < 3:
+        paras = re.split(r"(?<=[\.\?\!])\s+(?=[A-ZÄÖÜ])", text)
+    return [p.strip() for p in paras if p.strip()]
+
+
+def _embed_sync(texts: list[str], timeout: float = 60.0,
+                batch_size: int = 32) -> list[list[float]]:
+    """Synchroner Batch-Embed-Call (Anchor-Lokalisierung laeuft in
+    Sync-HTML-Render, nicht in async context)."""
+    if not texts:
+        return []
+    out: list[list[float]] = []
+    with httpx.Client(timeout=timeout) as client:
+        for i in range(0, len(texts), batch_size):
+            batch = texts[i:i + batch_size]
+            try:
+                r = client.post(f"{EMBEDDING_URL}/embed", json={"texts": batch})
+                r.raise_for_status()
+                out.extend(r.json().get("embeddings") or [])
+            except Exception as e:
+                logger.warning("Anchor embed sub-batch [%d-%d] failed: %s",
+                               i, i + len(batch), e)
+                out.extend([[] for _ in batch])
+    return out
+
+
+def _cosine(a: list[float], b: list[float]) -> float:
+    if not a or not b or len(a) != len(b):
+        return 0.0
+    dot = sum(x * y for x, y in zip(a, b))
+    na = math.sqrt(sum(x * x for x in a))
+    nb = math.sqrt(sum(y * y for y in b))
+    if na == 0 or nb == 0:
+        return 0.0
+    return dot / (na * nb)
+
+
+def _doc_paragraphs_and_vectors(
+    doc_id: str, doc_text: str,
+) -> tuple[list[str], list[list[float]]]:
+    """Lazy-cache: Absaetze + Vektoren pro Doc-ID. Wird genau einmal pro
+    Doc und Run berechnet."""
+    cache = _get_cache()
+    if doc_id in cache:
+        return cache[doc_id]
+
+    paras = _split_paragraphs(doc_text)
+    if not paras:
+        cache[doc_id] = ([], [])
+        return cache[doc_id]
+
+    vecs = _embed_sync(paras)
+    cache[doc_id] = (paras, vecs)
+    return cache[doc_id]
+
+
+def _keyword_fallback(fl: str, doc_text: str) -> dict | None:
+    """Fallback wenn Embedding-Service ausfaellt oder zu wenig Trefferqualitaet."""
+    # Use the old _ANCHOR_QUERIES list — extract just the fallback hint
+    for label_partial, _query, fallback_hint in _ANCHOR_QUERIES:
+        if _normalize(label_partial) in fl:
+            return {
+                "anchor_phrase": None,
+                "position_hint": fallback_hint,
+                "confidence": "low",
+                "method": "fallback",
+            }
+    return None
+
+
+def locate_anchor(
+    finding_label: str,
+    doc_text: str,
+    doc_id: str | None = None,
+) -> dict | None:
+    """Fuer ein Finding den passendsten Anker-Absatz im Doc-Text finden.
+
+    Primary: Embedding-Match. Sekundaer: Keyword-Hits als Bonus. Fallback:
+    rein keyword-basiert wenn Embedding-Service nicht erreichbar ist.
+
+    `doc_id` ist ein cache-key (z.B. doc_type oder url). Wenn None, wird
+    aus dem doc_text-Hash abgeleitet.
+    """
+    if not doc_text or not finding_label:
+        return None
+
+    fl = _normalize(finding_label)
+
+    # Welche Anchor-Query matched dieses Finding?
+    query = None
+    fallback_hint = None
+    matched_label = None
+    for label_partial, q, fb in _ANCHOR_QUERIES:
+        if _normalize(label_partial) in fl:
+            query, fallback_hint, matched_label = q, fb, label_partial
+            break
+    if not query:
+        return None
+
+    doc_id = doc_id or f"doc-{hash(doc_text) & 0xffffffff:08x}"
+
+    # 1) Embedding-Match
+    paras, doc_vecs = _doc_paragraphs_and_vectors(doc_id, doc_text)
+    if not paras:
+        return None
+
+    embeddings_available = any(v for v in doc_vecs)
+    if not embeddings_available:
+        return _keyword_fallback(fl, doc_text)
+
+    try:
+        q_vec = _embed_sync([query])[0] if query else None
+    except Exception:
+        q_vec = None
+
+    if not q_vec:
+        return _keyword_fallback(fl, doc_text)
+
+    # Per-Absatz Score = cosine + Heading-Bonus
+    best_idx = -1
+    best_score = 0.0
+    for i, (p, dv) in enumerate(zip(paras, doc_vecs)):
+        if not dv:
+            continue
+        sim = _cosine(q_vec, dv)
+        # Heading-Bonus: kurze Absaetze + Markdown-Heading-Marker
+        if len(p.split()) <= 8 or p.strip().startswith("#"):
+            sim += 0.05
+        if sim > best_score:
+            best_score = sim
+            best_idx = i
+
+    # Konfidenz-Schwellen — kalibriert anhand BMW-Run
+    if best_idx < 0 or best_score < 0.40:
+        # Zu schwacher Match — Fallback verwenden
+        return {
+            "anchor_phrase": None,
+            "position_hint": fallback_hint,
+            "confidence": "low",
+            "score": round(best_score, 3) if best_idx >= 0 else 0,
+            "method": "embedding-no-match",
+        }
+
+    if best_score >= 0.62:
+        confidence = "high"
+    elif best_score >= 0.50:
+        confidence = "medium"
+    else:
+        confidence = "low"
+
+    anchor = paras[best_idx]
+    words = anchor.split()
+    snippet = " ".join(words[:30]) + ("…" if len(words) > 30 else "")
+    return {
+        "anchor_phrase": snippet,
+        "anchor_index": best_idx,
+        "total_paragraphs": len(paras),
+        "position_hint": f"Nach Absatz {best_idx + 1} von {len(paras)}: '{snippet}'",
+        "confidence": confidence,
+        "score": round(best_score, 3),
+        "method": "embedding",
+    }
+
+
+def annotate_findings_with_anchors(
+    findings: Iterable[dict], doc_text: str, doc_id: str | None = None,
+) -> list[dict]:
+    """Pro Finding den Anchor suchen und das Dict um 'anchor' erweitern."""
+    out = []
+    for f in findings:
+        a = locate_anchor(f.get("label") or f.get("title") or "", doc_text, doc_id)
+        out.append({**f, "anchor": a})
+    return out
@@ -0,0 +1,353 @@
+"""
+Action-Recipes — pro Finding-Typ eine umsetzbare Handlungsanweisung:
+WAS tun, WARUM (Rechtsgrundlage), WIE formulieren (konkreter Textbaustein),
+WO einfuegen (Doc-Abschnitt-Hinweis).
+
+Ohne Recipes ist ein Finding nur eine Diagnose. Mit Recipe weiss der
+Kunde sofort welchen Satz er an welche Stelle setzen muss.
+
+Verwendung:
+  from compliance.services.finding_action_recipes import recipe_for
+  rec = recipe_for("no_cookies_listed")   # → dict mit what/why/fix_text/where/example
+"""
+
+from __future__ import annotations
+
+from typing import TypedDict
+
+
+class ActionRecipe(TypedDict, total=False):
+    what: str          # 1-Satz Diagnose
+    why: str           # Rechtsgrundlage / Risiko
+    fix_text: str      # konkreter Textbaustein zum Einfuegen
+    where: str         # in welchem Doc-Abschnitt
+    example: str       # echtes Anwendungsbeispiel
+    severity: str      # 'critical' | 'high' | 'medium' | 'low'
+
+
+# ─── Vendor-/Cookie-Findings (im VVT-Block) ────────────────────────
+
+VENDOR_FINDINGS: dict[str, ActionRecipe] = {
+
+    "no_cookies_listed": {
+        "what": "Anbieter ist erfasst, aber es sind keine konkreten Cookies "
+                "dokumentiert.",
+        "why": "Die DSK-Orientierungshilfe Telemedien 2024 fordert pro Anbieter "
+               "eine Auflistung aller gesetzten Cookies mit Name + Zweck + "
+               "Speicherdauer. 'Verwendet Cookies' ohne Details erfuellt "
+               "Art. 13 Abs. 1 lit. e DSGVO nicht.",
+        "fix_text": "Ergaenzen Sie pro Cookie eine Zeile mit folgenden Feldern:\n"
+                    "  • Cookie-Name (z.B. _ga, _fbp, NID)\n"
+                    "  • Setzender Anbieter (Firma + Sitzland)\n"
+                    "  • Zweck (1 Satz, z.B. 'Reichweitenmessung pseudonym')\n"
+                    "  • Speicherdauer (z.B. '2 Jahre', '24h', 'Session')",
+        "where": "Cookie-Richtlinie unter dem Abschnitt 'Kategorien' "
+                 "(Notwendig / Marketing / Statistik / ...).",
+        "example": "_ga (Google Ireland Ltd.) — Statistik — eindeutige "
+                   "Besucher-ID — Speicherdauer 2 Jahre",
+        "severity": "high",
+    },
+
+    "no_country": {
+        "what": "Anbieter-Sitzland ist nicht dokumentiert.",
+        "why": "Art. 30 Abs. 1 lit. d DSGVO fordert die Kategorien der Empfaenger "
+               "inkl. Sitz. Bei Drittland-Empfaengern (US, IN, CN ...) ist das "
+               "zusaetzlich Pflicht nach Art. 13 Abs. 1 lit. f DSGVO.",
+        "fix_text": "Fuegen Sie hinter dem Anbieternamen das Sitzland in "
+                    "Klammern ein. Bei Drittland (ausserhalb EU/EWR) zusaetzlich "
+                    "den Transfer-Mechanismus (SCC, DPF, Angemessenheitsbeschluss).",
+        "where": "VVT-Eintrag bei 'Empfaenger' oder im Cookie-Anbieter-Listing.",
+        "example": "Google Ireland Ltd., Dublin, Irland (EU). Bei US-Transfer: "
+                   "'Google LLC, Mountain View, US — DPF-zertifiziert'.",
+        "severity": "high",
+    },
+
+    "no_privacy_url": {
+        "what": "Kein direkter Link zur Datenschutzerklaerung des Anbieters.",
+        "why": "Art. 13 Abs. 2 lit. f DSGVO Transparenzgebot: der Nutzer muss "
+               "die Datenverarbeitung beim Drittanbieter eigenverantwortlich "
+               "nachvollziehen koennen.",
+        "fix_text": "Ergaenzen Sie einen klickbaren Link zur Datenschutzerklaerung "
+                    "des Anbieters direkt neben dem Anbieternamen.",
+        "where": "Cookie-Richtlinie pro Vendor-Eintrag. Idealerweise als "
+                 "letzter Spalteneintrag oder Inline-Link.",
+        "example": "Google Analytics — Datenschutz: https://policies.google.com/privacy",
+        "severity": "medium",
+    },
+
+    "broken_privacy_url": {
+        "what": "Der angegebene Privacy-Link gibt einen Fehler zurueck "
+                "(404 / 403 / Timeout).",
+        "why": "Ein toter Link erfuellt Art. 13(2)(f) DSGVO nicht — die "
+               "Transparenz-Pflicht laeuft ins Leere.",
+        "fix_text": "1. Pruefen Sie den Link aus einem normalen Browser. "
+                    "Funktioniert er dort? → Anbieter blockt automatisierte Pruefer (kein Mangel).\n"
+                    "2. Funktioniert er nicht? → Aktualisieren Sie den Link in Ihrer "
+                    "Cookie-Richtlinie auf die heute gueltige Privacy-URL des Anbieters.",
+        "where": "Cookie-Richtlinie / Drittanbieter-Liste.",
+        "example": "Wenn Adobe-Privacy-Link 404 gibt, ersetzen durch "
+                   "https://www.adobe.com/privacy/policy.html",
+        "severity": "high",
+    },
+
+    "no_opt_out_url": {
+        "what": "Kein Opt-Out-Link fuer diesen Anbieter dokumentiert.",
+        "why": "Art. 7 Abs. 3 DSGVO: der Widerruf der Einwilligung muss so "
+               "einfach wie die Erteilung sein. Pro Drittanbieter muss eine "
+               "Opt-Out-Moeglichkeit angeboten werden.",
+        "fix_text": "Recherchieren Sie den offiziellen Opt-Out-Link des "
+                    "Anbieters und ergaenzen Sie ihn. Falls Ihr Cookie-Banner "
+                    "ein 'Einstellungen aendern' anbietet, ist das oft "
+                    "ausreichend — der Link sollte trotzdem als Backup "
+                    "dokumentiert sein.",
+        "where": "Cookie-Richtlinie pro Vendor-Eintrag.",
+        "example": "Google Analytics Opt-Out: https://tools.google.com/dlpage/gaoptout",
+        "severity": "high",
+    },
+
+    "broken_opt_out": {
+        "what": "Der angegebene Opt-Out-Link funktioniert nicht "
+                "(404 / 403 / Timeout).",
+        "why": "Art. 7(3) DSGVO Widerrufsmoeglichkeit ohne funktionierenden "
+               "Link ist nicht gegeben.",
+        "fix_text": "1. Pruefen Sie ob der Link aus einem Browser funktioniert. "
+                    "403 kommt oft von Bot-Schutz und ist kein realer Mangel.\n"
+                    "2. Bei echtem 404: aktualisieren Sie auf den heute gueltigen "
+                    "Opt-Out-Link.\n"
+                    "3. Alternativ verlinken Sie auf Ihren eigenen Cookie-Banner "
+                    "'Einstellungen aendern'-Trigger.",
+        "where": "Cookie-Richtlinie pro Vendor-Eintrag.",
+        "example": "Wenn Criteo-Opt-Out 403 gibt, ist das oft Cloudflare-Schutz — "
+                   "Link aus dem Browser klickbar → kein Mangel. Alternativ: "
+                   "https://www.youronlinechoices.com/de/",
+        "severity": "medium",
+    },
+}
+
+
+# ─── Doc-Pruefungs-Findings (im DSE/Impressum/Cookie-Pruefblock) ───
+
+DOC_CHECK_FINDINGS: dict[str, ActionRecipe] = {
+
+    "Auftragsverarbeiter erwaehnt": {
+        "what": "Auftragsverarbeiter (Hosting, CRM, Newsletter) sind nicht "
+                "explizit erwaehnt + es fehlt Hinweis auf AVVs nach Art. 28 DSGVO.",
+        "why": "Art. 28 DSGVO Auftragsverarbeitung erfordert dokumentierte AVVs. "
+               "Art. 13(1)(e) DSGVO erfordert die Nennung der Empfaenger-"
+               "Kategorien. Fehlt der Hinweis = klassischer Pruefpunkt der "
+               "Aufsichtsbehoerden.",
+        "fix_text": "Wir setzen sorgfaeltig ausgewaehlte Auftragsverarbeiter "
+                    "(z.B. fuer Hosting, Webanalyse, Customer Service) ein. Mit "
+                    "allen Auftragsverarbeitern haben wir Vertraege zur "
+                    "Auftragsverarbeitung nach Art. 28 DSGVO geschlossen. Die "
+                    "Auftragsverarbeiter handeln ausschliesslich auf unsere "
+                    "Weisung und sind vertraglich zu angemessenen technischen "
+                    "und organisatorischen Massnahmen verpflichtet.",
+        "where": "Datenschutzerklaerung im Abschnitt 'Empfaenger' oder "
+                 "'Datenuebermittlung'. Direkt nach der Aufzaehlung der "
+                 "Empfaenger-Kategorien.",
+        "example": "Empfaenger: BMW Niederlassungen, Vertragshaendler, "
+                   "Auftragsverarbeiter (Cloud-Hosting AWS, CRM Salesforce, "
+                   "Webanalyse Adobe Analytics — mit allen sind AVVs nach "
+                   "Art. 28 DSGVO geschlossen).",
+        "severity": "high",
+    },
+
+    "Automatisierte Entscheidungen / Profiling": {
+        "what": "Keine Aussage zu automatisierten Einzelentscheidungen "
+                "oder Profiling nach Art. 22 DSGVO.",
+        "why": "Art. 13 Abs. 2 lit. f DSGVO: bei automatisierten "
+               "Einzelentscheidungen muessen Logik, Tragweite und Auswirkungen "
+               "erklaert werden. Bei KEINEM Profiling muss das explizit "
+               "verneint werden — sonst lassen Aufsichtsbehoerden Unsicherheit "
+               "offen.",
+        "fix_text": "Variante A (kein Profiling):\n"
+                    "  'Es findet keine automatisierte Entscheidungsfindung "
+                    "im Sinne des Art. 22 DSGVO statt. Sofern wir Profiling "
+                    "zur Anzeige personalisierter Inhalte einsetzen, erfolgt "
+                    "dies ausschliesslich auf Basis Ihrer Einwilligung und "
+                    "wird im Abschnitt [X] erlaeutert.'\n\n"
+                    "Variante B (Profiling vorhanden, z.B. fuer Werbung):\n"
+                    "  'Wir nutzen Profiling zur Anzeige personalisierter "
+                    "Werbung. Die Logik basiert auf [Klick-Historie / "
+                    "Besuchsverhalten / Praeferenzen]. Tragweite: "
+                    "Anpassung der angezeigten Anzeigen. Auswirkung: keine "
+                    "rechtlichen oder erheblichen Auswirkungen — Sie koennen "
+                    "jederzeit widersprechen unter [Link/Kontakt].'",
+        "where": "Datenschutzerklaerung am Ende des Abschnitts "
+                 "'Betroffenenrechte' oder als eigener Absatz unter "
+                 "'Automatisierte Entscheidungen'.",
+        "example": "Standardformulierung Variante A — falls Sie KEIN Profiling "
+                   "betreiben, ist das der sichere Default-Text.",
+        "severity": "high",
+    },
+
+    "Konkrete Aufsichtsbehoerde benannt": {
+        "what": "Aufsichtsbehoerde wird nicht namentlich genannt.",
+        "why": "Art. 13(2)(d) DSGVO: das Beschwerderecht muss konkret "
+               "kommuniziert werden — nicht 'die zustaendige Behoerde', sondern "
+               "Name + Anschrift + Website.",
+        "fix_text": "Sie haben das Recht, sich bei einer Datenschutz-"
+                    "Aufsichtsbehoerde zu beschweren. Zustaendig ist:\n\n"
+                    "  [Aufsichtsbehoerde mit Namen + Strasse + PLZ + Ort + Web]\n\n"
+                    "Beispiel: 'Bayerisches Landesamt fuer Datenschutzaufsicht "
+                    "(BayLDA), Promenade 18, 91522 Ansbach, www.lda.bayern.de'",
+        "where": "Datenschutzerklaerung im Abschnitt 'Betroffenenrechte' oder "
+                 "'Beschwerderecht'.",
+        "example": "Fuer BMW AG (Sitz Muenchen, Bayern): BayLDA, Promenade 18, "
+                   "91522 Ansbach, www.lda.bayern.de",
+        "severity": "high",
+    },
+
+    "Angemessenheitsbeschluss der Kommission": {
+        "what": "Drittlandtransfer erwaehnt, aber kein Hinweis auf den "
+                "konkreten Angemessenheitsbeschluss / DPF / SCC.",
+        "why": "Art. 13(1)(f) DSGVO: bei Drittlandtransfer muss der "
+               "Transfer-Mechanismus benannt sein (Angemessenheitsbeschluss, "
+               "Standardvertragsklauseln, BCR, Ausnahmen nach Art. 49).",
+        "fix_text": "Fuer Datenuebermittlungen in die USA stuetzen wir uns auf "
+                    "den Angemessenheitsbeschluss der EU-Kommission vom "
+                    "10.07.2023 (EU-US Data Privacy Framework / DPF). Sofern "
+                    "der Empfaenger DPF-zertifiziert ist, ist die Uebermittlung "
+                    "rechtlich abgesichert. Bei nicht-zertifizierten Empfaengern "
+                    "ergaenzen wir EU-Standardvertragsklauseln (SCC) gemaess "
+                    "Durchfuehrungsbeschluss 2021/914.",
+        "where": "Datenschutzerklaerung im Abschnitt 'Drittlandtransfer' oder "
+                 "'Internationale Datenuebermittlung'.",
+        "example": "Konkret: Google LLC (USA) ist DPF-zertifiziert "
+                   "(Zertifikat einsehbar unter dataprivacyframework.gov).",
+        "severity": "high",
+    },
+
+    "Anschrift des Verantwortlichen": {
+        "what": "In der Cookie-Richtlinie fehlt die vollstaendige Anschrift.",
+        "why": "Art. 13(1)(a) DSGVO: der Verantwortliche muss eindeutig "
+               "identifizierbar sein. Cookie-Richtlinie + DSE muessen "
+               "konsistente Angaben enthalten.",
+        "fix_text": "Verantwortlich fuer die Datenverarbeitung im Sinne der "
+                    "DSGVO ist:\n  [Firmenname]\n  [Strasse + Hausnummer]\n  "
+                    "[PLZ + Ort]\n  [Land]\n  E-Mail: [...]",
+        "where": "Cookie-Richtlinie am Anfang ODER Verweis-Link zur DSE.",
+        "example": "Bayerische Motoren Werke Aktiengesellschaft, Petuelring 130, "
+                   "80809 Muenchen, Deutschland",
+        "severity": "high",
+    },
+
+    "Konkrete Cookie-Namen aufgelistet": {
+        "what": "Cookies werden allgemein erwaehnt aber ohne konkrete Namen + "
+                "Speicherdauer.",
+        "why": "DSK-Orientierungshilfe Telemedien: Auflistung jedes einzelnen "
+               "Cookies mit Name. Generische Aussagen ('wir nutzen "
+               "Werbe-Cookies') sind unzureichend.",
+        "fix_text": "Erweitern Sie die Cookie-Tabelle um die Spalten:\n"
+                    "  Name | Anbieter | Zweck | Speicherdauer\n\n"
+                    "Browser-Devtools (Application > Cookies) zeigt die "
+                    "tatsaechlich gesetzten Namen — bitte Cookie-Liste "
+                    "regelmaessig synchronisieren.",
+        "where": "Cookie-Richtlinie im Abschnitt 'Verwendete Cookies'.",
+        "example": "_ga | Google Ireland | Statistik (Besucher-ID) | 2 Jahre\n"
+                   "_fbp | Meta Platforms IE | Werbung (Browser-ID) | 90 Tage",
+        "severity": "high",
+    },
+
+    "Konkrete Speicherdauern pro Cookie": {
+        "what": "Speicherdauer nur pauschal oder als generischer Bereich.",
+        "why": "Art. 13(2)(a) DSGVO: konkrete Speicherdauer oder Kriterien "
+               "fuer die Festlegung. DSK fordert Speicherdauer PRO Cookie.",
+        "fix_text": "Pro Cookie eine konkrete Zeitangabe in der Cookie-Tabelle "
+                    "ergaenzen: 'Session', '24 Stunden', '90 Tage', '2 Jahre'.",
+        "where": "Cookie-Richtlinie in der Cookie-Tabelle.",
+        "example": "JSESSIONID: Session — _ga: 2 Jahre — _fbp: 90 Tage",
+        "severity": "high",
+    },
+
+    "Opt-Out-Links pro Drittanbieter": {
+        "what": "Pro Drittanbieter ist kein direkter Opt-Out-Link angegeben.",
+        "why": "Art. 7(3) DSGVO Widerrufsmoeglichkeit. EuGH Planet49 "
+               "(C-673/17) bestaetigt: Widerruf muss so einfach wie Einwilligung sein.",
+        "fix_text": "Pro Drittanbieter eine Spalte 'Opt-Out' ergaenzen mit "
+                    "direktem Link. Alternativ: zentralen 'Cookie-"
+                    "Einstellungen aendern'-Button im Footer der Webseite + "
+                    "Hinweis darauf in der Cookie-Richtlinie.",
+        "where": "Cookie-Richtlinie pro Vendor-Eintrag oder als zentraler "
+                 "Abschnitt 'Wie kann ich widersprechen?'.",
+        "example": "Google Analytics: https://tools.google.com/dlpage/gaoptout\n"
+                   "Meta Pixel: ueber Facebook-Konto-Einstellungen",
+        "severity": "high",
+    },
+
+    "Privacy-Policy-Links pro Drittanbieter": {
+        "what": "Pro Drittanbieter ist kein direkter Datenschutz-Link gesetzt.",
+        "why": "Art. 13(2)(f) DSGVO Transparenzgebot — Nutzer muss die "
+               "Datenverarbeitung beim Drittanbieter eigenverantwortlich "
+               "nachvollziehen koennen.",
+        "fix_text": "Pro Drittanbieter Link auf dessen Datenschutzerklaerung "
+                    "ergaenzen. Tabelle 'Anbieter | Datenschutz-Link'.",
+        "where": "Cookie-Richtlinie im Drittanbieter-Listing.",
+        "example": "Adobe Analytics: https://www.adobe.com/privacy/policy.html",
+        "severity": "medium",
+    },
+
+    "Rechtswidriger Haftungsausschluss fuer Links": {
+        "what": "Klassischer Link-Disclaimer ('Wir distanzieren uns von verlinkten "
+                "Inhalten') ist im Impressum.",
+        "why": "BGH I ZR 317/01: solche Disclaimer sind rechtlich wirkungslos. "
+               "Sie befreien NICHT von der Stoererhaftung und koennen sogar "
+               "den gegenteiligen Effekt haben (Anerkennung der eigenen "
+               "Pruefpflicht).",
+        "fix_text": "Entfernen Sie pauschale Link-Disclaimer vollstaendig aus "
+                    "dem Impressum. Bei Bedarf ein knapper, zutreffender Satz:\n"
+                    "  'Fuer den Inhalt verlinkter externer Webseiten ist "
+                    "ausschliesslich deren Betreiber verantwortlich.'",
+        "where": "Impressum am Ende des Dokuments.",
+        "example": "Statt 'Wir distanzieren uns ausdruecklich von allen "
+                   "Inhalten verlinkter Seiten' — einfach nichts schreiben.",
+        "severity": "low",
+    },
+
+    "Verbraucherstreitbeilegung / OS-Plattform": {
+        "what": "Kein Link zur EU-OS-Plattform bzw. keine Aussage zur "
+                "Streitbeilegung.",
+        "why": "Art. 14 EU-VO 524/2013 (B2C-Anbieter, Online-Verkauf): "
+               "klickbarer Link auf https://ec.europa.eu/consumers/odr "
+               "PFLICHT. §36 VSBG: Aussage ob teilnehmend oder nicht.",
+        "fix_text": "Die EU-Kommission stellt eine Plattform zur Online-"
+                    "Streitbeilegung (OS) bereit, die Sie unter "
+                    "<a href='https://ec.europa.eu/consumers/odr'>"
+                    "https://ec.europa.eu/consumers/odr</a> erreichen.\n\n"
+                    "Wir sind nicht bereit oder verpflichtet, an "
+                    "Streitbeilegungsverfahren vor einer "
+                    "Verbraucherschlichtungsstelle teilzunehmen.",
+        "where": "Impressum am Ende, eigener Abschnitt 'Streitbeilegung'.",
+        "example": "Standardformulierung anwendbar fuer alle B2C-Anbieter ohne "
+                   "ODR-Teilnahme.",
+        "severity": "high",
+    },
+
+    "Name der vertretungsberechtigten Person": {
+        "what": "Vertretungsberechtigte Person ist nicht namentlich mit "
+                "Funktionsbezeichnung genannt.",
+        "why": "§5 Abs. 1 Nr. 1 TMG: bei juristischen Personen sind die "
+               "Vertretungsberechtigten namentlich zu nennen.",
+        "fix_text": "Ergaenzen Sie nach der Firmenbezeichnung:\n"
+                    "  'Vertreten durch: [Vorstand / Geschaeftsfuehrung] "
+                    "[Vorname Nachname]'",
+        "where": "Impressum direkt nach Firmenname + Anschrift.",
+        "example": "Vertreten durch: Vorstandsvorsitzender Oliver Zipse",
+        "severity": "high",
+    },
+}
+
+
+def recipe_for(finding_key: str) -> ActionRecipe | None:
+    """Lookup Recipe by finding-key. Vendors + Doc-Findings in einem Lookup."""
+    if finding_key in VENDOR_FINDINGS:
+        return VENDOR_FINDINGS[finding_key]
+    if finding_key in DOC_CHECK_FINDINGS:
+        return DOC_CHECK_FINDINGS[finding_key]
+    # Fuzzy match auf Doc-Findings (label kann variieren)
+    fk = finding_key.lower()
+    for k, v in DOC_CHECK_FINDINGS.items():
+        if k.lower() in fk or fk in k.lower():
+            return v
+    return None
@@ -0,0 +1,309 @@
+"""
+MC Embedding Match — semantic fallback for the regex-based doc_check.
+
+The Sonnet classifier filtered MCs to `check_type='text'` (matchable
+against doc text). But the regex matcher is still too strict — BMW
+writes "Speicherdauer 2 Jahre", the MC pattern expects
+"\\d+\\s*(Tag|Jahr)". We catch these via BGE-M3 embeddings + cosine
+similarity:
+
+  1. Embed the MC's check_question (once, cached in sidecar)
+  2. Embed the doc text in 50-word chunks
+  3. cosine(MC, max(chunks)) ≥ threshold → MC passes via "semantic"
+
+This recovers ~50% of failed MCs at BMW-scale (estimated).
+
+Embeddings come from bp-core-embedding-service (BGE-M3, 1024-dim,
+multilingual). Sidecar SQLite stores 1024 × 4 bytes = 4KB per MC.
+"""
+
+from __future__ import annotations
+
+import logging
+import math
+import os
+import re
+import sqlite3
+import struct
+from typing import Iterable
+
+import httpx
+
+logger = logging.getLogger(__name__)
+
+EMBEDDING_URL = os.getenv("EMBEDDING_URL", "http://embedding-service:8087")
+SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
+DIM = 1024  # BGE-M3
+SIMILARITY_THRESHOLD = float(os.getenv("MC_EMBEDDING_THRESHOLD", "0.55"))
+CHUNK_SIZE_WORDS = 50
+CHUNK_STRIDE = 30  # overlap so multi-sentence MCs aren't cut
+
+# Short Pflichtfelder (Impressum: HRB-Nr, USt-IdNr, Anschrift) gehen in
+# 50-Wort-Chunks unter. Wir scannen den Doc ZUSAETZLICH mit feineren
+# 15-Wort-Fenstern und lockerem Threshold fuer Impressum/AVV-Typen.
+SHORT_FIELD_CHUNK_WORDS = 15
+SHORT_FIELD_STRIDE = 8
+SHORT_FIELD_THRESHOLD = float(os.getenv("MC_EMBEDDING_THRESHOLD_SHORT", "0.50"))
+SHORT_FIELD_DOC_TYPES = {"impressum", "avv"}
+
+# Doc-Type-spezifische Threshold-Overrides — kalibriert anhand BMW v7
+# Run: bei 0.55 lagen DSE+Cookie systemisch bei 93% (Over-Firing weil
+# 8000-Wort-Texte alles vage matchen). 0.60 zieht die echten ~80% ein.
+# Impressum hat nur 6 echte MCs + Short-Field-Rescue → 0.50 ok.
+THRESHOLD_OVERRIDE = {
+    "impressum": 0.50,
+    "avv":       0.55,
+    "dse":       0.60,
+    "cookie":    0.60,
+    "widerruf":  0.58,
+    "loeschkonzept": 0.55,
+    "dsfa":      0.55,
+}
+
+
+def _ensure_schema() -> None:
+    """Add embedding column to mc_classification if not present."""
+    try:
+        with sqlite3.connect(SIDECAR_DB) as c:
+            cols = [r[1] for r in c.execute("PRAGMA table_info(mc_classification)")]
+            if "embedding" not in cols:
+                c.execute("ALTER TABLE mc_classification ADD COLUMN embedding BLOB")
+                logger.info("Added embedding column to mc_classification")
+    except Exception as e:
+        logger.warning("Embedding schema migration skipped: %s", e)
+
+
+def _vec_to_blob(v: list[float]) -> bytes:
+    return struct.pack(f"{len(v)}f", *v)
+
+
+def _blob_to_vec(b: bytes) -> list[float]:
+    return list(struct.unpack(f"{len(b)//4}f", b))
+
+
+EMBED_BATCH_SIZE = 32
+
+
+async def _embed_texts(texts: list[str], timeout: float = 120.0) -> list[list[float]]:
+    """Call the central embedding-service in batches; returns one vector per input.
+
+    BGE-M3 hangs / times out on >100 inputs at once on a CPU-only host.
+    We chunk into 32er batches and collect.
+    """
+    if not texts:
+        return []
+    out: list[list[float]] = []
+    async with httpx.AsyncClient(timeout=timeout) as client:
+        for i in range(0, len(texts), EMBED_BATCH_SIZE):
+            batch = texts[i:i + EMBED_BATCH_SIZE]
+            try:
+                r = await client.post(
+                    f"{EMBEDDING_URL}/embed", json={"texts": batch},
+                )
+                r.raise_for_status()
+                vecs = r.json().get("embeddings") or []
+                out.extend(vecs)
+            except httpx.HTTPError as e:
+                logger.warning("Embed sub-batch [%d-%d] failed: %s %s",
+                               i, i + len(batch), type(e).__name__, e)
+                # Pad with empty vectors so caller can still align by index
+                out.extend([[] for _ in batch])
+    return out
+
+
+async def ensure_mc_embeddings(batch_size: int = 64, force: bool = False) -> int:
+    """One-shot: embed every text-MC missing an embedding. Returns count.
+
+    Embeds the title + (rough) check_question for each MC to give the
+    BGE-M3 enough context. Title alone is too terse for the model to
+    discriminate against full-paragraph doc text.
+
+    Idempotent — only fills NULL rows unless force=True. Safe to call on
+    every run.
+    """
+    _ensure_schema()
+    # Pull check_question from the PG source table once per call (needs
+    # context that's not in the sidecar)
+    try:
+        import psycopg2
+        pg = psycopg2.connect(os.environ["DATABASE_URL"])
+        with pg.cursor() as c:
+            c.execute("SELECT control_id, doc_type, title, check_question "
+                      "FROM compliance.doc_check_controls")
+            pg_rows = c.fetchall()
+        pg.close()
+        pg_lookup = {(r[0], r[1] or ""): (r[2] or "", r[3] or "") for r in pg_rows}
+    except Exception as e:
+        logger.warning("ensure_mc_embeddings PG load failed: %s", e)
+        pg_lookup = {}
+
+    try:
+        with sqlite3.connect(SIDECAR_DB) as c:
+            where = ("WHERE check_type = 'text'" + ("" if force else " AND embedding IS NULL"))
+            rows = c.execute(
+                f"SELECT control_id, doc_type, title FROM mc_classification {where}"
+            ).fetchall()
+    except Exception as e:
+        logger.warning("ensure_mc_embeddings query failed: %s", e)
+        return 0
+
+    if not rows:
+        return 0
+
+    logger.info("Embedding %d text-MCs (force=%s) via %s ...",
+                len(rows), force, EMBEDDING_URL)
+    done = 0
+    for i in range(0, len(rows), batch_size):
+        batch = rows[i:i + batch_size]
+        # Compose "title — check_question" so the embedding captures both
+        # the topic (title) and the concrete check phrasing (question).
+        # That helps BMW's actual policy language land in the same vector
+        # neighbourhood as our control wording.
+        texts: list[str] = []
+        for cid, dt, t in batch:
+            title_text, question = pg_lookup.get((cid, dt or ""), (t or "", ""))
+            combined = f"{title_text}. {question}".strip()
+            texts.append(combined[:600])
+        try:
+            embs = await _embed_texts(texts)
+        except Exception as e:
+            logger.warning("Embed batch failed (i=%d): %s", i, e)
+            continue
+        with sqlite3.connect(SIDECAR_DB) as c:
+            for (cid, dt, _t), vec in zip(batch, embs):
+                if not vec or len(vec) != DIM:
+                    continue
+                c.execute(
+                    "UPDATE mc_classification SET embedding = ? "
+                    "WHERE control_id = ? AND doc_type = ?",
+                    (_vec_to_blob(vec), cid, dt),
+                )
+            c.commit()
+        done += len(batch)
+    logger.info("ensure_mc_embeddings: filled %d/%d", done, len(rows))
+    return done
+
+
+def _chunk_text(text: str, size: int = CHUNK_SIZE_WORDS,
+                stride: int = CHUNK_STRIDE) -> list[str]:
+    """Sliding-window chunking — overlap helps catch MCs that span 2 sentences."""
+    words = re.findall(r"\S+", text or "")
+    if len(words) <= size:
+        return [" ".join(words)] if words else []
+    out: list[str] = []
+    i = 0
+    while i < len(words):
+        out.append(" ".join(words[i:i + size]))
+        i += stride
+    return out
+
+
+def _cosine(a: list[float], b: list[float]) -> float:
+    """Plain Python cosine — fast enough for our scale, no numpy import."""
+    if not a or not b or len(a) != len(b):
+        return 0.0
+    dot = sum(x * y for x, y in zip(a, b))
+    na = math.sqrt(sum(x * x for x in a))
+    nb = math.sqrt(sum(y * y for y in b))
+    if na == 0 or nb == 0:
+        return 0.0
+    return dot / (na * nb)
+
+
+async def embedding_match(
+    doc_text: str,
+    mc_records: Iterable[dict],
+    doc_type: str | None = None,
+    threshold: float | None = None,
+) -> set[str]:
+    """Return the subset of MC control_ids that semantically match doc_text.
+
+    For Impressum/AVV-types we ADDITIONALLY scan the doc with smaller
+    15-word windows and a looser threshold so that short Pflichtfelder
+    (HRB, USt-IdNr, postal address) land in their own chunk and aren't
+    diluted by 50-word neighbourhoods of unrelated text.
+    """
+    if not doc_text or not mc_records:
+        return set()
+    candidates = list(mc_records)
+    if not candidates:
+        return set()
+
+    cid_set = {c.get("control_id") for c in candidates if c.get("control_id")}
+    if not cid_set:
+        return set()
+
+    try:
+        with sqlite3.connect(SIDECAR_DB) as c:
+            placeholders = ",".join("?" * len(cid_set))
+            q = ("SELECT control_id, embedding FROM mc_classification "
+                 f"WHERE control_id IN ({placeholders}) "
+                 "AND check_type='text' AND embedding IS NOT NULL")
+            params = list(cid_set)
+            if doc_type:
+                q += " AND doc_type = ?"
+                params.append(doc_type)
+            rows = c.execute(q, params).fetchall()
+    except Exception as e:
+        logger.warning("embedding lookup failed: %s", e)
+        return set()
+    if not rows:
+        return set()
+    mc_embeddings = {cid: _blob_to_vec(blob) for cid, blob in rows}
+
+    effective_threshold = threshold or THRESHOLD_OVERRIDE.get(
+        (doc_type or "").lower(), SIMILARITY_THRESHOLD)
+
+    chunks = _chunk_text(doc_text)
+    if not chunks:
+        return set()
+    try:
+        chunk_vecs = await _embed_texts(chunks)
+    except Exception as e:
+        logger.warning("doc chunk embedding failed: %s %s",
+                       type(e).__name__, e or "(empty msg)", exc_info=True)
+        return set()
+    # Filter empty vectors (failed sub-batches return [] placeholders)
+    chunk_vecs = [v for v in chunk_vecs if v and len(v) == DIM]
+    if not chunk_vecs:
+        logger.warning("doc chunk embedding: no usable vectors (all batches failed)")
+        return set()
+
+    matched: set[str] = set()
+    for cid, mc_vec in mc_embeddings.items():
+        best = max((_cosine(mc_vec, cv) for cv in chunk_vecs), default=0.0)
+        if best >= effective_threshold:
+            matched.add(cid)
+
+    # Short-field rescue pass for Impressum-type docs: small windows +
+    # looser threshold catch one-line Pflichtfelder that 50-word chunks
+    # dilute (HRB-Nr, USt-IdNr, postal address). Only runs for MCs not
+    # yet matched in the main pass.
+    if (doc_type or "").lower() in SHORT_FIELD_DOC_TYPES:
+        unmatched = {cid: vec for cid, vec in mc_embeddings.items() if cid not in matched}
+        if unmatched:
+            short_chunks = _chunk_text(doc_text, size=SHORT_FIELD_CHUNK_WORDS,
+                                       stride=SHORT_FIELD_STRIDE)
+            try:
+                short_vecs = await _embed_texts(short_chunks)
+            except Exception as e:
+                logger.warning("short-chunk embedding failed: %s", e)
+                short_vecs = []
+            if short_vecs:
+                short_passes = 0
+                for cid, mc_vec in unmatched.items():
+                    best = max((_cosine(mc_vec, cv) for cv in short_vecs), default=0.0)
+                    if best >= SHORT_FIELD_THRESHOLD:
+                        matched.add(cid)
+                        short_passes += 1
+                if short_passes:
+                    logger.info(
+                        "embedding short-field rescue for %s: +%d MCs (threshold %.2f, %d chunks)",
+                        doc_type, short_passes, SHORT_FIELD_THRESHOLD, len(short_chunks),
+                    )
+
+    logger.info(
+        "embedding match for %s: %d/%d MCs passed semantic threshold (main=%.2f)",
+        doc_type or "?", len(matched), len(mc_embeddings), effective_threshold,
+    )
+    return matched
@@ -101,11 +101,36 @@ def build_scorecard(check_results: list[dict]) -> dict:
    }


+_DEDUP_KEYWORDS = [
+    "einfache sprache", "verstaendliche sprache", "verständliche sprache",
+    "klare sprache", "einwilligungstexte", "einwilligungsaufforderung",
+    "einwilligungserklaerung", "einwilligungserklärung",
+    "mehrdeutige", "verstaendliche form", "verständliche form",
+    "fachbegriffe erklaeren", "fachbegriffe erklären",
+]
+
+
+def _dedup_key(label: str) -> str:
+    """Cluster label to a stable dedup-key: if it contains one of the
+    well-known repetitive Sprache/Einwilligungs-Aufforderungs-Concepts,
+    collapse them all to that single concept. Otherwise return original."""
+    l = (label or "").lower()
+    for kw in _DEDUP_KEYWORDS:
+        if kw in l:
+            return f"_dup:{kw}"
+    return label
+
+
 def top_fails(check_results: list[dict], n: int = 10) -> list[dict]:
    """Return top-N failing MCs sorted by severity then label.

    Skipped + passed MCs are excluded. INFO severity is excluded by
    default since those are guidance, not findings.
+
+    Near-duplicates (multiple MCs that all complain about "einfache
+    Sprache" / "Einwilligungsaufforderung" / ...) are collapsed to ONE
+    representative entry — sonst dominieren UI-Sprache-Hinweise die
+    Top-Liste und echte Lecks gehen unter.
    """
    fails = [
        r for r in (check_results or [])
@@ -116,7 +141,17 @@ def top_fails(check_results: list[dict], n: int = 10) -> list[dict]:
        _SEV_RANK.get((r.get("severity") or "MEDIUM").upper(), 5),
        r.get("label", ""),
    ))
-    return fails[:n]
+    seen_keys: set[str] = set()
+    deduped: list[dict] = []
+    for r in fails:
+        k = _dedup_key(r.get("label", ""))
+        if k in seen_keys:
+            continue
+        seen_keys.add(k)
+        deduped.append(r)
+        if len(deduped) >= n:
+            break
+    return deduped


 def full_audit_records(
@@ -37,6 +37,7 @@ async def check_document_with_controls(
    db_url: str = "",
    max_controls: int = 0,  # 0 = no limit, check ALL
    use_agent: bool = False,  # Use LLM agent for intelligent evaluation
+    business_scope: set[str] | None = None,
 ) -> list[dict]:
    """Check document against ALL doc_check_controls for this doc_type.

@@ -56,7 +57,7 @@ async def check_document_with_controls(
    mapped_type = _map_doc_type(doc_type)

    # Load ALL controls for this doc_type
-    controls = await _load_controls(mapped_type, db_url, max_controls)
+    controls = await _load_controls(mapped_type, db_url, max_controls, business_scope)
    if not controls:
        logger.info("No MCs for doc_type '%s' (%s)", mapped_type, doc_title)
        return []
@@ -71,6 +72,31 @@ async def check_document_with_controls(
        if result:
            results.append(result)

+    # Semantic fallback (Phase 3): MCs that failed via regex get a second
+    # chance via BGE-M3 cosine similarity. BMW writes "Speicherdauer 2
+    # Jahre" — the regex misses, embedding catches it.
+    failed_ids = {r.get("control_id") for r in results
+                  if not r.get("passed") and r.get("control_id")}
+    if failed_ids:
+        try:
+            from compliance.services.mc_embedding_matcher import (
+                ensure_mc_embeddings, embedding_match,
+            )
+            await ensure_mc_embeddings()  # idempotent: only embeds new MCs
+            failed_mcs = [c for c in controls if c.get("control_id") in failed_ids]
+            semantic_passes = await embedding_match(
+                text, failed_mcs, doc_type=mapped_type,
+            )
+            if semantic_passes:
+                for r in results:
+                    cid = r.get("control_id")
+                    if cid and cid in semantic_passes and not r.get("passed"):
+                        r["passed"] = True
+                        r["matched_text"] = "[semantischer Treffer via Embedding]"
+                        r["hint"] = (r.get("hint") or "") + " (passed via Embedding-Match, BGE-M3 cosine)"
+        except Exception as e:
+            logger.warning("Embedding fallback skipped: %s", e, exc_info=True)
+
    passed = sum(1 for r in results if r["passed"])
    failed_results = [r for r in results if not r["passed"]]
    logger.info("MC results: %d passed, %d failed out of %d for '%s'",
@@ -161,6 +187,7 @@ def _check_mc_deterministic(text_lower: str, mc: dict) -> Optional[dict]:

    return {
        "id": f"mc-{control_id}",
+        "control_id": control_id,
        "label": mc.get("title", "")[:80],
        "passed": passed,
        "severity": severity,
@@ -266,11 +293,72 @@ _MC_ALIAS_FALLBACK = {
 }


-async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
+def _load_text_only_ids(
+    doc_type: str | None = None,
+    business_scope: set[str] | None = None,
+) -> set[str]:
+    """Return control_ids that the Sonnet-classifier flagged as 'text'.
+
+    Filters applied:
+    1. check_type='text' (only doc-text-matchable MCs)
+    2. doc_type matches (per-doc-type variant from v2-Sidecar)
+    3. fits_doc_type=1 (LLM auditor approved this MC for this doc_type)
+    4. scope_requires NULL or contained in business_scope
+       (e.g. MCs with scope_requires='biometric_processing' are skipped
+       on sites that don't do biometric processing — Art. 22 FRT-MC bei
+       BMW falsch-positiv)
+
+    `business_scope` comes from the business_profiler (set of detected
+    site characteristics like 'b2c', 'shop', 'biometric_processing',
+    'ai_decision_making', 'child_targeting').
+
+    Returns empty set if the sidecar doesn't exist yet.
+    """
+    import sqlite3
+    db_path = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
+    try:
+        with sqlite3.connect(db_path) as c:
+            cols = [r[1] for r in c.execute("PRAGMA table_info(mc_classification)")]
+            has_fit = "fits_doc_type" in cols
+            has_scope = "scope_requires" in cols
+            fit_clause = " AND (fits_doc_type IS NULL OR fits_doc_type = 1)" if has_fit else ""
+            base = ("SELECT control_id, scope_requires FROM mc_classification "
+                    "WHERE check_type = 'text'" + fit_clause) if has_scope else (
+                   "SELECT control_id, NULL FROM mc_classification "
+                   "WHERE check_type = 'text'" + fit_clause)
+            params: list = []
+            if doc_type:
+                base += " AND doc_type = ?"
+                params.append(doc_type)
+            rows = c.execute(base, params).fetchall()
+            scope = business_scope or set()
+            keep: set[str] = set()
+            for cid, req in rows:
+                if not req:
+                    keep.add(cid)
+                else:
+                    # Multiple requirements separated by '|' — ALL must
+                    # be in scope to include. Empty req tokens are skipped.
+                    needed = {r.strip().lower() for r in req.split("|") if r.strip()}
+                    if needed.issubset({s.lower() for s in scope}):
+                        keep.add(cid)
+            return keep
+    except sqlite3.OperationalError:
+        return set()
+    except Exception as e:
+        logger.warning("MC classification lookup failed: %s", e)
+        return set()
+
+
+async def _load_controls(doc_type: str, db_url: str, limit: int,
+                         business_scope: set[str] | None = None) -> list[dict]:
    """Load all doc_check_controls for a doc_type from PostgreSQL.

    Falls back via _MC_ALIAS_FALLBACK when no MCs exist for the requested
    type (e.g. 'nutzungsbedingungen' -> 'agb').
+
+    Filters to only check_type='text' MCs when the classification sidecar
+    is present — process/review MCs are routed to other modules.
    """
    try:
        import asyncpg
@@ -297,7 +385,17 @@ async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
            fallback = _MC_ALIAS_FALLBACK[doc_type]
            logger.info("No MCs for %s -> falling back to %s", doc_type, fallback)
            rows = await conn.fetch(query, fallback)
-        return [dict(r) for r in rows]
+
+        controls = [dict(r) for r in rows]
+        text_only = _load_text_only_ids(doc_type, business_scope)
+        if text_only:
+            before = len(controls)
+            controls = [c for c in controls if c.get("control_id") in text_only]
+            logger.info(
+                "MC filter (text only) for %s: %d/%d MCs after Sonnet check_type filter",
+                doc_type, len(controls), before,
+            )
+        return controls
    except Exception as e:
        logger.warning("MC query failed: %s", e)
        return []
@@ -0,0 +1,407 @@
+"""
+Vendor-Cost-Estimator — leitet pro Vendor ein Pricing-Tier aus
+Cookie-Signalen ab und gibt eine intensitaets-basierte Jahres-
+kostenschaetzung zurueck.
+
+Cookie-Signale die wir auswerten:
+  - Anzahl Cookies pro Vendor (proxy fuer Modul-Tiefe)
+  - Premium-Feature-Cookies (z.B. 's_target_qa', '_ab_test' → Enterprise-Add-on)
+  - Edge/Region-Cookies (Multi-Region → Premier-Tier CDN)
+  - Cookie-Persistenz (Multi-Jahr → Heavy-Tracking-Lizenz)
+
+Plus business_profile fuer Company-Tier-Inferenz.
+
+Output pro Vendor:
+  - inferred_tier: 'starter' | 'professional' | 'enterprise' | 'premier'
+  - tier_signals : Liste der Indikatoren die zum Tier gefuehrt haben
+  - cost_year_eur_range: (low, high) basierend auf Tier × Vendor-Pricing
+  - confidence: 'low' | 'medium' | 'high'
+
+Dieses Modul ergaenzt vendor_redundancy.py — die einfachen low/high
+Pauschalen dort werden hier durch dynamische, signal-basierte Werte
+ersetzt.
+"""
+
+from __future__ import annotations
+
+import logging
+import re
+from typing import Iterable
+
+logger = logging.getLogger(__name__)
+
+
+# ─── Premium-Feature-Cookies: Indikator fuer Enterprise-Add-ons ─────
+#
+# Wenn ein Vendor diese Cookies setzt, ist der Kunde mit hoher
+# Wahrscheinlichkeit auf einem Enterprise-Plan.
+
+_PREMIUM_FEATURE_PATTERNS: list[tuple[str, str, str]] = [
+    # (regex, vendor_key, premium_feature_label)
+    (r"^s_target_qa$",             "adobe analytics", "Adobe Target Add-on"),
+    (r"adobe.*target",             "adobe target",    "Personalization Enterprise"),
+    (r"^aam_uuid",                 "adobe analytics", "Audience Manager Enterprise"),
+    (r"^s_ecid",                   "adobe analytics", "Experience Cloud ID Service"),
+    (r"^_pcid_",                   "adobe analytics", "People-Based Destinations"),
+
+    (r"^_gat_gtag_UA",             "google analytics", "GA360 Multi-Tracker"),
+    (r"^_ga_[A-Z0-9]+_[A-Z0-9]+",  "google analytics", "GA4 Enterprise Stream"),
+
+    (r"^_uetmsdns",                "microsoft advertising", "Custom Conversion Tracking"),
+    (r"^_fbp.*test",               "meta pixel",      "Conversions API Premium"),
+    (r"^_pin_unauth_premium",      "pinterest",       "Pinterest Premium-API"),
+
+    (r"^afm",                      "adform",          "Affinity-Module"),
+    (r"^cto_dna",                  "criteo",          "Dynamic Retargeting Premium"),
+
+    # CDN / Infra Premium
+    (r"^aws-alb-[a-z0-9]+",        "amazon web services", "ALB + Multi-Region"),
+    (r"^aws-waf",                  "amazon web services", "WAF Enterprise"),
+    (r"^cf_clearance",             "cloudflare",      "Bot-Management Pro"),
+    (r"^akm_[a-z]+",               "akamai",          "Adaptive Media Delivery Enterprise"),
+
+    # Salesforce Customer-360
+    (r"^bid_n_",                   "salesforce",      "Marketing Cloud Personalization"),
+    (r"^_cs_",                     "salesforce",      "CDP Premium"),
+]
+
+
+# ─── Tier-Pricing pro Vendor (jaehrlich, EUR) ───────────────────────
+#
+# 4 Tiers: starter (KMU), professional (Mid), enterprise (Konzern),
+# premier (Global Brand / Heavy User).
+
+_TIER_PRICING: dict[str, dict[str, tuple[int, int]]] = {
+    "adobe analytics": {
+        "starter":      ( 10_000,  30_000),
+        "professional": ( 60_000, 150_000),
+        "enterprise":   (200_000, 500_000),
+        "premier":      (500_000, 900_000),
+    },
+    "adobe target": {
+        "starter":      (  8_000,  25_000),
+        "professional": ( 40_000, 100_000),
+        "enterprise":   (120_000, 300_000),
+        "premier":      (300_000, 600_000),
+    },
+    "adobe campaign": {
+        "starter":      ( 10_000,  30_000),
+        "professional": ( 40_000, 100_000),
+        "enterprise":   (120_000, 280_000),
+        "premier":      (280_000, 500_000),
+    },
+    "google analytics": {
+        "starter":      (      0,      0),  # GA4 free
+        "professional": (      0,      0),
+        "enterprise":   ( 80_000, 150_000),  # GA360
+        "premier":      (150_000, 300_000),
+    },
+    "matomo": {
+        "starter":      (      0,   3_000),  # On-prem free / Cloud Starter
+        "professional": (  6_000,  20_000),
+        "enterprise":   ( 20_000,  80_000),
+        "premier":      ( 60_000, 150_000),
+    },
+    "content square": {
+        "starter":      ( 12_000,  40_000),
+        "professional": ( 60_000, 150_000),
+        "enterprise":   (150_000, 350_000),
+        "premier":      (350_000, 700_000),
+    },
+    "contentsquare": {
+        "starter":      ( 12_000,  40_000),
+        "professional": ( 60_000, 150_000),
+        "enterprise":   (150_000, 350_000),
+        "premier":      (350_000, 700_000),
+    },
+    "dynatrace": {
+        "starter":      (  5_000,  15_000),
+        "professional": ( 30_000,  80_000),
+        "enterprise":   (100_000, 300_000),
+        "premier":      (300_000, 800_000),
+    },
+    "qualtrics": {
+        "starter":      (  6_000,  20_000),
+        "professional": ( 30_000,  80_000),
+        "enterprise":   ( 80_000, 200_000),
+        "premier":      (200_000, 500_000),
+    },
+
+    # Advertising / Retargeting (Lizenz + Self-Service; Media-Spend SEPARAT)
+    "criteo": {
+        "starter":      (  6_000,  20_000),
+        "professional": ( 30_000,  80_000),
+        "enterprise":   ( 80_000, 250_000),
+        "premier":      (250_000, 600_000),
+    },
+    "adform": {
+        "starter":      ( 12_000,  40_000),
+        "professional": ( 60_000, 150_000),
+        "enterprise":   (150_000, 400_000),
+        "premier":      (400_000, 800_000),
+    },
+    "outbrain": {
+        "starter":      (  6_000,  20_000),
+        "professional": ( 30_000,  80_000),
+        "enterprise":   ( 80_000, 200_000),
+        "premier":      (200_000, 500_000),
+    },
+    "taboola": {
+        "starter":      (  6_000,  20_000),
+        "professional": ( 30_000,  80_000),
+        "enterprise":   ( 80_000, 200_000),
+        "premier":      (200_000, 500_000),
+    },
+    "teads": {
+        "starter":      (  6_000,  18_000),
+        "professional": ( 20_000,  60_000),
+        "enterprise":   ( 60_000, 150_000),
+        "premier":      (150_000, 350_000),
+    },
+    "pinterest": {
+        "starter":      (  3_000,  15_000),
+        "professional": ( 15_000,  50_000),
+        "enterprise":   ( 50_000, 150_000),
+        "premier":      (150_000, 400_000),
+    },
+    "linkedin insight": {
+        "starter":      (  3_000,  12_000),
+        "professional": ( 12_000,  40_000),
+        "enterprise":   ( 40_000, 120_000),
+        "premier":      (120_000, 300_000),
+    },
+
+    # CDN / Cloud
+    "akamai": {
+        "starter":      ( 20_000,  60_000),
+        "professional": ( 80_000, 200_000),
+        "enterprise":   (200_000, 500_000),
+        "premier":      (500_000, 1_500_000),
+    },
+    "amazon web services": {
+        "starter":      ( 12_000,  60_000),
+        "professional": ( 60_000, 300_000),
+        "enterprise":   (300_000, 1_500_000),
+        "premier":      (1_500_000, 8_000_000),
+    },
+    "baqend": {
+        "starter":      (  3_000,  12_000),
+        "professional": ( 12_000,  40_000),
+        "enterprise":   ( 40_000, 120_000),
+        "premier":      (120_000, 300_000),
+    },
+    "speedkit": {
+        "starter":      (  3_000,  12_000),
+        "professional": ( 12_000,  40_000),
+        "enterprise":   ( 40_000, 120_000),
+        "premier":      (120_000, 300_000),
+    },
+    "speedcurve": {
+        "starter":      (  1_200,   4_800),
+        "professional": (  6_000,  18_000),
+        "enterprise":   ( 18_000,  60_000),
+        "premier":      ( 60_000, 120_000),
+    },
+
+    # CRM / Marketing
+    "salesforce": {
+        "starter":      ( 20_000,  60_000),
+        "professional": ( 80_000, 250_000),
+        "enterprise":   (250_000, 800_000),
+        "premier":      (800_000, 2_500_000),
+    },
+    "genesys": {
+        "starter":      ( 24_000,  80_000),
+        "professional": ( 80_000, 250_000),
+        "enterprise":   (250_000, 800_000),
+        "premier":      (800_000, 2_000_000),
+    },
+
+    # Captcha
+    "hcaptcha": {
+        "starter":      (      0,   2_400),
+        "professional": (  2_400,  12_000),
+        "enterprise":   ( 12_000,  40_000),
+        "premier":      ( 40_000, 100_000),
+    },
+
+    # Lead-Tracking
+    "salesviewer": {
+        "starter":      (  1_200,   3_600),
+        "professional": (  3_600,  12_000),
+        "enterprise":   ( 12_000,  40_000),
+        "premier":      ( 40_000, 100_000),
+    },
+}
+
+
+def _vendor_key(vendor_name: str) -> str | None:
+    """Map a vendor name to a known pricing-table key."""
+    n = (vendor_name or "").lower()
+    for k in _TIER_PRICING:
+        if k in n:
+            return k
+    return None
+
+
+def infer_company_tier(business_profile: dict | None) -> str:
+    """Coarse company-tier from business profile.
+
+    Used as the baseline when vendor-specific signals are weak.
+    """
+    if not business_profile:
+        return "professional"
+    bp = business_profile
+    features = {f.lower() for f in (bp.get("features") or [])}
+    btype = (bp.get("type") or "").lower()
+    # Heavy enterprise-only signals
+    if any(f in features for f in ("multi_country", "konzern", "enterprise",
+                                    "international", "automotive", "banking",
+                                    "luxury", "premium")):
+        return "premier"
+    # Large but maybe single-country
+    if "shop" in features or "konfigurator" in features or btype == "b2c":
+        return "enterprise"
+    return "professional"
+
+
+def infer_vendor_tier(vendor: dict, company_tier: str) -> tuple[str, list[str]]:
+    """Infer pricing tier for a single vendor from its cookie footprint.
+
+    Signals (additive — more signals → higher tier):
+      - cookie_count > 30          → +1 tier
+      - cookie_count > 60          → +2 tiers
+      - premium-feature cookie hit → +1 tier
+      - 'is_third_party' on most cookies → +1 tier (heavy-tracking signal)
+      - very long expiry (>=2 years) → +1 tier
+    """
+    cookies = vendor.get("cookies") or []
+    n_cookies = len(cookies)
+    cookie_names = [c.get("name", "").lower() for c in cookies]
+    signals: list[str] = []
+
+    base_tiers = ["starter", "professional", "enterprise", "premier"]
+    # Start at company-tier as baseline
+    idx = base_tiers.index(company_tier) if company_tier in base_tiers else 1
+
+    if n_cookies >= 60:
+        idx = min(len(base_tiers) - 1, idx + 1)
+        signals.append(f"{n_cookies} Cookies (sehr hohe Modul-Tiefe)")
+    elif n_cookies >= 30:
+        signals.append(f"{n_cookies} Cookies (hohe Modul-Tiefe)")
+
+    # Premium feature detection
+    vk = _vendor_key(vendor.get("name", ""))
+    for pattern, expected_key, feature_label in _PREMIUM_FEATURE_PATTERNS:
+        if vk and vk != expected_key and expected_key not in (vendor.get("name") or "").lower():
+            continue
+        for cn in cookie_names:
+            if re.search(pattern, cn):
+                idx = min(len(base_tiers) - 1, idx + 1)
+                signals.append(f"Premium-Feature-Cookie: {feature_label}")
+                break
+
+    # Heavy third-party tracking
+    third_party_ratio = sum(1 for c in cookies if c.get("is_third_party")) / max(n_cookies, 1)
+    if third_party_ratio >= 0.6 and n_cookies >= 10:
+        signals.append(f"{int(third_party_ratio * 100)}% Drittanbieter-Cookies — Tracking-Heavy")
+
+    # Long-lived cookies
+    long_lived = sum(1 for c in cookies if _expiry_years(c.get("expiry", "")) >= 2)
+    if long_lived >= 3:
+        signals.append(f"{long_lived} Cookies mit ≥2 Jahre Speicherdauer")
+
+    return base_tiers[idx], signals
+
+
+def _expiry_years(expiry_str: str) -> float:
+    """Rough parse: '2 Jahre' → 2.0, '24 Monate' → 2.0, '90 Tage' → 0.25"""
+    s = (expiry_str or "").lower()
+    m = re.search(r"(\d+)\s*(jahr|year)", s)
+    if m: return float(m.group(1))
+    m = re.search(r"(\d+)\s*(monat|month)", s)
+    if m: return float(m.group(1)) / 12.0
+    m = re.search(r"(\d+)\s*(tag|day)", s)
+    if m: return float(m.group(1)) / 365.0
+    return 0.0
+
+
+def estimate_vendor_cost(vendor: dict, business_profile: dict | None = None) -> dict:
+    """Return cost estimation for one vendor incl. tier inference + signals."""
+    vk = _vendor_key(vendor.get("name", ""))
+    company_tier = infer_company_tier(business_profile)
+
+    if not vk:
+        return {
+            "vendor": vendor.get("name", ""),
+            "matched_pricing_key": None,
+            "inferred_tier": None,
+            "tier_signals": [],
+            "company_tier_baseline": company_tier,
+            "cost_year_eur_range": (0, 0),
+            "confidence": "none",
+            "note": "Kein Pricing-Eintrag fuer diesen Anbieter — Saving-Schaetzung uebergangen.",
+        }
+
+    tier, signals = infer_vendor_tier(vendor, company_tier)
+    pricing = _TIER_PRICING[vk].get(tier) or (0, 0)
+    confidence = "high" if len(signals) >= 2 else ("medium" if signals else "low")
+
+    return {
+        "vendor": vendor.get("name", ""),
+        "matched_pricing_key": vk,
+        "inferred_tier": tier,
+        "tier_signals": signals,
+        "company_tier_baseline": company_tier,
+        "cost_year_eur_range": pricing,
+        "confidence": confidence,
+    }
+
+
+def estimate_total_stack_cost(
+    vendors: Iterable[dict],
+    business_profile: dict | None = None,
+) -> dict:
+    """Aggregate cost estimation over all vendors.
+
+    Returns:
+      - per_vendor list (one entry each)
+      - per_recipient_type aggregate (INTERNAL vs PROCESSOR vs CONTROLLER)
+      - total range
+      - master-contract dedup hint: vendors whose name starts with the
+        site owner ('BMW AG — ...') are bundled into ONE master contract
+        per vendor-tool-key (not double-counted).
+    """
+    per_vendor: list[dict] = []
+    seen_master_keys: set[tuple[str, str]] = set()
+    total_low = 0
+    total_high = 0
+
+    for v in vendors:
+        est = estimate_vendor_cost(v, business_profile)
+        per_vendor.append(est)
+        if not est["matched_pricing_key"]:
+            continue
+        rtype = (v.get("recipient_type") or "").upper()
+        master_key = (est["matched_pricing_key"], rtype if rtype == "INTERNAL" else "EXT")
+        if rtype == "INTERNAL" and master_key in seen_master_keys:
+            # Same Adobe contract serves many "BMW AG — Adobe XYZ" lines —
+            # count cost only ONCE per (key, internal).
+            est["bundled_into_master_contract"] = True
+            continue
+        seen_master_keys.add(master_key)
+        lo, hi = est["cost_year_eur_range"]
+        total_low += lo
+        total_high += hi
+
+    return {
+        "per_vendor": per_vendor,
+        "total_year_eur_range": (total_low, total_high),
+        "master_contracts_counted": len(seen_master_keys),
+        "disclaimer": (
+            "Schaetzung basiert auf Cookie-Signalen (Anzahl, Premium-Feature-Detection, "
+            "Drittanbieter-Quote, Lebensdauer) + Listpreisen pro Tier. Konzern-Konditionen "
+            "koennen 30-50% darunter liegen. Eintraege derselben Eigenmarke werden zu EINEM "
+            "Master-Vertrag aggregiert. Media-Spend ist NICHT enthalten."
+        ),
+    }
@@ -0,0 +1,727 @@
+"""
+Vendor Redundancy + EU-Alternatives Analyzer.
+
+Eingang: Liste von Vendors aus dem CMP-Capture (z.B. BMW 90 Vendors).
+Ausgang: drei strukturierte Listen die im Email + Migration-Modal
+gerendert werden:
+
+  1. functional_categories : Vendor → Funktionsklasse (analytics,
+     advertising, cdn, captcha, chat, …)
+  2. redundancies          : Kategorien mit ≥2 Vendors die dasselbe tun
+                             → Konsolidierungspotenzial
+  3. eu_alternatives       : pro US-Vendor passender EU-Ersatz aus
+                             kuratierter Lookup-Tabelle (Matomo statt
+                             Adobe Analytics, IONOS statt AWS, etc.)
+  4. multi_function_tools  : EU-Tools die mehrere Kategorien abdecken
+                             (z.B. SAP CX = Analytics + CRM + Marketing)
+"""
+
+from __future__ import annotations
+
+import logging
+import re
+from collections import defaultdict
+from typing import Iterable
+
+logger = logging.getLogger(__name__)
+
+
+# ─── Kategorisierung ──────────────────────────────────────────────────
+
+# Substring-Match (lowercase) → Kategorie. Erste Treffer gewinnt.
+_CATEGORY_RULES: list[tuple[str, str]] = [
+    # Web Analytics / Behavior
+    ("adobe analytics",        "web_analytics"),
+    ("adobe target",           "personalisation"),
+    ("adobe campaign",         "marketing_automation"),
+    ("adobe staging library",  "tag_management"),
+    ("adobelaunch",            "tag_management"),
+    ("google analytics",       "web_analytics"),
+    ("matomo",                 "web_analytics"),
+    ("hotjar",                 "web_analytics"),
+    ("content square",         "web_analytics"),
+    ("contentsquare",          "web_analytics"),
+    ("dynatrace",              "monitoring"),
+    ("performance analytics",  "web_analytics"),
+    ("form analytics",         "web_analytics"),
+    ("form campaign analytics","web_analytics"),
+    ("psyma",                  "survey"),
+    ("qualtrics",              "survey"),
+
+    # Tag Management
+    ("google tag manager",     "tag_management"),
+    ("gtm",                    "tag_management"),
+
+    # Advertising / Retargeting
+    ("google ads",             "advertising"),
+    ("google advertising",     "advertising"),
+    ("doubleclick",            "advertising"),
+    ("googleads",              "advertising"),
+    ("meta pixel",             "advertising"),
+    ("meta platforms",         "advertising"),
+    ("facebook",               "advertising"),
+    ("adform",                 "advertising"),
+    ("criteo",                 "advertising"),
+    ("outbrain",               "advertising"),
+    ("taboola",                "advertising"),
+    ("teads",                  "advertising"),
+    ("pinterest",              "advertising"),
+    ("linkedin insight",       "advertising"),
+    ("youtube performance",    "advertising"),
+    ("youtube player",         "external_media"),
+    ("amazon advertising",     "advertising"),
+    ("instagram",              "advertising"),
+    ("dotaki",                 "advertising"),
+
+    # Video / Embeds
+    ("youtube",                "external_media"),
+    ("vimeo",                  "external_media"),
+    ("jw player",              "external_media"),
+    ("jw video",               "external_media"),
+    ("jwplayer",               "external_media"),
+    ("jwconnatix",             "external_media"),
+
+    # Maps / Geo
+    ("google maps",            "maps"),
+    ("google geolocation",     "maps"),
+    ("geolocation",            "maps"),
+
+    # CDN / Infrastructure
+    ("akamai",                 "cdn"),
+    ("amazon web services",    "cloud_infra"),
+    ("aws",                    "cloud_infra"),
+    ("baqend",                 "cdn"),
+    ("speedkit",               "cdn"),
+    ("speedcurve",             "monitoring"),
+    ("salesforce",             "crm"),
+
+    # Chat / Support
+    ("genesys",                "chat"),
+    ("ckm",                    "chat"),
+    ("chat widget",            "chat"),
+
+    # Captcha / Bot-Protection
+    ("hcaptcha",               "captcha"),
+    ("recaptcha",              "captcha"),
+
+    # Sales / Lead-Tracking
+    ("salesviewer",            "lead_tracking"),
+
+    # Marketing/Sales overlay
+    ("nayoki",                 "social_aggregator"),
+
+    # Site-eigene Funktionen
+    ("infrastructure",         "site_infra"),
+    ("infrastrukturbereit",    "site_infra"),
+    ("javaserverpages",        "site_infra"),
+    ("single sign-on",         "auth"),
+    ("mybmw account",          "auth"),
+    ("sso",                    "auth"),
+    ("consent",                "consent_management"),
+    ("session",                "site_infra"),
+    ("scroll",                 "site_infra"),
+    ("sticky",                 "site_infra"),
+    ("sidebar",                "site_infra"),
+    ("dealer search",          "site_feature"),
+    ("test drive",             "site_feature"),
+    ("vehicle configurator",   "site_feature"),
+    ("stocklocator",           "site_feature"),
+    ("eshop",                  "site_feature"),
+    ("shop",                   "site_feature"),
+    ("language",               "site_infra"),
+    ("sprach",                 "site_infra"),
+    ("region",                 "site_infra"),
+    ("ip popup",               "site_infra"),
+    ("popup",                  "site_infra"),
+    ("dynatrace",              "monitoring"),
+]
+
+
+def classify_vendor(name: str) -> str:
+    """Map a vendor name to a functional category."""
+    n = (name or "").lower()
+    for needle, cat in _CATEGORY_RULES:
+        if needle in n:
+            return cat
+    return "other"
+
+
+# ─── EU-Alternativen ─────────────────────────────────────────────────
+
+# Kuratierte Liste — pro US-/Nicht-EU-Vendor passende(r) EU-Ersatz.
+# Quellen: Matomo Vergleich, etracker SoMo-Studie, IONOS-Pakete,
+# Friendly Captcha Whitepaper, SAP CX-Suite, Brevo / CleverReach DE-Listen.
+_EU_ALTERNATIVES: dict[str, list[dict]] = {
+    "adobe analytics": [
+        {"name": "Matomo (On-Premise)", "vendor": "InnoCraft", "country": "DE-self-hosted",
+         "license": "GPL", "notes": "100% DSGVO, keine 3rd-Country, gleicher Funktionsumfang"},
+        {"name": "etracker Analytics", "vendor": "etracker GmbH", "country": "DE",
+         "license": "Commercial", "notes": "DSGVO-konform aus Hamburg, IP-Anonymisierung"},
+        {"name": "Mapp Intelligence", "vendor": "Mapp Digital", "country": "DE",
+         "license": "Commercial", "notes": "Enterprise-Alternative, Server in DE"},
+    ],
+    "google analytics": [
+        {"name": "Matomo", "vendor": "InnoCraft", "country": "DE-self-hosted",
+         "license": "GPL", "notes": "Direkter Drop-in-Ersatz mit GA-Migrationspfad"},
+        {"name": "Plausible Analytics", "vendor": "Plausible Insights", "country": "EE",
+         "license": "AGPL/Commercial", "notes": "Cookielos, ohne Einwilligung nutzbar"},
+        {"name": "Fathom Analytics EU", "vendor": "Fathom", "country": "DE-Region",
+         "license": "Commercial", "notes": "Cookielos, EU-Hosting"},
+    ],
+    "content square": [
+        {"name": "Mouseflow EU", "vendor": "Mouseflow ApS", "country": "DK",
+         "license": "Commercial", "notes": "Session-Recording + Heatmaps EU-Hosting"},
+        {"name": "Hotjar EU", "vendor": "Hotjar Ltd", "country": "MT",
+         "license": "Commercial", "notes": "EU-DataCenter (Frankfurt), Einwilligung erforderlich"},
+    ],
+    "dynatrace": [
+        {"name": "Dynatrace EU", "vendor": "Dynatrace", "country": "AT",
+         "license": "Commercial", "notes": "Bereits EU (Linz). Cluster auf EU einstellen"},
+    ],
+    "speedcurve": [
+        {"name": "SpeedCurve EU", "vendor": "SpeedCurve", "country": "EU-tenant",
+         "license": "Commercial", "notes": "Region-Tenant explizit konfigurieren"},
+        {"name": "Calibre", "vendor": "Calibre", "country": "AU/EU",
+         "license": "Commercial", "notes": "Performance Monitoring, EU-Region"},
+    ],
+    "akamai": [
+        {"name": "Bunny CDN", "vendor": "BunnyWay d.o.o.", "country": "SI",
+         "license": "Commercial", "notes": "Slowenischer CDN, EU-Backbone"},
+        {"name": "Cloudflare EU-Only", "vendor": "Cloudflare", "country": "Multi",
+         "license": "Commercial", "notes": "EU-Datacenter erzwingbar via 'Geo Steering'"},
+        {"name": "IONOS CDN", "vendor": "IONOS SE", "country": "DE",
+         "license": "Commercial", "notes": "100% DE-Hosting"},
+    ],
+    "amazon web services": [
+        {"name": "IONOS Cloud", "vendor": "IONOS SE", "country": "DE",
+         "license": "Commercial", "notes": "DE-Hosting, BSI C5-zertifiziert"},
+        {"name": "OVHcloud", "vendor": "OVH SAS", "country": "FR",
+         "license": "Commercial", "notes": "FR-Hosting, SecNumCloud-zertifiziert"},
+        {"name": "Hetzner Cloud", "vendor": "Hetzner Online GmbH", "country": "DE",
+         "license": "Commercial", "notes": "DE/FI-Hosting, sehr kostenguenstig"},
+        {"name": "STACKIT", "vendor": "Schwarz IT (Lidl-Gruppe)", "country": "DE",
+         "license": "Commercial", "notes": "Souveraener DE-Cloud, fuer Enterprise"},
+    ],
+    "salesforce": [
+        {"name": "SAP Customer Experience", "vendor": "SAP SE", "country": "DE",
+         "license": "Commercial", "notes": "Vollstaendige CRM-Suite EU-Hosting"},
+        {"name": "weclapp", "vendor": "weclapp SE", "country": "DE",
+         "license": "Commercial", "notes": "Cloud-CRM aus Marburg"},
+    ],
+    "adobe campaign": [
+        {"name": "CleverReach", "vendor": "CleverReach GmbH", "country": "DE",
+         "license": "Commercial", "notes": "E-Mail-Marketing DE-Hosting"},
+        {"name": "Brevo (Sendinblue)", "vendor": "Brevo", "country": "FR",
+         "license": "Commercial", "notes": "Marketing-Automation EU-Hosting"},
+        {"name": "Inxmail", "vendor": "Inxmail GmbH", "country": "DE",
+         "license": "Commercial", "notes": "Enterprise-E-Mail-Marketing aus Freiburg"},
+    ],
+    "google ads": [
+        {"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
+         "license": "Commercial", "notes": "FR-Hosting, Programmatic + Direct-Sold"},
+        {"name": "Bing Ads (Microsoft Advertising EU)", "vendor": "Microsoft", "country": "Multi",
+         "license": "Commercial", "notes": "EU-Datacenter optional"},
+    ],
+    "google maps": [
+        {"name": "HERE Maps", "vendor": "HERE Technologies", "country": "DE",
+         "license": "Commercial", "notes": "Berliner Anbieter, professionelle Karten + Routing"},
+        {"name": "OpenStreetMap (self-host)", "vendor": "OSM Foundation", "country": "DE-self-host",
+         "license": "ODbL", "notes": "Frei, OSM-Tiles self-hosted oder via Maptiler EU"},
+        {"name": "Maptiler Cloud EU", "vendor": "MapTiler", "country": "CH",
+         "license": "Commercial", "notes": "Schweizer Anbieter, EU-Tiles"},
+    ],
+    "criteo": [  # criteo IS EU but use as example for retargeting alts
+        {"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
+         "license": "Commercial", "notes": "Retargeting + Display, FR-Hosting"},
+    ],
+    "hcaptcha": [
+        {"name": "Friendly Captcha", "vendor": "Friendly Captcha GmbH", "country": "DE",
+         "license": "Commercial", "notes": "100% DSGVO, ohne Cookie, Hosting in DE"},
+        {"name": "Turnstile (Cloudflare EU-Only)", "vendor": "Cloudflare", "country": "Multi",
+         "license": "Commercial", "notes": "Ohne Cookie, EU-Region erzwingbar"},
+    ],
+    "qualtrics": [
+        {"name": "LamaPoll", "vendor": "Lamano GmbH", "country": "DE",
+         "license": "Commercial", "notes": "DSGVO-Surveys aus Berlin"},
+        {"name": "evasys", "vendor": "evasys GmbH", "country": "DE",
+         "license": "Commercial", "notes": "Enterprise-Survey-Plattform aus Lueneburg"},
+    ],
+    "meta pixel": [
+        {"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
+         "license": "Commercial", "notes": "EU-Alternative fuer Conversion-Tracking"},
+    ],
+    "facebook": [
+        {"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
+         "license": "Commercial", "notes": "Programmatic ohne Meta"},
+    ],
+    "linkedin insight": [
+        {"name": "Xing Insights", "vendor": "New Work SE", "country": "DE",
+         "license": "Commercial", "notes": "DE/AT/CH B2B-Targeting aus Hamburg"},
+    ],
+    "outbrain": [
+        {"name": "Plista", "vendor": "Plista GmbH", "country": "DE",
+         "license": "Commercial", "notes": "Native Advertising aus Berlin"},
+    ],
+    "taboola": [
+        {"name": "Plista", "vendor": "Plista GmbH", "country": "DE",
+         "license": "Commercial", "notes": "Native Advertising aus Berlin"},
+    ],
+    "genesys": [
+        {"name": "Userlike", "vendor": "Userlike UG", "country": "DE",
+         "license": "Commercial", "notes": "Live-Chat aus Koeln, BSI-konform"},
+        {"name": "LiveZilla / EasyChat EU", "vendor": "LiveZilla GmbH", "country": "DE",
+         "license": "Commercial", "notes": "DSGVO-Live-Chat"},
+    ],
+    "salesviewer": [
+        {"name": "Leadinfo", "vendor": "Leadinfo BV", "country": "NL",
+         "license": "Commercial", "notes": "B2B-Webvisitor-Tracking EU"},
+        {"name": "Albacross EU", "vendor": "Albacross", "country": "SE",
+         "license": "Commercial", "notes": "EU-Tenant verfuegbar"},
+    ],
+    "youtube": [
+        {"name": "Vimeo Pro EU", "vendor": "Vimeo", "country": "Multi",
+         "license": "Commercial", "notes": "EU-Region waehlbar, weniger Tracking"},
+        {"name": "Self-hosted video (BunnyStream)", "vendor": "BunnyWay", "country": "SI",
+         "license": "Commercial", "notes": "Eigene Player + CDN ohne Drittanbieter"},
+    ],
+    "amazon advertising": [
+        {"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
+         "license": "Commercial", "notes": "Retail-Media-Alternative FR"},
+    ],
+    "instagram": [
+        {"name": "Pinterest EU + Owned-Channels", "vendor": "Mix", "country": "Multi",
+         "license": "Commercial", "notes": "Owned-Channels (Newsletter via CleverReach)"},
+    ],
+}
+
+
+# ─── Kosten-Annahmen (oeffentliche Listenpreise, Schaetzung) ──────
+#
+# Format: (low_year_eur, high_year_eur, tier_assumption)
+# Tier: 'sme' = <100 Mitarbeiter, 'mid' = 100-1000, 'ent' = >1000.
+# Quellen: oeffentliche Listenpreise + Branchen-Benchmarks (Gartner,
+# Forrester 2025). Konkrete Vertrags-Konditionen koennen 30-70% abweichen
+# (Volumen-Rabatte, Bundling). Werden im Output explizit als
+# 'Schaetzbereich' markiert.
+
+_COST_LOOKUP: dict[str, tuple[int, int, str]] = {
+    "adobe analytics":      (120_000, 600_000, "ent"),
+    "adobe target":         ( 80_000, 350_000, "ent"),
+    "adobe campaign":       ( 60_000, 250_000, "ent"),
+    "adobe staging library":(      0,       0, "ent"),  # bundled
+    "google analytics":     (      0, 150_000, "ent"),  # GA4 free, GA360 ~150k
+    "matomo":               (  6_000,  30_000, "mid"),  # Cloud/On-Prem
+    "hotjar":               (  3_600,  18_000, "mid"),
+    "content square":       ( 60_000, 300_000, "ent"),
+    "contentsquare":        ( 60_000, 300_000, "ent"),
+    "dynatrace":            ( 50_000, 400_000, "ent"),  # per-host pricing
+    "performance analytics":(  5_000,  40_000, "mid"),
+    "qualtrics":            ( 25_000, 150_000, "ent"),
+
+    # Self-Service-Werbung — KEIN Tool-Lizenz, nur Media-Spend (separat).
+    # Wir zaehlen 0 hier, weil "Sparpotenzial bei der Lizenz" = 0 ist.
+    # Konsolidierung wuerde nur Media-Spend reduzieren — anderes Thema.
+    "google ads":           (      0,       0, "ent"),
+    "google advertising":   (      0,       0, "ent"),
+    "doubleclick":          (      0,       0, "ent"),
+    "meta pixel":           (      0,       0, "ent"),
+    "facebook":             (      0,       0, "ent"),
+    "amazon advertising":   (      0,       0, "ent"),
+    "youtube performance":  (      0,       0, "ent"),
+    "youtube player":       (      0,       0, "ent"),
+    "instagram":            (      0,       0, "ent"),
+    # Echte DSP-/Plattform-Lizenzen — hier zahlt der Kunde eine Saas-Fee
+    # ON TOP des Media-Spends. Range bewusst enger gehalten (Faktor max 4x).
+    "adform":               ( 80_000,  300_000, "ent"),
+    "criteo":               ( 50_000,  200_000, "ent"),
+    "outbrain":             ( 30_000,  120_000, "ent"),
+    "taboola":              ( 30_000,  120_000, "ent"),
+    "teads":                ( 25_000,  100_000, "ent"),
+    "pinterest":            ( 15_000,   60_000, "ent"),
+    "linkedin insight":     ( 10_000,   50_000, "ent"),
+
+    "google maps":          (  2_000,  30_000, "mid"),
+    "akamai":               ( 50_000, 500_000, "ent"),
+    "amazon web services":  (100_000, 3_000_000, "ent"),
+    "baqend":               (  6_000,  60_000, "mid"),
+    "speedkit":             (  6_000,  60_000, "mid"),
+    "speedcurve":           (  2_400,  24_000, "mid"),
+
+    "salesforce":           (100_000, 1_500_000, "ent"),  # CRM seats
+    "genesys":              ( 80_000, 800_000, "ent"),  # contact-center seats
+    "ckm":                  ( 15_000, 120_000, "mid"),
+    "hcaptcha":             (      0,  12_000, "sme"),  # free tier OR pro
+
+    "salesviewer":          (  3_600,  18_000, "mid"),
+    "youtube":              (      0,  50_000, "ent"),  # embed kostenlos, Production-Kosten variieren
+}
+
+
+# ─── EU-Alternativen-Kosten (gleiche Tier-Logik) ───────────────────
+
+_EU_ALT_COSTS: dict[str, tuple[int, int]] = {
+    "Matomo (On-Premise)":          (  3_000,   15_000),
+    "Matomo (Pro / Cloud EU)":      (  6_000,   30_000),
+    "Matomo":                       (  6_000,   30_000),
+    "etracker Analytics":           ( 10_000,   60_000),
+    "Mapp Intelligence":            ( 40_000,  200_000),
+    "Plausible Analytics":          (    240,    6_000),
+    "Fathom Analytics EU":          (    240,    6_000),
+    "Mouseflow EU":                 ( 12_000,   60_000),
+    "Hotjar EU":                    (  3_600,   18_000),
+    "Dynatrace EU":                 ( 50_000,  400_000),  # gleicher Preis, nur Region
+    "SpeedCurve EU":                (  2_400,   24_000),
+    "Calibre":                      (  3_600,   30_000),
+    "Bunny CDN":                    (  1_200,   12_000),
+    "Cloudflare EU-Only":           (  6_000,   80_000),
+    "IONOS CDN":                    (  3_000,   30_000),
+    "IONOS Cloud":                  ( 30_000,  600_000),
+    "OVHcloud":                     ( 30_000,  600_000),
+    "Hetzner Cloud":                (  6_000,  120_000),
+    "STACKIT":                      ( 50_000,  800_000),
+    "SAP Customer Experience":      ( 80_000, 1_200_000),
+    "weclapp":                      ( 12_000,   80_000),
+    "CleverReach":                  (  2_400,   24_000),
+    "Brevo (Sendinblue)":           (    600,   24_000),
+    "Inxmail":                      (  8_000,   60_000),
+    "Smart AdServer (Equativ)":     ( 30_000,  300_000),
+    "Bing Ads (Microsoft Advertising EU)": ( 30_000, 3_000_000),
+    "HERE Maps":                    (  1_200,   24_000),
+    "OpenStreetMap (self-host)":    (      0,    6_000),  # nur Server-Kosten
+    "Maptiler Cloud EU":            (    600,   12_000),
+    "Friendly Captcha":             (    600,    9_600),
+    "Turnstile (Cloudflare EU-Only)": (    0,    6_000),
+    "LamaPoll":                     (  1_200,   24_000),
+    "evasys":                       (  6_000,   60_000),
+    "Xing Insights":                (  6_000,   60_000),
+    "Plista":                       ( 20_000,  150_000),
+    "Userlike":                     (  1_200,   30_000),
+    "LiveZilla / EasyChat EU":      (    600,   12_000),
+    "Leadinfo":                     (  1_200,   12_000),
+    "Albacross EU":                 (  3_600,   24_000),
+    "Vimeo Pro EU":                 (    900,    6_000),
+    "Self-hosted video (BunnyStream)": (   600,   12_000),
+    "Pinterest EU + Owned-Channels": (   600,   24_000),
+}
+
+
+# ─── Bekannte Gruende fuer Duplikate (sollen Konsolidierung NICHT empfehlen) ─
+
+_DUPLICATION_CAVEATS = {
+    "web_analytics": [
+        "A/B-Vergleich verschiedener Anbieter waehrend Migration",
+        "Marketing nutzt Adobe, Produkt nutzt Matomo — Inhouse-Politik",
+        "Regional split (Adobe fuer DE, GA fuer International)",
+    ],
+    "advertising": [
+        "Brand-Kampagne vs Performance-Kampagne (verschiedene DSPs)",
+        "Saisonal: Black Friday/Super Bowl nutzt mehr Kanaele",
+        "Markenspezifisch: BMW M-Modelle anders targetet als 1er-Serie",
+    ],
+    "cdn": [
+        "Multi-CDN-Strategie fuer Ausfallsicherheit (Akamai + Cloudflare)",
+        "Event-CDN-Spike (Auto-Show, Modell-Launch) braucht Skalierung",
+        "Regionale Latenz-Optimierung (Akamai APAC, AWS US)",
+    ],
+    "marketing_automation": [
+        "Salesforce Marketing Cloud fuer B2C, Adobe Campaign fuer B2B",
+        "Lead-Generierung (Adobe) vs Loyalitaet (Salesforce)",
+    ],
+    "monitoring": [
+        "APM (Dynatrace) misst Backend, RUM (SpeedCurve) misst Frontend",
+    ],
+    "captcha": [
+        "Stufenweise Migration zu cookieless Captcha",
+    ],
+}
+
+
+def _company_tier_bounds(company_tier: str | None) -> tuple[float, float]:
+    """Wie viel der Listpreis-Range tatsaechlich verwenden — abhaengig
+    vom Company-Tier. Bei 'enterprise' / 'premier' nutzen wir den UPPER
+    Teil (50-100%) statt starter→premier.
+    """
+    t = (company_tier or "professional").lower()
+    if t == "premier":   return (0.70, 1.00)
+    if t == "enterprise": return (0.40, 0.85)
+    if t == "professional": return (0.20, 0.60)
+    return (0.05, 0.40)  # 'sme' / starter
+
+
+def _estimate_savings_for_redundancy(
+    redundancy: dict, vendors: Iterable[dict],
+    company_tier: str = "enterprise",
+) -> dict:
+    """Schaetzbereich pro Redundanz: derzeitige Kosten + EU-Konsolidierungs-Saving.
+
+    Beruecksichtigt den company_tier — wir wollen fuer ein Konzern wie
+    BMW nicht die starter-Range mit anzeigen. Realistic Range ergibt
+    sich aus tier_bounds × (low, high).
+    """
+    low_frac, high_frac = _company_tier_bounds(company_tier)
+    current_low = current_high = 0
+    matched_vendors = []
+    cat_vendors = [v for v in vendors if v.get("name") in redundancy.get("vendors", [])]
+    for v in cat_vendors:
+        name = (v.get("name") or "").lower()
+        for k, (lo, hi, _tier) in _COST_LOOKUP.items():
+            if k in name:
+                # Tier-aware: nimm low_frac..high_frac des Pricing-Bereichs
+                span = hi - lo
+                current_low  += int(lo + span * low_frac)
+                current_high += int(lo + span * high_frac)
+                matched_vendors.append(v.get("name"))
+                break
+
+    # Konsolidierung: ein einziges EU-Tool ersetzt alle in der Kategorie
+    suggested_eu = None
+    suggested_low = suggested_high = 0
+    # 1. Multi-Funktions-Tool das diese Kategorie abdeckt
+    for tool in _MULTI_FUNCTION_TOOLS:
+        if redundancy["category"] in tool["covers"]:
+            suggested_eu = tool["name"]
+            cost = _EU_ALT_COSTS.get(tool["name"])
+            if cost:
+                suggested_low, suggested_high = cost
+            break
+    # 2. Sonst: EU-Alternative aus den Eintraegen — ABER NUR FUR VENDORS
+    #    AUS DER AKTUELLEN KATEGORIE (sonst kommt Userlike fuer Werbung)
+    if not suggested_eu:
+        for v in cat_vendors:
+            n = (v.get("name") or "").lower()
+            for k, alts in _EU_ALTERNATIVES.items():
+                if k in n and alts:
+                    suggested_eu = alts[0]["name"]
+                    cost = _EU_ALT_COSTS.get(alts[0]["name"])
+                    if cost:
+                        suggested_low, suggested_high = cost
+                    break
+            if suggested_eu:
+                break
+
+    saving_low  = max(0, current_low  - suggested_high)
+    saving_high = max(0, current_high - suggested_low)
+
+    return {
+        "current_estimate_year_eur": [current_low, current_high],
+        "suggested_eu_tool": suggested_eu,
+        "suggested_estimate_year_eur": [suggested_low, suggested_high],
+        "estimated_saving_year_eur": [saving_low, saving_high],
+        "caveats": _DUPLICATION_CAVEATS.get(redundancy["category"], []),
+        "cost_disclaimer": (
+            "Schaetzbereich auf Basis oeffentlicher Listenpreise. Tatsaechliche "
+            "Vertragspreise koennen 30-70% niedriger liegen (Volumen, Bundling, "
+            "Konzern-Konditionen). Bitte mit der jeweiligen Einkaufsabteilung verifizieren."
+        ),
+    }
+
+
+# ─── Multi-Funktions-Tools (Konsolidierungs-Ankerpunkte) ───────────
+
+_MULTI_FUNCTION_TOOLS = [
+    {
+        "name": "Matomo (Pro / Cloud EU)",
+        "vendor": "InnoCraft",
+        "country": "DE-self-host / EU",
+        "covers": ["web_analytics", "tag_management", "personalisation"],
+        "notes": "Ersetzt Adobe Analytics + GTM + Adobe Target in einem Tool. "
+                 "100% DSGVO ohne Einwilligung wenn IP anonymisiert.",
+    },
+    {
+        "name": "SAP Customer Experience Suite",
+        "vendor": "SAP SE",
+        "country": "DE",
+        "covers": ["crm", "marketing_automation", "personalisation", "survey"],
+        "notes": "Ersetzt Salesforce + Adobe Campaign + Qualtrics. EU-Hosting, "
+                 "tiefe ERP-Integration.",
+    },
+    {
+        "name": "IONOS Cloud (Compute + CDN + Storage + DNS)",
+        "vendor": "IONOS SE",
+        "country": "DE",
+        "covers": ["cloud_infra", "cdn", "monitoring"],
+        "notes": "Ersetzt AWS + Akamai + zusaetzliches Monitoring in einer "
+                 "DE-Cloud (BSI C5).",
+    },
+    {
+        "name": "Userlike Suite",
+        "vendor": "Userlike UG",
+        "country": "DE",
+        "covers": ["chat", "consent_management"],
+        "notes": "Ersetzt Genesys Chat. Bietet eigenes Consent-Modul.",
+    },
+    {
+        "name": "Smart AdServer (Equativ)",
+        "vendor": "Equativ",
+        "country": "FR",
+        "covers": ["advertising"],
+        "notes": "Ersetzt Mehrfach-DSPs (Adform/Criteo/Outbrain/Taboola/Meta) "
+                 "durch Programmatic+Direct-Sold EU-Stack.",
+    },
+    {
+        "name": "HERE Maps",
+        "vendor": "HERE Technologies",
+        "country": "DE",
+        "covers": ["maps"],
+        "notes": "Berliner Anbieter, professionelle Karten + Routing.",
+    },
+    {
+        "name": "Vimeo Pro EU (oder self-hosted BunnyStream)",
+        "vendor": "Vimeo / BunnyWay",
+        "country": "Multi / SI",
+        "covers": ["external_media"],
+        "notes": "Ersetzt YouTube-Embeds + JW Player in einem Player.",
+    },
+    {
+        "name": "LamaPoll",
+        "vendor": "Lamano GmbH",
+        "country": "DE",
+        "covers": ["survey"],
+        "notes": "DSGVO-Surveys aus Berlin. Ersetzt Qualtrics / Psyma.",
+    },
+]
+
+
+# ─── Analyse ─────────────────────────────────────────────────────────
+
+def analyze(vendors: Iterable[dict], company_tier: str = "enterprise") -> dict:
+    """Main entry. Returns categorised view + redundancies + EU options.
+
+    `company_tier` (starter|professional|enterprise|premier) steuert die
+    Cost-Range so dass z.B. fuer einen DAX-Konzern nicht starter-Preise
+    in der unteren Schranke landen.
+    """
+    by_cat: dict[str, list[dict]] = defaultdict(list)
+    for v in vendors:
+        cat = classify_vendor(v.get("name", ""))
+        by_cat[cat].append(v)
+
+    # Redundancies: any category with ≥2 vendors (excl. site-internal cats)
+    skip_redundancy_cats = {"site_infra", "site_feature", "consent_management",
+                            "auth", "other"}
+    all_vendors_list = list(vendors)
+    redundancies: list[dict] = []
+    for cat, vs in by_cat.items():
+        if cat in skip_redundancy_cats or len(vs) < 2:
+            continue
+        red = {
+            "category": cat,
+            "category_label": _CATEGORY_LABEL.get(cat, cat),
+            "count": len(vs),
+            "vendors": [v.get("name", "") for v in vs],
+            "consolidation_hint": _CONSOLIDATION_HINT.get(cat, ""),
+        }
+        red.update(_estimate_savings_for_redundancy(
+            red, all_vendors_list, company_tier))
+        redundancies.append(red)
+    redundancies.sort(key=lambda r: -(r.get("estimated_saving_year_eur") or [0, 0])[1])
+
+    # EU alternatives lookup
+    eu_alternatives: list[dict] = []
+    seen = set()
+    for v in vendors:
+        name = v.get("name") or ""
+        n_lower = name.lower()
+        for k, alts in _EU_ALTERNATIVES.items():
+            if k in n_lower and k not in seen:
+                eu_alternatives.append({
+                    "current_vendor": name,
+                    "current_recipient_type": v.get("recipient_type", ""),
+                    "matched_key": k,
+                    "alternatives": alts,
+                })
+                seen.add(k)
+                break
+
+    # Multi-function tool recommendations: only if the customer has vendors
+    # across the categories the tool covers
+    present_cats = set(by_cat.keys())
+    multi_function = []
+    for tool in _MULTI_FUNCTION_TOOLS:
+        covered_here = [c for c in tool["covers"] if c in present_cats]
+        if len(covered_here) >= 2:
+            # Vendor-Namen sammeln statt nur summieren — dedupliziert
+            unique_vendors: set[str] = set()
+            for c in covered_here:
+                for v in by_cat[c]:
+                    unique_vendors.add(v.get("name", ""))
+            multi_function.append({
+                **tool,
+                "replaces_categories": covered_here,
+                "potential_replacements": len(unique_vendors),
+            })
+    multi_function.sort(key=lambda t: -t["potential_replacements"])
+
+    total_current_low = sum((r.get("current_estimate_year_eur") or [0, 0])[0] for r in redundancies)
+    total_current_high = sum((r.get("current_estimate_year_eur") or [0, 0])[1] for r in redundancies)
+    total_saving_low = sum((r.get("estimated_saving_year_eur") or [0, 0])[0] for r in redundancies)
+    total_saving_high = sum((r.get("estimated_saving_year_eur") or [0, 0])[1] for r in redundancies)
+
+    return {
+        "summary": {
+            "total_vendors": len(all_vendors_list),
+            "distinct_categories": len([c for c in by_cat if c != "other"]),
+            "redundancy_count": len(redundancies),
+            "eu_alternative_count": len(eu_alternatives),
+            "consolidation_potential": sum(r["count"] - 1 for r in redundancies),
+            "estimated_current_year_eur": [total_current_low, total_current_high],
+            "estimated_saving_year_eur": [total_saving_low, total_saving_high],
+            "estimated_saving_pct": (
+                # Beide Bounds gegen denselben Nenner (Mittelwert der
+                # aktuellen Schaetzung) — sonst explodiert die obere
+                # Schranke wenn current_low klein ist. Cap auf 95%.
+                (lambda mid: (
+                    f"{min(95, int(100 * total_saving_low / mid))}–"
+                    f"{min(95, int(100 * total_saving_high / mid))}%"
+                ))((total_current_low + total_current_high) / 2)
+                if total_current_high else "n/a"
+            ),
+            "cost_disclaimer": (
+                "Schaetzbereich auf Basis oeffentlicher Listenpreise (Gartner, Forrester 2025). "
+                "Vertragspreise koennen 30-70% niedriger liegen (Volumen-Rabatte, Konzern-Konditionen, "
+                "Bundling). Werte dienen als Diskussionsgrundlage mit dem Einkauf, NICHT als Angebot."
+            ),
+        },
+        "by_category": {cat: [v.get("name", "") for v in vs]
+                        for cat, vs in by_cat.items()},
+        "redundancies": redundancies,
+        "eu_alternatives": eu_alternatives,
+        "multi_function_tools": multi_function,
+    }
+
+
+_CATEGORY_LABEL = {
+    "web_analytics":       "Web-Analytics",
+    "advertising":         "Werbung / Retargeting",
+    "tag_management":      "Tag-Management",
+    "marketing_automation": "Marketing-Automation",
+    "personalisation":     "Personalisierung",
+    "external_media":      "Externe Medien (Video)",
+    "maps":                "Karten / Geo",
+    "cdn":                 "CDN",
+    "cloud_infra":         "Cloud-Infrastruktur",
+    "monitoring":          "Performance-Monitoring",
+    "crm":                 "CRM",
+    "chat":                "Chat / Support",
+    "captcha":             "Bot-Schutz",
+    "lead_tracking":       "Lead-Tracking",
+    "survey":              "Umfragen",
+    "social_aggregator":   "Social-Media-Aggregation",
+    "consent_management":  "Consent-Management",
+    "auth":                "Authentifizierung",
+    "site_infra":          "Eigene Infrastruktur",
+    "site_feature":        "Eigene Features",
+    "other":               "Sonstige",
+}
+
+_CONSOLIDATION_HINT = {
+    "web_analytics":       "Mehrere Analytics-Tools sammeln meist redundante Daten. Ein Tool genuegt — Matomo (DE) ist DSGVO-Standard.",
+    "advertising":         "Werbe-/Retargeting-Pixel sind oft austauschbar. Konzentration auf 2-3 Kanaele senkt Drittland-Risiko.",
+    "external_media":      "Mehrere Video-Embeds nur wenn fachlich noetig. Self-hosted (BunnyStream/Vimeo) reduziert Tracking.",
+    "maps":                "Eine Karten-Loesung reicht. HERE Maps (DE) als EU-Alternative zu Google Maps.",
+    "cdn":                 "Ein CDN+Performance-Stack genuegt. IONOS oder Bunny vereinen mehrere Funktionen.",
+    "marketing_automation": "Marketing-Cloud + separates E-Mail-Tool sind oft Dopplung — SAP CX oder CleverReach allein moeglich.",
+    "chat":                "Ein Chat-System genuegt. Userlike (DE) ersetzt Genesys-Stack.",
+    "monitoring":          "RUM + APM koennen in einem Tool gebuendelt werden (Dynatrace EU oder Sentry-Self-host).",
+    "survey":              "Eine Survey-Plattform genuegt — LamaPoll (DE) oder Mapp.",
+}