Files
breakpilot-compliance/backend-compliance/compliance/services/vendor_cost_estimator.py
T
Benjamin Admin 662327e8b4
CI / nodejs-build (push) Successful in 2m47s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features
Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9):

Core Compliance-Check
- Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs
  in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db).
  rag_document_checker filtert auf check_type='text' fuer doc_check.
  Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in
  falscher doc_type-Schublade.
- scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden
  per business_profile gefiltert (FRT skipped fuer BMW etc.).
- Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match:
  Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60),
  Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum.
  Title+check_question als Embedding-Input fuer mehr Kontext.
- Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem
  CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction
  wenn richer (BMW 1824 vs 600 Worte).

Vendor-Redundanz + EU-Alternativen + Cost-Saving
- vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors,
  Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup
  (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...).
- vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl
  + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/
  enterprise/premier).
- Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten
  (nur Media-Spend, separat). DSP-Plattformen behalten enge Range.
- Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den
  oberen 40-100%-Band der Listpreise, nicht starter→premier.
- Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart
  AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere
  Kategorien gleichzeitig.

Cookie-Wissens-DB + Funktionale Klassifikation
- cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...)
  mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk,
  schrems_ii_status, EuGH-Urteile, EU-Alternative.
- cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id,
  ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact.

Country-Inferenz aus Rechtsform
- cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet
  (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table.
  Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors
  (Adform DK, Pinterest IE).

Action-Recipes + Doc-Anchor-Locator
- finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country,
  broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling",
  ...) eine strukturierte Anweisung mit what/why/fix_text/where/example.
  Zum 1:1-Einfuegen in Kunden-Dokumente.
- doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den
  passenden Absatz im existierenden Kundendokument fuer jeden Finding.
  Per-Run Thread-Local-Cache. Fallback: keyword-Match.
- Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail
  + Vendor-Flag-Liste mit aufklappbarer Action-Liste.
- Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip).

Migration-Pipeline (Compliance-Check -> Customer Banner/Documents)
- migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit
  4 Kategorien + Review-Flags.
- migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register
  + Privacy-Policy-Pre-Fills.
- agent_migration_routes: 3 Preview-Endpoints (banner-preview,
  document-preview, summary). Persistierung der cmp_vendors in
  /data/compliance_audits.db check_payloads-Tabelle.

Borlabs-Parity Cookie-Banner-Features
- Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage.
- Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video
  Placeholder bis Einwilligung.
- Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB.
- Consent-Log Export (CSV/JSON) per einwilligungen_export_routes.

Bug-Fixes
- canonical_control_routes: _jsonish-Helper fuer string-typed jsonb,
  similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr).
- Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views.
- Embedding-Service-Batching (32er Batches statt 165 in einem Call).
- KeyError 'control_id' in MC-Result-Aggregation (defensive .get).
- Master-Controls-Klick-Through von /sdk/master-controls auf
  /sdk/control-library?control=<id> mit URL-Param-Auto-Open.
- Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht).
- Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction).
- doc_type-aware MC-Filter (statt all-text-MCs).
- Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag).
- A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert.

Tests
- test_migration_mappers.py (9 Tests)
- test_migration_endpoints.py (4 Tests)

Skripte (one-shot)
- classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type)
- audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires)

BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes):
  DSE     7,5% -> 81-83%
  Impressum 4%   -> 100% (6 echte MCs alle erfuellt)
  Cookie  0%    -> 79-83% (CMP-Text-Routing + Embedding)
  Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr
  Plus: Action-Recipes + Doc-Anchors fuer jeden Fail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:30:08 +02:00

408 lines
15 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""
Vendor-Cost-Estimator — leitet pro Vendor ein Pricing-Tier aus
Cookie-Signalen ab und gibt eine intensitaets-basierte Jahres-
kostenschaetzung zurueck.
Cookie-Signale die wir auswerten:
- Anzahl Cookies pro Vendor (proxy fuer Modul-Tiefe)
- Premium-Feature-Cookies (z.B. 's_target_qa', '_ab_test' → Enterprise-Add-on)
- Edge/Region-Cookies (Multi-Region → Premier-Tier CDN)
- Cookie-Persistenz (Multi-Jahr → Heavy-Tracking-Lizenz)
Plus business_profile fuer Company-Tier-Inferenz.
Output pro Vendor:
- inferred_tier: 'starter' | 'professional' | 'enterprise' | 'premier'
- tier_signals : Liste der Indikatoren die zum Tier gefuehrt haben
- cost_year_eur_range: (low, high) basierend auf Tier × Vendor-Pricing
- confidence: 'low' | 'medium' | 'high'
Dieses Modul ergaenzt vendor_redundancy.py — die einfachen low/high
Pauschalen dort werden hier durch dynamische, signal-basierte Werte
ersetzt.
"""
from __future__ import annotations
import logging
import re
from typing import Iterable
logger = logging.getLogger(__name__)
# ─── Premium-Feature-Cookies: Indikator fuer Enterprise-Add-ons ─────
#
# Wenn ein Vendor diese Cookies setzt, ist der Kunde mit hoher
# Wahrscheinlichkeit auf einem Enterprise-Plan.
_PREMIUM_FEATURE_PATTERNS: list[tuple[str, str, str]] = [
# (regex, vendor_key, premium_feature_label)
(r"^s_target_qa$", "adobe analytics", "Adobe Target Add-on"),
(r"adobe.*target", "adobe target", "Personalization Enterprise"),
(r"^aam_uuid", "adobe analytics", "Audience Manager Enterprise"),
(r"^s_ecid", "adobe analytics", "Experience Cloud ID Service"),
(r"^_pcid_", "adobe analytics", "People-Based Destinations"),
(r"^_gat_gtag_UA", "google analytics", "GA360 Multi-Tracker"),
(r"^_ga_[A-Z0-9]+_[A-Z0-9]+", "google analytics", "GA4 Enterprise Stream"),
(r"^_uetmsdns", "microsoft advertising", "Custom Conversion Tracking"),
(r"^_fbp.*test", "meta pixel", "Conversions API Premium"),
(r"^_pin_unauth_premium", "pinterest", "Pinterest Premium-API"),
(r"^afm", "adform", "Affinity-Module"),
(r"^cto_dna", "criteo", "Dynamic Retargeting Premium"),
# CDN / Infra Premium
(r"^aws-alb-[a-z0-9]+", "amazon web services", "ALB + Multi-Region"),
(r"^aws-waf", "amazon web services", "WAF Enterprise"),
(r"^cf_clearance", "cloudflare", "Bot-Management Pro"),
(r"^akm_[a-z]+", "akamai", "Adaptive Media Delivery Enterprise"),
# Salesforce Customer-360
(r"^bid_n_", "salesforce", "Marketing Cloud Personalization"),
(r"^_cs_", "salesforce", "CDP Premium"),
]
# ─── Tier-Pricing pro Vendor (jaehrlich, EUR) ───────────────────────
#
# 4 Tiers: starter (KMU), professional (Mid), enterprise (Konzern),
# premier (Global Brand / Heavy User).
_TIER_PRICING: dict[str, dict[str, tuple[int, int]]] = {
"adobe analytics": {
"starter": ( 10_000, 30_000),
"professional": ( 60_000, 150_000),
"enterprise": (200_000, 500_000),
"premier": (500_000, 900_000),
},
"adobe target": {
"starter": ( 8_000, 25_000),
"professional": ( 40_000, 100_000),
"enterprise": (120_000, 300_000),
"premier": (300_000, 600_000),
},
"adobe campaign": {
"starter": ( 10_000, 30_000),
"professional": ( 40_000, 100_000),
"enterprise": (120_000, 280_000),
"premier": (280_000, 500_000),
},
"google analytics": {
"starter": ( 0, 0), # GA4 free
"professional": ( 0, 0),
"enterprise": ( 80_000, 150_000), # GA360
"premier": (150_000, 300_000),
},
"matomo": {
"starter": ( 0, 3_000), # On-prem free / Cloud Starter
"professional": ( 6_000, 20_000),
"enterprise": ( 20_000, 80_000),
"premier": ( 60_000, 150_000),
},
"content square": {
"starter": ( 12_000, 40_000),
"professional": ( 60_000, 150_000),
"enterprise": (150_000, 350_000),
"premier": (350_000, 700_000),
},
"contentsquare": {
"starter": ( 12_000, 40_000),
"professional": ( 60_000, 150_000),
"enterprise": (150_000, 350_000),
"premier": (350_000, 700_000),
},
"dynatrace": {
"starter": ( 5_000, 15_000),
"professional": ( 30_000, 80_000),
"enterprise": (100_000, 300_000),
"premier": (300_000, 800_000),
},
"qualtrics": {
"starter": ( 6_000, 20_000),
"professional": ( 30_000, 80_000),
"enterprise": ( 80_000, 200_000),
"premier": (200_000, 500_000),
},
# Advertising / Retargeting (Lizenz + Self-Service; Media-Spend SEPARAT)
"criteo": {
"starter": ( 6_000, 20_000),
"professional": ( 30_000, 80_000),
"enterprise": ( 80_000, 250_000),
"premier": (250_000, 600_000),
},
"adform": {
"starter": ( 12_000, 40_000),
"professional": ( 60_000, 150_000),
"enterprise": (150_000, 400_000),
"premier": (400_000, 800_000),
},
"outbrain": {
"starter": ( 6_000, 20_000),
"professional": ( 30_000, 80_000),
"enterprise": ( 80_000, 200_000),
"premier": (200_000, 500_000),
},
"taboola": {
"starter": ( 6_000, 20_000),
"professional": ( 30_000, 80_000),
"enterprise": ( 80_000, 200_000),
"premier": (200_000, 500_000),
},
"teads": {
"starter": ( 6_000, 18_000),
"professional": ( 20_000, 60_000),
"enterprise": ( 60_000, 150_000),
"premier": (150_000, 350_000),
},
"pinterest": {
"starter": ( 3_000, 15_000),
"professional": ( 15_000, 50_000),
"enterprise": ( 50_000, 150_000),
"premier": (150_000, 400_000),
},
"linkedin insight": {
"starter": ( 3_000, 12_000),
"professional": ( 12_000, 40_000),
"enterprise": ( 40_000, 120_000),
"premier": (120_000, 300_000),
},
# CDN / Cloud
"akamai": {
"starter": ( 20_000, 60_000),
"professional": ( 80_000, 200_000),
"enterprise": (200_000, 500_000),
"premier": (500_000, 1_500_000),
},
"amazon web services": {
"starter": ( 12_000, 60_000),
"professional": ( 60_000, 300_000),
"enterprise": (300_000, 1_500_000),
"premier": (1_500_000, 8_000_000),
},
"baqend": {
"starter": ( 3_000, 12_000),
"professional": ( 12_000, 40_000),
"enterprise": ( 40_000, 120_000),
"premier": (120_000, 300_000),
},
"speedkit": {
"starter": ( 3_000, 12_000),
"professional": ( 12_000, 40_000),
"enterprise": ( 40_000, 120_000),
"premier": (120_000, 300_000),
},
"speedcurve": {
"starter": ( 1_200, 4_800),
"professional": ( 6_000, 18_000),
"enterprise": ( 18_000, 60_000),
"premier": ( 60_000, 120_000),
},
# CRM / Marketing
"salesforce": {
"starter": ( 20_000, 60_000),
"professional": ( 80_000, 250_000),
"enterprise": (250_000, 800_000),
"premier": (800_000, 2_500_000),
},
"genesys": {
"starter": ( 24_000, 80_000),
"professional": ( 80_000, 250_000),
"enterprise": (250_000, 800_000),
"premier": (800_000, 2_000_000),
},
# Captcha
"hcaptcha": {
"starter": ( 0, 2_400),
"professional": ( 2_400, 12_000),
"enterprise": ( 12_000, 40_000),
"premier": ( 40_000, 100_000),
},
# Lead-Tracking
"salesviewer": {
"starter": ( 1_200, 3_600),
"professional": ( 3_600, 12_000),
"enterprise": ( 12_000, 40_000),
"premier": ( 40_000, 100_000),
},
}
def _vendor_key(vendor_name: str) -> str | None:
"""Map a vendor name to a known pricing-table key."""
n = (vendor_name or "").lower()
for k in _TIER_PRICING:
if k in n:
return k
return None
def infer_company_tier(business_profile: dict | None) -> str:
"""Coarse company-tier from business profile.
Used as the baseline when vendor-specific signals are weak.
"""
if not business_profile:
return "professional"
bp = business_profile
features = {f.lower() for f in (bp.get("features") or [])}
btype = (bp.get("type") or "").lower()
# Heavy enterprise-only signals
if any(f in features for f in ("multi_country", "konzern", "enterprise",
"international", "automotive", "banking",
"luxury", "premium")):
return "premier"
# Large but maybe single-country
if "shop" in features or "konfigurator" in features or btype == "b2c":
return "enterprise"
return "professional"
def infer_vendor_tier(vendor: dict, company_tier: str) -> tuple[str, list[str]]:
"""Infer pricing tier for a single vendor from its cookie footprint.
Signals (additive — more signals → higher tier):
- cookie_count > 30 → +1 tier
- cookie_count > 60 → +2 tiers
- premium-feature cookie hit → +1 tier
- 'is_third_party' on most cookies → +1 tier (heavy-tracking signal)
- very long expiry (>=2 years) → +1 tier
"""
cookies = vendor.get("cookies") or []
n_cookies = len(cookies)
cookie_names = [c.get("name", "").lower() for c in cookies]
signals: list[str] = []
base_tiers = ["starter", "professional", "enterprise", "premier"]
# Start at company-tier as baseline
idx = base_tiers.index(company_tier) if company_tier in base_tiers else 1
if n_cookies >= 60:
idx = min(len(base_tiers) - 1, idx + 1)
signals.append(f"{n_cookies} Cookies (sehr hohe Modul-Tiefe)")
elif n_cookies >= 30:
signals.append(f"{n_cookies} Cookies (hohe Modul-Tiefe)")
# Premium feature detection
vk = _vendor_key(vendor.get("name", ""))
for pattern, expected_key, feature_label in _PREMIUM_FEATURE_PATTERNS:
if vk and vk != expected_key and expected_key not in (vendor.get("name") or "").lower():
continue
for cn in cookie_names:
if re.search(pattern, cn):
idx = min(len(base_tiers) - 1, idx + 1)
signals.append(f"Premium-Feature-Cookie: {feature_label}")
break
# Heavy third-party tracking
third_party_ratio = sum(1 for c in cookies if c.get("is_third_party")) / max(n_cookies, 1)
if third_party_ratio >= 0.6 and n_cookies >= 10:
signals.append(f"{int(third_party_ratio * 100)}% Drittanbieter-Cookies — Tracking-Heavy")
# Long-lived cookies
long_lived = sum(1 for c in cookies if _expiry_years(c.get("expiry", "")) >= 2)
if long_lived >= 3:
signals.append(f"{long_lived} Cookies mit ≥2 Jahre Speicherdauer")
return base_tiers[idx], signals
def _expiry_years(expiry_str: str) -> float:
"""Rough parse: '2 Jahre' → 2.0, '24 Monate' → 2.0, '90 Tage' → 0.25"""
s = (expiry_str or "").lower()
m = re.search(r"(\d+)\s*(jahr|year)", s)
if m: return float(m.group(1))
m = re.search(r"(\d+)\s*(monat|month)", s)
if m: return float(m.group(1)) / 12.0
m = re.search(r"(\d+)\s*(tag|day)", s)
if m: return float(m.group(1)) / 365.0
return 0.0
def estimate_vendor_cost(vendor: dict, business_profile: dict | None = None) -> dict:
"""Return cost estimation for one vendor incl. tier inference + signals."""
vk = _vendor_key(vendor.get("name", ""))
company_tier = infer_company_tier(business_profile)
if not vk:
return {
"vendor": vendor.get("name", ""),
"matched_pricing_key": None,
"inferred_tier": None,
"tier_signals": [],
"company_tier_baseline": company_tier,
"cost_year_eur_range": (0, 0),
"confidence": "none",
"note": "Kein Pricing-Eintrag fuer diesen Anbieter — Saving-Schaetzung uebergangen.",
}
tier, signals = infer_vendor_tier(vendor, company_tier)
pricing = _TIER_PRICING[vk].get(tier) or (0, 0)
confidence = "high" if len(signals) >= 2 else ("medium" if signals else "low")
return {
"vendor": vendor.get("name", ""),
"matched_pricing_key": vk,
"inferred_tier": tier,
"tier_signals": signals,
"company_tier_baseline": company_tier,
"cost_year_eur_range": pricing,
"confidence": confidence,
}
def estimate_total_stack_cost(
vendors: Iterable[dict],
business_profile: dict | None = None,
) -> dict:
"""Aggregate cost estimation over all vendors.
Returns:
- per_vendor list (one entry each)
- per_recipient_type aggregate (INTERNAL vs PROCESSOR vs CONTROLLER)
- total range
- master-contract dedup hint: vendors whose name starts with the
site owner ('BMW AG — ...') are bundled into ONE master contract
per vendor-tool-key (not double-counted).
"""
per_vendor: list[dict] = []
seen_master_keys: set[tuple[str, str]] = set()
total_low = 0
total_high = 0
for v in vendors:
est = estimate_vendor_cost(v, business_profile)
per_vendor.append(est)
if not est["matched_pricing_key"]:
continue
rtype = (v.get("recipient_type") or "").upper()
master_key = (est["matched_pricing_key"], rtype if rtype == "INTERNAL" else "EXT")
if rtype == "INTERNAL" and master_key in seen_master_keys:
# Same Adobe contract serves many "BMW AG — Adobe XYZ" lines —
# count cost only ONCE per (key, internal).
est["bundled_into_master_contract"] = True
continue
seen_master_keys.add(master_key)
lo, hi = est["cost_year_eur_range"]
total_low += lo
total_high += hi
return {
"per_vendor": per_vendor,
"total_year_eur_range": (total_low, total_high),
"master_contracts_counted": len(seen_master_keys),
"disclaimer": (
"Schaetzung basiert auf Cookie-Signalen (Anzahl, Premium-Feature-Detection, "
"Drittanbieter-Quote, Lebensdauer) + Listpreisen pro Tier. Konzern-Konditionen "
"koennen 30-50% darunter liegen. Eintraege derselben Eigenmarke werden zu EINEM "
"Master-Vertrag aggregiert. Media-Spend ist NICHT enthalten."
),
}