feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features
CI / nodejs-build (push) Successful in 2m47s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped

Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9):

Core Compliance-Check
- Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs
  in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db).
  rag_document_checker filtert auf check_type='text' fuer doc_check.
  Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in
  falscher doc_type-Schublade.
- scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden
  per business_profile gefiltert (FRT skipped fuer BMW etc.).
- Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match:
  Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60),
  Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum.
  Title+check_question als Embedding-Input fuer mehr Kontext.
- Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem
  CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction
  wenn richer (BMW 1824 vs 600 Worte).

Vendor-Redundanz + EU-Alternativen + Cost-Saving
- vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors,
  Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup
  (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...).
- vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl
  + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/
  enterprise/premier).
- Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten
  (nur Media-Spend, separat). DSP-Plattformen behalten enge Range.
- Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den
  oberen 40-100%-Band der Listpreise, nicht starter→premier.
- Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart
  AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere
  Kategorien gleichzeitig.

Cookie-Wissens-DB + Funktionale Klassifikation
- cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...)
  mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk,
  schrems_ii_status, EuGH-Urteile, EU-Alternative.
- cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id,
  ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact.

Country-Inferenz aus Rechtsform
- cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet
  (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table.
  Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors
  (Adform DK, Pinterest IE).

Action-Recipes + Doc-Anchor-Locator
- finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country,
  broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling",
  ...) eine strukturierte Anweisung mit what/why/fix_text/where/example.
  Zum 1:1-Einfuegen in Kunden-Dokumente.
- doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den
  passenden Absatz im existierenden Kundendokument fuer jeden Finding.
  Per-Run Thread-Local-Cache. Fallback: keyword-Match.
- Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail
  + Vendor-Flag-Liste mit aufklappbarer Action-Liste.
- Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip).

Migration-Pipeline (Compliance-Check -> Customer Banner/Documents)
- migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit
  4 Kategorien + Review-Flags.
- migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register
  + Privacy-Policy-Pre-Fills.
- agent_migration_routes: 3 Preview-Endpoints (banner-preview,
  document-preview, summary). Persistierung der cmp_vendors in
  /data/compliance_audits.db check_payloads-Tabelle.

Borlabs-Parity Cookie-Banner-Features
- Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage.
- Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video
  Placeholder bis Einwilligung.
- Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB.
- Consent-Log Export (CSV/JSON) per einwilligungen_export_routes.

Bug-Fixes
- canonical_control_routes: _jsonish-Helper fuer string-typed jsonb,
  similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr).
- Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views.
- Embedding-Service-Batching (32er Batches statt 165 in einem Call).
- KeyError 'control_id' in MC-Result-Aggregation (defensive .get).
- Master-Controls-Klick-Through von /sdk/master-controls auf
  /sdk/control-library?control=<id> mit URL-Param-Auto-Open.
- Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht).
- Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction).
- doc_type-aware MC-Filter (statt all-text-MCs).
- Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag).
- A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert.

Tests
- test_migration_mappers.py (9 Tests)
- test_migration_endpoints.py (4 Tests)

Skripte (one-shot)
- classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type)
- audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires)

BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes):
  DSE     7,5% -> 81-83%
  Impressum 4%   -> 100% (6 echte MCs alle erfuellt)
  Cookie  0%    -> 79-83% (CMP-Text-Routing + Embedding)
  Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr
  Plus: Action-Recipes + Doc-Anchors fuer jeden Fail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-18 18:30:08 +02:00
parent 52fb8b91e7
commit 662327e8b4
31 changed files with 5214 additions and 104 deletions
@@ -0,0 +1,167 @@
"""
Cookie-Function-Classifier — pro Cookie eine inhaltliche Funktionsbestimmung.
Heute haben wir pro Vendor eine Kategorie (analytics/advertising/...).
Aber: ein Vendor hat oft 10-50 verschiedene Cookies. Nicht jeder Cookie
einer Marketing-Plattform macht Werbung — viele sind Session-Mgmt,
Sprachpraeferenz, ScrollPosition etc.
Dieses Modul klassifiziert pro Cookie:
- functional_role : was der Cookie technisch tut (session_id,
csrf_token, ab_test, user_id, ad_id, …)
- data_collected : welche Daten dahinter stehen (visitor_id,
page_view, click, conversion_event, …)
- blocking_impact : was passiert wenn der Cookie geblockt wird
(none, no_personalization, no_tracking, site_breaks)
Damit kann der Vendor-Redundanz-Analyzer praezise sagen:
"Adobe Analytics setzt 55 Cookies, davon 12 fuer Tracking, 8 fuer A/B-Test
und 35 fuer interne Performance. Matomo deckt 12 Tracking + 8 A/B Tests
ab — 55 Adobe-Cookies werden zu 20 Matomo-Cookies."
"""
from __future__ import annotations
import re
from typing import Iterable
# Pattern → (functional_role, blocking_impact)
# Reihenfolge entscheidet: spezifischer zuerst.
_PATTERNS: list[tuple[str, str, str]] = [
# Session / Authentifizierung
(r"^(jsessionid|phpsessid|sessionid|sid|connect\.sid)$", "session_id", "site_breaks"),
(r"sso|signon|auth|login|token|jwt|bearer", "auth_token", "site_breaks"),
(r"^csrf|xsrf|antiforgery", "csrf_token", "site_breaks"),
# Spracheinstellung / Region
(r"lang|locale|culture|region", "preference", "no_personalization"),
# User-Praeferenzen (Theme, View, Bookmark)
(r"theme|dark|mode|view|sort|filter", "ui_preference", "no_personalization"),
(r"bookmark|favorite|favorit", "user_data", "no_personalization"),
# Consent-Cookie selbst
(r"consent|gdpr|tcf|euconsent", "consent_state", "site_breaks"),
# Tracking IDs (most analytics)
(r"^_ga|gid|gat|google_analytic", "tracking_id", "no_tracking"),
(r"^_pk_|matomo|piwik", "tracking_id", "no_tracking"),
(r"^s_|s\.cc|adobesite|aam", "tracking_id", "no_tracking"), # Adobe
(r"hjid|hjsession|hotjar", "session_recording", "no_tracking"),
(r"_uetsid|_uetvid|microsoft", "tracking_id", "no_tracking"),
# Visitor identification
(r"visitor|uid|user_id|customer_id", "visitor_id", "no_personalization"),
# A/B-Test / Personalisation
(r"ab_test|abtest|variant|experiment|target|target_qa", "ab_test", "no_personalization"),
(r"personalization|personalisation|adobe_target", "personalisation", "no_personalization"),
# Werbung / Retargeting
(r"fbp|fbc|fb_id|facebook|meta_pixel|fr$", "ad_pixel", "no_tracking"),
(r"adform|criteo|outbrain|taboola|tapad|adsrvr", "ad_pixel", "no_tracking"),
(r"doubleclick|test_cookie|ide|nid|exchange_uid", "ad_pixel", "no_tracking"),
(r"google_ad|gads|gcl", "ad_pixel", "no_tracking"),
(r"^li_|linkedin|bcookie|bscookie", "ad_pixel", "no_tracking"),
(r"pinterest|_pinterest_|_pin_unauth", "ad_pixel", "no_tracking"),
# Affiliate / Conversion
(r"conversion|orderid|order_id|transaction|purchase", "conversion_event", "no_tracking"),
(r"campaign|utm|source|medium|term", "campaign_attribution", "no_tracking"),
# ScrollPosition / Form-Helper
(r"scroll|position|form_|form_state", "ui_state", "no_personalization"),
# Loadbalancer / Sticky
(r"affinity|sticky|lb_|alb-|aws-alb", "load_balancer", "site_breaks"),
# Chat / Support
(r"chat|widget|genesys|livechat", "chat_session", "no_personalization"),
# Captcha
(r"hcaptcha|recaptcha|cf_|cloudflare", "bot_protection", "site_breaks"),
]
_FUNCTIONAL_LABEL = {
"session_id": "Sitzungs-ID",
"auth_token": "Auth-Token",
"csrf_token": "CSRF-Schutz",
"preference": "Sprache / Region",
"ui_preference": "UI-Praeferenz",
"user_data": "Nutzer-Daten",
"consent_state": "Consent-Speicher",
"tracking_id": "Tracking-ID",
"session_recording": "Session-Recording",
"visitor_id": "Besucher-ID",
"ab_test": "A/B-Test",
"personalisation": "Personalisierung",
"ad_pixel": "Werbe-Pixel",
"conversion_event": "Konversions-Tracking",
"campaign_attribution":"Kampagnen-Attribution",
"ui_state": "UI-Zustand (ScrollPos etc.)",
"load_balancer": "Load-Balancer",
"chat_session": "Chat-Session",
"bot_protection": "Bot-Schutz",
"unknown": "Unbekannt",
}
# Welche functional_roles ueberlappen funktional — verwendet vom
# vendor_redundancy.analyze() um echte Konsolidierungschancen zu
# erkennen statt nur Provider-Doppelungen zu zaehlen.
OVERLAPPING_ROLES = {
"tracking_id": "tracking",
"session_recording": "tracking",
"ab_test": "personalisation",
"personalisation": "personalisation",
"ad_pixel": "advertising",
"conversion_event": "advertising",
"campaign_attribution":"advertising",
}
def classify_cookie(cookie_name: str) -> tuple[str, str]:
"""Return (functional_role, blocking_impact) for a cookie name."""
n = (cookie_name or "").lower().strip()
for pattern, role, impact in _PATTERNS:
if re.search(pattern, n):
return role, impact
return "unknown", "no_tracking"
def annotate_vendor_cookies(vendor: dict) -> dict:
"""Enrich a vendor record with functional_role per cookie."""
cookies = vendor.get("cookies") or []
annotated = []
role_counts: dict[str, int] = {}
for c in cookies:
role, impact = classify_cookie(c.get("name", ""))
annotated.append({**c, "functional_role": role, "blocking_impact": impact})
role_counts[role] = role_counts.get(role, 0) + 1
return {
**vendor,
"cookies": annotated,
"role_distribution": role_counts,
"role_labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in role_counts},
}
def aggregate_cookie_purposes(vendors: Iterable[dict]) -> dict:
"""Tenant-weite Verteilung: welche funktionalen Rollen kommen wie oft vor?"""
total: dict[str, int] = {}
by_vendor: dict[str, dict[str, int]] = {}
for v in vendors:
roles = v.get("role_distribution") or {}
if not roles and v.get("cookies"):
v = annotate_vendor_cookies(v)
roles = v["role_distribution"]
for r, n in roles.items():
total[r] = total.get(r, 0) + n
by_vendor[v.get("name", "")] = roles
return {
"total_per_role": total,
"labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in total},
"vendors_per_role": {
r: [v for v, rd in by_vendor.items() if rd.get(r, 0) > 0]
for r in total
},
}