feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features
CI / nodejs-build (push) Successful in 2m47s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / nodejs-build (push) Successful in 2m47s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9): Core Compliance-Check - Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db). rag_document_checker filtert auf check_type='text' fuer doc_check. Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in falscher doc_type-Schublade. - scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden per business_profile gefiltert (FRT skipped fuer BMW etc.). - Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match: Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60), Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum. Title+check_question als Embedding-Input fuer mehr Kontext. - Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction wenn richer (BMW 1824 vs 600 Worte). Vendor-Redundanz + EU-Alternativen + Cost-Saving - vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors, Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...). - vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/ enterprise/premier). - Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten (nur Media-Spend, separat). DSP-Plattformen behalten enge Range. - Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den oberen 40-100%-Band der Listpreise, nicht starter→premier. - Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere Kategorien gleichzeitig. Cookie-Wissens-DB + Funktionale Klassifikation - cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...) mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk, schrems_ii_status, EuGH-Urteile, EU-Alternative. - cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id, ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact. Country-Inferenz aus Rechtsform - cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table. Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors (Adform DK, Pinterest IE). Action-Recipes + Doc-Anchor-Locator - finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country, broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling", ...) eine strukturierte Anweisung mit what/why/fix_text/where/example. Zum 1:1-Einfuegen in Kunden-Dokumente. - doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den passenden Absatz im existierenden Kundendokument fuer jeden Finding. Per-Run Thread-Local-Cache. Fallback: keyword-Match. - Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail + Vendor-Flag-Liste mit aufklappbarer Action-Liste. - Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip). Migration-Pipeline (Compliance-Check -> Customer Banner/Documents) - migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit 4 Kategorien + Review-Flags. - migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register + Privacy-Policy-Pre-Fills. - agent_migration_routes: 3 Preview-Endpoints (banner-preview, document-preview, summary). Persistierung der cmp_vendors in /data/compliance_audits.db check_payloads-Tabelle. Borlabs-Parity Cookie-Banner-Features - Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage. - Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video Placeholder bis Einwilligung. - Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB. - Consent-Log Export (CSV/JSON) per einwilligungen_export_routes. Bug-Fixes - canonical_control_routes: _jsonish-Helper fuer string-typed jsonb, similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr). - Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views. - Embedding-Service-Batching (32er Batches statt 165 in einem Call). - KeyError 'control_id' in MC-Result-Aggregation (defensive .get). - Master-Controls-Klick-Through von /sdk/master-controls auf /sdk/control-library?control=<id> mit URL-Param-Auto-Open. - Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht). - Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction). - doc_type-aware MC-Filter (statt all-text-MCs). - Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag). - A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert. Tests - test_migration_mappers.py (9 Tests) - test_migration_endpoints.py (4 Tests) Skripte (one-shot) - classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type) - audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires) BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes): DSE 7,5% -> 81-83% Impressum 4% -> 100% (6 echte MCs alle erfuellt) Cookie 0% -> 79-83% (CMP-Text-Routing + Embedding) Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr Plus: Action-Recipes + Doc-Anchors fuer jeden Fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,167 @@
|
||||
"""
|
||||
Cookie-Function-Classifier — pro Cookie eine inhaltliche Funktionsbestimmung.
|
||||
|
||||
Heute haben wir pro Vendor eine Kategorie (analytics/advertising/...).
|
||||
Aber: ein Vendor hat oft 10-50 verschiedene Cookies. Nicht jeder Cookie
|
||||
einer Marketing-Plattform macht Werbung — viele sind Session-Mgmt,
|
||||
Sprachpraeferenz, ScrollPosition etc.
|
||||
|
||||
Dieses Modul klassifiziert pro Cookie:
|
||||
- functional_role : was der Cookie technisch tut (session_id,
|
||||
csrf_token, ab_test, user_id, ad_id, …)
|
||||
- data_collected : welche Daten dahinter stehen (visitor_id,
|
||||
page_view, click, conversion_event, …)
|
||||
- blocking_impact : was passiert wenn der Cookie geblockt wird
|
||||
(none, no_personalization, no_tracking, site_breaks)
|
||||
|
||||
Damit kann der Vendor-Redundanz-Analyzer praezise sagen:
|
||||
"Adobe Analytics setzt 55 Cookies, davon 12 fuer Tracking, 8 fuer A/B-Test
|
||||
und 35 fuer interne Performance. Matomo deckt 12 Tracking + 8 A/B Tests
|
||||
ab — 55 Adobe-Cookies werden zu 20 Matomo-Cookies."
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from typing import Iterable
|
||||
|
||||
# Pattern → (functional_role, blocking_impact)
|
||||
# Reihenfolge entscheidet: spezifischer zuerst.
|
||||
_PATTERNS: list[tuple[str, str, str]] = [
|
||||
# Session / Authentifizierung
|
||||
(r"^(jsessionid|phpsessid|sessionid|sid|connect\.sid)$", "session_id", "site_breaks"),
|
||||
(r"sso|signon|auth|login|token|jwt|bearer", "auth_token", "site_breaks"),
|
||||
(r"^csrf|xsrf|antiforgery", "csrf_token", "site_breaks"),
|
||||
|
||||
# Spracheinstellung / Region
|
||||
(r"lang|locale|culture|region", "preference", "no_personalization"),
|
||||
|
||||
# User-Praeferenzen (Theme, View, Bookmark)
|
||||
(r"theme|dark|mode|view|sort|filter", "ui_preference", "no_personalization"),
|
||||
(r"bookmark|favorite|favorit", "user_data", "no_personalization"),
|
||||
|
||||
# Consent-Cookie selbst
|
||||
(r"consent|gdpr|tcf|euconsent", "consent_state", "site_breaks"),
|
||||
|
||||
# Tracking IDs (most analytics)
|
||||
(r"^_ga|gid|gat|google_analytic", "tracking_id", "no_tracking"),
|
||||
(r"^_pk_|matomo|piwik", "tracking_id", "no_tracking"),
|
||||
(r"^s_|s\.cc|adobesite|aam", "tracking_id", "no_tracking"), # Adobe
|
||||
(r"hjid|hjsession|hotjar", "session_recording", "no_tracking"),
|
||||
(r"_uetsid|_uetvid|microsoft", "tracking_id", "no_tracking"),
|
||||
|
||||
# Visitor identification
|
||||
(r"visitor|uid|user_id|customer_id", "visitor_id", "no_personalization"),
|
||||
|
||||
# A/B-Test / Personalisation
|
||||
(r"ab_test|abtest|variant|experiment|target|target_qa", "ab_test", "no_personalization"),
|
||||
(r"personalization|personalisation|adobe_target", "personalisation", "no_personalization"),
|
||||
|
||||
# Werbung / Retargeting
|
||||
(r"fbp|fbc|fb_id|facebook|meta_pixel|fr$", "ad_pixel", "no_tracking"),
|
||||
(r"adform|criteo|outbrain|taboola|tapad|adsrvr", "ad_pixel", "no_tracking"),
|
||||
(r"doubleclick|test_cookie|ide|nid|exchange_uid", "ad_pixel", "no_tracking"),
|
||||
(r"google_ad|gads|gcl", "ad_pixel", "no_tracking"),
|
||||
(r"^li_|linkedin|bcookie|bscookie", "ad_pixel", "no_tracking"),
|
||||
(r"pinterest|_pinterest_|_pin_unauth", "ad_pixel", "no_tracking"),
|
||||
|
||||
# Affiliate / Conversion
|
||||
(r"conversion|orderid|order_id|transaction|purchase", "conversion_event", "no_tracking"),
|
||||
(r"campaign|utm|source|medium|term", "campaign_attribution", "no_tracking"),
|
||||
|
||||
# ScrollPosition / Form-Helper
|
||||
(r"scroll|position|form_|form_state", "ui_state", "no_personalization"),
|
||||
|
||||
# Loadbalancer / Sticky
|
||||
(r"affinity|sticky|lb_|alb-|aws-alb", "load_balancer", "site_breaks"),
|
||||
|
||||
# Chat / Support
|
||||
(r"chat|widget|genesys|livechat", "chat_session", "no_personalization"),
|
||||
|
||||
# Captcha
|
||||
(r"hcaptcha|recaptcha|cf_|cloudflare", "bot_protection", "site_breaks"),
|
||||
]
|
||||
|
||||
_FUNCTIONAL_LABEL = {
|
||||
"session_id": "Sitzungs-ID",
|
||||
"auth_token": "Auth-Token",
|
||||
"csrf_token": "CSRF-Schutz",
|
||||
"preference": "Sprache / Region",
|
||||
"ui_preference": "UI-Praeferenz",
|
||||
"user_data": "Nutzer-Daten",
|
||||
"consent_state": "Consent-Speicher",
|
||||
"tracking_id": "Tracking-ID",
|
||||
"session_recording": "Session-Recording",
|
||||
"visitor_id": "Besucher-ID",
|
||||
"ab_test": "A/B-Test",
|
||||
"personalisation": "Personalisierung",
|
||||
"ad_pixel": "Werbe-Pixel",
|
||||
"conversion_event": "Konversions-Tracking",
|
||||
"campaign_attribution":"Kampagnen-Attribution",
|
||||
"ui_state": "UI-Zustand (ScrollPos etc.)",
|
||||
"load_balancer": "Load-Balancer",
|
||||
"chat_session": "Chat-Session",
|
||||
"bot_protection": "Bot-Schutz",
|
||||
"unknown": "Unbekannt",
|
||||
}
|
||||
|
||||
# Welche functional_roles ueberlappen funktional — verwendet vom
|
||||
# vendor_redundancy.analyze() um echte Konsolidierungschancen zu
|
||||
# erkennen statt nur Provider-Doppelungen zu zaehlen.
|
||||
OVERLAPPING_ROLES = {
|
||||
"tracking_id": "tracking",
|
||||
"session_recording": "tracking",
|
||||
"ab_test": "personalisation",
|
||||
"personalisation": "personalisation",
|
||||
"ad_pixel": "advertising",
|
||||
"conversion_event": "advertising",
|
||||
"campaign_attribution":"advertising",
|
||||
}
|
||||
|
||||
|
||||
def classify_cookie(cookie_name: str) -> tuple[str, str]:
|
||||
"""Return (functional_role, blocking_impact) for a cookie name."""
|
||||
n = (cookie_name or "").lower().strip()
|
||||
for pattern, role, impact in _PATTERNS:
|
||||
if re.search(pattern, n):
|
||||
return role, impact
|
||||
return "unknown", "no_tracking"
|
||||
|
||||
|
||||
def annotate_vendor_cookies(vendor: dict) -> dict:
|
||||
"""Enrich a vendor record with functional_role per cookie."""
|
||||
cookies = vendor.get("cookies") or []
|
||||
annotated = []
|
||||
role_counts: dict[str, int] = {}
|
||||
for c in cookies:
|
||||
role, impact = classify_cookie(c.get("name", ""))
|
||||
annotated.append({**c, "functional_role": role, "blocking_impact": impact})
|
||||
role_counts[role] = role_counts.get(role, 0) + 1
|
||||
return {
|
||||
**vendor,
|
||||
"cookies": annotated,
|
||||
"role_distribution": role_counts,
|
||||
"role_labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in role_counts},
|
||||
}
|
||||
|
||||
|
||||
def aggregate_cookie_purposes(vendors: Iterable[dict]) -> dict:
|
||||
"""Tenant-weite Verteilung: welche funktionalen Rollen kommen wie oft vor?"""
|
||||
total: dict[str, int] = {}
|
||||
by_vendor: dict[str, dict[str, int]] = {}
|
||||
for v in vendors:
|
||||
roles = v.get("role_distribution") or {}
|
||||
if not roles and v.get("cookies"):
|
||||
v = annotate_vendor_cookies(v)
|
||||
roles = v["role_distribution"]
|
||||
for r, n in roles.items():
|
||||
total[r] = total.get(r, 0) + n
|
||||
by_vendor[v.get("name", "")] = roles
|
||||
return {
|
||||
"total_per_role": total,
|
||||
"labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in total},
|
||||
"vendors_per_role": {
|
||||
r: [v for v, rd in by_vendor.items() if rd.get(r, 0) > 0]
|
||||
for r in total
|
||||
},
|
||||
}
|
||||
Reference in New Issue
Block a user