feat(audit): Cookie-Compliance-Audit (3-Quellen-Vergleich) + Vendor-Dedup + Block-Parser
CI / detect-changes (push) Successful in 12s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 55s
CI / iace-gt-coverage (push) Successful in 25s
CI / test-python-backend (push) Successful in 44s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 18s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m43s

ZENTRALER USP: cookie_compliance_audit.py vergleicht 3 Quellen
* DEKLARIERT in Cookie-Richtlinie (parse_cookie_table + parse_flat)
* TATSAECHLICH im Browser geladen (banner_result.phases.after_accept)
* LIBRARY-Metadaten (cookie_library lookup)

Liefert 3 Listen mit Compliance-Verdict:
* compliant (deklariert UND geladen) — gruener Block
* undeclared_in_browser (geladen NICHT deklariert) — ROTER HIGH-Block
  → Art. 13(1)(c) DSGVO + § 25 TDDDG Verstoss
* declared_not_loaded (deklariert NICHT geladen) — gelber Hinweis
  → Tabelle moeglicherweise veraltet

parse_cookie_table erweitert um Block-Format (5 Zeilen pro Cookie wie
beim User-Copy aus VW). Findet 35+ Cookies aus Copy-Paste statt 0.

vendor_normalizer.py: 50+ Aliases (Google-Familie, Adobe-Familie,
Trade Desk, AdForm, ...) + Garbage-Filter (URLs, leere Strings,
'click to select', 'Mehrere OEMs'). Mergt cookies-Listen beim Dedup.

_guess_vendor erweitert: Adobe-Familie (s_ecid/AMCV/demdex/mbox/...),
Trade Desk (TDID/TDCPM/TTDOptOut), AdForm (uid/cid/otsid),
Salesforce LiveAgent, etracker, Akamai, EDAA.

audit_quality_checks: vendor-thin-Threshold jetzt dynamisch nach
Cookie-Doc-Wörter (3k→10 / 6k→20 / 10k→30 / 15k+→40).

VW-Test-Fixture: tests/fixtures/cookie_gt/vw_cookie_richtlinie.txt
(36-Cookie-Sample fuer Regression-Tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-21 23:36:45 +02:00
parent 16fd406c1a
commit 081e4f057a
6 changed files with 678 additions and 22 deletions
@@ -79,10 +79,116 @@ def _parse_persistence(s: str) -> str:
return ""
_CATEGORY_INDICATORS = (
"funktionscookie", "tracking cookie", "trackingcookie",
"marketing", "analytics", "necessary", "notwendig",
"performance", "session cookie", "persistent cookie",
"permanent cookie", "permanent/protokoll", "sitzungs-cookie",
)
def parse_block_format(text: str) -> list[dict]:
"""Block-Format (Browser-Copy aus VW/BMW/Mercedes ohne Tab-Trenner):
Pro Cookie 5 Zeilen: Name / Kategorie / Zweck / Speicherdauer / Art.
Heuristik: gehe ueber alle Zeilen. Wenn eine Zeile NICHT eine
Kategorie/Dauer/Art ist und die naechste eine Kategorie enthaelt
→ das ist ein Cookie-Name. Sammle die naechsten 4 Zeilen als
Kategorie/Zweck/Dauer/Art.
"""
if not text or len(text) < 100:
return []
raw_lines = [ln.strip() for ln in text.splitlines()]
# Aggressive newline-collapse: leere Zeilen entfernen, aber Zeilen
# die Teil eines mehrzeiligen Zwecks sind moegen separat bleiben.
lines = [ln for ln in raw_lines if ln]
if len(lines) < 10:
return []
# Drop the header row(s) if present
start = 0
if lines[0].lower() in ("name des cookies", "cookie name", "name"):
start = 5 if len(lines) > 5 else 1
by_vendor: dict[str, dict] = {}
seen_names: set[str] = set()
i = start
while i < len(lines) - 2:
name_line = lines[i]
cat_line = lines[i + 1] if i + 1 < len(lines) else ""
# Verify cat_line is a category indicator (otherwise the
# block is malformed — skip 1 line and try again).
if not any(c in cat_line.lower() for c in _CATEGORY_INDICATORS):
i += 1
continue
# Cookie-Name validation
nl = name_line.lower().strip()
if (not name_line or len(name_line) > 80
or len(name_line) < 2
or any(c in nl for c in _CATEGORY_INDICATORS)
or nl in seen_names
or nl in ("name des cookies", "kategorie",
"verwendungszweck", "speicherdauer",
"art des cookies")):
i += 1
continue
# Look ahead for the Art-Cookie line (max 8 lines forward)
purpose_parts: list[str] = []
persistence = ""
art = ""
j = i + 2
while j < min(i + 12, len(lines)):
ln = lines[j]
ll = ln.lower()
if any(t in ll for t in (
"permanent/protokoll", "session cookie",
"persistent cookie", "permanent cookie",
"sitzungs-cookie", "permanent/ protokoll",
)):
art = ln
if not persistence and j > i + 2:
persistence = lines[j - 1]
break
purpose_parts.append(ln)
j += 1
purpose = " ".join(purpose_parts[:-1]) if len(purpose_parts) > 1 else " ".join(purpose_parts)
purpose = purpose[:500].strip()
seen_names.add(nl)
provider = _guess_vendor(name_line) or "Unbekannter Anbieter (VW-intern)"
# Marketing-Cookies = Drittanbieter
if "marketing" in cat_line.lower() or "tracking" in cat_line.lower():
if provider == "Unbekannter Anbieter (VW-intern)":
provider = "Unbekannter Drittanbieter (Marketing)"
entry = by_vendor.setdefault(provider, {
"name": provider, "country": "",
"purpose": "", "category": _normalize_category(cat_line),
"opt_out_url": "", "privacy_policy_url": "",
"persistence": "",
"cookies": [],
"source": "block_paste",
})
entry["cookies"].append({
"name": name_line,
"purpose": purpose[:300],
"expiry": persistence,
"is_third_party": "tracking" in cat_line.lower() or "marketing" in cat_line.lower(),
})
i = j + 1 if art else i + 5
out = list(by_vendor.values())
logger.info("parse_block_format: %d vendors / %d cookies",
len(out), sum(len(v["cookies"]) for v in out))
return out
def parse_cookie_table(text: str) -> list[dict]:
"""Returns vendor-records aus einer copy-pasted Cookie-Tabelle.
Bei nicht-tabellarischem Text: return [].
Probiert in dieser Reihenfolge:
1. Tab/Pipe/Komma-getrennt (klassisches Tabellen-Layout)
2. 5-Zeilen-Block-Format (VW Browser-Copy)
3. return []
"""
if not text or len(text) < 100:
return []
@@ -98,6 +204,10 @@ def parse_cookie_table(text: str) -> list[dict]:
if sep:
sep_counts[sep] = sep_counts.get(sep, 0) + 1
if not sep_counts or max(sep_counts.values()) < 3:
# Kein Separator-Format → versuche Block-Format
block_vendors = parse_block_format(text)
if block_vendors:
return block_vendors
return []
sep = max(sep_counts, key=sep_counts.get)
@@ -257,22 +367,67 @@ def parse_flat_cookie_text(text: str) -> list[dict]:
_VENDOR_GUESS = (
# Google-Familie (alles unter "Google" zusammenfassen — Dedup kuemmert sich)
("_ga", "Google"), ("_gid", "Google"), ("_gcl_", "Google"),
("ANID", "Google"), ("AID", "Google"), ("FPGCLDC", "Google"),
("IDE", "Google DoubleClick"), ("DSID", "Google"),
("_fbp", "Meta / Facebook"), ("fr", "Meta / Facebook"),
("FPAU", "Google"), ("FLC", "Google"), ("APC", "Google"),
("IDE", "Google"), ("DSID", "Google"), ("TAID", "Google"),
("NID", "Google"), ("1P_JAR", "Google"),
# Meta / Facebook
("_fbp", "Meta / Facebook"), ("_fbc", "Meta / Facebook"),
# fr ist Meta-Cookie, nur wenn keine andere Site-eigene Verwendung
# Microsoft / Bing
("_pin_unauth", "Pinterest"), ("_uetsid", "Microsoft Bing"),
("_uetvid", "Microsoft Bing"), ("MUID", "Microsoft"),
# Soziale Netzwerke
("tt_", "TikTok"), ("li_at", "LinkedIn"),
# CMP
("OptanonConsent", "OneTrust"), ("cookieconsent", "Borlabs / Cookie-CMP"),
("CookieConsentPolicy", "Borlabs / Cookie-CMP"),
# Analytics
("eta_", "etracker"), ("matomo", "Matomo"),
("_hjid", "Hotjar"), ("_hj", "Hotjar"),
("__cf", "Cloudflare"), ("datadome", "DataDome"),
("incap_", "Imperva Incapsula"),
("ajs_", "Segment"), ("amp_", "Amplitude"),
# Adobe-Familie
("sat_track", "Adobe Experience Cloud"),
("AMCV_", "Adobe Experience Cloud"),
("AMCV", "Adobe Experience Cloud"),
("AMCVS", "Adobe Experience Cloud"),
("demdex", "Adobe Experience Cloud"),
("dextp", "Adobe Experience Cloud"),
("dpm", "Adobe Experience Cloud"),
("mbox", "Adobe Target"),
("smartSignals", "Adobe Experience Cloud"),
("adbCDP", "Adobe Experience Cloud"),
("s_cc", "Adobe Analytics"), ("s_sq", "Adobe Analytics"),
("s_ecid", "Adobe Analytics"), ("s_vi", "Adobe Analytics"),
("s_fid", "Adobe Analytics"), ("s_plt", "Adobe Analytics"),
("s_pltp", "Adobe Analytics"), ("s_invisit", "Adobe Analytics"),
("s_vnc365", "Adobe Analytics"), ("s_ivc", "Adobe Analytics"),
("sc_appvn", "Adobe Analytics"), ("sc_pCmp", "Adobe Analytics"),
("sc_prevpage", "Adobe Analytics"), ("sc_prop", "Adobe Analytics"),
("sc_v17", "Adobe Analytics"), ("sc_v44", "Adobe Analytics"),
("sc_v49", "Adobe Analytics"),
# The Trade Desk
("TDID", "The Trade Desk"), ("TDCPM", "The Trade Desk"),
("TTDOptOut", "The Trade Desk"),
# AdForm
("uid", "AdForm"), ("cid", "AdForm"), ("otsid", "AdForm"),
# everest
("everest", "Adobe Advertising Cloud (everest)"),
# Infra/CDN
("__cf", "Cloudflare"), ("datadome", "DataDome"),
("incap_", "Imperva Incapsula"), ("awsalb", "AWS Load Balancer"),
# Salesforce
("sfdc-", "Salesforce"), ("X-Salesforce", "Salesforce"),
("liveagent_", "Salesforce LiveAgent"),
# Inbenta
("inbenta", "Inbenta"),
# Sonstige Tracker
("_pk_", "Matomo / Piwik"),
("hmt_", "Akamai mPulse"),
# EDAA / Industry Self-regulation
("EDAAT", "EDAA / Online Choices"),
("Eboptout", "EDAA / Online Choices"),
)