Files
breakpilot-compliance/backend-compliance/compliance/services/cookie_compliance_audit.py
T
Benjamin Admin 081e4f057a
CI / detect-changes (push) Successful in 12s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 55s
CI / iace-gt-coverage (push) Successful in 25s
CI / test-python-backend (push) Successful in 44s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 18s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m43s
feat(audit): Cookie-Compliance-Audit (3-Quellen-Vergleich) + Vendor-Dedup + Block-Parser
ZENTRALER USP: cookie_compliance_audit.py vergleicht 3 Quellen
* DEKLARIERT in Cookie-Richtlinie (parse_cookie_table + parse_flat)
* TATSAECHLICH im Browser geladen (banner_result.phases.after_accept)
* LIBRARY-Metadaten (cookie_library lookup)

Liefert 3 Listen mit Compliance-Verdict:
* compliant (deklariert UND geladen) — gruener Block
* undeclared_in_browser (geladen NICHT deklariert) — ROTER HIGH-Block
  → Art. 13(1)(c) DSGVO + § 25 TDDDG Verstoss
* declared_not_loaded (deklariert NICHT geladen) — gelber Hinweis
  → Tabelle moeglicherweise veraltet

parse_cookie_table erweitert um Block-Format (5 Zeilen pro Cookie wie
beim User-Copy aus VW). Findet 35+ Cookies aus Copy-Paste statt 0.

vendor_normalizer.py: 50+ Aliases (Google-Familie, Adobe-Familie,
Trade Desk, AdForm, ...) + Garbage-Filter (URLs, leere Strings,
'click to select', 'Mehrere OEMs'). Mergt cookies-Listen beim Dedup.

_guess_vendor erweitert: Adobe-Familie (s_ecid/AMCV/demdex/mbox/...),
Trade Desk (TDID/TDCPM/TTDOptOut), AdForm (uid/cid/otsid),
Salesforce LiveAgent, etracker, Akamai, EDAA.

audit_quality_checks: vendor-thin-Threshold jetzt dynamisch nach
Cookie-Doc-Wörter (3k→10 / 6k→20 / 10k→30 / 15k+→40).

VW-Test-Fixture: tests/fixtures/cookie_gt/vw_cookie_richtlinie.txt
(36-Cookie-Sample fuer Regression-Tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 23:36:45 +02:00

222 lines
8.9 KiB
Python

"""
Cookie-Compliance-Audit — 3-Quellen-Vergleich.
DAS ist der eigentliche Mehrwert des Tools:
* A. Was in der Cookie-Richtlinie DEKLARIERT ist (Text-Parse)
* B. Was im Browser TATSAECHLICH GELADEN wurde (after_accept)
* C. Was unsere LIBRARY ueber den Cookie weiss (Vendor, Kategorie)
Daraus 3 Listen:
1. ✓ deklariert + geladen + library-bekannt → compliant
2. ❌ geladen aber NICHT deklariert → HIGH-Verstoss (Art. 13(1)(c) DSGVO)
3. ⚠️ deklariert aber NICHT geladen → Tabelle veraltet (LOW)
4. 🔍 deklariert + Library-Kategorie weicht ab → Pruefanlass
"""
from __future__ import annotations
import logging
import re
from typing import Iterable
from sqlalchemy import text as sa_text
from sqlalchemy.orm import Session
logger = logging.getLogger(__name__)
def _normalize_cookie_name(name: str) -> str:
"""Wildcard-Cookies wie 'AMCV_*', 'pm_sess_NNN' werden auf Prefix
reduziert damit '_ga' und '_ga_GTM-XXX' als ein Cookie zaehlen."""
if not name:
return ""
s = name.strip()
# AMCV_*, sc_v44, etc.
s = re.sub(r"[<\[].*?[>\]]", "", s) # entferne <ID>, [...]
s = s.rstrip("*").rstrip("_")
s = re.sub(r"_NNN$|_\d+$", "", s)
return s.lower()
def _extract_declared_cookies(cookie_doc_text: str | None) -> set[str]:
"""Liest Cookie-Namen aus dem Cookie-Richtlinien-Text.
Nutzt zuerst parse_cookie_table (Block/Tab-Format), dann
parse_flat_cookie_text (Anchor-Pattern).
"""
if not cookie_doc_text:
return set()
declared: set[str] = set()
try:
from compliance.services.cookies_table_parser import (
parse_cookie_table, parse_flat_cookie_text,
)
for v in parse_cookie_table(cookie_doc_text):
for c in (v.get("cookies") or []):
if isinstance(c, dict) and c.get("name"):
declared.add(_normalize_cookie_name(c["name"]))
for v in parse_flat_cookie_text(cookie_doc_text):
for c in (v.get("cookies") or []):
if isinstance(c, dict) and c.get("name"):
declared.add(_normalize_cookie_name(c["name"]))
except Exception as e:
logger.warning("declared-cookie-extract failed: %s", e)
return {n for n in declared if n}
def _extract_browser_cookies(banner_result: dict | None) -> set[str]:
"""Liest Cookie-Namen aus banner_result.phases.after_accept.cookies."""
out: set[str] = set()
if not isinstance(banner_result, dict):
return out
phases = banner_result.get("phases") or {}
for ph_name in ("after_accept", "before_consent", "after_reject"):
ph = phases.get(ph_name) or {}
if not isinstance(ph, dict):
continue
for c in (ph.get("cookies") or []):
if isinstance(c, str):
out.add(_normalize_cookie_name(c))
elif isinstance(c, dict) and c.get("name"):
out.add(_normalize_cookie_name(c["name"]))
return {n for n in out if n}
def _lookup_library(db: Session, names: Iterable[str]) -> dict[str, dict]:
"""Liefert {normalized_name: {category, vendor}} aus cookie_library."""
nl = [n for n in names if n]
if not nl:
return {}
try:
rows = db.execute(sa_text(
"SELECT cookie_name, actual_category, vendor_name "
"FROM compliance.cookie_library "
"WHERE LOWER(cookie_name) = ANY(:lc)"
), {"lc": nl}).fetchall()
return {r[0].lower(): {"category": r[1], "vendor": r[2]} for r in rows}
except Exception as e:
logger.warning("library lookup failed: %s", e)
return {}
def audit_cookie_compliance(
db: Session | None,
cookie_doc_text: str | None,
banner_result: dict | None,
) -> dict:
"""Hauptfunktion: liefert dict mit 4 Listen + counts."""
declared = _extract_declared_cookies(cookie_doc_text)
browser = _extract_browser_cookies(banner_result)
all_names = declared | browser
library = _lookup_library(db, all_names) if db else {}
declared_only = declared - browser
browser_only = browser - declared
both = declared & browser
return {
"declared_count": len(declared),
"browser_count": len(browser),
"library_count": len(library),
"compliant": sorted(both),
"undeclared_in_browser": sorted(browser_only),
"declared_not_loaded": sorted(declared_only),
"library_metadata": library,
"high_findings": len(browser_only),
"low_findings": len(declared_only),
}
def build_cookie_audit_block_html(audit: dict) -> str:
"""Rendert den 3-Spalten-Vergleichs-Block in die Mail."""
if not audit:
return ""
n_dec = audit.get("declared_count", 0)
n_brw = audit.get("browser_count", 0)
n_undecl = len(audit.get("undeclared_in_browser") or [])
n_dec_only = len(audit.get("declared_not_loaded") or [])
n_both = len(audit.get("compliant") or [])
sev_color = "#dc2626" if n_undecl else "#16a34a"
undecl_html = ""
if audit.get("undeclared_in_browser"):
undecl_html = (
'<div style="margin-top:10px;padding:10px 12px;background:#fee2e2;'
'border:1px solid #fecaca;border-radius:6px">'
f'<strong style="color:#991b1b">❌ {n_undecl} Cookie'
f'{"s" if n_undecl != 1 else ""} im Browser geladen, '
'aber NICHT in der Cookie-Richtlinie deklariert:</strong>'
'<div style="font-family:monospace;font-size:10px;color:#7f1d1d;'
'margin-top:6px;max-height:200px;overflow:auto">'
+ ", ".join(audit["undeclared_in_browser"][:50])
+ (f' ... +{n_undecl - 50} weitere'
if n_undecl > 50 else '') +
'</div>'
'<div style="font-size:10px;color:#7f1d1d;margin-top:4px;'
'font-style:italic">Art. 13(1)(c) DSGVO + § 25 TDDDG — '
'die Empfaengerliste muss vollstaendig sein. Diese Cookies '
'sind potenziell ungenannte Verarbeitungen.</div>'
'</div>'
)
dec_only_html = ""
if audit.get("declared_not_loaded"):
dec_only_html = (
'<div style="margin-top:10px;padding:10px 12px;background:#fef3c7;'
'border:1px solid #fde68a;border-radius:6px">'
f'<strong style="color:#92400e">⚠️ {n_dec_only} Cookie'
f'{"s" if n_dec_only != 1 else ""} in der Richtlinie '
'deklariert, aber bei diesem Audit NICHT im Browser gesehen:</strong>'
'<div style="font-family:monospace;font-size:10px;color:#78350f;'
'margin-top:6px;max-height:200px;overflow:auto">'
+ ", ".join(audit["declared_not_loaded"][:50])
+ (f' ... +{n_dec_only - 50} weitere'
if n_dec_only > 50 else '') +
'</div>'
'<div style="font-size:10px;color:#78350f;margin-top:4px;'
'font-style:italic">Kein direkter Verstoss — die Cookies '
'koennen nur in bestimmten User-Journeys / Geo-Regionen / '
'eingeloggten Zustaenden geladen werden. Empfehlung: '
'pruefen ob die Cookie-Richtlinie veraltet ist.</div>'
'</div>'
)
compliant_html = ""
if audit.get("compliant"):
compliant_html = (
'<div style="margin-top:10px;padding:10px 12px;background:#dcfce7;'
'border:1px solid #bbf7d0;border-radius:6px">'
f'<strong style="color:#166534">✓ {n_both} Cookie'
f'{"s" if n_both != 1 else ""} sowohl deklariert als auch geladen '
'(compliant):</strong>'
'<div style="font-family:monospace;font-size:10px;color:#14532d;'
'margin-top:6px;max-height:150px;overflow:auto">'
+ ", ".join(audit["compliant"][:50])
+ (f' ... +{n_both - 50} weitere'
if n_both > 50 else '') +
'</div>'
'</div>'
)
return (
'<div style="font-family:-apple-system,BlinkMacSystemFont,sans-serif;'
'max-width:760px;margin:0 auto 16px;padding:14px 18px;'
'background:#fff;border:1px solid #cbd5e1;border-radius:8px">'
f'<div style="font-size:11px;color:{sev_color};text-transform:uppercase;'
f'letter-spacing:1.2px;margin-bottom:4px;font-weight:600">'
'Cookie-Compliance-Audit — 3-Quellen-Vergleich</div>'
'<h3 style="margin:0 0 6px;font-size:14px;color:#1e293b">'
f'{n_dec} in Richtlinie · {n_brw} im Browser · '
f'{n_both} compliant · {n_undecl} undokumentiert · '
f'{n_dec_only} nicht geladen</h3>'
'<p style="margin:0 0 8px;font-size:11px;color:#475569;line-height:1.5">'
'Wir vergleichen die in der Cookie-Richtlinie genannten Cookies '
'mit dem was der Browser nach Akzeptieren tatsaechlich laed. '
'Undokumentierte Cookies im Browser sind ein direkter Verstoss '
'gegen die DSGVO-Informationspflicht.'
'</p>'
+ undecl_html + dec_only_html + compliant_html +
'</div>'
)