feat(consent+report): P56-P67 Mercedes-Audit-Cycle (Anti-Audit, Phase G Vendors, Cookie-Behavior-Validator + 5 Mail-Polish-Items) [migration-approved]
CI / detect-changes (push) Successful in 11s
CI / branch-name (push) Has been skipped
CI / nodejs-build (push) Successful in 2m19s
CI / test-go (push) Has been skipped
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 15s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 37s
CI / detect-changes (push) Successful in 11s
CI / branch-name (push) Has been skipped
CI / nodejs-build (push) Successful in 2m19s
CI / test-go (push) Has been skipped
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 15s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 37s
P56 Anti-Auditing-Detection als constructive Compliance-Finding (Audit-API-
Empfehlung statt Anklage, weil Mercedes berechtigt Bots blockiert)
P57 Phase G vendor_details Union mit cmp_vendors -> 42 Anbieter sichtbar
P58 Anti-Audit-Detection robuster (Script-Domain-Check + Settings-spezifisch)
P59 Cookie-Behavior-Validator (4 Layer, 3-Tier-Severity: MEDIUM=Kategorie-
Mismatch / HIGH=Zweck-Mismatch / CRITICAL=beide=Vorsatz-Indiz)
+ Open Cookie Database (CC0) als Library-Seed (2264 Cookies)
P59b Cookie-Behavior in Banner-Check verdrahtet + Mail-Block (BUGFIX:
SessionLocal selbst oeffnen, db war im Background-Task nicht im Scope)
Mail-Polish nach Mercedes-Review:
P63 Banner-Footer-Links auch im wb7-link/role=link erkennen (Shadow-DOM-
Walker label-based statt nur <a href>)
P64 Re-Access-Severity: MEDIUM statt HIGH, wenn Footer "Einstellungen" oder
Mercedes-typisch existiert; OEM-Footer-Detection (wb7-footer)
P65 Text-Truncation: Word-Boundary statt Zeichen-Cut (kein "einfa"-Bruch
mehr in Sofortmassnahmen)
P66 GF-Aktionen: Service-Zweck vs Cookie-Zweck explizit erklaert
(haeufige Verwechslung Marketing/GF: "Akamai-Beschreibung" != Cookie-
Zweck pro DSK-OH 2024)
P67 Stirring-Finding mit "Verlust-Framing"-Erklaerung + Alt-vs-Neutral-
Beispiel, statt nur EDPB-Fachbegriff
Compliance-Advisor FAQ (admin agent-core/soul):
+ CNIL/EDPB Top-Bussgelder (Google 100M, Meta 60M, Amazon 35M)
+ Deutsche Praezedenz (LG Muenchen Google Fonts, EuGH Planet49, BGH I ZR 7/16)
+ 4 Risiko-Pfade (Bussgeld/Abmahnung/Sammelklage/NOYB) + Berechnungs-Methodik
Document-Generator Templates: AGB-DE (142), Impressum (140), Widerrufs-
formular-Anlage (143), DSR-Process-Dedup (139), Cookie-Library (144).
Architektur: doc_action_mappings.py + banner_dom_walkers.py +
cookie_behavior_validator.py + vendor_detail_extractor.py rausgezogen,
um die 500-LOC-Caps in agent_doc_check_report.py und
banner_text_checker.py einzuhalten.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -396,6 +396,17 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
||||
f"mit-geprueft.",
|
||||
))
|
||||
continue
|
||||
# P24: DSB-Kontakt ist Pflichtangabe in der DSE (Art. 13(1)(b)
|
||||
# DSGVO) — wenn kein separates DSB-Dokument vorliegt, ist das
|
||||
# KEIN Fehler. DSB-Pruefung passiert ohnehin in der DSE.
|
||||
if doc_type == "dsb" and not (entry.get("url") or "").strip():
|
||||
results.append(DocCheckResult(
|
||||
label=label, url="", doc_type=doc_type,
|
||||
error="Nicht separat vorhanden — DSB-Kontaktdaten "
|
||||
"werden in der Datenschutzerklaerung als "
|
||||
"Pflichtangabe nach Art. 13(1)(b) DSGVO geprueft.",
|
||||
))
|
||||
continue
|
||||
# Empty entry — either from auto-discovery padding (no URL
|
||||
# to fetch) or from a fetch that returned nothing. If there
|
||||
# was a URL we keep the error so the user knows the fetch
|
||||
@@ -442,7 +453,7 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
||||
if banner_url:
|
||||
_update(check_id, "Cookie-Banner wird geprueft...", 82)
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=120.0) as client:
|
||||
async with httpx.AsyncClient(timeout=900.0) as client: # P50: +10min for vendor-detail-phase
|
||||
resp = await client.post(
|
||||
f"{CONSENT_TESTER_URL}/scan",
|
||||
json={"url": banner_url, "timeout_per_phase": 10},
|
||||
@@ -450,7 +461,9 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
||||
if resp.status_code == 200:
|
||||
banner_result = resp.json()
|
||||
except Exception as e:
|
||||
logger.warning("Banner check failed: %s", e)
|
||||
logger.warning(
|
||||
"Banner check failed: %s (%s)", e or "<empty>", type(e).__name__
|
||||
)
|
||||
|
||||
# Step 3c: Cross-check Banner vs Cookie-Richtlinie (88-90%)
|
||||
if banner_result and "cookie" in doc_texts:
|
||||
@@ -530,12 +543,35 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
||||
)
|
||||
cookie_payloads = []
|
||||
cookie_text = ""
|
||||
# P30: aggregate cmp_payloads from ALL doc_entries — sites
|
||||
# like Mercedes load Usercentrics only on the homepage, so
|
||||
# the JSON gets captured during DSE/Impressum discovery, not
|
||||
# in the cookies.html fetch. Dedup by URL since the same
|
||||
# payload is captured on every page load.
|
||||
seen_cmp_urls: set[str] = set()
|
||||
for e in doc_entries:
|
||||
if e.get("doc_type") == "cookie":
|
||||
if e.get("cmp_payloads"):
|
||||
cookie_payloads.extend(e["cmp_payloads"])
|
||||
if e.get("text"):
|
||||
cookie_text = e["text"]
|
||||
for p in (e.get("cmp_payloads") or []):
|
||||
p_url = p.get("url") or ""
|
||||
if p_url and p_url in seen_cmp_urls:
|
||||
continue
|
||||
seen_cmp_urls.add(p_url)
|
||||
cookie_payloads.append(p)
|
||||
if e.get("doc_type") == "cookie" and e.get("text"):
|
||||
cookie_text = e["text"]
|
||||
# P48: also pull cmp_payloads from the Banner-Scan (homepage
|
||||
# 3-phase consent test). Mercedes' Usercentrics-JSON is
|
||||
# captured there even when not in DSI-Discovery of static
|
||||
# legal pages.
|
||||
if banner_result:
|
||||
for p in (banner_result.get("cmp_payloads") or []):
|
||||
p_url = p.get("url") or ""
|
||||
if p_url and p_url in seen_cmp_urls:
|
||||
continue
|
||||
seen_cmp_urls.add(p_url)
|
||||
cookie_payloads.append(p)
|
||||
if cookie_payloads:
|
||||
logger.info("P48: %d CMP-payloads available for vendor-extract (after Banner-Scan merge)",
|
||||
len(cookie_payloads))
|
||||
# P17-D: Fallback wenn cookie via P15 deduped wurde — nutze DSE-Text
|
||||
# sofern Cookie-Begriffe drin sind, damit LLM-Vendor-Extract trotzdem
|
||||
# greifen kann.
|
||||
@@ -570,6 +606,160 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
||||
category=v.get("category", ""),
|
||||
owner_name=owner_name,
|
||||
)
|
||||
# P57: Phase G vendor_details als zusätzliche Vendor-Quelle.
|
||||
# Wenn extract_vendors_from_payloads weniger findet als
|
||||
# Phase G's Info-Click-Through (z.B. Mercedes-Settings nicht
|
||||
# erkannt als usercentrics-kind), die Phase-G-Namen als
|
||||
# eigenständige Vendors hinzufügen.
|
||||
if banner_result:
|
||||
vd_list = banner_result.get("vendor_details") or []
|
||||
vd_list = [v for v in vd_list if v.get("name") != "__TDM_OPTOUT__"]
|
||||
existing_names = {(v.get("name") or "").strip().lower()
|
||||
for v in cmp_vendors}
|
||||
added = 0
|
||||
for d in vd_list:
|
||||
n = (d.get("name") or "").strip()
|
||||
if not n or n.lower() in existing_names:
|
||||
continue
|
||||
# Skip generic category-labels (Mercedes-Kategorien)
|
||||
if n.lower() in ("technisch erforderlich", "analyse und statistik",
|
||||
"marketing", "alles auswählen",
|
||||
"alles auswaehlen"):
|
||||
continue
|
||||
from compliance.services.vendor_classifier import classify
|
||||
cmp_vendors.append({
|
||||
"name": n,
|
||||
"country": "",
|
||||
"purpose": d.get("description", "")[:500],
|
||||
"category": "",
|
||||
"opt_out_url": d.get("opt_out_url", ""),
|
||||
"privacy_policy_url": d.get("privacy_url", ""),
|
||||
"persistence": d.get("retention", ""),
|
||||
"cookies": d.get("cookies", []),
|
||||
"processing_company": d.get("processing_company", ""),
|
||||
"address": d.get("address", ""),
|
||||
"purposes": d.get("purposes", []),
|
||||
"technologies": d.get("technologies", []),
|
||||
"recipient_type": classify(
|
||||
vendor_name=n, category="", owner_name=owner_name,
|
||||
),
|
||||
})
|
||||
existing_names.add(n.lower())
|
||||
added += 1
|
||||
if added:
|
||||
logger.info("P57: added %d new vendors from Phase G (total: %d)",
|
||||
added, len(cmp_vendors))
|
||||
|
||||
# P50: enrich vendors with per-vendor detail-modal-extracts
|
||||
# (description, opt-out URL, privacy URL, cookies). Detail
|
||||
# comes from Phase G Info-button-click-through in /scan.
|
||||
tdm_opt_out_notice = ""
|
||||
if cmp_vendors and banner_result:
|
||||
vendor_details = banner_result.get("vendor_details") or []
|
||||
# P50f: filter out TDM-opt-out sentinel
|
||||
tdm_sentinel = next((v for v in vendor_details
|
||||
if v.get("name") == "__TDM_OPTOUT__"), None)
|
||||
if tdm_sentinel:
|
||||
tdm_opt_out_notice = tdm_sentinel.get("description", "")
|
||||
logger.info("P50f: TDM opt-out — skipped detail-enrichment for vendors")
|
||||
vendor_details = [v for v in vendor_details
|
||||
if v.get("name") != "__TDM_OPTOUT__"]
|
||||
if vendor_details:
|
||||
details_by_name = {}
|
||||
for d in vendor_details:
|
||||
n = (d.get("name") or "").strip().lower()
|
||||
if n:
|
||||
details_by_name[n] = d
|
||||
enriched = 0
|
||||
for v in cmp_vendors:
|
||||
key = (v.get("name") or "").strip().lower()
|
||||
# Substring fallback for fuzzy matches (e.g.
|
||||
# "Google Analytics" detail-name may differ slightly)
|
||||
d = details_by_name.get(key)
|
||||
if not d:
|
||||
for dn, dv in details_by_name.items():
|
||||
if key in dn or dn in key:
|
||||
d = dv
|
||||
break
|
||||
if not d:
|
||||
continue
|
||||
if not v.get("country") and (d.get("processing_company") or d.get("address")):
|
||||
# Heuristic country extract from address (DE/EU keywords)
|
||||
addr = d.get("address", "")
|
||||
if re.search(r"\b(deutschland|germany|berlin|m(?:ue|ü)nchen|hamburg|stuttgart)\b", addr, re.I):
|
||||
v["country"] = "DE"
|
||||
elif re.search(r"\bireland|irland|dublin\b", addr, re.I):
|
||||
v["country"] = "IE"
|
||||
elif re.search(r"\busa|united states|california|new york|delaware\b", addr, re.I):
|
||||
v["country"] = "US"
|
||||
if not v.get("purpose"):
|
||||
v["purpose"] = d.get("description", "")[:500]
|
||||
if not v.get("opt_out_url"):
|
||||
v["opt_out_url"] = d.get("opt_out_url", "")
|
||||
if not v.get("privacy_policy_url"):
|
||||
v["privacy_policy_url"] = d.get("privacy_url", "")
|
||||
if not v.get("cookies"):
|
||||
v["cookies"] = d.get("cookies", [])
|
||||
v["purposes"] = d.get("purposes", [])
|
||||
v["technologies"] = d.get("technologies", [])
|
||||
if not v.get("persistence"):
|
||||
v["persistence"] = d.get("retention", "")
|
||||
v["processing_company"] = d.get("processing_company", "")
|
||||
v["address"] = d.get("address", "")
|
||||
enriched += 1
|
||||
logger.info("P50: enriched %d/%d vendors with detail-modal data",
|
||||
enriched, len(cmp_vendors))
|
||||
# P59b: Cookie-Behavior-Validator — pruefe alle gesetzten Cookies
|
||||
# gegen unsere Library, generiere 3-Tier-Severity-Findings.
|
||||
# Background-Task hat keinen DB-Dependency-Inject -> SessionLocal
|
||||
# selber oeffnen + sauber schliessen.
|
||||
cookie_behavior_findings: list[dict] = []
|
||||
if banner_result:
|
||||
cookies_detailed = banner_result.get("cookies_detailed") or []
|
||||
if cookies_detailed:
|
||||
cb_session = None
|
||||
try:
|
||||
from database import SessionLocal
|
||||
from compliance.services.cookie_behavior_validator import (
|
||||
validate_cookie_behavior,
|
||||
)
|
||||
from urllib.parse import urlparse
|
||||
fp_domain = ""
|
||||
if banner_url:
|
||||
fp_domain = urlparse(banner_url).netloc.replace("www.", "")
|
||||
cb_session = SessionLocal()
|
||||
cookie_behavior_findings = validate_cookie_behavior(
|
||||
cb_session, cookies_detailed,
|
||||
network_requests=[], # TODO Layer B in P59d
|
||||
first_party_domain=fp_domain,
|
||||
)
|
||||
if cookie_behavior_findings:
|
||||
sevs = {f["severity"] for f in cookie_behavior_findings}
|
||||
logger.info(
|
||||
"P59b: Cookie-Behavior-Check %d findings "
|
||||
"(severities: %s) ueber %d Cookies",
|
||||
len(cookie_behavior_findings),
|
||||
sorted(sevs),
|
||||
len(cookies_detailed),
|
||||
)
|
||||
banner_result["cookie_behavior_findings"] = (
|
||||
cookie_behavior_findings
|
||||
)
|
||||
else:
|
||||
logger.info(
|
||||
"P59b: Cookie-Behavior-Check 0 findings "
|
||||
"ueber %d Cookies (library miss / clean)",
|
||||
len(cookies_detailed),
|
||||
)
|
||||
except Exception as cb_err:
|
||||
logger.warning("P59b Cookie-Behavior-Check failed: %s", cb_err)
|
||||
finally:
|
||||
if cb_session is not None:
|
||||
try:
|
||||
cb_session.close()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
if cmp_vendors:
|
||||
logger.info("VVT: %d vendors extracted, validating links",
|
||||
len(cmp_vendors))
|
||||
@@ -1149,10 +1339,15 @@ _DISCOVERY_RULES: list[tuple[str, tuple[str, ...]]] = [
|
||||
"right-of-withdrawal", "ruecktritts", "rücktritts")),
|
||||
("social_media", ("social-media", "soziale-medien", "social_media",
|
||||
"social-media-policy")),
|
||||
# P23: 'terms-and-conditions' kann Allgemeine Geschaeftsbedingungen ODER
|
||||
# Nutzungsbedingungen meinen. Discovery-Funktion klassifiziert spaeter
|
||||
# praeziser per Titel + Inhalt. Hier nur Url-Hint:
|
||||
("agb", ("/agb", "geschaeftsbedingungen", "geschäftsbedingungen",
|
||||
"terms-and-conditions", "general-terms")),
|
||||
("nutzungsbedingungen", ("nutzungsbedingung", "terms-of-use",
|
||||
"nutzungsordnung", "terms-of-service")),
|
||||
"general-terms")),
|
||||
("nutzungsbedingungen", ("nutzungsbedingung", "nutzungsbedingungen",
|
||||
"terms-of-use", "terms-and-conditions",
|
||||
"nutzungsordnung", "terms-of-service",
|
||||
"allgemeine-nutzungsbedingungen")),
|
||||
("dsb", ("datenschutzbeauftragt", "data-protection-officer",
|
||||
"dpo-contact", "/dsb")),
|
||||
("impressum", ("impressum", "imprint", "legal-notice", "site-notice",
|
||||
|
||||
Reference in New Issue
Block a user