fix(audit): VW-404-Recovery + P52 LLM-Merge + P51 Banner-UX-Checks
CI / detect-changes (push) Successful in 10s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 14s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped

VW-404-Fix: submitted_types zaehlt jetzt nur Doc-Types mit >= 200 Zeichen
echtem Text. Eine eingegebene URL die 404/Mini-Text liefert (VW cookie-
richtlinie.html) wird als 'missing' behandelt, sodass Auto-Discovery
alternative URLs auf der Homepage probiert. In-place-Update statt
Duplicate-Entry, rejected_url wird fuer Audit-Transparenz aufgehoben.

P52 LLM-Cascade Merge: vendor_llm_extractor laeuft jetzt bei < 5 Vendors
(nicht nur bei 0), und die Ergebnisse werden MIT existing cmp_vendors
gemerged statt zu ueberschreiben. VW-typische Setups (Generic CMP +
0 cmp_payloads) bekommen damit den Text-basierten Vendor-Layer dazu.

P51 — banner_consistency_checks erweitert:
* check_banner_copyability: scannt banner_html nach user-select:none /
  oncopy=return false / onselectstart. MEDIUM Finding wenn Banner-Text
  nicht kopierbar (Art. 7 (2) DSGVO).
* check_consent_history: prueft auf 'Meine Einwilligungen' / Consent-
  Historie / Datenschutz-Cockpit. MEDIUM wenn keine sichtbare Historie
  (Art. 7 (3) — Widerruf muss so einfach wie Erteilung sein).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-21 17:27:55 +02:00
parent 309c10c203
commit 6dc427a754
2 changed files with 157 additions and 18 deletions
@@ -687,24 +687,42 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
cmp_vendors = extract_vendors_from_payloads(
cookie_payloads, owner_name=owner_name,
)
# V3 fallback: no named CMP captured but we have substantive
# cookie text → ask Qwen/OVH to extract vendor list from the text.
# Skip on very short text (likely navigation) to save LLM cost.
if not cmp_vendors and cookie_text and len(cookie_text.split()) >= 500:
# P52: LLM-Fallback nicht nur wenn 0 Vendors, sondern auch
# wenn die strukturierten Quellen < 5 Vendors lieferten und
# der Cookie-Text substantiell ist. So holt sich VW-typische
# Setups (Generic CMP, 28 Cookies aber 0 cmp_payloads) noch
# ihre echten Vendors aus dem Text.
if (len(cmp_vendors) < 5
and cookie_text and len(cookie_text.split()) >= 500):
from compliance.services.vendor_llm_extractor import (
extract_vendors_via_llm,
)
from compliance.services.vendor_classifier import classify
_update(check_id, "Vendor-Liste per LLM extrahieren...", 94)
cmp_vendors = await extract_vendors_via_llm(cookie_text)
# LLM path doesn't run through extract_vendors_from_payloads,
# so classify here.
for v in cmp_vendors:
llm_vendors = await extract_vendors_via_llm(cookie_text)
# P52: classify die LLM-Vendors und MERGE mit existing
# statt zu ueberschreiben.
existing_names = {(v.get("name") or "").strip().lower()
for v in cmp_vendors}
added_llm = 0
for v in llm_vendors:
nm = (v.get("name") or "").strip()
if not nm or nm.lower() in existing_names:
continue
v["recipient_type"] = classify(
vendor_name=v.get("name", ""),
vendor_name=nm,
category=v.get("category", ""),
owner_name=owner_name,
)
v.setdefault("source", "llm_cascade")
cmp_vendors.append(v)
existing_names.add(nm.lower())
added_llm += 1
if added_llm:
logger.info(
"P52 LLM-Cascade: +%d Vendors (total: %d)",
added_llm, len(cmp_vendors),
)
# P57: Phase G vendor_details als zusätzliche Vendor-Quelle.
# Wenn extract_vendors_from_payloads weniger findet als
# Phase G's Info-Click-Through (z.B. Mercedes-Settings nicht
@@ -1543,11 +1561,31 @@ async def _autodiscover_missing(
"""
from urllib.parse import urlparse
# Submitted doc_types (those the user actually entered URL or text for).
# VW-Fix: nur Doc-Types mit substantieller Text-Ausbeute zaehlen
# als 'submitted'. Wenn der User eine URL eingegeben hat aber die
# 404 liefert (VW cookie-richtlinie.html), oder der Crawler weniger
# als 200 Zeichen extrahiert (SPA-Shell), als 'missing' behandeln
# damit der Discovery-Pass alternative URLs probiert.
_MIN_USEFUL_CHARS = 200
submitted_types = {
e["doc_type"] for e in doc_entries
if e.get("text") or (e.get("url") or "").strip()
if len((e.get("text") or "").strip()) >= _MIN_USEFUL_CHARS
}
# Markiere die fehlgeschlagenen URL-Submissions damit der Discovery
# ihre URL nicht erneut probiert (waere sinnlos).
failed_urls: set[str] = {
(e.get("url") or "").strip()
for e in doc_entries
if (e.get("url") or "").strip()
and len((e.get("text") or "").strip()) < _MIN_USEFUL_CHARS
}
if failed_urls:
logger.info(
"VW-Fix: %d eingegebene URLs lieferten <%d Zeichen — Discovery "
"soll Alternativen probieren: %s",
len(failed_urls), _MIN_USEFUL_CHARS,
", ".join(list(failed_urls)[:3]),
)
# Map alias types to canonical
submitted_canon = {
"dse" if t in ("datenschutz", "privacy") else t for t in submitted_types
@@ -1657,16 +1695,21 @@ async def _autodiscover_missing(
if canon and canon in missing and canon not in by_type:
by_type[canon] = d
# Append a new entry for every missing canonical type. Auto-discovered
# Append/Update entry for every missing canonical type. Auto-discovered
# ones get the text/URL filled; ungratched ones stay empty so the
# padding step renders them as 'Auf der Website nicht gefunden'.
# VW-Fix: wenn schon ein leerer entry existiert (URL gesetzt, aber
# fetch hat 0/Mini-Text geliefert), in-place updaten statt duplizieren.
filled = 0
for dt in missing:
new_entry: dict = {
existing = next((e for e in doc_entries
if e.get("doc_type") == dt), None)
new_entry: dict = existing if existing else {
"doc_type": dt, "url": "", "text": "", "word_count": 0,
"auto_discovered": False, "discovery_attempted": True,
"cmp_payloads": [],
}
new_entry["discovery_attempted"] = True
d = by_type.get(dt)
if d:
full = d.get("full_text") or d.get("text_preview") or ""
@@ -1685,21 +1728,24 @@ async def _autodiscover_missing(
full = cmp_merged
if len(full.split()) >= 100:
new_entry["text"] = full
# Behalte die original URL als "rejected_url" damit Audit
# zeigt 'X war 404, wir haben Y gefunden'.
if existing and (existing.get("url") or "").strip() in failed_urls:
new_entry["rejected_url"] = existing.get("url")
new_entry["url"] = d.get("url", "")
new_entry["word_count"] = len(full.split())
new_entry["auto_discovered"] = True
# Auto-discovery happens on the HOMEPAGE — any CMP payload
# captured at that level likely belongs to the cookie page
# (CMP widget loaded site-wide). Attach to 'cookie' entry.
if dt == "cookie" and disc_payloads:
new_entry["cmp_payloads"] = disc_payloads
doc_texts[dt] = full
filled += 1
logger.info(
"auto-discovered %s on %s: %s (%d words)",
"auto-discovered %s on %s: %s (%d words)%s",
dt, base, d.get("url", "")[:80], new_entry["word_count"],
" [REPLACED failed URL]" if existing else "",
)
doc_entries.append(new_entry)
if not existing:
doc_entries.append(new_entry)
logger.info(
"auto-discovery: filled %d/%d missing types from %s",