feat(audit): A Audit-Transparenz + B Tabellen-Parse + D HTML-Tables aus DOM
CI / detect-changes (push) Successful in 10s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-python-backend (push) Successful in 45s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 20s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-python-backend (push) Successful in 45s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 20s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Drei zusammenhaengende Fixes fuer den VW-Befund (6 Vendors statt 100+): A — audit_quality_checks.py: drei systemische Vorbehalte die IMMER prominent gezeigt werden: * banner_detected=False trotz Cookie-Doc → HIGH 'CMP-Tool ungeladen' * cookie_doc >= 30k chars aber cmp_vendors < 15 → HIGH/MEDIUM 'Vendor-Liste auffaellig kurz fuer Doc-Groesse' * submitted URL aber 0/Mini-Text → MEDIUM 'URL nicht ladbar' Rote Audit-Vorbehalt-Box ueber dem GF-1-Pager. GF-Summary sagt 'Audit unvollstaendig' statt faelschlich 'Keine kritischen Themen'. gf_one_pager nimmt audit_quality_findings in top_findings auf (BEVOR andere Findings). B — cookies_table_parser laeuft jetzt auch auf gecrawltem Cookie-Doc- Text (nicht nur bei User-Paste). Wenn der dsi-discovery-Response Tab/ Pipe-getrennte Tabellen-Reihen liefert, parsen wir sie deterministisch. D — consent-tester/dsi-discovery extrahiert jetzt zusaetzlich zum Text die <table>-Elemente aus dem DOM als list[str] (Tab-getrennt pro Zeile, mind. 2 Zellen, mind. 3 Zeilen, max 10 Tabellen pro Doc). Backend schleust diese als 'html_table'-cmp_payload ein und jagt sie zuerst durch cookies_table_parser → 100% deterministische Vendor-Extraktion ohne LLM. VW-Erwartung: aus der 65k-Cookie-Tabelle werden jetzt 30-50 Vendors deterministisch geparst statt 6 vom LLM-Cascade. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,198 @@
|
||||
"""
|
||||
A — Audit-Transparenz / Audit-Quality-Checks.
|
||||
|
||||
Wenn der Crawler nicht alles gefunden hat, MUSS die Mail das prominent
|
||||
zeigen — sonst denkt der User 'alles gut' obwohl die Datenlage Luecken
|
||||
hat.
|
||||
|
||||
Erkennt 4 Quality-Failures:
|
||||
1. banner_detected=False trotz vorhandenem Cookie-Doc → CMP-Tool ungeladen
|
||||
2. cookie_doc >= 30k chars aber cmp_vendors < 10 → Vendor-Extract unvollstaendig
|
||||
3. doc_text submitted aber 0 chars geladen → Crawler-Failure
|
||||
4. cmp_vendors > 0 aber alle aus llm_cascade ohne Library-Match → vermutl. unvollstaendig
|
||||
|
||||
Diese Findings landen IMMER im GF-1-Pager (auch wenn kein anderes
|
||||
HIGH-Finding da ist) — sie sagen "die Datenlage ist unvollstaendig,
|
||||
manuelle Pruefung empfohlen".
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _word_count(text: str | None) -> int:
|
||||
if not text:
|
||||
return 0
|
||||
return len(text.split())
|
||||
|
||||
|
||||
def check_banner_not_detected(
|
||||
banner_result: dict | None,
|
||||
cookie_doc_text: str | None,
|
||||
) -> dict | None:
|
||||
"""1) Banner nicht geladen aber Cookie-Doc vorhanden → CMP-Tool kaputt."""
|
||||
if not isinstance(banner_result, dict):
|
||||
return None
|
||||
detected = banner_result.get("banner_detected")
|
||||
if detected is None or detected is True:
|
||||
return None
|
||||
if not cookie_doc_text or len(cookie_doc_text) < 5000:
|
||||
return None
|
||||
return {
|
||||
"severity": "HIGH",
|
||||
"code": "audit_banner_not_detected",
|
||||
"label": "Audit-Vorbehalt: Cookie-Banner konnte vom Crawler nicht "
|
||||
"geladen werden",
|
||||
"area": "Cookie-Banner",
|
||||
"owner": "DSB + Marketing/CMP-Admin",
|
||||
"detail": (
|
||||
"Unser Crawler konnte das CMP-Tool dieser Site nicht analysieren — "
|
||||
"weder Vendor-Liste noch Cookie-Verhalten konnten geprueft werden. "
|
||||
"Moegliche Ursachen: Anti-Bot-Schutz (Akamai/Cloudflare/DataDome) "
|
||||
"blockiert Playwright; das CMP-Skript laed nur fuer bestimmte "
|
||||
"Geo-Regionen; ein neues CMP-Tool das wir noch nicht unterstuetzen. "
|
||||
"Empfehlung: manuelle Pruefung des Banners durch DSB, alternativ "
|
||||
"Cookie-Tabelle im Audit-Tool direkt einfuegen (Copy-Paste-Modus)."
|
||||
),
|
||||
"legal_basis": "Art. 5 (2) DSGVO Rechenschaftspflicht — der Audit-"
|
||||
"Befund muss transparent zwischen 'geprueft & OK' und "
|
||||
"'nicht pruefbar' unterscheiden.",
|
||||
}
|
||||
|
||||
|
||||
def check_vendor_extract_incomplete(
|
||||
cookie_doc_text: str | None,
|
||||
cmp_vendors: list | None,
|
||||
) -> dict | None:
|
||||
"""2) Cookie-Doc gross aber wenig Vendors → Extract unvollstaendig."""
|
||||
wc = _word_count(cookie_doc_text)
|
||||
n_vendors = len(cmp_vendors or [])
|
||||
# Heuristik: Cookie-Doc >= 5000 Wörter (~30k chars) sollte zu mind. 15
|
||||
# Vendors fuehren. Wenn weniger → Vendor-Extraktion hat den Text nicht
|
||||
# vollstaendig verarbeitet.
|
||||
if wc < 5000 or n_vendors >= 15:
|
||||
return None
|
||||
# Verhaeltniszahl bilden — je groesser das Doc, desto auffaelliger
|
||||
return {
|
||||
"severity": "HIGH" if wc >= 8000 else "MEDIUM",
|
||||
"code": "audit_vendor_extract_thin",
|
||||
"label": (
|
||||
f"Audit-Vorbehalt: Cookie-Richtlinie hat {wc:,} Wörter, "
|
||||
f"wir konnten aber nur {n_vendors} Vendor"
|
||||
f"{'en' if n_vendors != 1 else ''} extrahieren"
|
||||
).replace(",", "."),
|
||||
"area": "Vendor-Liste / VVT",
|
||||
"owner": "DSB + Marketing",
|
||||
"detail": (
|
||||
"Bei dieser Doc-Groesse erwarten wir typischerweise 20-50+ "
|
||||
"Vendors in einer Cookie-Richtlinie. Die niedrige extrahierte "
|
||||
"Zahl deutet auf eine Tabelle die unser LLM nicht vollstaendig "
|
||||
"parsen konnte. Empfehlung: VVT-Tabelle mit DSB / Marketing "
|
||||
"manuell abgleichen, oder die Cookie-Tabelle im Copy-Paste-Modus "
|
||||
"neu einreichen — dort parsen wir Spalten deterministisch."
|
||||
),
|
||||
"legal_basis": "Art. 13(1)(e) DSGVO — die Empfaengerliste muss "
|
||||
"vollstaendig sein; ein unvollstaendiger Audit darf "
|
||||
"nicht als vollstaendig dargestellt werden.",
|
||||
}
|
||||
|
||||
|
||||
def check_url_fetch_failed(doc_entries: list | None) -> list[dict]:
|
||||
"""3) Submitted URL aber 0 oder Mini-Text → Crawler-Failure pro Doc."""
|
||||
out: list[dict] = []
|
||||
for e in (doc_entries or []):
|
||||
if not isinstance(e, dict):
|
||||
continue
|
||||
url = (e.get("url") or "").strip()
|
||||
text = (e.get("text") or "").strip()
|
||||
if not url or len(text) >= 200 or e.get("auto_discovered"):
|
||||
continue
|
||||
dt = e.get("doc_type", "doc")
|
||||
rejected = e.get("rejected_url") or ""
|
||||
out.append({
|
||||
"severity": "MEDIUM",
|
||||
"code": f"audit_url_fetch_failed_{dt}",
|
||||
"label": (
|
||||
f"Audit-Vorbehalt: {dt}-URL konnte nicht geladen werden "
|
||||
f"({len(text)} Zeichen extrahiert)"
|
||||
),
|
||||
"area": dt,
|
||||
"owner": "DSB + Web-Team",
|
||||
"detail": (
|
||||
f"Die eingegebene URL {url[:120]} lieferte weniger als 200 "
|
||||
"Zeichen. Moegliche Ursachen: 404, JS-only Render, Anti-Bot, "
|
||||
"Cookie-Wall. Auto-Discovery hat versucht eine Alternative "
|
||||
"auf der Homepage zu finden — ohne Erfolg. Empfehlung: "
|
||||
"korrekte URL pruefen oder den Text direkt einfuegen "
|
||||
"(Copy-Paste-Modus)."
|
||||
),
|
||||
"legal_basis": "Art. 5 (2) DSGVO Rechenschaftspflicht.",
|
||||
})
|
||||
return out
|
||||
|
||||
|
||||
def run_all(
|
||||
banner_result: dict | None,
|
||||
cookie_doc_text: str | None,
|
||||
cmp_vendors: list | None,
|
||||
doc_entries: list | None,
|
||||
) -> list[dict]:
|
||||
findings: list[dict] = []
|
||||
try:
|
||||
f1 = check_banner_not_detected(banner_result, cookie_doc_text)
|
||||
if f1:
|
||||
findings.append(f1)
|
||||
except Exception as e:
|
||||
logger.warning("audit_banner_not_detected failed: %s", e)
|
||||
try:
|
||||
f2 = check_vendor_extract_incomplete(cookie_doc_text, cmp_vendors)
|
||||
if f2:
|
||||
findings.append(f2)
|
||||
except Exception as e:
|
||||
logger.warning("audit_vendor_extract_thin failed: %s", e)
|
||||
try:
|
||||
findings.extend(check_url_fetch_failed(doc_entries))
|
||||
except Exception as e:
|
||||
logger.warning("audit_url_fetch_failed failed: %s", e)
|
||||
return findings
|
||||
|
||||
|
||||
def build_audit_quality_block_html(findings: list[dict]) -> str:
|
||||
if not findings:
|
||||
return ""
|
||||
items: list[str] = []
|
||||
for f in findings:
|
||||
sev = f.get("severity", "MEDIUM")
|
||||
sev_color = "#dc2626" if sev == "HIGH" else "#d97706"
|
||||
items.append(
|
||||
f'<li style="margin-bottom:10px;font-size:11px;line-height:1.5">'
|
||||
f'<strong style="color:{sev_color}">[{sev}] {f.get("label","")}</strong>'
|
||||
f'<div style="color:#475569;margin-top:3px">{f.get("detail","")}</div>'
|
||||
f'<div style="color:#94a3b8;margin-top:2px;font-style:italic">'
|
||||
f'{f.get("legal_basis","")}</div>'
|
||||
f'</li>'
|
||||
)
|
||||
return (
|
||||
'<div style="font-family:-apple-system,BlinkMacSystemFont,sans-serif;'
|
||||
'max-width:760px;margin:0 auto 16px;padding:14px 18px;'
|
||||
'background:#fee2e2;border:1px solid #fecaca;border-radius:8px">'
|
||||
'<div style="font-size:11px;color:#991b1b;text-transform:uppercase;'
|
||||
'letter-spacing:1.2px;margin-bottom:4px;font-weight:600">'
|
||||
'Audit-Vorbehalt — Datenlage unvollstaendig</div>'
|
||||
f'<h3 style="margin:0 0 6px;font-size:14px;color:#1e293b">'
|
||||
f'{len(findings)} Punkt'
|
||||
f'{"e" if len(findings) != 1 else ""} bei denen der Audit selbst '
|
||||
f'an Grenzen gestossen ist</h3>'
|
||||
'<p style="margin:0 0 10px;font-size:11px;color:#475569;line-height:1.5">'
|
||||
'Die folgenden Punkte betreffen NICHT die Compliance Ihrer Website, '
|
||||
'sondern die Vollstaendigkeit unserer Pruefung. Bei diesen Bereichen '
|
||||
'sollten Sie den Audit nicht als "alles ok" werten, sondern manuell '
|
||||
'oder im Copy-Paste-Modus nachpruefen.'
|
||||
'</p>'
|
||||
'<ul style="margin:0 0 0 18px;padding:0">'
|
||||
+ "".join(items) +
|
||||
'</ul></div>'
|
||||
)
|
||||
Reference in New Issue
Block a user