feat(audit): Cookie-Compliance-Audit (3-Quellen-Vergleich) + Vendor-Dedup + Block-Parser
CI / detect-changes (push) Successful in 12s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 55s
CI / iace-gt-coverage (push) Successful in 25s
CI / test-python-backend (push) Successful in 44s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 18s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m43s
CI / detect-changes (push) Successful in 12s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 55s
CI / iace-gt-coverage (push) Successful in 25s
CI / test-python-backend (push) Successful in 44s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 18s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m43s
ZENTRALER USP: cookie_compliance_audit.py vergleicht 3 Quellen * DEKLARIERT in Cookie-Richtlinie (parse_cookie_table + parse_flat) * TATSAECHLICH im Browser geladen (banner_result.phases.after_accept) * LIBRARY-Metadaten (cookie_library lookup) Liefert 3 Listen mit Compliance-Verdict: * compliant (deklariert UND geladen) — gruener Block * undeclared_in_browser (geladen NICHT deklariert) — ROTER HIGH-Block → Art. 13(1)(c) DSGVO + § 25 TDDDG Verstoss * declared_not_loaded (deklariert NICHT geladen) — gelber Hinweis → Tabelle moeglicherweise veraltet parse_cookie_table erweitert um Block-Format (5 Zeilen pro Cookie wie beim User-Copy aus VW). Findet 35+ Cookies aus Copy-Paste statt 0. vendor_normalizer.py: 50+ Aliases (Google-Familie, Adobe-Familie, Trade Desk, AdForm, ...) + Garbage-Filter (URLs, leere Strings, 'click to select', 'Mehrere OEMs'). Mergt cookies-Listen beim Dedup. _guess_vendor erweitert: Adobe-Familie (s_ecid/AMCV/demdex/mbox/...), Trade Desk (TDID/TDCPM/TTDOptOut), AdForm (uid/cid/otsid), Salesforce LiveAgent, etracker, Akamai, EDAA. audit_quality_checks: vendor-thin-Threshold jetzt dynamisch nach Cookie-Doc-Wörter (3k→10 / 6k→20 / 10k→30 / 15k+→40). VW-Test-Fixture: tests/fixtures/cookie_gt/vw_cookie_richtlinie.txt (36-Cookie-Sample fuer Regression-Tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -67,33 +67,48 @@ def check_vendor_extract_incomplete(
|
||||
cookie_doc_text: str | None,
|
||||
cmp_vendors: list | None,
|
||||
) -> dict | None:
|
||||
"""2) Cookie-Doc gross aber wenig Vendors → Extract unvollstaendig."""
|
||||
"""2) Cookie-Doc gross aber wenig Vendors → Extract unvollstaendig.
|
||||
|
||||
Dynamische Schwelle nach Doc-Groesse:
|
||||
* 3k-6k Wörter → mind. 10 Vendors erwartet
|
||||
* 6k-10k Wörter → mind. 20 Vendors
|
||||
* 10k-15k Wörter → mind. 30 Vendors
|
||||
* 15k+ Wörter → mind. 40 Vendors
|
||||
"""
|
||||
wc = _word_count(cookie_doc_text)
|
||||
n_vendors = len(cmp_vendors or [])
|
||||
# Heuristik: Cookie-Doc >= 5000 Wörter (~30k chars) sollte zu mind. 15
|
||||
# Vendors fuehren. Wenn weniger → Vendor-Extraktion hat den Text nicht
|
||||
# vollstaendig verarbeitet.
|
||||
if wc < 5000 or n_vendors >= 15:
|
||||
if wc < 3000:
|
||||
return None
|
||||
# Erwartete Vendor-Anzahl heuristisch nach Doc-Groesse
|
||||
if wc >= 15000:
|
||||
expected = 40
|
||||
elif wc >= 10000:
|
||||
expected = 30
|
||||
elif wc >= 6000:
|
||||
expected = 20
|
||||
else:
|
||||
expected = 10
|
||||
if n_vendors >= expected:
|
||||
return None
|
||||
# Verhaeltniszahl bilden — je groesser das Doc, desto auffaelliger
|
||||
return {
|
||||
"severity": "HIGH" if wc >= 8000 else "MEDIUM",
|
||||
"code": "audit_vendor_extract_thin",
|
||||
"label": (
|
||||
f"Audit-Vorbehalt: Cookie-Richtlinie hat {wc:,} Wörter, "
|
||||
f"wir konnten aber nur {n_vendors} Vendor"
|
||||
f"{'en' if n_vendors != 1 else ''} extrahieren"
|
||||
f"erwartet ~{expected} Vendors, extrahiert nur {n_vendors}"
|
||||
).replace(",", "."),
|
||||
"area": "Vendor-Liste / VVT",
|
||||
"owner": "DSB + Marketing",
|
||||
"detail": (
|
||||
"Bei dieser Doc-Groesse erwarten wir typischerweise 20-50+ "
|
||||
"Vendors in einer Cookie-Richtlinie. Die niedrige extrahierte "
|
||||
"Zahl deutet auf eine Tabelle die unser LLM nicht vollstaendig "
|
||||
"parsen konnte. Empfehlung: VVT-Tabelle mit DSB / Marketing "
|
||||
"manuell abgleichen, oder die Cookie-Tabelle im Copy-Paste-Modus "
|
||||
"neu einreichen — dort parsen wir Spalten deterministisch."
|
||||
),
|
||||
f"Bei einer Cookie-Richtlinie mit {wc:,} Woertern erwarten wir "
|
||||
f"typischerweise {expected}+ unique Vendors. Die extrahierte Zahl "
|
||||
f"({n_vendors}) ist auffaellig niedrig — entweder hat unser "
|
||||
"Parser/LLM die Tabelle nicht vollstaendig erfasst oder "
|
||||
"Vendors wurden zu konservativ erkannt. Empfehlung: Cookie-"
|
||||
"Tabelle im Copy-Paste-Modus einreichen (Frontend-Toggle "
|
||||
"'Text einfuegen' pro Cookie-Doc-Zeile) — dort parsen wir "
|
||||
"Spalten deterministisch."
|
||||
).replace(",", "."),
|
||||
"legal_basis": "Art. 13(1)(e) DSGVO — die Empfaengerliste muss "
|
||||
"vollstaendig sein; ein unvollstaendiger Audit darf "
|
||||
"nicht als vollstaendig dargestellt werden.",
|
||||
|
||||
Reference in New Issue
Block a user