feat(vvt): per-vendor extraction + opt-out check + VVT table in email (V1)

When a known CMP (ePaaS, OneTrust) renders the cookie policy, we now
extract structured vendor records, probe their opt-out + privacy URLs,
score each vendor (0-100), and append a 'VVT-Vorschlag' table to the
compliance email — one row per vendor, sortable by compliance score.

consent-tester:
- DSIDiscoveryResult.cmp_payloads: surfaces raw CMP JSON to callers
- DSIDiscoveryResponse: new cmp_payloads field
- discover_dsi_documents sets cmp_payloads from cmp_capture
- cmp_library/{epaas,onetrust}.py: new extract_vendors(d) returning
  list[VendorRecord]

backend:
- _fetch_text() now returns (text, cmp_payloads) tuple
- doc_entries store cmp_payloads per doc (mostly cookie)
- _autodiscover_missing forwards homepage payloads to the cookie entry
- New module vendor_extractor.py: dispatches ePaaS/OneTrust/generic
  schemas; dedupes vendors across multiple payloads
- cookie_link_validator.py extended with validate_vendor_urls(vendors)
  and score_vendors(vendors) — 0-100 score per vendor based on name,
  purpose, country, opt-out reachable, privacy URL reachable, cookies
  with names + expiry
- agent_doc_check_extras.build_vvt_table_html: renders the table
- Route appends VVT HTML after the provider list, before the
  document-by-document report
- Response JSON gains cmp_vendors for future frontend rendering

Example for BMW: ~30 ePaaS providers → table with Name | Kategorie |
Sitz | Cookies | Opt-Out (✓/✗) | Privacy (✓/✗) | Score. Sorted by
score ascending so the worst-compliant vendors are at the top.
This commit is contained in:
Benjamin Admin
2026-05-17 09:50:11 +02:00
parent c9c0fb5965
commit ea4dbb223f
8 changed files with 592 additions and 16 deletions
+13 -2
View File
@@ -168,6 +168,10 @@ class DSIDiscoveryResult:
total_found: int = 0
languages_detected: list[str] = field(default_factory=list)
errors: list[str] = field(default_factory=list)
# Raw CMP payloads captured during navigation (one per matched JSON).
# Schema: [{"kind": str, "url": str, "data": dict}, ...]
# Backend uses these to build vendor records + run per-vendor checks.
cmp_payloads: list[dict] = field(default_factory=list)
def _matches_dsi_keyword(text: str) -> tuple[bool, str]:
"""Check if text contains any DSI keyword. Returns (match, language)."""
@@ -270,6 +274,10 @@ async def discover_dsi_documents(
logger.info("PDF redirect detected: %s -> %s", url, final_url)
# Return early — a PDF redirect means no HTML content to scan
result.total_found = len(result.documents)
result.cmp_payloads = [
{"kind": kind, "data": data}
for kind, data in cmp_capture.payloads
]
return result
# Step 1b: Try dismissing cookie consent banners before extraction.
@@ -534,8 +542,11 @@ async def discover_dsi_documents(
result.languages_detected = list(set(
d.language for d in result.documents if d.language
))
logger.info("DSI discovery complete: %d documents found in %s",
result.total_found, result.languages_detected)
result.cmp_payloads = [
{"kind": kind, "data": data} for kind, data in cmp_capture.payloads
]
logger.info("DSI discovery complete: %d documents found in %s, %d CMP payloads",
result.total_found, result.languages_detected, len(result.cmp_payloads))
return result
# Nav elements, not real documents