feat(vvt): per-vendor extraction + opt-out check + VVT table in email (V1)

When a known CMP (ePaaS, OneTrust) renders the cookie policy, we now extract structured vendor records, probe their opt-out + privacy URLs, score each vendor (0-100), and append a 'VVT-Vorschlag' table to the compliance email — one row per vendor, sortable by compliance score. consent-tester: - DSIDiscoveryResult.cmp_payloads: surfaces raw CMP JSON to callers - DSIDiscoveryResponse: new cmp_payloads field - discover_dsi_documents sets cmp_payloads from cmp_capture - cmp_library/{epaas,onetrust}.py: new extract_vendors(d) returning list[VendorRecord] backend: - _fetch_text() now returns (text, cmp_payloads) tuple - doc_entries store cmp_payloads per doc (mostly cookie) - _autodiscover_missing forwards homepage payloads to the cookie entry - New module vendor_extractor.py: dispatches ePaaS/OneTrust/generic schemas; dedupes vendors across multiple payloads - cookie_link_validator.py extended with validate_vendor_urls(vendors) and score_vendors(vendors) — 0-100 score per vendor based on name, purpose, country, opt-out reachable, privacy URL reachable, cookies with names + expiry - agent_doc_check_extras.build_vvt_table_html: renders the table - Route appends VVT HTML after the provider list, before the document-by-document report - Response JSON gains cmp_vendors for future frontend rendering Example for BMW: ~30 ePaaS providers → table with Name | Kategorie | Sitz | Cookies | Opt-Out (✓/✗) | Privacy (✓/✗) | Score. Sorted by score ascending so the worst-compliant vendors are at the top.
2026-05-17 09:50:11 +02:00
parent c9c0fb5965
commit ea4dbb223f
8 changed files with 592 additions and 16 deletions
@@ -168,6 +168,10 @@ class DSIDiscoveryResult:
    total_found: int = 0
    languages_detected: list[str] = field(default_factory=list)
    errors: list[str] = field(default_factory=list)
+    # Raw CMP payloads captured during navigation (one per matched JSON).
+    # Schema: [{"kind": str, "url": str, "data": dict}, ...]
+    # Backend uses these to build vendor records + run per-vendor checks.
+    cmp_payloads: list[dict] = field(default_factory=list)

 def _matches_dsi_keyword(text: str) -> tuple[bool, str]:
    """Check if text contains any DSI keyword. Returns (match, language)."""
@@ -270,6 +274,10 @@ async def discover_dsi_documents(
                logger.info("PDF redirect detected: %s -> %s", url, final_url)
            # Return early — a PDF redirect means no HTML content to scan
            result.total_found = len(result.documents)
+            result.cmp_payloads = [
+                {"kind": kind, "data": data}
+                for kind, data in cmp_capture.payloads
+            ]
            return result

        # Step 1b: Try dismissing cookie consent banners before extraction.
@@ -534,8 +542,11 @@ async def discover_dsi_documents(
    result.languages_detected = list(set(
        d.language for d in result.documents if d.language
    ))
-    logger.info("DSI discovery complete: %d documents found in %s",
-                result.total_found, result.languages_detected)
+    result.cmp_payloads = [
+        {"kind": kind, "data": data} for kind, data in cmp_capture.payloads
+    ]
+    logger.info("DSI discovery complete: %d documents found in %s, %d CMP payloads",
+                result.total_found, result.languages_detected, len(result.cmp_payloads))
    return result

 # Nav elements, not real documents