feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features

Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9): Core Compliance-Check - Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db). rag_document_checker filtert auf check_type='text' fuer doc_check. Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in falscher doc_type-Schublade. - scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden per business_profile gefiltert (FRT skipped fuer BMW etc.). - Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match: Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60), Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum. Title+check_question als Embedding-Input fuer mehr Kontext. - Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction wenn richer (BMW 1824 vs 600 Worte). Vendor-Redundanz + EU-Alternativen + Cost-Saving - vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors, Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...). - vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/ enterprise/premier). - Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten (nur Media-Spend, separat). DSP-Plattformen behalten enge Range. - Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den oberen 40-100%-Band der Listpreise, nicht starter→premier. - Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere Kategorien gleichzeitig. Cookie-Wissens-DB + Funktionale Klassifikation - cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...) mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk, schrems_ii_status, EuGH-Urteile, EU-Alternative. - cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id, ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact. Country-Inferenz aus Rechtsform - cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table. Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors (Adform DK, Pinterest IE). Action-Recipes + Doc-Anchor-Locator - finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country, broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling", ...) eine strukturierte Anweisung mit what/why/fix_text/where/example. Zum 1:1-Einfuegen in Kunden-Dokumente. - doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den passenden Absatz im existierenden Kundendokument fuer jeden Finding. Per-Run Thread-Local-Cache. Fallback: keyword-Match. - Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail + Vendor-Flag-Liste mit aufklappbarer Action-Liste. - Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip). Migration-Pipeline (Compliance-Check -> Customer Banner/Documents) - migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit 4 Kategorien + Review-Flags. - migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register + Privacy-Policy-Pre-Fills. - agent_migration_routes: 3 Preview-Endpoints (banner-preview, document-preview, summary). Persistierung der cmp_vendors in /data/compliance_audits.db check_payloads-Tabelle. Borlabs-Parity Cookie-Banner-Features - Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage. - Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video Placeholder bis Einwilligung. - Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB. - Consent-Log Export (CSV/JSON) per einwilligungen_export_routes. Bug-Fixes - canonical_control_routes: _jsonish-Helper fuer string-typed jsonb, similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr). - Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views. - Embedding-Service-Batching (32er Batches statt 165 in einem Call). - KeyError 'control_id' in MC-Result-Aggregation (defensive .get). - Master-Controls-Klick-Through von /sdk/master-controls auf /sdk/control-library?control=<id> mit URL-Param-Auto-Open. - Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht). - Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction). - doc_type-aware MC-Filter (statt all-text-MCs). - Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag). - A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert. Tests - test_migration_mappers.py (9 Tests) - test_migration_endpoints.py (4 Tests) Skripte (one-shot) - classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type) - audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires) BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes): DSE 7,5% -> 81-83% Impressum 4% -> 100% (6 echte MCs alle erfuellt) Cookie 0% -> 79-83% (CMP-Text-Routing + Embedding) Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr Plus: Action-Recipes + Doc-Anchors fuer jeden Fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:30:08 +02:00
parent 52fb8b91e7
commit 662327e8b4
31 changed files with 5214 additions and 104 deletions
@@ -159,6 +159,13 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
        from .agent_doc_check_routes import CheckItem, DocCheckResult
        from .agent_doc_check_report import build_html_report

+        # Reset anchor-locator cache per run (avoid cross-run leak)
+        try:
+            from compliance.services.doc_anchor_locator import reset_cache
+            reset_cache()
+        except Exception:
+            pass
+
        # Step 1: Resolve texts (fetch from URL if needed) — 0-30%
        _update(check_id, "Texte werden geladen...", 1)
        doc_texts: dict[str, str] = {}
@@ -234,6 +241,20 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
        # Filter out doc_types that don't apply to this business profile
        skip_types = _get_skip_types(profile)

+        # Derive business_scope hints for the MC filter (O1 — Doc-type Scope-Flag).
+        # MCs that explicitly require a feature (e.g. 'biometric_processing',
+        # 'ai_decision_making', 'child_targeting') get dropped when the
+        # detected profile doesn't declare it.
+        business_scope: set[str] = set()
+        for svc in (getattr(profile, "detected_services", []) or []):
+            business_scope.add(str(svc).lower())
+        if (getattr(profile, "business_type", "") or "").lower() == "b2c":
+            business_scope.add("b2c")
+        if getattr(profile, "has_online_shop", False):
+            business_scope.add("ecommerce")
+        if getattr(profile, "is_regulated_profession", False):
+            business_scope.add("regulated_profession")
+
        # Document checks: 40-80%
        n_entries = max(1, len(doc_entries))
        for i, entry in enumerate(doc_entries):
@@ -268,6 +289,7 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
            result = await _check_single(
                text, doc_type, label, url,
                entry["word_count"], use_agent_flag,
+                business_scope=business_scope,
            )

            # Apply profile context filter
@@ -421,9 +443,42 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
                            len(cmp_vendors))
                cmp_vendors = await validate_vendor_urls(cmp_vendors)
                cmp_vendors = score_vendors(cmp_vendors)
+                # Enrich each vendor with per-cookie functional roles
+                try:
+                    from compliance.services.cookie_function_classifier import (
+                        annotate_vendor_cookies,
+                    )
+                    cmp_vendors = [annotate_vendor_cookies(v) for v in cmp_vendors]
+                except Exception as e:
+                    logger.warning("Cookie function classification skipped: %s", e)
        except Exception as e:
            logger.warning("VVT vendor extraction skipped: %s", e)

+        # Vendor-Redundanz + EU-Alternativen + Cost/Savings (O4)
+        redundancy_report = None
+        try:
+            from compliance.services.vendor_redundancy import analyze as analyze_redundancy
+            from compliance.services.vendor_cost_estimator import infer_company_tier
+            if cmp_vendors:
+                # Company-Tier aus business_profile ableiten — beeinflusst die
+                # Cost-Range so dass z.B. fuer DAX-Konzerne nicht starter-Preise
+                # die untere Schranke duruecken.
+                bp_dict = {
+                    "type": getattr(profile, "business_type", ""),
+                    "features": list(business_scope),
+                }
+                ctier = infer_company_tier(bp_dict)
+                redundancy_report = analyze_redundancy(cmp_vendors, company_tier=ctier)
+                logger.info(
+                    "Redundanz: %d Kategorien mit Mehrfach-Anbietern, "
+                    "Spar-Schaetzung %s pro Jahr (company_tier=%s)",
+                    redundancy_report["summary"]["redundancy_count"],
+                    redundancy_report["summary"]["estimated_saving_pct"],
+                    ctier,
+                )
+        except Exception as e:
+            logger.warning("Vendor redundancy analysis skipped: %s", e)
+
        summary_html = build_management_summary(results)
        scanned_html = build_scanned_urls_html(doc_entries)
        providers_html = build_provider_list_html(banner_result, vvt_entries)
@@ -468,11 +523,18 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
            if scorecard else ""
        )

-        report_html = build_html_report(results, None)
+        report_html = build_html_report(results, None, doc_texts)
        profile_html = _build_profile_html(profile)
+
+        # O4: Vendor-Redundanz / EU-Alternativen + Cost-Savings-Block —
+        # zwischen VVT und Doc-Report einsortiert, damit Geschaeftsfuehrung
+        # die Einsparung sieht bevor sie in die Detail-Pruefung geht.
+        from .agent_doc_check_redundancy import build_redundancy_html
+        redundancy_html = build_redundancy_html(redundancy_report)
+
        full_html = (
            summary_html + scanned_html + profile_html + scorecard_html
-            + providers_html + vvt_html + report_html
+            + providers_html + vvt_html + redundancy_html + report_html
        )

        # Step 6: Send email — derive site name primarily from entered URL.
@@ -602,6 +664,7 @@ async def _fetch_text(url: str, doc_type: str = "") -> tuple[str, list[dict]]:
                payload = resp.json()
                docs = payload.get("documents", [])
                cmp_payloads = payload.get("cmp_payloads") or []
+                cmp_cookie_text = payload.get("cmp_cookie_text") or ""
                if docs:
                    texts = []
                    for doc in docs:
@@ -609,6 +672,22 @@ async def _fetch_text(url: str, doc_type: str = "") -> tuple[str, list[dict]]:
                        if t and len(t) > 50:
                            texts.append(t)
                    merged = "\n\n".join(texts)
+                    # For cookie/dse/social_media: when CMP reconstruction is
+                    # substantially richer than DOM extraction, use it. This
+                    # fixes the BMW case where DOM yields ~600 words of
+                    # navigation but the ePaaS payload reconstructs to ~1800
+                    # words of actual cookie policy.
+                    if (doc_type in short_extract_types
+                            and cmp_cookie_text
+                            and len(cmp_cookie_text.split()) > len(merged.split())):
+                        logger.info(
+                            "Preferring CMP-reconstructed text for %s on %s "
+                            "(%d words CMP vs %d words DOM)",
+                            doc_type, url,
+                            len(cmp_cookie_text.split()),
+                            len(merged.split()),
+                        )
+                        merged = cmp_cookie_text
                    if merged and len(merged.split()) > 100:
                        if len(texts) > 1:
                            logger.info("Merged %d docs from %s (%d words)",
@@ -727,6 +806,7 @@ async def _autodiscover_missing(

    discovered: list[dict] = []
    disc_payloads: list[dict] = []
+    disc_cookie_texts: list[str] = []
    for base in crawl_bases:
        try:
            async with httpx.AsyncClient(timeout=180.0) as client:
@@ -742,8 +822,14 @@ async def _autodiscover_missing(
                body = resp.json()
                discovered.extend(body.get("documents", []) or [])
                disc_payloads.extend(body.get("cmp_payloads") or [])
-                logger.info("auto-discovery on %s: %d docs",
-                            base, len(body.get("documents", []) or []))
+                cmp_text = body.get("cmp_cookie_text") or ""
+                if cmp_text:
+                    disc_cookie_texts.append(cmp_text)
+                logger.info("auto-discovery on %s: %d docs, %d CMP payloads, "
+                            "cmp_cookie_text=%d words", base,
+                            len(body.get("documents", []) or []),
+                            len(body.get("cmp_payloads") or []),
+                            len(cmp_text.split()))
        except Exception as e:
            logger.warning("auto-discovery failed for %s: %s", base, e)

@@ -772,6 +858,19 @@ async def _autodiscover_missing(
        d = by_type.get(dt)
        if d:
            full = d.get("full_text") or d.get("text_preview") or ""
+            # For cookie: prefer the CMP-reconstructed text when it's
+            # substantially richer than the auto-discovered DOM extraction.
+            # BMW homepage CMP yields ~1800 words of authoritative policy;
+            # DOM extraction typically yields ~600 words of site chrome.
+            if dt == "cookie" and disc_cookie_texts:
+                cmp_merged = "\n\n".join(disc_cookie_texts)
+                if len(cmp_merged.split()) > len(full.split()):
+                    logger.info(
+                        "cookie: using CMP-reconstructed text (%d words) "
+                        "instead of DOM (%d words)",
+                        len(cmp_merged.split()), len(full.split()),
+                    )
+                    full = cmp_merged
            if len(full.split()) >= 100:
                new_entry["text"] = full
                new_entry["url"] = d.get("url", "")
@@ -829,6 +928,7 @@ def _classify_discovered_doc(title: str, url: str) -> str | None:
 async def _check_single(
    text: str, doc_type: str, label: str, url: str,
    word_count: int, use_agent: bool,
+    business_scope: set[str] | None = None,
 ):
    """Run regex + MC checks on a single document."""
    from compliance.services.doc_checks.runner import check_document_completeness
@@ -862,6 +962,7 @@ async def _check_single(
        # (top-10 FAILs) so cost stays bounded.
        mc_results = await check_document_with_controls(
            text, doc_type, label, max_controls=0, use_agent=use_agent,
+            business_scope=business_scope,
        )
        if mc_results:
            for mc in mc_results: