fix: 4 Bugs gemeinsam — B22 PDF + B17 Walk-Fallback + company_name + Plausibility-Fallback

(1) B22 Cross-Domain (fix #59): Elli-Test fand AGB auf logpay.de NICHT obwohl URL in doc_entries korrekt. Vermutete Ursache: Discovery-Phase A drops/überschreibt Original-URL bei PDF-Fetch-Fail (word_count=0). Fix: _collect_audit_urls() iteriert über state.doc_entries + rejected_url + req.documents — Cross-Domain-Hosting ist unabhängig vom Text-Inhalt. Plus Trace-Logging für künftige Diagnose. Dedup per (doc_type, host_sld). (2) B17 Audit-Walk-Fail-Fallback (fix #60): BMW v5 hatte audit_walk=None ohne Mail-Hinweis. Vermutlich 180s-Timeout bei OneTrust-CMP-Banner-Tour. Fix: Timeout 180s → 300s. Plus: Bei Fail wird ein Hinweis- Stub mit error-Grund in state["audit_walk"] + HTML-Block geschrieben — Reviewer sieht den Fail statt silent-skip. (3) company_name + origin_domain im Backend (fix #61): Frontend sendet seit ec03317 die zwei Felder — Backend ignorierte sie. Fix: ComplianceCheckRequest-Schema um company_name + origin_domain erweitert. phase_e_email priorisiert User-Input vor URL-Heuristik für site_name. Bei origin_domain ohne ableitbare doc_entries-domain wird der User-Input als domain übernommen. (4) Plausibility-LLM Fallback-Modell (fix #62): qwen3:30b-a3b liefert auf großen DSEs (BMW 122 FAIL) gehäuft leere format='json'-Responses — Circuit-Breaker griff aber Phase blieb nutzlos. Fix: Default-Modell auf qwen2.5:7b umgestellt (4× kleiner, zuverlässiger bei format=json, ausreichendes Reasoning für PASS/MODIFY/DROP-Klassifikation). Plus Strategy-C eingeführt — Fallback-Modell (llama3.2:3b) wenn primary leer bleibt. BATCH_SIZE 4 → 3. ENV-Switches PLAUSIBILITY_LLM_MODEL + PLAUSIBILITY_FALLBACK_MODEL für Tuning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 16:39:33 +02:00
parent ec03317170
commit d6b8bf87c2
5 changed files with 138 additions and 35 deletions
@@ -57,20 +57,46 @@ async def run_b17(state: dict) -> None:
        return

    walk: dict = {}
+    walk_error: str | None = None
    try:
-        async with httpx.AsyncClient(timeout=180.0) as c:
+        async with httpx.AsyncClient(timeout=300.0) as c:
            r = await c.post(
                f"{CONSENT_TESTER_URL}/scan-audit-walk",
                json={"url": homepage, "dwell_s": 4.0, "max_links": 8},
-                timeout=180.0,
+                timeout=300.0,
            )
            if r.status_code == 200:
                walk = r.json()
+            else:
+                walk_error = f"consent-tester HTTP {r.status_code}"
    except Exception as e:
-        logger.warning("B17 audit-walk request failed: %s", e)
-        return
+        walk_error = f"{type(e).__name__}: {str(e)[:120]}"
+        logger.warning("B17 audit-walk request failed: %s", walk_error)

    if not walk or not walk.get("walk_id"):
+        # Fallback-Stub damit Audit-Report einen Hinweis bekommt
+        # statt "audit_walk: None". Reviewer sieht den Fail.
+        state["audit_walk"] = {
+            "walk_id": "",
+            "url": homepage,
+            "video": {},
+            "actions": [],
+            "annotations": [],
+            "error": walk_error or "unknown (no walk_id returned)",
+        }
+        state["audit_walk_html"] = (
+            "<div style='margin:24px 0;padding:16px;border-left:4px solid #f59e0b;"
+            "background:#fef3c7;border-radius:4px;'>"
+            "<h2 style='margin:0 0 8px;color:#92400e;font-size:16px;'>"
+            "⚠️ Audit-Walk konnte nicht aufgezeichnet werden"
+            "</h2>"
+            f"<p style='margin:0;font-size:13px;color:#92400e;'>"
+            f"Site: <code>{homepage}</code> · Ursache: "
+            f"<code>{walk_error or 'unknown'}</code>. Mögliche "
+            "Gründe: komplexes CMP-Banner (lange Tour-Zeit), Anti-Bot-"
+            "Protection, oder consent-tester überlastet.</p>"
+            "</div>"
+        )
        return

    # Stufe-5: annotierte Screenshots pro Finding. Schickt die
@@ -36,7 +36,17 @@ def run_phase_e(state: dict) -> None:
    doc_count = len([r for r in results if not r.error])
    url_company = _company_name_from_url(doc_entries)
    domain = _extract_domain(doc_entries)
-    site_name = url_company or domain or "Unbekannt"
+    # Priorität: User-Input (req.company_name) > URL-Heuristik > "Unbekannt"
+    req_company = (getattr(req, "company_name", None) or "").strip()
+    req_domain = (getattr(req, "origin_domain", None) or "").strip()
+    site_name = req_company or url_company or domain or "Unbekannt"
+    if req_domain and not domain:
+        # Falls keine domain aus URLs ableitbar war: User-Input verwenden
+        from urllib.parse import urlparse
+        try:
+            domain = urlparse(req_domain).netloc.lstrip("www.") or req_domain
+        except Exception:
+            domain = req_domain
    _update(check_id, "E-Mail wird versendet...", 98)

    # A1: bundle cookie-evidence slices into a ZIP attachment so the
@@ -28,6 +28,11 @@ class ComplianceCheckRequest(BaseModel):
    # Rechtsform, Konzern, MA, Besondere Daten, Drittland). Wird im
    # Snapshot persistiert und filtert die MC-Auswertung (P72).
    scan_context: dict | None = None
+    # Frontend-eingegebene Firma + Origin-Domain. Priorisiert vor
+    # LLM-extracted_profile-Inferenz. Wenn leer: Fallback auf Heuristik
+    # aus URL-Domains und DSE-Text.
+    company_name: str | None = None
+    origin_domain: str | None = None


 class ComplianceCheckStartResponse(BaseModel):