fix: 4 Bugs gemeinsam — B22 PDF + B17 Walk-Fallback + company_name + Plausibility-Fallback

(1) B22 Cross-Domain (fix #59): Elli-Test fand AGB auf logpay.de NICHT obwohl URL in doc_entries korrekt. Vermutete Ursache: Discovery-Phase A drops/überschreibt Original-URL bei PDF-Fetch-Fail (word_count=0). Fix: _collect_audit_urls() iteriert über state.doc_entries + rejected_url + req.documents — Cross-Domain-Hosting ist unabhängig vom Text-Inhalt. Plus Trace-Logging für künftige Diagnose. Dedup per (doc_type, host_sld). (2) B17 Audit-Walk-Fail-Fallback (fix #60): BMW v5 hatte audit_walk=None ohne Mail-Hinweis. Vermutlich 180s-Timeout bei OneTrust-CMP-Banner-Tour. Fix: Timeout 180s → 300s. Plus: Bei Fail wird ein Hinweis- Stub mit error-Grund in state["audit_walk"] + HTML-Block geschrieben — Reviewer sieht den Fail statt silent-skip. (3) company_name + origin_domain im Backend (fix #61): Frontend sendet seit ec03317 die zwei Felder — Backend ignorierte sie. Fix: ComplianceCheckRequest-Schema um company_name + origin_domain erweitert. phase_e_email priorisiert User-Input vor URL-Heuristik für site_name. Bei origin_domain ohne ableitbare doc_entries-domain wird der User-Input als domain übernommen. (4) Plausibility-LLM Fallback-Modell (fix #62): qwen3:30b-a3b liefert auf großen DSEs (BMW 122 FAIL) gehäuft leere format='json'-Responses — Circuit-Breaker griff aber Phase blieb nutzlos. Fix: Default-Modell auf qwen2.5:7b umgestellt (4× kleiner, zuverlässiger bei format=json, ausreichendes Reasoning für PASS/MODIFY/DROP-Klassifikation). Plus Strategy-C eingeführt — Fallback-Modell (llama3.2:3b) wenn primary leer bleibt. BATCH_SIZE 4 → 3. ENV-Switches PLAUSIBILITY_LLM_MODEL + PLAUSIBILITY_FALLBACK_MODEL für Tuning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 16:39:33 +02:00
parent ec03317170
commit d6b8bf87c2
5 changed files with 138 additions and 35 deletions
@@ -87,17 +87,52 @@ def _site_origin_sld(state: dict) -> str:
    return max(counter, key=counter.get)


-def check_cross_domain_docs(state: dict) -> list[dict]:
-    """Emit findings for doc_entries whose URL has a different SLD
-    than the site origin."""
-    primary = _site_origin_sld(state)
-    if not primary:
-        return []
-    findings: list[dict] = []
+def _collect_audit_urls(state: dict) -> list[tuple[str, str]]:
+    """Sammle (doc_type, url) aus BEIDEN Quellen — state.doc_entries
+    (nach Discovery) UND req.documents (USER-Original-Input). Discovery
+    kann Original-URLs verlieren (PDF-Fetch-Fail, Auto-Reclassify), aber
+    Cross-Domain-Hosting ist juristisch unabhängig vom Text-Inhalt
+    der Datei.
+    """
+    seen: set[tuple[str, str]] = set()
+    out: list[tuple[str, str]] = []
    for e in (state.get("doc_entries") or []):
        url = (e.get("url") or "").strip()
        doc_type = (e.get("doc_type") or "").lower()
-        if not url or "://" not in url:
+        if url and doc_type and (doc_type, url) not in seen:
+            seen.add((doc_type, url))
+            out.append((doc_type, url))
+        # rejected_url ist die Original-URL die Discovery rejected hat
+        rej = (e.get("rejected_url") or "").strip()
+        if rej and doc_type and (doc_type, rej) not in seen:
+            seen.add((doc_type, rej))
+            out.append((doc_type, rej))
+    # Fallback: req.documents — USER hat sie explizit eingegeben
+    req = state.get("req")
+    if req is not None:
+        for d in getattr(req, "documents", []) or []:
+            url = (getattr(d, "url", "") or "").strip()
+            doc_type = (getattr(d, "doc_type", "") or "").lower()
+            if url and doc_type and (doc_type, url) not in seen:
+                seen.add((doc_type, url))
+                out.append((doc_type, url))
+    return out
+
+
+def check_cross_domain_docs(state: dict) -> list[dict]:
+    """Emit findings for doc-URLs whose host has a different SLD
+    than the site origin."""
+    primary = _site_origin_sld(state)
+    if not primary:
+        logger.info("B22 cross-domain: kein primary SLD ermittelbar")
+        return []
+    findings: list[dict] = []
+    audit_urls = _collect_audit_urls(state)
+    logger.info("B22 cross-domain: primary=%s, prüfe %d URL(s)",
+                primary, len(audit_urls))
+    emitted_keys: set[tuple[str, str]] = set()
+    for doc_type, url in audit_urls:
+        if "://" not in url:
            continue
        try:
            host = urlparse(url).netloc
@@ -106,6 +141,12 @@ def check_cross_domain_docs(state: dict) -> list[dict]:
            continue
        if not url_sld or url_sld == primary:
            continue
+        # Dedup pro (doc_type, host_sld) damit rejected_url + url nicht
+        # doppelt gemeldet werden
+        e_key = (doc_type, url_sld)
+        if e_key in emitted_keys:
+            continue
+        emitted_keys.add(e_key)
        # Cross-Domain detected
        severity = _SEVERITY_BY_DOC.get(doc_type, "MEDIUM")
        doc_label = {