feat: Browser-Matrix C2 + B11 AI-Retention + Impressum-Specialist-Agent + B1 Mobile Playwright

Task #15 Stage 1.c-e — Browser-Matrix Backend-Integration: - _phase_c2_browser_matrix.py: ruft consent-tester /scan-matrix wenn env BROWSER_MATRIX=true, fuellt state["browser_matrix"] + state["browser_aggregate"] + state["browser_matrix_html"] - V2-Mail-Block: 🌐 Browser-Matrix Tabelle (Profile · Score · Sub-Scores PC/RR/BD · Bewertung) mit Worst-of-Header - Orchestrator ruft run_phase_c2 nach run_phase_c KNOWN: Stage 1.b (consent_scanner browser_profile-Param) bleibt zurueckgestellt (Datei in loc-exception, Hook-Patch verweigert). Stage 1.a-Shim laeuft im consent-tester — alle Profile aktuell auf Chromium, echte Engine-Diversitaet kommt mit 1.b. Task #17 TH-RETENTION-002 als B11 ai_retention_granularity_check: - Erkennt AI-Provider-Kontext (vertex/openai/anthropic/etc) - In +-800-char-Window: prueft ≥2 Datenkategorien aus Standard-Liste (Texteingaben/IP/Geraet/Session/Fehlerprotokoll/Zeitstempel) - Wenn 1 pauschale Speicherdauer + ≥2 Kategorien aber kein per-Kategorie-Differential → LOW - Smoke: Elli-Mock-DSE trifft LOW "AI-Speicherdauer pauschal" Task #18 Specialist-Agents Phase-1-Prototyp: - compliance/services/specialist_agents/__init__.py mit Architektur-Doku - impressum_agent.py: 9 Pflichtangaben § 5 TMG + § 1 DL-InfoV als Pattern-Registry (Name, Email, Telefon, HR, USt-IdNr, Vertretungsberechtigt, Aufsichtsbehoerde, Berufsangaben, OS-Link) - business_scope-aware (OS-Link nur fuer ecommerce, Aufsichtsbehoerde nur fuer regulated_profession/financial/insurance) - Phase-1 ist Pattern-Match-only (kein LLM), demonstriert die Schnittstelle. Phase 2 ersetzt Pattern durch System-Prompt + KB. - Smoke: minimal-Impressum triggert 4 Findings korrekt Task #7 B1 Playwright Mobile-Verifikation: - consent-tester/services/mobile_reachability_scanner.py: echte WebKit-launch + p.devices['iPhone 15'] preset + de-DE locale + Europe/Berlin timezone - Footer-Anchor-Suche via locator("footer >> text=/.../i") fuer 13 Reopen-Phrasen - Tap-Target-Boundingbox-Messung (Apple HIG / WCAG ≥44x44) - Click-Behavior: DOM-Modal-Snapshot vor/nach, erkennt CMP-Open - Output: has_anchor, anchor_text, tap_target_px, click_opens_cmp, engine_meta, screenshot_b64 (Footer-Crop wenn kein Anchor) - consent-tester/routes_mobile.py POST /scan-mobile-reachability - Backend _b1_wiring erweitert: ruft Mobile-Endpoint zuerst, Fallback auf statischen HTTP-Fetch. Mobile-Daten enrichen finding.mobile_playwright + Severity-Bump bei tap-target<44 / click-doesnt-open-CMP. KNOWN: WebKit-System-Libs sind im Dockerfile ergaenzt (Stage 1.a- Commit), greifen aber erst nach CI/CD-Rebuild des consent-tester. Bis dahin faellt B1 sauber auf statischen Fetch zurueck. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-06 22:20:25 +02:00
parent e1dadc8027
commit 37093ff9e3
11 changed files with 702 additions and 7 deletions
@@ -0,0 +1,116 @@
+"""B11 — AI-Retention-Granularity-Check (TH-RETENTION-002).
+
+DSGVO Art. 13 Abs. 2 lit. a + DSK-Empfehlung: pro Datenkategorie
+eine spezifische Speicherdauer. Eine pauschale Angabe wie
+"6 Monate für alle Daten" reicht nicht.
+
+GT-Pattern Elli:
+  Vertex-AI-Chatbot speichert "IT- und pseudonymisierte
+  Nutzungsdaten" pauschal 6 Monate. Keine Abstufung nach
+  Datenkategorie (Texteingaben / IP / Geräteinformationen /
+  Session-ID / Fehlerprotokolle).
+
+Heuristik:
+  1. AI-Kontext erkennen (vertex ai / openai / claude / etc.)
+  2. In ±600-char-Window prüfen:
+     - Existiert eine Speicherdauer-Aussage? (parse_duration_to_days)
+     - Werden ≥2 Datenkategorien aus AI-Standardliste genannt?
+       (Texteingaben, IP, Geräteinformationen, Session, Fehlerprotokolle)
+     - Wenn 1 Speicherdauer + ≥2 Kategorien aber kein
+       per-Kategorie-Differential → LOW
+"""
+
+from __future__ import annotations
+
+import logging
+import re
+
+from .retention_comparator import parse_duration_to_days
+
+logger = logging.getLogger(__name__)
+
+
+_AI_PROVIDERS = (
+    "vertex ai", "google vertex", "openai", "gpt-3", "gpt-4", "chatgpt",
+    "anthropic", "claude.ai", "claude-3", "mistral ai",
+    "ki-assistent", "ki assistent", "ai assistant",
+)
+
+
+_AI_DATA_CATEGORIES = (
+    "texteingab",   # Texteingaben / Texteingabe
+    "chatverlauf", "chatverläuf",
+    "ip-adress",
+    "geräteinform", "geraeteinform", "device-info",
+    "session-id", "sitzungs-id",
+    "browserversion", "user-agent",
+    "fehlerprotokoll",
+    "zeitstempel",
+)
+
+
+def _per_category_phrases() -> tuple[str, ...]:
+    """Patterns indicating per-category retention is mentioned."""
+    return (
+        "pro datenkategorie",
+        "je datenkategorie",
+        "unterschiedlich je",
+        "abhängig vom datentyp",
+        "abhaengig vom datentyp",
+        "differenziert nach",
+        "pro kategorie",
+    )
+
+
+def check_ai_retention_granularity(state: dict) -> list[dict]:
+    doc_texts = state.get("doc_texts") or {}
+    dse = (doc_texts.get("dse") or "").lower()
+    if not dse:
+        return []
+    findings: list[dict] = []
+    for ai_kw in _AI_PROVIDERS:
+        idx = dse.find(ai_kw)
+        if idx < 0:
+            continue
+        window = dse[max(0, idx - 800): idx + 800]
+        if not window:
+            continue
+        categories_found = [c for c in _AI_DATA_CATEGORIES if c in window]
+        if len(categories_found) < 2:
+            continue
+        # Per-category retention phrase already present? then OK
+        if any(p in window for p in _per_category_phrases()):
+            return []
+        # Retention-claim in window? parse duration
+        m = re.search(
+            r"(\d+(?:[.,]\d+)?\s*(?:tage?|monat\w*|jahre?|"
+            r"day|month|year))", window,
+        )
+        if not m:
+            continue
+        days, kind = parse_duration_to_days(m.group(1))
+        if days is None:
+            continue
+        findings.append({
+            "check_id": "TH-RETENTION-GRANULARITY-001",
+            "severity": "LOW",
+            "severity_reason": "incomplete",
+            "title": (
+                "AI-Speicherdauer pauschal — pro Datenkategorie "
+                "differenzieren empfohlen"
+            ),
+            "norm": "DSGVO Art. 13 Abs. 2 lit. a + DSK-OH AI",
+            "ai_provider": ai_kw,
+            "retention_days": int(days),
+            "categories_detected": categories_found,
+            "action": (
+                f"Für '{ai_kw}'-Kontext separate Speicherdauern je "
+                f"Datenkategorie angeben (Texteingaben / IP / "
+                f"Geräteinformationen / Session). Aktuell pauschal "
+                f"{int(days)} Tage."
+            ),
+        })
+        break  # one per DSE is enough
+    if findings:
+        logger.info("B11 AI-retention-granularity: %d findings", len(findings))
+    return findings
@@ -44,8 +44,10 @@ def compose_v2(state: dict) -> str:
        state.get("vendor_consistency_html", ""),
        # B5 — AI-Act Art. 50 Transparenzpflicht
        state.get("ai_act_html", ""),
-        # B6/B7/B8 — DPO-cross-doc + Doc-Staleness + CMP-fingerprint
+        # B6/B7/B8/B9/B10 — DPO + Staleness + CMP + MultiEntity + Transfer
        state.get("extra_findings_html", ""),
+        # Browser-Matrix (Stage 1.c)
+        state.get("browser_matrix_html", ""),
        # All legacy build_*_html() wrapped in V2 sections — preserves
        # every information block from the old renderer (Exec Summary,
        # Banner-Screenshot, VVT, Redundancy, Solutions, Diff, etc.)
@@ -0,0 +1,17 @@
+"""Doc-Type Specialist-Agents — Phase 1 Prototyp.
+
+Architektur:
+  - Pro Doc-Type ein Spezialist-Agent mit System-Prompt (Domänenwissen)
+    + Knowledge-Base (anonymisierte Patterns/Statistiken aus
+    Multi-Mandanten-Daten)
+  - Jeder Agent liefert strukturierte Findings → enriched state
+  - Ein Cross-Doc-Router-Agent prüft ob Absätze falsch zugeordnet sind
+    ("Cookie-Inhalt steht in AGB statt Cookie-Richtlinie")
+
+Phase 1: Impressum-Agent als Prototyp (Pattern-Match-only, ohne LLM).
+Phase 2: DSE-Agent + Cross-Doc-Router (LLM-gestützt).
+Phase 3+: Weitere Doc-Types + Continuous Learning der KB.
+
+Privacy: KB enthält NIEMALS Roh-Mandantendaten. Anonymisierung +
+Aggregation Pflicht (NER-Maskierung vor KB-Speicher).
+"""
@@ -0,0 +1,159 @@
+"""Impressum-Specialist-Agent Phase-1 Prototyp.
+
+Pattern-Match-only (kein LLM). Demonstriert die Architektur:
+  - Knowledge-Base mit § 5 TMG/DDG-Pflichtangaben
+  - Pattern-Library für Erkennung
+  - strukturierte Findings mit Norm + Action
+
+Phase 2 wird denselben Output produzieren, aber LLM-gestützt mit
+Domain-spezifischem System-Prompt + Cross-Customer-KB.
+
+KB-Beispiel-Einträge:
+  - HR-Format DE: HR[BA] <Nr> <Stadt>
+  - USt-IdNr-Format DE: DE\\d{9}
+  - Aufsichtsbehörden-Liste (Branchen)
+  - DSB-Adressformat
+"""
+
+from __future__ import annotations
+
+import logging
+import re
+
+logger = logging.getLogger(__name__)
+
+
+# Pflichtangaben nach § 5 TMG + § 1 DL-InfoV
+PFLICHTANGABEN = {
+    "name_anbieter": {
+        "label": "Name + Anschrift des Anbieters",
+        "norm": "§ 5 Abs. 1 Nr. 1 TMG",
+        "patterns": [
+            re.compile(r"\b(?:Anbieter|Diensteanbieter|"
+                       r"Verantwortlich(?:er Anbieter)?)\s*[:.\s]",
+                       re.IGNORECASE),
+        ],
+        "severity_if_missing": "HIGH",
+    },
+    "kontakt_email": {
+        "label": "Email-Adresse",
+        "norm": "§ 5 Abs. 1 Nr. 2 TMG",
+        "patterns": [
+            re.compile(r"\b[\w.+-]+@[\w-]+\.[a-z]{2,}\b", re.IGNORECASE),
+        ],
+        "severity_if_missing": "HIGH",
+    },
+    "kontakt_telefon": {
+        "label": "Telefon",
+        "norm": "§ 5 Abs. 1 Nr. 2 TMG",
+        "patterns": [
+            re.compile(r"(?:Tel(?:efon)?|Phone)\.?\s*[:.\s]\s*[\+\d][\d\s/\-()]{5,}",
+                       re.IGNORECASE),
+        ],
+        "severity_if_missing": "MEDIUM",
+    },
+    "handelsregister": {
+        "label": "Handelsregister-Eintrag",
+        "norm": "§ 5 Abs. 1 Nr. 4 TMG",
+        "patterns": [
+            re.compile(r"\bHR[BA]\s+\d", re.IGNORECASE),
+            re.compile(r"Handelsregister", re.IGNORECASE),
+        ],
+        "severity_if_missing": "HIGH",
+    },
+    "ust_id": {
+        "label": "USt-IdNr",
+        "norm": "§ 5 Abs. 1 Nr. 6 TMG",
+        "patterns": [
+            re.compile(r"\b(?:USt-?Id(?:Nr)?\.?|VAT(?:-?Id)?)\s*[:.\s]",
+                       re.IGNORECASE),
+            re.compile(r"\bDE\d{9}\b"),
+        ],
+        "severity_if_missing": "MEDIUM",
+    },
+    "vertretungsberechtigte": {
+        "label": "Vertretungsberechtigte Person",
+        "norm": "§ 5 Abs. 1 Nr. 1 TMG (juristische Personen)",
+        "patterns": [
+            re.compile(r"(?:Geschäftsführer|Vertretungsberechtigt|"
+                       r"vertreten\s+durch)\s*[:.\s]",
+                       re.IGNORECASE),
+        ],
+        "severity_if_missing": "HIGH",
+    },
+    "aufsichtsbehoerde": {
+        "label": "Aufsichtsbehörde (regulierte Branchen)",
+        "norm": "§ 5 Abs. 1 Nr. 3 TMG (Branchen-bedingt)",
+        "patterns": [
+            re.compile(r"Aufsichtsbeh(?:ö|oe)rde\s*[:.\s]", re.IGNORECASE),
+            re.compile(r"\bBAFin\b|\bBNetzA\b|\bLKA\b", re.IGNORECASE),
+        ],
+        "severity_if_missing": "LOW",
+    },
+    "berufsangaben": {
+        "label": "Berufsbezeichnung + Berufsrechtliche Angaben",
+        "norm": "§ 5 Abs. 1 Nr. 5 TMG (Kammerberufe)",
+        "patterns": [
+            re.compile(r"Berufsbezeichnung|Berufsordnung|Kammer",
+                       re.IGNORECASE),
+        ],
+        "severity_if_missing": "LOW",
+    },
+    "odr_link": {
+        "label": "OS-Link auf EU-Plattform",
+        "norm": "Art. 14 EU-VO 524/2013 (B2C-Onlineshops)",
+        "patterns": [
+            re.compile(r"ec\.europa\.eu/consumers/odr", re.IGNORECASE),
+        ],
+        "severity_if_missing": "MEDIUM",
+    },
+}
+
+
+def evaluate(impressum_text: str,
+             business_scope: set[str] | None = None) -> list[dict]:
+    """Run Impressum-Agent against the doc text.
+
+    Returns a list of finding dicts; empty when all Pflichtangaben
+    present. `business_scope` controls which optional checks run
+    (e.g. OS-Link only for B2C ecommerce).
+    """
+    if not impressum_text:
+        return []
+    business_scope = business_scope or set()
+    findings: list[dict] = []
+    for field_id, spec in PFLICHTANGABEN.items():
+        # Skip context-dependent fields when scope doesn't match
+        if field_id == "odr_link" and "ecommerce" not in business_scope:
+            continue
+        if field_id == "aufsichtsbehoerde" and (
+            "regulated_profession" not in business_scope
+            and "financial_services" not in business_scope
+            and "insurance" not in business_scope
+        ):
+            continue
+        if field_id == "berufsangaben" and (
+            "regulated_profession" not in business_scope
+        ):
+            continue
+        found = any(p.search(impressum_text) for p in spec["patterns"])
+        if found:
+            continue
+        findings.append({
+            "check_id": f"IMPRESSUM-AGENT-{field_id.upper()}",
+            "agent": "impressum_agent_v1",
+            "field_id": field_id,
+            "severity": spec["severity_if_missing"],
+            "severity_reason": "missing",
+            "title": f"Pflichtangabe '{spec['label']}' fehlt im Impressum",
+            "norm": spec["norm"],
+            "action": (
+                f"{spec['label']} im Impressum ergänzen "
+                f"(Pflichtangabe nach {spec['norm']})."
+            ),
+        })
+    if findings:
+        logger.info(
+            "impressum_agent: %d findings (kein LLM, KB v1)", len(findings),
+        )
+    return findings