breakpilot-compliance

Author	SHA1	Message	Date
Benjamin Admin	3f90e40807	fix(browser-matrix): Tracking-Signal statt Cookie-Rohzahl + Matrix-Schnellpfad Korrektheit (§ 25 TDDDG): "Cookies vor Consent" ist KEIN Verstoss per se — technisch notwendige Cookies inkl. des Consent-Cookies (speichert die Ablehnung) sind nach Abs. 2 erlaubt. Verstoss ist nur nicht-essentielles TRACKING vor Consent. - browser_cross_finding: Befund haengt jetzt an violations.before_consent (Tracking), nicht an der Cookie-Rohzahl; § 25 Abs. 2-Hinweis im Detail. Regressionstest: Cookies-ohne-Tracking → KEIN Befund. - multi_browser_scanner._extract_dimensions: Score nutzt Tracking-Violations + reject_respected-Verdikt statt Rohzahl (Fallback erhalten). - BrowserBehaviorView: "Cookies vor Consent" nur rot/⚠ bei Tracking, "nach Ablehnen" neutral (Verdikt = reject-Spalte); erklaerende Zeile. Speed: run_consent_test ueberspringt im Matrix-Modus (browser_profile gesetzt) die teuren Phasen C/D-F/G — nur A+B noetig. Verhindert das 504 beim Multi-Engine-Scan (BMW 4 Engines lief sonst in den 338s-Gateway-Timeout). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-13 00:10:41 +02:00
Benjamin Admin	75d42a834b	fix(consent-tester): playwright install-deps — Firefox/WebKit fehlten OS-Libs E2E auf BMW (macmini, arm64) zeigte: nur Chromium lief, Firefox/WebKit/Mobile- Safari scheiterten mit "Host system is missing dependencies to run browsers". Die manuell gepflegte apt-Lib-Liste war fuer Gecko/WebKit unvollstaendig. `playwright install-deps chromium firefox webkit` (als root) installiert den vollstaendigen OS-Dep-Satz → alle Engines starten. Betrifft beide Arches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-12 23:51:17 +02:00
Benjamin Admin	9587726936	feat(admin): Tab "Browser-Verhalten" — Per-Browser-Matrix + Screenshots (Phase 3) - BrowserBehaviorView: laedt gespeicherte Matrix (GET), sonst "Browser-Test starten" (POST run, Live-Lauf). Per-Browser-Tabelle (Cookies vor Consent / nach Ablehnen / Ablehnen respektiert / Oberflaeche / Score), Engine-Detail mit Banner-Screenshot + Oberflaechen-Befunden, Mobil-Badge, "nicht verfuegbar"-Zeilen fuer fehlende Browser (arm64-Dev). - Proxys browser-behavior (GET) + browser-behavior/run (POST, langer Timeout). - page.tsx: Tab "Browser-Verhalten" (sichtbar sobald scanbare URL im Snapshot). - consent-tester scan_matrix_summary: banner_findings je Engine im summary (Text/Severity/Norm) → Oberflaechen-Befunde im Tab. - tsc strict clean; Vitest BrowserBehaviorView (2). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-12 23:15:06 +02:00
Benjamin Admin	c7fde93061	feat(backend): On-demand Browser-Verhaltens-Matrix + Snapshot-Persistenz (Phase 2) - check_snapshot: update_browser_matrix/load_browser_matrix — migrationsfrei in banner_result.browser_matrix (JSONB jsonb_set, eigener scanned_at) - snapshot_check_routes: POST /snapshots/{id}/browser-behavior/run laeuft /scan-matrix LIVE (Re-Crawl je Engine, nur live messbar), persistiert das Ergebnis; GET /snapshots/{id}/browser-behavior liefert die gespeicherte Matrix ohne Re-Crawl. Profil-Set = 4 Default-Engines + Brave/Chrome/Edge. - consent-tester multi_browser_scanner: Semaphore(2) gegen OOM (7 Browser parallel sprengten das 2g-mem_limit) - Pydantic-Modell mit Optional[List[...]] (nicht `\| None`) → Py3.9-sicher - Tests: _snapshot_scan_url + Request-Defaults (5) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-12 23:03:28 +02:00
Benjamin Admin	7c0126f2ef	feat(consent-tester): Brave + Chrome/Edge-Channels im Image (amd64-gated, Phase 1.3) - Dockerfile: Brave-apt-Repo + `playwright install --with-deps chrome msedge`, beide hinter TARGETARCH=amd64-Gate und best-effort (\|\| echo) → arm64-Dev- Builds (macmini) brechen NICHT, laufen mit den 4 Default-Engines; Brave/ Chrome/Edge sind amd64-only opt-in-Extras (EXTRA_PROFILES). - docker-compose.hetzner.yml: consent-tester auf linux/amd64 (statt arm64- Emulation auf Orca) — Voraussetzung dafuer, dass die echten Browser ueberhaupt installiert werden. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-12 22:52:49 +02:00
Benjamin Admin	881e9c28de	feat(consent-tester): /scan-matrix echt — Profil je Engine + Per-Engine-Summary (Phase 1.2) - _scanner_run reicht browser_profile an run_consent_test durch (statt Single-Chromium-Shim) - neue scan_matrix_summary.matrix_scan_dict: ConsentTestResult -> schlanke Matrix-dict-Form (phases fuer _extract_dimensions + kompakter `summary`: cookies_before_consent/after_reject, reject_respected-Heuristik [keine Verstoesse UND kein neuer Tracker], surface, screenshot) - multi_browser_scanner._run_one hebt summary + engine + is_mobile an die Zeile, verwirft die vollen Cookie-Listen (JSONB-Persistenz schlank) - consent_scanner: _ctx_base mit Mobile-Device-Emulation (iPhone-Profil -> echtes Mobile-Viewport/Touch), alle 5 new_context auf **_ctx_base - Tests: test_scan_matrix_summary (6) inkl. _extract_dimensions-Vertrag Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-12 22:46:42 +02:00
Benjamin Admin	c816827720	feat(consent-tester): browser_profile-Param — echte Engine-Wahl im Scan (Phase 1.1) run_consent_test nimmt jetzt browser_profile (browser_profiles.py): Firefox/Gecko, WebKit/Safari oder Blink (Chromium-Default / Chrome-/Edge-Channel / Brave via executable_path). Rückwärtskompatibel: None → Chromium wie bisher. Fundament für die echte /scan-matrix (Stage-1.b-Shim), die als nächstes Profile durchreicht. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-12 22:21:20 +02:00
Benjamin Admin	11740bd2f9	feat(consent-tester): 4 weitere Edge-Cases — Consent-or-Pay, Consent Mode, CNAME-Cloaking, Returning-User #4 Consent-or-Pay (EDPB Opinion 08/2024): Banner-Text-Signatur (Pur-Abo/ "zustimmen oder bezahlen" + Consent-Kontext) → MEDIUM-Befund "rechtlich umstritten, gesondert prüfen". #5 Google Consent Mode v2: page.evaluate (dataLayer-consent-Events / inline gtag('consent')) → MEDIUM "ist KEINE gültige Einwilligung". #6 CNAME-Cloaking: First-Party-Subdomains per socket.gethostbyname_ex auflösen, CNAME-Kette gegen bekannte Tracker-Infra (Eulerian/Adobe/Webtrekk/…) → HIGH "faktisch Drittanbieter trotz First-Party-Optik". Best-effort, kurze Timeouts. #7 Returning-User: Scanner nutzt by-design frische Browser-Contexts → Hinweis im Kein-Banner-Befund (fehlendes Banner liegt nicht an erinnertem Consent). Tests + py_compile grün. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-12 20:45:20 +02:00
Benjamin Admin	2b928dcb33	fix(consent-tester): Edge-Case-Befunde auch im no-banner-Frühreturn #1/#2 (kein-Banner-affirmativ) feuerte nicht, weil der no-banner-Pfad bei Zeile 220 früh zurückkehrt — vor dem Edge-Case-Block am Funktionsende. Logik in _apply_edge_case_findings extrahiert und an BEIDEN Return-Pfaden aufgerufen (Früh-Return + Ende). Damit greift #1 jetzt auf statischen Seiten. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-12 19:55:42 +02:00
Benjamin Admin	c2422138e6	feat(consent-tester): 3 Edge-Cases — kein-Banner-konform, Geo-Caveat, Non-Cookie-Tracking #1/#2: Wenn KEIN Banner erkannt UND kein Tracking vor Consent (statische Seite oder nur technisch notwendige Cookies, §25 Abs.2 TDDDG) → affirmativer LOW-Befund "konform, kein Banner nötig" statt stillem "Banner fehlt". Inkl. Geo-Caveat (Scan außerhalb EU sieht geo-getargetete Banner evtl. nicht). #3: detect_non_cookie_tracking erkennt Pixel/Fingerprinting per Domain-Signatur (Meta, TikTok, LinkedIn, Pinterest, Clarity, FingerprintJS, Hotjar, Reddit, Snapchat) → MEDIUM-Befund "§25/Art.5(3) gilt auch ohne Cookies". '0 Cookies' ≠ 'kein einwilligungspflichtiges Tracking'. Verdrahtet in consent_scanner vor dem Return. Tests + py_compile grün. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-12 19:49:55 +02:00
Benjamin Admin	d8a9e3049d	feat(consent-tester): cookieless Opt-out erkennen statt False-HIGHs Cookie-freie Analyse mit reinem Opt-out-Hinweis (z.B. bayshore.ai: "Privacy-friendly, cookie-free analytics are currently enabled ... Disable") ist KEIN Consent-Banner: cookieless = kein Endgeräte-Zugriff → §25 TDDDG verlangt keine Einwilligung → Opt-out statt Opt-in. Die Standard-Opt-in- Checks (granulare Kategorien, Accept/Reject-Balance, Impressum-im-Banner) trafen nicht zu und erzeugten 3 Falsch-HIGHs. is_cookieless_optout() erkennt das Muster (cookieless-Signal + Opt-out-Wort, KEIN Consent-Signal); check_banner_text gibt dann früh EINEN ausführlichen LOW-Erklär-Befund zurück (zählt nicht als HIGH) und setzt die Opt-in-Checks aus. Ausführlich, weil der Fall extrem untypisch ist. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-12 19:27:12 +02:00
Benjamin Admin	08c08fcba2	feat(crawl): Vollstaendigkeit — Shadow-DOM/versteckte Links + Interaktions-Fixpunkt + Wayback-CDX-Orphans CI / test-python-backend (push) Successful in 30s Details CI / detect-changes (push) Successful in 9s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / build-sha-integrity (push) Failing after 4s Details CI / validate-canonical-controls (push) Successful in 12s Details CI / loc-budget (push) Successful in 15s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Has been skipped Details CI / test-go (push) Has been skipped Details CI / iace-gt-coverage (push) Has been skipped Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details Damit die Specialist-Agents auf vollstaendigem Website-Content arbeiten: A — _find_dsi_links pierct jetzt Shadow-DOM (Web-Components wie Usercentrics/ Mercedes) rekursiv; versteckte (display:none) Links werden erfasst + als Coverage-Metadatum geflaggt. B — _expand_to_fixpoint klappt Akkordeons/Tabs/Hover-Menues in einer Schleife auf, bis das DOM stabil ist (statt 1 Pass); erweiterte Selektoren; Coverage-Telemetrie (Runden, expandierte Elemente, DOM-Wachstum, Shadow-/ versteckte Links) → Response + Backend-Log. C — legacy_url_cdx.cdx_enumerate listet via Wayback-CDX-API ALLE je archivierten URLs der Domain → findet Orphan-/Legacy-Seiten, die nie im Slug-Raster standen (z.B. nicht mehr verlinktes /datenschutz, per Direkt- URL noch erreichbar). Fliesst durch das bestehende Legacy-URL-Inventar. Tests: test_legacy_url_cdx.py (6) + consent-tester/tests/test_dsi_discovery.py (Pure-Helper + Real-Browser-Integration). Alle gruen, LOC-Gate gruen. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 12:33:34 +02:00
Benjamin Admin	b16130369a	feat(b17): Stufe 4 banner-tour + Stufe 5 annotierte Screenshots + V2-default Stufe 4 — Cookie-Banner-Tour vor dem Accept-Klick: - audit_walk_banner_tour.tour_cookie_banner(): öffnet Settings (16 Phrase-Varianten), scrollt vertikal, aktiviert jedes [role=tab], expandet jedes [aria-expanded=false] / details / summary + 14 CMP-spezifische Selektoren. Max 35 Klicks, Best-Effort. - audit_walk_recorder ruft tour_cookie_banner() VOR _try_accept_banner auf — Reviewer sieht den vollen Consent- Katalog im Video (Vendor-Liste, Kategorien, Zwecke). - Recorder unter 500 LOC (412+155 split). Stufe 5 — Annotierte Screenshots pro Finding: - finding_annotator.annotate_url(): WebKit headless, JS-Inject eines rot-banner-Labels oben + roter Outline um das Element (Selector oder Text-Match). - finding_annotator.annotate_findings(): dispatched 3 Cases — B1 Tap-Target (Anchor markiert mit "Tap-Target X×Y px"), B16 URL-Slug-Drift (404-Seite mit "/<slug> 404"), B13 Widerruf (Footer markiert "Widerruf-Link fehlt"). - routes_audit_walk.POST /annotate-findings (consent-tester). - _b17_wiring ruft annotate-findings nach record_audit_walk und speichert annotations in walk.annotations. - audit_walk_zip_builder packt PNGs nach findings/<name>.png ins ZIP — Reviewer hat Beweis-Bilder im Postfach. Plausibility Circuit-Breaker: - Nach 6 consecutive empty batches (PLAUSIBILITY_EMPTY_BUDGET=6) bricht die ganze Phase ab statt 200 Calls zu warten. Fix für qwen3-down + große DSE-Sites (BMW: ohne Breaker 21min, mit Breaker ~3min). audit_walk_zip_builder fängt walk.annotations ab und legt sie unter findings/<fname>.png im ZIP-Anhang ab. V2-Default: - docker-compose.yml backend-compliance.environment.MAIL_RENDER_V2: default 'true'. Ohne diesen Override liefert die Engine weiterhin das alte Legacy-Mail-Layout, in dem die B-Wiring- Blöcke nicht sichtbar sind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-07 20:44:42 +02:00
Benjamin Admin	c7d2038ad9	feat(b17): DSMS-CID-Anchor für Audit-Walk-Video (Stufe 3, #7 ) Video + walk.json werden nach Aufnahme zu DSMS-IPFS hochgeladen. Die zurückgegebenen CIDs sind manipulationssichere Audit-Anker — Reviewer können das Walk-Video Monate später noch verifizieren und auf Unverändertheit prüfen. consent-tester: - _upload_to_dsms(): Best-Effort-Upload zu /api/v1/documents (Bearer-Token, document_type=audit_walk_video\|meta). DSMS-Down bricht den Walk nicht ab — CID fehlt einfach im result. - record_audit_walk(): nach video.webm + walk.json erzeugt, beide hochladen. walk.json wird re-written sodass es BEIDE CIDs selbstreferenziell enthält. - ENV: DSMS_GATEWAY_URL + DSMS_BEARER konfigurierbar. backend: - _b17_wiring._publicize_gateway_url(): DSMS gibt intern http://dsms-node:8080/ipfs/{cid} zurück. Für die Audit-Mail wird das via env DSMS_PUBLIC_GATEWAY (default https://dsms-dev.breakpilot.ai) durch eine extern erreichbare URL ersetzt. - Render-Block: gelber DSMS-Anchor-Hinweis mit Video-CID + walk.json-CID, beide als klickbare Links zur public Gateway. Real-World-Smoke gegen Elli: - Video-CID: QmbdFwtSymPuWGYYdC6eNZ1eEvVLsTYmoRRxEo5L6BXgwt - walk.json-CID: QmWaTqwZq4KVd5wYFVAKB12uZtAosPqoG1X4m1azysXYJi - DSMS-Upload erfolgreich, gateway_url im response Tests: 12/12 grün (+2 für DSMS-Anchor-Render-Pfade inkl. Internal-Host → Public-Gateway-Rewrite). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-07 17:32:34 +02:00
Benjamin Admin	80c4778017	feat(b17): Akkordeon-Expansion im Audit-Walk (Stufe 2, #7 ) Nach jedem Compliance-Doc-Aufruf werden alle Akkordeons / <details> / [aria-expanded=false] / Trigger-Patterns geklickt und im Video aufgenommen. - _expand_accordions(): 7 Selektor-Patterns, max 25 Expansionen pro Seite, Dedup nach inner_text (verhindert Endlos-Loops bei nesteten Strukturen). Scroll-into-view + click + 400ms warten sicher dass das Klick-Result im Video erfasst wird. - _visit_link(): Returns (nav_event, expand_event) Tuple. Expand läuft nur bei HTTP 2xx + ohne nav-error. - 1500ms post-expand wait gibt der Kamera Zeit, den finalen Zustand mitzuschneiden. Backend B17 render: "expand_accordions" Action wird als "5 Akkordeon/Details-Sektion(en) entfaltet" gerendert. Bei 0: "Keine Akkordeons gefunden" (neutraler Hinweis, kein Fehler). Real-World-Smoke gegen Elli: Impressum: 0 Akkordeons (keine) Datenschutzerkl: 5 Akkordeons aufgeklappt Nutzungsbeding: 0 Akkordeons Video-Größe verdoppelt sich (581 KB → 1.14 MB) — Reviewer sieht jetzt den vollen DSE-Vendor-Tabellen-Inhalt im Video. Tests: 10/10 grün (+2 für Akkordeon-Render-Pfade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-07 17:23:55 +02:00
Benjamin Admin	cb4b352846	feat(b17): Playwright Audit-Walk-Video (Stufe 1, #7 ) Nimmt einen kompletten Site-Walk als WebKit-Browser-Session inkl. Video auf. Reviewer kann nachträglich exakt nachvollziehen, wie die Engine zum Befund kam. consent-tester: - services/audit_walk_recorder.py: Playwright record_video_dir, iPhone-Viewport-free 1280×800. Goto homepage → Banner-Accept (Best-Effort: 12 Text-Phrasen + 5 CMP-Fallback-Selektoren) → Footer-Links sammeln (compliance-relevant gefiltert) → pro Link navigate + Dwell-Time → JSON-Action-Index mit UTC-Timestamps + SHA-256 vom Video als Manipulation-Schutz. - routes_audit_walk.py: POST /scan-audit-walk; statische Serves für /audit-walks/{walk_id}/video.webm + walk.json. - main.py: Router registriert. backend: - _b17_wiring.py: Triggert /scan-audit-walk, speichert Walk-Metadata in state["audit_walk"]. Render-Block mit HTML-Tabelle aller Actions (HH:MM:SS + Aktion + Detail) + Links zu Video und walk.json. - _orchestrator.py: run_b17 nach run_b16, async-aufgerufen. - mail_render_v2/_compose.py: audit_walk_html im V2-Layout. - test_b17_audit_walk.py: 8 Tests (Render-Pfade + Wiring). Stufe-2 (Akkordeon-Expansion) und Stufe-3 (DSMS-CID-Anchor) folgen separat. Real-World-Smoke gegen Elli: - 581 KB Video, SHA-256 verifizierbar - 3 Footer-Links besucht (Impressum, Datenschutzerkl., Nutzungs-) - 6 Actions im JSON-Index Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-07 17:20:13 +02:00
Benjamin Admin	ff796fb480	feat: B12 Chatbot-Cookie-Klassifikation (#19 ) + Cookie-Matrix scan + safetykon test #19 Chatbot-Cookie-Klassifikation: - chat_providers.json KB mit 11 Providern (iAdvize, Intercom, Tidio, Drift, Userlike, Zendesk, LivePerson, HubSpot, Vertex AI, OpenAI, Anthropic Claude). Pro Provider: Cookie-Pattern-Regex, typical_retention_days, tn_functions vs cp_functions, ai_capable. - chatbot_cookie_classification_check.py mit 4 KORRIGIERTEN Checks: CHAT-COOKIE-CLASS-001 (MED) — TN deklariert + Vendor-Purpose erwähnt Targeting/Analytics/A-B-Tests CHAT-COOKIE-CLASS-002 (MED) — Provider hat tn+cp Funktionen, Tabelle nennt nur eine Seite → keine Einwilligungs-Differenzierung CHAT-COOKIE-PURPOSE-001 (LOW) — Zweck zu generisch (Art. 13 DSGVO konkret) CHAT-COOKIE-RETENTION-001 (HIGH) — deklariert <90d, KB-typisch >365d → vermutlich unterdeklariert NEU vs vorigem Plan: kein "eigene Banner-Kategorie Chat/AI"-Check — gesetzlich nicht vorgeschrieben (Vermischung Zweck-Transparenz vs Kategorie-Name). Anwender-Frage berechtigt, Konzept geschärft. - _b12_wiring.py + Orchestrator-Wire + V2-Compose-Slot - Cookie-Inventar mit [Chat]/[Chat+AI]-Tag pro Cookie-Name (KB-Lookup) - Smoke (3 Vendors / 5 Cookies): 9 findings korrekt (3 HIGH RETENTION, 3 MEDIUM CLASS-001, 4 LOW PURPOSE) Cookie-Matrix Scan (Browser-Vergleich gegen safetykon.de): - consent-tester/services/cookie_behavior_per_browser.py: eigener fokussierter Scanner. Pro Browser-Profile: cookies before / after reject / after accept in separaten Kontexten. Sequenzielle Runs statt parallel (Race-Conditions). - routes_cookie_matrix.py POST /scan-cookie-matrix - Live-Test safetykon.de: chromium=1, firefox=0, webkit=1, mobile- safari=1 nach reject — Firefox setzt KEIN Cookie nach Reject! (consent-tester Rebuild brachte playwright install-deps für system-libs) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-06 23:25:20 +02:00
Benjamin Admin	37093ff9e3	feat: Browser-Matrix C2 + B11 AI-Retention + Impressum-Specialist-Agent + B1 Mobile Playwright Task #15 Stage 1.c-e — Browser-Matrix Backend-Integration: - _phase_c2_browser_matrix.py: ruft consent-tester /scan-matrix wenn env BROWSER_MATRIX=true, fuellt state["browser_matrix"] + state["browser_aggregate"] + state["browser_matrix_html"] - V2-Mail-Block: 🌐 Browser-Matrix Tabelle (Profile · Score · Sub-Scores PC/RR/BD · Bewertung) mit Worst-of-Header - Orchestrator ruft run_phase_c2 nach run_phase_c KNOWN: Stage 1.b (consent_scanner browser_profile-Param) bleibt zurueckgestellt (Datei in loc-exception, Hook-Patch verweigert). Stage 1.a-Shim laeuft im consent-tester — alle Profile aktuell auf Chromium, echte Engine-Diversitaet kommt mit 1.b. Task #17 TH-RETENTION-002 als B11 ai_retention_granularity_check: - Erkennt AI-Provider-Kontext (vertex/openai/anthropic/etc) - In +-800-char-Window: prueft ≥2 Datenkategorien aus Standard-Liste (Texteingaben/IP/Geraet/Session/Fehlerprotokoll/Zeitstempel) - Wenn 1 pauschale Speicherdauer + ≥2 Kategorien aber kein per-Kategorie-Differential → LOW - Smoke: Elli-Mock-DSE trifft LOW "AI-Speicherdauer pauschal" Task #18 Specialist-Agents Phase-1-Prototyp: - compliance/services/specialist_agents/__init__.py mit Architektur-Doku - impressum_agent.py: 9 Pflichtangaben § 5 TMG + § 1 DL-InfoV als Pattern-Registry (Name, Email, Telefon, HR, USt-IdNr, Vertretungsberechtigt, Aufsichtsbehoerde, Berufsangaben, OS-Link) - business_scope-aware (OS-Link nur fuer ecommerce, Aufsichtsbehoerde nur fuer regulated_profession/financial/insurance) - Phase-1 ist Pattern-Match-only (kein LLM), demonstriert die Schnittstelle. Phase 2 ersetzt Pattern durch System-Prompt + KB. - Smoke: minimal-Impressum triggert 4 Findings korrekt Task #7 B1 Playwright Mobile-Verifikation: - consent-tester/services/mobile_reachability_scanner.py: echte WebKit-launch + p.devices['iPhone 15'] preset + de-DE locale + Europe/Berlin timezone - Footer-Anchor-Suche via locator("footer >> text=/.../i") fuer 13 Reopen-Phrasen - Tap-Target-Boundingbox-Messung (Apple HIG / WCAG ≥44x44) - Click-Behavior: DOM-Modal-Snapshot vor/nach, erkennt CMP-Open - Output: has_anchor, anchor_text, tap_target_px, click_opens_cmp, engine_meta, screenshot_b64 (Footer-Crop wenn kein Anchor) - consent-tester/routes_mobile.py POST /scan-mobile-reachability - Backend _b1_wiring erweitert: ruft Mobile-Endpoint zuerst, Fallback auf statischen HTTP-Fetch. Mobile-Daten enrichen finding.mobile_playwright + Severity-Bump bei tap-target<44 / click-doesnt-open-CMP. KNOWN: WebKit-System-Libs sind im Dockerfile ergaenzt (Stage 1.a- Commit), greifen aber erst nach CI/CD-Rebuild des consent-tester. Bis dahin faellt B1 sauber auf statischen Fetch zurueck. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-06 22:20:25 +02:00
Benjamin Admin	e1dadc8027	feat: Browser-Matrix Stufe 1.a + 2 weitere GT-Findings + Plausibility-LLM-Härtung Stage 1.a Browser-Matrix (Task #15) — Multi-Engine Scaffolding: - consent-tester/Dockerfile: firefox + webkit + Xvfb deps - playwright install chromium firefox webkit - services/browser_profiles.py: Registry mit DEFAULT_PROFILES (Chromium-Headed/Firefox-Headed/WebKit-Headed/Mobile-Safari) + EXTRA_PROFILES (Chrome-Channel, Edge, Brave) - services/multi_browser_scanner.py: run_matrix() orchestriert N parallele Scans + worst-of-Aggregation + 3 Sub-Scores (Pre-Consent 50%, Reject-Respekt 30%, Banner-Design 20%) + Hard-Fail-Cap auf <60% bei Pre-Consent/Reject-Verstoß - routes_matrix.py: POST /scan-matrix Endpoint (eigenes Modul, damit main.py unter 500 LOC bleibt) KNOWN: Stage 1.a-Shim ruft alle Profile auf demselben Chromium, echte Engine-Diversität in Stage 1.b (consent_scanner.py Param) Coverage-Gap 3 (Task #17): 2/3 verbleibende GT-Lücken geschlossen: - B9 impressum_multi_entity_check (IMPRESSUM-001): erkennt USt-IdNr/HR/GF-Fehlen pro Entity bei multi-entity Impressen (Elli: USt-IdNr nur bei Elli Mobility, fehlt bei VW Group Charging) - B10 transfer_mechanism_check (TRANSFER-001): pro Non-EU-Vendor in cmp_vendors prüft DSE auf DPF/SCCs/BCRs/Einwilligung im ±400-char-Window. Findet Vendors ohne benannten Mechanismus. - TH-RETENTION-002 (AI-Datenkategorie-Differenzierung) bleibt semantisch-tief, vorgesehen für Specialist-Agents Task #18. Plausibility-LLM Empty-Response-Härtung (Task #16): - BATCH_SIZE 8 → 4, EXCERPT 4000 → 1500 chars, TIMEOUT 60 → 45s - Single-retry mit halbierter Batch wenn LLM empty content zurückgibt — qwen3:30b-a3b rejektiert manchmal ≥6-Item-Prompts unter format='json'. Falls auch Half-Batch empty: log + skip. - Pipeline läuft jetzt nicht mehr 10min in Timeouts. GT-Coverage Sprung: 10/13 → 11/13 (85%). 4/4 HIGH ✓, 5/6 MEDIUM ✓, 2/3 LOW ✓. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-06 21:42:27 +02:00
Benjamin Admin	efeef73f90	feat(audit): overlapping evidence-slices fuer lueckenlose Beweiskette Statt EIN full-page screenshot: full-page wird per PIL in viewport-grosse Slices geschnitten, jede ueberlappt die vorherige um overlap_px Pixel. Jeder Cookie erscheint in mind. einer Slice, an Slice-Grenzen sogar in zwei → Dedup nach Name eliminiert die Doppel. Warum nicht direkt scroll-based slicing in Playwright? VW's Cookie-Page nutzt scroll-snap / fixed-position — alle viewport-shots kamen identisch zurueck (Header-Overlay). PIL-cut auf dem full-page PNG bypasst das Problem voellig. VW smoke-test (32 slices): per-slice: [0, 0, 2, 5, 5, 3, 4, 7, 4, 3, 4, 5, ...] 103 raw cookies → 79 unique nach dedup 14 vendor records (Google 9, Adobe-Familie 17, etc.) Jeder Slice hat eigenen Timestamp + SHA256 → ZIP-Anhang fuer juristische Beweiskette. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 23:38:13 +02:00
Benjamin Admin	1784b43d72	feat(audit): Screenshot+Tesseract-OCR Cookie-Extract als Vendor-Quelle C Statt fragiler text-Regex + LLM-Cascade-Workarounds: deterministische Pipeline. consent-tester macht Full-Page-Screenshot der Cookie-Richtlinie (akzeptiert Banner, klappt Accordions, brennt Timestamp ein). Backend laesst Tesseract OCR (deu, PSM 4) drueber + anchor-basierter Parser extrahiert {name, category, purpose, duration, type} pro Cookie. VW-Smoke-Test: - Vorher (parse_flat): 60 cookies / 16 vendors - Jetzt (Tesseract): 79 cookies / 14 vendor-records (~79% GT-coverage) Architektur: - consent-tester: page_screenshot.py + /capture-evidence Endpoint - backend: cookie_screenshot_ocr.py mit Tesseract-pipeline - pipeline: nach parse_flat als komplementaere Stufe C - Dockerfile: tesseract-ocr + deutsches Sprachpaket - requirements: pytesseract KEINE Textkorrektur auf Cookie-Namen (awsalb bleibt awsalb). Timestamp im Screenshot = juristischer Beweis was wir zum Scan-Zeitpunkt wirklich auf der Site gesehen haben. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 23:22:35 +02:00
Benjamin Admin	75174273f4	diag(cmp): log skipped CMP candidates with top-keys for Phase 0 VW & andere unbekannte CMPs liefern 603-Wort-Bug: kein Named-Matcher greift, generische Heuristik filtert oder size_kb < 5 → cmp_cookie_text bleibt leer → Backend faellt auf 603-Wort DOM-Navigation zurueck. Neuer INFO-Log fuer jede JSON-Response >=3KB die als CMP-Kandidat ueberlebt, aber Heuristik ODER Size-Schwelle nicht passt. Top-Keys + URL + Size — beim naechsten VW-Run sofort sichtbar, welcher Endpoint ein Named-Pattern braucht. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 11:51:03 +02:00
Benjamin Admin	e2be51b0aa	feat(audit): P106 MC-Audit-Type + P83 BUILD_SHA in Dockerfiles + P80 v2 full CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 16s Details CI / detect-changes (push) Successful in 11s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 16s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m42s Details CI / test-go (push) Has been skipped Details CI / iace-gt-coverage (push) Has been skipped Details CI / test-python-backend (push) Successful in 41s Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details P106 — mc_audit_type.py: zentrales Quality-Thema. Klassifiziert pro MC: verifiable / process_internal / doc_internal / ambiguous. Pattern-Match auf check_question + title + fail_criteria (Schulung, AVV abgeschlossen, TOM umgesetzt, DSFA durchgefuehrt, Ausnahmen dokumentieren, kostenfrei zur Verfuegung, opt-out intern ermoeglichen, …). Interne MCs werden in der MC-Auswertung NICHT mehr als FAIL gewertet, sondern als CHECK markiert (audit_status='check'). Sie zaehlen im build_scorecard als skipped (nicht failed) damit der Score realistisch ist. build_internal_checks_block_html() rendert sie als separaten blauen Block 'Pruefungen die wir von aussen NICHT durchfuehren koennen' nach dem MC-Scorecard. Erwartete Wirkung: bei VW 95 FAILs → wahrscheinlich 30-40 echte verifiable_fails + 50-60 internal_checks. GF-Mail wird drastisch realistischer (statt 'Sie haben 95 Verstoesse' → 'Sie haben 35 extern sichtbare Themen + 60 interne Checks, bitte mit DSB klaeren'). P83 — BUILD_SHA in backend/admin/consent-tester Dockerfiles als ARG + ENV. check-rebuild-needed.sh kann jetzt deployed vs local SHA vergleichen + REBUILD REQUIRED melden. P80 v2 — check_replay.py macht jetzt vollstaendigen Replay aller post-fetch Quality-Generatoren: vendor_normalizer (Dedup), audit_quality_checks, cookie_compliance_audit, tcf_vendor_authority, cookie_value_entropy, cookie_network_tracer. Snapshots aus alter Zeit zeigen jetzt im Replay den aktuellen Audit-Stand. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 08:57:02 +02:00
Benjamin Admin	8cbb513e2c	feat(audit): Phase 1 Quick-Wins (P81 + P85 + P70 + P83) + TCF DELETE/INSERT-Fix CI / guardrail-integrity (push) Has been skipped Details CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / detect-changes (push) Successful in 11s Details CI / branch-name (push) Has been skipped Details CI / loc-budget (push) Failing after 16s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 15s Details CI / iace-gt-coverage (push) Has been skipped Details CI / test-python-backend (push) Successful in 38s Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details CI / test-go (push) Has been skipped Details P81 — tests/fixtures/golden_truth/vw_de.json: GT-Fixture mit must_find_cookies (47 VW-Cookies) + expected_vendors (Google, Adobe, Trade Desk, ...). Basis fuer kuenftige Regression-Tests. P85 — banner_screenshot_block.py + consent_scanner.py + main.py: consent-tester macht beim Banner-Detect einen base64-PNG-Screenshot (< 1.5MB). Backend rendert ihn als <img src="data:..."> direkt nach dem GF-1-Pager. Visueller Beweis 'so sah das Banner aus' fuer Dispute mit Marketing/DSB. P70 — rag_provenance.py: classify_finding_provenance() klassifiziert ein Finding als 'rag' (Norm + Quelle), 'mixed' (Norm ohne Quelle) oder 'heuristic' (eigene Interpretation). provenance_badge_html() rendert kleine Badges (✓ RAG / NORM / ⚠ HEURISTIK). Modul ist generisch, kann bei jedem Finding-Renderer einklinkt werden. P83 — scripts/check-rebuild-needed.sh: Prueft ob die im Container deployten BUILD_SHA mit local HEAD uebereinstimmen. Bei Mismatch exit 1 mit 'REBUILD REQUIRED'-Hinweis. Verhindert das 'alter Code im Container'-Problem das uns mehrfach erwischt hat (Frontend-Tabs sichtbar, Backend ohne neuen Service). TCF-Fix — tcf_vendor_authority.py: cookie_library hat keinen UNIQUE-Index auf cookie_name → ON CONFLICT war unmoeglich. Loesung: vor Insert DELETE WHERE source_name='iab_tcf_v2'. Idempotent. + per-Vendor-Commit damit ein Fail die naechsten nicht blockt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 08:24:46 +02:00
Benjamin Admin	1451873194	fix(audit): parse_flat_cookie_text fuer VW-Style Flat-Tabellen CI / loc-budget (push) Failing after 19s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 3m4s Details CI / test-go (push) Has been skipped Details CI / iace-gt-coverage (push) Has been skipped Details CI / test-python-backend (push) Successful in 43s Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details CI / detect-changes (push) Successful in 12s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 19s Details VW Cookie-Doc liefert die Tabelle als FLACHEN Text ohne Spalten-Trenner: 'IDE Tracking Cookies (Marketing) Beschreibung 13 Monate Permanent TAID Tracking Cookies (Marketing) ...' parse_flat_cookie_text matched mit Regex: NAME [Tracking\|Session\|Funktional\|...] Cookies ... [13 Monate\|Session\|Permanent] Backend faellt bei parse_cookie_table=[] auf parse_flat zurueck. Damit holen wir aus dem 65k VW Cookie-Doc ~30-50 Cookies + Vendors deterministisch, auch wenn der HTML-Table-DOM-Extract leer ist (was passiert wenn die Tabelle aus mehreren append-Code-Pfaden geladen wird). Bonus: _extract_dom_tables Helper in dsi_discovery.py vorbereitet fuer spaeteres Einhaengen an allen 7 DiscoveredDSI.append-Stellen. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 21:24:14 +02:00
Benjamin Admin	cb5dad1a2f	feat(audit): A Audit-Transparenz + B Tabellen-Parse + D HTML-Tables aus DOM CI / detect-changes (push) Successful in 10s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Has been skipped Details CI / test-python-backend (push) Successful in 45s Details CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 20s Details CI / loc-budget (push) Failing after 17s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / test-go (push) Has been skipped Details CI / iace-gt-coverage (push) Has been skipped Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details Drei zusammenhaengende Fixes fuer den VW-Befund (6 Vendors statt 100+): A — audit_quality_checks.py: drei systemische Vorbehalte die IMMER prominent gezeigt werden: * banner_detected=False trotz Cookie-Doc → HIGH 'CMP-Tool ungeladen' * cookie_doc >= 30k chars aber cmp_vendors < 15 → HIGH/MEDIUM 'Vendor-Liste auffaellig kurz fuer Doc-Groesse' * submitted URL aber 0/Mini-Text → MEDIUM 'URL nicht ladbar' Rote Audit-Vorbehalt-Box ueber dem GF-1-Pager. GF-Summary sagt 'Audit unvollstaendig' statt faelschlich 'Keine kritischen Themen'. gf_one_pager nimmt audit_quality_findings in top_findings auf (BEVOR andere Findings). B — cookies_table_parser laeuft jetzt auch auf gecrawltem Cookie-Doc- Text (nicht nur bei User-Paste). Wenn der dsi-discovery-Response Tab/ Pipe-getrennte Tabellen-Reihen liefert, parsen wir sie deterministisch. D — consent-tester/dsi-discovery extrahiert jetzt zusaetzlich zum Text die <table>-Elemente aus dem DOM als list[str] (Tab-getrennt pro Zeile, mind. 2 Zellen, mind. 3 Zeilen, max 10 Tabellen pro Doc). Backend schleust diese als 'html_table'-cmp_payload ein und jagt sie zuerst durch cookies_table_parser → 100% deterministische Vendor-Extraktion ohne LLM. VW-Erwartung: aus der 65k-Cookie-Tabelle werden jetzt 30-50 Vendors deterministisch geparst statt 6 vom LLM-Cascade. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 20:21:28 +02:00
Benjamin Admin	9c11b5463c	fix(audit): P98 + P100 — Cookie-Tabellen-Whitespace + Anpassen-Button-Check CI / detect-changes (push) Successful in 11s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 18s Details CI / loc-budget (push) Failing after 17s Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Has been skipped Details CI / test-go (push) Has been skipped Details CI / iace-gt-coverage (push) Has been skipped Details CI / test-python-backend (push) Has been skipped Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details P98: HTML-Tabellen-Zellen wurden bei VW-Cookie-Richtlinie ohne Whitespace verkettet ('smartSignals2UiDsmartSignals2sUiDsmartSignals2CPs...'). Grund: el.textContent ignoriert Block-Element-Grenzen. Fix: innerText (whitespace- respecting) statt textContent. Cookie-Namen werden jetzt einzeln erkannt — VW-Lauf sollte ~100 Cookies statt 1 finden. P100: Banner-Check fuer 'Anpassen'/'Einstellungen'-Button im Initial-Banner. VW-Pattern: nur 2 Buttons (Nur technisch notwendige / Alle akzeptieren), keine granulare Wahl vor Akzeptanz/Ablehnung. Faktische Manipulation Richtung Pauschal-Akzeptanz. HIGH-Finding nach EDPB 5/2020 §82. Pattern: anpassen/einstellungen/cookie-einstellungen/manage cookies/ preferences/customize. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 15:08:33 +02:00
Benjamin Admin	e1df24cad7	fix(audit): P93+P95 — Reject-Wording erweitert + Vendor-zentrisches Cookie-Format akzeptiert CI / detect-changes (push) Successful in 10s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / test-python-backend (push) Successful in 38s Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 16s Details CI / loc-budget (push) Failing after 16s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Has been skipped Details CI / test-go (push) Has been skipped Details CI / iace-gt-coverage (push) Has been skipped Details P93: 'Cookies verbieten', 'Tracking ablehnen', 'verweigern' usw. zaehlen nun als expliziter Reject-Mechanismus. EDPB 5/2020 schreibt kein bestimmtes Wort vor — BMW False-Positive 'Kein Ablehnen-Mechanismus' weg. P95: cookie_table-Check akzeptiert nun zwei gleichwertige Formate: (a) klassische Tabelle, (b) Vendor-Detailseite mit Block pro Anbieter (Name+Anschrift, Zweck, Speicherdauer aggregiert, Cookie-Namen-Liste, Opt-Out-Link). BMW-Stil mit Adform-Block ist DSK-OH 2024 konform. False-Positive 'tabellarisches Cookie-Verzeichnis fehlt' wird seltener. Hinweis-Text in cookie_table umformuliert: nennt beide akzeptablen Formate, weniger normativ. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 12:21:29 +02:00
Benjamin Admin	7938e377b6	feat(audit-tonality): P89/P76/P91 — Co-Pilot statt Roboter-Anwalt CI / branch-name (push) Has been skipped Details CI / detect-changes (push) Successful in 11s Details CI / guardrail-integrity (push) Has been skipped Details CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 14s Details CI / loc-budget (push) Failing after 15s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Has been skipped Details CI / test-go (push) Failing after 48s Details CI / iace-gt-coverage (push) Successful in 25s Details CI / test-python-backend (push) Successful in 43s Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details User-Feedback in einer Session: "Wir erzeugen nur Panik. Egal was da steht, es dauert Wochen. Wir sind Tool an der Seite von CMO/GF/CIO, nicht Gegner." Memory: feedback_breakpilot_tonalitaet.md (gilt fuer ALLE Module + Marketing). P89 Critical-Findings-Block ENTFERNT/UMGEBAUT — keine Panik-Rot-Box mehr. - Statt "🚨 SOFORTMASSNAHMEN ERFORDERLICH" -> "Zusammenfassung fuer die Geschaeftsfuehrung", blauer dezenter Block - Statt "VERSTOSSE" -> "Themen zur Besprechung mit DSB, Marketing und Entwicklung" - Statt "Bussgeldrahmen 4% Weltumsatz" als Erstes -> realistische Einordnung (0,1-1%) in dezenter Schluss-Notiz mit Konfidenz-Hinweis - "Sofortmassnahme" -> "Empfehlung" - "Themen 1, 2, 3..." statt "HIGH"-Badges (P87-Vorbereitung) - Explizite Zeitschaetzung "4-8 Wochen (DSB -> Agentur -> Dev -> Freigabe)" P76 Mercedes-Sekundaer-Buttons (Datenschutzerklaerung + Impressum klein unter den 3 Haupt-Buttons) erkennen. Walker scant jetzt label-basiert ALLE klickbaren Elemente im Shadow-DOM (wb7-link, wb7-link-secondary, wb7-button-text, span[onclick], small a, [role=button], etc.). Vermeidet Mercedes-Impressum-False-Positive der Phase 1. P91 VVT-Tabellen-Renderer in neuer Co-Pilot-Tonalitaet. Statt "Verstoss-Liste mit Bussgeldpotenzial" -> Wahrscheinlichkeits-Aussage: "Bei Anbieter-Reduktion + Wechsel zu europaeischen Alternativen ist Reduktion des Tracking-Footprints + Lizenz-Einsparung wahrscheinlich. Fundierte Bewertung erfordert DSB-Abstimmung." BMW-Bug B1-B4 (P90) bewusst nicht in diesem Commit: BMW-Lauf hat ePaaS 4x captured im consent-tester, aber Backend bekommt 0 cmp_payloads. Wiring-Bug zwischen consent-tester /dsi-discovery und Backend _fetch_text — eigene Diagnose-Session noetig (siehe Task P90). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:24:57 +02:00
Benjamin Admin	57c0f940a2	feat(consent+report): P56-P67 Mercedes-Audit-Cycle (Anti-Audit, Phase G Vendors, Cookie-Behavior-Validator + 5 Mail-Polish-Items) [migration-approved] CI / detect-changes (push) Successful in 11s Details CI / branch-name (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m19s Details CI / test-go (push) Has been skipped Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 16s Details CI / loc-budget (push) Failing after 15s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / iace-gt-coverage (push) Has been skipped Details CI / test-python-backend (push) Successful in 37s Details P56 Anti-Auditing-Detection als constructive Compliance-Finding (Audit-API- Empfehlung statt Anklage, weil Mercedes berechtigt Bots blockiert) P57 Phase G vendor_details Union mit cmp_vendors -> 42 Anbieter sichtbar P58 Anti-Audit-Detection robuster (Script-Domain-Check + Settings-spezifisch) P59 Cookie-Behavior-Validator (4 Layer, 3-Tier-Severity: MEDIUM=Kategorie- Mismatch / HIGH=Zweck-Mismatch / CRITICAL=beide=Vorsatz-Indiz) + Open Cookie Database (CC0) als Library-Seed (2264 Cookies) P59b Cookie-Behavior in Banner-Check verdrahtet + Mail-Block (BUGFIX: SessionLocal selbst oeffnen, db war im Background-Task nicht im Scope) Mail-Polish nach Mercedes-Review: P63 Banner-Footer-Links auch im wb7-link/role=link erkennen (Shadow-DOM- Walker label-based statt nur <a href>) P64 Re-Access-Severity: MEDIUM statt HIGH, wenn Footer "Einstellungen" oder Mercedes-typisch existiert; OEM-Footer-Detection (wb7-footer) P65 Text-Truncation: Word-Boundary statt Zeichen-Cut (kein "einfa"-Bruch mehr in Sofortmassnahmen) P66 GF-Aktionen: Service-Zweck vs Cookie-Zweck explizit erklaert (haeufige Verwechslung Marketing/GF: "Akamai-Beschreibung" != Cookie- Zweck pro DSK-OH 2024) P67 Stirring-Finding mit "Verlust-Framing"-Erklaerung + Alt-vs-Neutral- Beispiel, statt nur EDPB-Fachbegriff Compliance-Advisor FAQ (admin agent-core/soul): + CNIL/EDPB Top-Bussgelder (Google 100M, Meta 60M, Amazon 35M) + Deutsche Praezedenz (LG Muenchen Google Fonts, EuGH Planet49, BGH I ZR 7/16) + 4 Risiko-Pfade (Bussgeld/Abmahnung/Sammelklage/NOYB) + Berechnungs-Methodik Document-Generator Templates: AGB-DE (142), Impressum (140), Widerrufs- formular-Anlage (143), DSR-Process-Dedup (139), Cookie-Library (144). Architektur: doc_action_mappings.py + banner_dom_walkers.py + cookie_behavior_validator.py + vendor_detail_extractor.py rausgezogen, um die 500-LOC-Caps in agent_doc_check_report.py und banner_text_checker.py einzuhalten. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 06:28:25 +02:00
Benjamin Admin	6f16507c5f	feat(banner): P19 + P20 — Per-Category-Click-Test + Frontend-Drilldown CI / detect-changes (push) Successful in 10s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 17s Details CI / loc-budget (push) Successful in 17s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m54s Details CI / test-go (push) Has been skipped Details CI / iace-gt-coverage (push) Has been skipped Details CI / test-python-backend (push) Successful in 43s Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details P19 (consent-tester): - dp-cookieconsent (TYPO3, Safetykon-Pattern) als CMP-Profil hinzu — Selektoren #dp--cookie-statistics/marketing + a.cc-allow Save-Button - Neues Signal provider_details_visible: nach Kategorie-Toggle prueft Playwright ob im Banner sichtbare Provider-/Cookie-Detail-Elemente erscheinen. Bei dp-cookieconsent (Banner ohne Listing) immer False -> HIGH-Violation "Kategorie zeigt keine Provider-/Cookie-Details — Nutzer kann nicht informiert einwilligen (Art. 7 Abs. 1 DSGVO)" - main.py serialisiert provider_details_visible + cookies_set pro Kategorie P20 (Frontend-Drilldown): - Backend: check_payloads-Tabelle um Spalte 'banner' (JSON) — voller banner_result persistiert (vorher nur in-memory). ALTER TABLE Migration idempotent. - Neuer Endpoint GET /api/compliance/agent/banner/<check_id> — liefert Quality-Score, Phases, Category-Tests, Banner-Checks, alle 46 structured_checks. - Frontend: BannerTab im /sdk/agent/audit/<id> mit Quality-Cards, 3-Phasen-Cookie-Tabelle, Per-Category-Listing (mit P19-Signal rot/gruen), Banner-Verstoesse + Rechtsgrundlagen, 46-Check-Drilldown filterbar nach Severity. - Tab-Switcher in page.tsx um "Cookie-Banner-Analyse" erweitert. - Bonus: 2 alte route.ts auf Next.js 15 Promise-params umgestellt (Build-Fix). Plus: Critical-Findings-Block nutzt provider_details_visible als primaeres Signal statt nur tracking_services-Anzahl. Smoke-Test Safetykon: 4 Critical Findings im Mail, banner-Endpoint liefert 46 checks + 3 phases + 2 categories mit provider_details_visible=False. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:31:13 +02:00
Benjamin Admin	662327e8b4	feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features CI / nodejs-build (push) Successful in 2m47s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / detect-changes (push) Successful in 10s Details CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 16s Details CI / loc-budget (push) Failing after 17s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-python-backend (push) Successful in 42s Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-go (push) Has been skipped Details CI / iace-gt-coverage (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9): Core Compliance-Check - Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db). rag_document_checker filtert auf check_type='text' fuer doc_check. Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in falscher doc_type-Schublade. - scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden per business_profile gefiltert (FRT skipped fuer BMW etc.). - Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match: Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60), Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum. Title+check_question als Embedding-Input fuer mehr Kontext. - Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction wenn richer (BMW 1824 vs 600 Worte). Vendor-Redundanz + EU-Alternativen + Cost-Saving - vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors, Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...). - vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/ enterprise/premier). - Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten (nur Media-Spend, separat). DSP-Plattformen behalten enge Range. - Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den oberen 40-100%-Band der Listpreise, nicht starter→premier. - Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere Kategorien gleichzeitig. Cookie-Wissens-DB + Funktionale Klassifikation - cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...) mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk, schrems_ii_status, EuGH-Urteile, EU-Alternative. - cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id, ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact. Country-Inferenz aus Rechtsform - cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table. Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors (Adform DK, Pinterest IE). Action-Recipes + Doc-Anchor-Locator - finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country, broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling", ...) eine strukturierte Anweisung mit what/why/fix_text/where/example. Zum 1:1-Einfuegen in Kunden-Dokumente. - doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den passenden Absatz im existierenden Kundendokument fuer jeden Finding. Per-Run Thread-Local-Cache. Fallback: keyword-Match. - Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail + Vendor-Flag-Liste mit aufklappbarer Action-Liste. - Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip). Migration-Pipeline (Compliance-Check -> Customer Banner/Documents) - migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit 4 Kategorien + Review-Flags. - migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register + Privacy-Policy-Pre-Fills. - agent_migration_routes: 3 Preview-Endpoints (banner-preview, document-preview, summary). Persistierung der cmp_vendors in /data/compliance_audits.db check_payloads-Tabelle. Borlabs-Parity Cookie-Banner-Features - Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage. - Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video Placeholder bis Einwilligung. - Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB. - Consent-Log Export (CSV/JSON) per einwilligungen_export_routes. Bug-Fixes - canonical_control_routes: _jsonish-Helper fuer string-typed jsonb, similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr). - Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views. - Embedding-Service-Batching (32er Batches statt 165 in einem Call). - KeyError 'control_id' in MC-Result-Aggregation (defensive .get). - Master-Controls-Klick-Through von /sdk/master-controls auf /sdk/control-library?control=<id> mit URL-Param-Auto-Open. - Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht). - Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction). - doc_type-aware MC-Filter (statt all-text-MCs). - Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag). - A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert. Tests - test_migration_mappers.py (9 Tests) - test_migration_endpoints.py (4 Tests) Skripte (one-shot) - classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type) - audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires) BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes): DSE 7,5% -> 81-83% Impressum 4% -> 100% (6 echte MCs alle erfuellt) Cookie 0% -> 79-83% (CMP-Text-Routing + Embedding) Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr Plus: Action-Recipes + Doc-Anchors fuer jeden Fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 18:30:08 +02:00
Benjamin Admin	189918b043	fix(cmp): stricter heuristic + only replace DOM when CMP is strictly larger Two bugs observed in BMW BMW test run: 1. Generic JSON heuristic captured /de-de/login/bmw/api/flyout/data (4KB, user login fly-out data) and reconstruct_generic produced 56 words of noise. The CMP-prefer logic then 'replaced' the 185-word imprint DOM extraction with those 56 words because self_wc(185) < 300 — even though cmp_wc(56) < self_wc(185). 2. The strict prefilter list was too short. Login/auth/cart endpoints often have category-shaped JSON without being cookie policies. Fixes: - dsi_discovery: replace DOM with CMP only when cmp_wc > self_wc AND meets one of the existing conditions. Tiny captures can no longer silently destroy a bigger DOM extraction. - cmp_extractor: skip non-cookie URLs (/login, /auth, /user, /session, /cart, /checkout, /search, /flyout, /menu, /nav, /translation, /i18n, /locale, /feature-flag). - cmp_extractor: require ≥5KB payload size — real CMP policies are always larger (BMW ePaaS is ~393KB). Tiny matches drop out before reconstruction.	2026-05-17 10:50:19 +02:00
Benjamin Admin	ea4dbb223f	feat(vvt): per-vendor extraction + opt-out check + VVT table in email (V1) When a known CMP (ePaaS, OneTrust) renders the cookie policy, we now extract structured vendor records, probe their opt-out + privacy URLs, score each vendor (0-100), and append a 'VVT-Vorschlag' table to the compliance email — one row per vendor, sortable by compliance score. consent-tester: - DSIDiscoveryResult.cmp_payloads: surfaces raw CMP JSON to callers - DSIDiscoveryResponse: new cmp_payloads field - discover_dsi_documents sets cmp_payloads from cmp_capture - cmp_library/{epaas,onetrust}.py: new extract_vendors(d) returning list[VendorRecord] backend: - _fetch_text() now returns (text, cmp_payloads) tuple - doc_entries store cmp_payloads per doc (mostly cookie) - _autodiscover_missing forwards homepage payloads to the cookie entry - New module vendor_extractor.py: dispatches ePaaS/OneTrust/generic schemas; dedupes vendors across multiple payloads - cookie_link_validator.py extended with validate_vendor_urls(vendors) and score_vendors(vendors) — 0-100 score per vendor based on name, purpose, country, opt-out reachable, privacy URL reachable, cookies with names + expiry - agent_doc_check_extras.build_vvt_table_html: renders the table - Route appends VVT HTML after the provider list, before the document-by-document report - Response JSON gains cmp_vendors for future frontend rendering Example for BMW: ~30 ePaaS providers → table with Name \| Kategorie \| Sitz \| Cookies \| Opt-Out (✓/✗) \| Privacy (✓/✗) \| Score. Sorted by score ascending so the worst-compliant vendors are at the top.	2026-05-17 09:50:11 +02:00
Benjamin Admin	5f2da1de88	feat(consent-tester): Phase E — self-improving CMP library cmp_discovery_log.py: - sqlite log at /data/cmp_discoveries.db: every LLM-discovered CMP pattern recorded with domain, strategy, value, sample text - Auto-promote (user-chosen 'voll automatisch' mode): when LLM returns strategy=url AND extracted text >= 800 words, write a new module /data/auto_cmp/auto_<slug>.py with derived regex matcher + reconstruct - record_discovery() called from dsi_discovery._try_llm_cascade on success cmp_library/_registry.py: - Loads both hand-written modules from services/cmp_library/ AND auto-promoted modules from /data/auto_cmp/ (CMP_AUTO_DIR env) - Auto modules use importlib.util.spec_from_file_location, no package install needed; restart consent-tester to pick up new ones dsi_discovery.py: - _try_llm_cascade now calls record_discovery() on every successful LLM analysis (cached AND fresh) main.py: - GET /cmp-discoveries — admin endpoint listing all logged discoveries - DELETE /cmp-discoveries/{id} — rollback (unlinks auto_*.py) This closes the self-improving loop: first encounter with a new CMP fires the LLM (cost) → discovery is auto-promoted → all future runs against the same vendor pattern hit Phase B (Named CMP) at <50ms with no LLM call.	2026-05-16 23:09:23 +02:00
Benjamin Admin	2400aa6a9e	feat(consent-tester): Phase C+D — LLM cascade fallback (Qwen → OVH) New module consent-tester/services/cmp_llm_fallback.py: - LLMCookieExtractor: single-endpoint adapter (Ollama OR OpenAI-compat) - LLMCascade: tries Qwen (local Mac Mini Ollama) first; falls through to OVH (managed 120B) when Qwen returns no usable strategy - LLMCascade.from_env(): reads OLLAMA_URL/CMP_LLM_MODEL + OVH_LLM_URL/ OVH_LLM_KEY/OVH_LLM_MODEL from environment - LLM returns JSON {strategy: url\|selector\|text, value: ...} - Valkey-backed cache per netloc (cmp:hint:<netloc>, 7-day TTL) — next run against the same domain skips the LLM entirely dsi_discovery.py: - Wired network_log collector (URL/status/content-type/size of every JSON response on the page) — passed to LLM prompt as observation - After Named CMP (Phase B) + Heuristic (Phase A) both fail AND DOM < 300 words: invoke LLMCascade.analyze(...) - _apply_llm_hint executes the LLM's strategy: refetch URL via Playwright request context, query DOM selector, or use text directly - Cache HIT path: apply cached hint, only fall back to LLM if cache is stale docker-compose.yml: - consent-tester gets env vars + cmp-data volume (for Phase E) - All LLM endpoints configurable via env, sensible defaults consent-tester/requirements.txt: - redis>=5.0 (asyncio client, Valkey-compatible) - httpx>=0.27	2026-05-16 23:06:05 +02:00
Benjamin Admin	7e426c31f1	feat(consent-tester): Phase B — named CMP library + plugin architecture cmp_extractor.py refactored to thin coordinator (123 LOC, was 223). Discovers all CMP modules via cmp_library/_registry.py:load_all() at import time. Restart consent-tester to pick up new modules. New cmp_library/ folder: - _registry.py: auto-discovers all modules with MATCHER + reconstruct() - epaas.py: BMW Group ePaaS (extracted from cmp_extractor) - onetrust.py: cdn.cookielaw.org Groups/Cookies schema - cookiebot.py: consent.cookiebot.com Categories schema - usercentrics.py: api.usercentrics.eu services schema - didomi.py: sdk.privacy-center.org notice + vendors + purposes - trustarc.py: consent.trustarc.com categories + vendors Each module: - MATCHER: re.Pattern matching the CMP JSON endpoint URL - reconstruct(d: dict) -> str: builds German Markdown cookie-policy text Phase E (self-improving) will write auto_*.py files into the same folder; _registry already picks those up via pkgutil.iter_modules.	2026-05-16 22:59:48 +02:00
Benjamin Admin	8283483909	feat(consent-tester): Phase A — generic JSON cookie-policy heuristic New module cmp_heuristic.py with: - looks_like_cookie_policy(data): shape-based classifier (top-level keys cookies/categories/providers/vendors/purposes/cookieList/etc. + at least 2 name+description objects, or IAB TCF v2 vendors[]+purposes[]) - reconstruct_generic(data): walks JSON, extracts name + description fields + standalone prologue/dataController/persistence fields, emits flat German Markdown text (max 5000 words, dedup) cmp_extractor.py wired so that AFTER named CMP matchers (epaas, onetrust) fail, every JSON response on the page is tested for the heuristic. If matched, payload is captured as '_heuristic' kind and reconstructed via the generic walker. This is Phase A of the 4-stage cascade (B-D follow). Unknown CMPs that return JSON now work without hand-coding each one. Pre-filter: skips response paths /api/config, /beacon, /track, /analytics, /fonts/, /log/, /heartbeat/, /.well-known/ to avoid spamming the heuristic on every Playwright load.	2026-05-16 22:56:20 +02:00
Benjamin Admin	9814b56f2f	fix(cookie-extract): max_documents=1 + faster networkidle bail (Phase 0 fix) Root cause of the recurring 603-word BMW result: - DSI discovery for cookie-policy URL was hitting 4x networkidle timeouts (60s each = ~240s total). - Backend httpx timeout (180s after the previous fix) gave up before the consent-tester finished, falling through to the raw HTTP fetch which returned BMWs SSR navigation chrome (603 words) as the 'cookie policy'. Two orthogonal fixes: 1. _fetch_text now passes max_documents=1 for user-specified URLs. We only want self-extraction of THAT page; link-following is unnecessary noise. 2. networkidle wait_until window dropped 60s -> 15s. SPAs like BMW/Daimler never reach networkidle anyway; the 60s wait was pure latency. Falls through to domcontentloaded+5s render-wait, same as before.	2026-05-16 22:53:23 +02:00
Benjamin Admin	938f9a6c51	fix(cmp): tolerate variable URL segments in ePaaS policy pattern BMW ePaaS URLs use 3 segments between /policypage/ and .epaas.json: /epaas/prod/policypage/<tenant>/<config-hash>/<locale>.epaas.json The old pattern only matched 2 segments. Switch to a tolerant pattern that matches any path before .epaas.json (anchored at .epaas.json end).	2026-05-16 20:58:48 +02:00
Benjamin Admin	17a93bc694	fix(consent-tester): prefer CMP-JSON over thin DOM extraction Previous threshold (DOM < 300 words) missed the BMW case where Playwright extracted 346 words of pure site navigation. The CMP JSON had 1673 words of real policy content but was discarded. New heuristic: prefer CMP when ANY of: - DOM < 300 words (existing) - CMP text >= 1000 words (authoritative at scale) - CMP text >1.5x longer than DOM	2026-05-16 20:56:11 +02:00
Benjamin Admin	1792c6f896	fix(consent-tester): capture CMP JSON to extract dynamically-loaded cookie policies BMW (and other big enterprise sites) do NOT render cookie policies as static HTML. Their widget loads structured data from a JSON endpoint (BMW: ePaaS at /epaas/prod/policypage/.../<locale>.epaas.json) and renders it client-side after consent. Our DOM extraction therefore only captured site navigation (603 words of header/footer chrome), not the actual policy. New module consent-tester/services/cmp_extractor.py: - CMPCapture: response listener that catches policy JSON during navigation - Reconstructors for ePaaS (BMW) + OneTrust placeholder - Returns Cookie-Richtlinie text built from policyPageMetadata + categories + providers (BMW: 1673 words reconstructed vs. 603 noise) dsi_discovery.py: - Attach CMPCapture before page.goto - After self-extraction: if rendered DOM < 300 words AND CMP captured a payload, prefer the CMP-reconstructed text. This bypasses the empty '.cookie-policy' div problem entirely.	2026-05-16 20:50:15 +02:00
Benjamin Admin	e61e9d9e2a	feat(agent): progress_pct + 6 BMW-Run Verbesserungen Backend (agent_compliance_check_routes.py): - progress_pct (0-100%) im Job-State, ueber alle Phasen verteilt (Laden 0-30, Profil 35-40, Pruefen 40-80, Banner 80-92, Report 95-100) - Status-Texte vereinheitlicht ("Texte laden X/N", "Pruefen X/N") - Firmenname fuer Email-Subject jetzt aus URL abgeleitet (bmw.de -> "BMW", mercedes-benz.de -> "Mercedes-Benz") statt unzuverlaessigem extracted_profile.companyName (matchte oft juris.de) - E-Mail-Report enthaelt jetzt Banner+TCF-Vendor-Liste (build_provider_list_html) Backend (agent_doc_check_extras.py — neu): - build_scanned_urls_html: gepruefte URLs als Tabelle oben im Report (transparent fuer GF, welche Quellen wirklich gezogen wurden) - Cross-Domain-Hinweis bei >1 netloc (BMW: bmw.de / bmwgroup.com / bmwgroup.jobs — Auffindbarkeit nach Art. 12 DSGVO) - build_provider_list_html: Banner-Box + TCF-Vendor-Tabelle mit Spalten Name \| Kategorie \| Zweck \| Drittland \| Rechtsgrundlage Backend (business_profiler.py): - §34d-GewO Versicherungsvermittler-Hinweise zaehlen nicht mehr als "finance"-Industrie (BMW wurde dadurch falsch als B2B/finance erkannt) - Neue Industry "automotive" (Fahrzeug/KFZ/Konfigurator/Modellpalette) - B2B-Keywords: generische Begriffe wie "unternehmen", "beratung", "consulting" entfernt (matchten in jedem Konzerntext) - B2C-Fallback: bei Verbraucher-Signalen ("widerruf", "kunde", redaktioneller Inhalt) tendiert auf b2c statt b2b Frontend (ComplianceCheckTab.tsx): - Progress-Balken mit Width-% und XX%-Anzeige rechts - liest data.progress_pct aus Polling-Response Consent-Tester (dsi_discovery.py): - Cookie-Policy-Extraktion kritisch fixt: wait_for_function bis body.innerText > 500 chars (BMW SPA-Rendering brauchte mehr Zeit) - _extract_text_robust: 3-Strategien-Extraktion (Selektoren -> Body- Cleanup -> P/LI/TD-Tags) - _extract_text_from_iframes: liest OneTrust/Sourcepoint/Usercentrics Iframe-Inhalte (manche Cookie-Policies leben dort) Adressiert alle Findings aus dem BMW-Ground-Truth-Vergleich.	2026-05-16 17:53:14 +02:00
Benjamin Admin	fca67c1f43	fix: accordion close bug + merge multi-page DSIs (BMW fix) 1. _expand_all_interactive(): Only click aria-expanded="false" buttons. Before: clicked ALL accordion buttons including open ones → BMW's pre-expanded accordions got CLOSED, reducing text from 1151 to 361w. 2. _fetch_text() + /extract-text: merge ALL documents found on a page (max_documents=10 instead of 1). BMW splits DSI across 5 sub-pages that the discovery finds as separate documents — now merged. 3. Tab panels: unhide hidden tabpanels instead of clicking tabs (clicking tabs can hide the currently visible panel). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-15 13:32:04 +02:00
Benjamin Admin	5e317d2f0f	fix: text extraction 50k char limit was root cause of all Spiegel FNs Build + Deploy / build-admin-compliance (push) Successful in 18s Details Build + Deploy / build-backend-compliance (push) Successful in 12s Details Build + Deploy / build-ai-sdk (push) Successful in 10s Details Build + Deploy / build-developer-portal (push) Successful in 10s Details Build + Deploy / build-tts (push) Successful in 10s Details Build + Deploy / build-document-crawler (push) Successful in 9s Details Build + Deploy / build-dsms-gateway (push) Successful in 10s Details Build + Deploy / build-dsms-node (push) Successful in 15s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 17s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m46s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Failing after 41s Details CI / test-python-backend (push) Successful in 37s Details CI / test-python-document-crawler (push) Successful in 27s Details CI / test-python-dsms-gateway (push) Successful in 22s Details CI / validate-canonical-controls (push) Successful in 13s Details Build + Deploy / trigger-orca (push) Successful in 2m13s Details ROOT CAUSE: main.py line 338 truncated full_text at 50,000 chars. Spiegel DSI has 107,720 chars (13,705 words) — only 47% was extracted. DSB, Art. 77, Betroffenenrechte were all in the truncated portion. Fixes: 1. Raise text limit from 50k to 200k chars in API response + discovery 2. click_button(): add iframe fallback for Sourcepoint/Quantcast 3. dsi_helpers: iterate ALL page.frames for consent buttons 4. Profiler: only check impressum (not full text) for regulated professions, and "rechtsanwalt" must be in first 500 chars (company description) 5. GT: save full Spiegel DSI text (13,705 words) as reference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 15:22:38 +02:00
Benjamin Admin	64e3a47b8c	fix(iace): confirmation dialog for ungrouping + undo/regroup Build + Deploy / build-admin-compliance (push) Successful in 1m53s Details Build + Deploy / build-backend-compliance (push) Successful in 10s Details Build + Deploy / build-ai-sdk (push) Successful in 9s Details Build + Deploy / build-developer-portal (push) Successful in 10s Details Build + Deploy / build-tts (push) Successful in 12s Details Build + Deploy / build-document-crawler (push) Successful in 10s Details Build + Deploy / build-dsms-gateway (push) Successful in 10s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 15s Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m40s Details CI / dep-audit (push) Has been skipped Details Build + Deploy / build-dsms-node (push) Successful in 13s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Failing after 44s Details CI / test-python-backend (push) Successful in 35s Details CI / test-python-document-crawler (push) Successful in 25s Details CI / test-python-dsms-gateway (push) Successful in 22s Details CI / validate-canonical-controls (push) Successful in 14s Details Build + Deploy / trigger-orca (push) Successful in 2m29s Details - X button replaced with confirmation dialog: "Als eigenen Punkt fuehren" / "Abbrechen" - Dialog explains the action and that it's reversible - Ungrouped items show orange "Zurueck in Block" button - Info bar shows count of ungrouped items + "alle zuruecksetzen" link - No destructive action without user confirmation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 15:19:39 +02:00
Benjamin Admin	b2c1f0ae84	fix(consent): add Sourcepoint iframe handler + banner_detector fallback CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 18s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 3m1s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Successful in 57s Details CI / test-python-backend (push) Successful in 41s Details CI / test-python-document-crawler (push) Successful in 28s Details CI / test-python-dsms-gateway (push) Successful in 25s Details CI / validate-canonical-controls (push) Successful in 15s Details Root cause: Spiegel DSI text was truncated because Sourcepoint consent wall was not dismissed — dsi_helpers.py had no Sourcepoint handler. Fixes: 1. Add Sourcepoint iframe click (frame_locator + .sp_choice_type_11) 2. Add banner_detector fallback (reuses 30 CMP selectors from scanner) 3. After banner dismiss, wait and re-navigate if page redirected 4. Add "Zustimmen und weiter" to generic text button list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 10:12:50 +02:00
Benjamin Admin	c702260ec1	fix: 5 regex bugs + text extraction scroll + GT update Build + Deploy / build-admin-compliance (push) Successful in 13s Details Build + Deploy / build-backend-compliance (push) Successful in 23s Details Build + Deploy / build-ai-sdk (push) Successful in 13s Details Build + Deploy / build-developer-portal (push) Successful in 14s Details Build + Deploy / build-tts (push) Successful in 15s Details Build + Deploy / build-document-crawler (push) Successful in 13s Details Build + Deploy / build-dsms-gateway (push) Successful in 15s Details Build + Deploy / build-dsms-node (push) Successful in 14s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 15s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m26s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Successful in 39s Details CI / test-python-backend (push) Successful in 39s Details CI / test-python-document-crawler (push) Successful in 25s Details CI / test-python-dsms-gateway (push) Successful in 22s Details CI / validate-canonical-controls (push) Successful in 15s Details Build + Deploy / trigger-orca (push) Successful in 2m28s Details Root cause: Spiegel DSI text was truncated (lazy-loading) — the rights/DSB/complaints sections at the bottom were never extracted. Fixes: 1. Text extraction: scroll to bottom before innerText (dsi_discovery.py) 2. V.i.S.d.P.: add "verantwortlicher i.s.v." + "§18 Abs. N MStV" pattern 3. USt-IdNr: add "umsatzsteuer-id" + "DE 212 442 423" (with spaces) 4. Profiler: remove generic "anwalt"/"praxis" (false positive on Spiegel "Redaktionsanwalt"), keep only "rechtsanwalt", "kanzlei" etc. 5. Section splitter: auto_fill_from_dsi() fills empty Cookie/Social-Media rows from sections found in the DSI text Ground Truth 06-spiegel.md fully rewritten with verified data from live website — 3 L1 False Negatives identified (DSB, Beschwerderecht, Betroffenenrechte all present on website but not in extracted text). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 01:20:55 +02:00
Benjamin Admin	c867478791	feat(tcf-vendors): GVL cache + vendor extraction + VVT mapping Build + Deploy / build-admin-compliance (push) Successful in 14s Details Build + Deploy / build-backend-compliance (push) Successful in 16s Details Build + Deploy / build-ai-sdk (push) Successful in 20s Details Build + Deploy / build-developer-portal (push) Successful in 12s Details Build + Deploy / build-tts (push) Successful in 15s Details Build + Deploy / build-document-crawler (push) Successful in 13s Details Build + Deploy / build-dsms-gateway (push) Successful in 13s Details Build + Deploy / build-dsms-node (push) Successful in 12s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 16s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m49s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Successful in 45s Details CI / test-python-backend (push) Successful in 38s Details CI / test-python-document-crawler (push) Successful in 26s Details CI / test-python-dsms-gateway (push) Successful in 23s Details CI / validate-canonical-controls (push) Successful in 15s Details Build + Deploy / trigger-orca (push) Successful in 2m23s Details Phase 1-2 of the closed quality loop: - GVL cache (consent-tester/services/gvl_cache.py): downloads and caches IAB Global Vendor List with 24h TTL, resolves vendor IDs to names, purposes, policy URLs, retention, country - Vendor extraction (consent_interceptor.py): extract_tcf_vendors() reads __tcfapi after accept phase, resolves via GVL - Scan response: tcf_vendors field added to /scan endpoint - VVT mapper (vendor_vvt_mapper.py): maps TCF vendors to VVT format with purpose labels, Rechtsgrundlage, Drittland detection - Vendor cross-check (banner_cookie_cross_check.py): checks all TCF vendors against DSI text — missing vendors, undocumented transfers - Compliance check integrates Step 3d: TCF vendors vs DSI Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-12 18:18:50 +02:00
Benjamin Admin	fde2f551d7	fix: Add impressum keywords to dsi_discovery.py inline DSI_KEYWORDS The inline DSI_KEYWORDS in dsi_discovery.py was missing 'impressum'. This caused self-extraction to skip impressum pages, returning datenschutz text instead. Added: impressum, anbieterkennzeichnung, imprint, legal notice, site notice. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-11 14:43:47 +02:00

1 2

82 Commits