breakpilot-compliance

Author	SHA1	Message	Date
Benjamin Admin	cb4b352846	feat(b17): Playwright Audit-Walk-Video (Stufe 1, #7 ) Nimmt einen kompletten Site-Walk als WebKit-Browser-Session inkl. Video auf. Reviewer kann nachträglich exakt nachvollziehen, wie die Engine zum Befund kam. consent-tester: - services/audit_walk_recorder.py: Playwright record_video_dir, iPhone-Viewport-free 1280×800. Goto homepage → Banner-Accept (Best-Effort: 12 Text-Phrasen + 5 CMP-Fallback-Selektoren) → Footer-Links sammeln (compliance-relevant gefiltert) → pro Link navigate + Dwell-Time → JSON-Action-Index mit UTC-Timestamps + SHA-256 vom Video als Manipulation-Schutz. - routes_audit_walk.py: POST /scan-audit-walk; statische Serves für /audit-walks/{walk_id}/video.webm + walk.json. - main.py: Router registriert. backend: - _b17_wiring.py: Triggert /scan-audit-walk, speichert Walk-Metadata in state["audit_walk"]. Render-Block mit HTML-Tabelle aller Actions (HH:MM:SS + Aktion + Detail) + Links zu Video und walk.json. - _orchestrator.py: run_b17 nach run_b16, async-aufgerufen. - mail_render_v2/_compose.py: audit_walk_html im V2-Layout. - test_b17_audit_walk.py: 8 Tests (Render-Pfade + Wiring). Stufe-2 (Akkordeon-Expansion) und Stufe-3 (DSMS-CID-Anchor) folgen separat. Real-World-Smoke gegen Elli: - 581 KB Video, SHA-256 verifizierbar - 3 Footer-Links besucht (Impressum, Datenschutzerkl., Nutzungs-) - 6 Actions im JSON-Index Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-07 17:20:13 +02:00
Benjamin Admin	ff796fb480	feat: B12 Chatbot-Cookie-Klassifikation (#19 ) + Cookie-Matrix scan + safetykon test #19 Chatbot-Cookie-Klassifikation: - chat_providers.json KB mit 11 Providern (iAdvize, Intercom, Tidio, Drift, Userlike, Zendesk, LivePerson, HubSpot, Vertex AI, OpenAI, Anthropic Claude). Pro Provider: Cookie-Pattern-Regex, typical_retention_days, tn_functions vs cp_functions, ai_capable. - chatbot_cookie_classification_check.py mit 4 KORRIGIERTEN Checks: CHAT-COOKIE-CLASS-001 (MED) — TN deklariert + Vendor-Purpose erwähnt Targeting/Analytics/A-B-Tests CHAT-COOKIE-CLASS-002 (MED) — Provider hat tn+cp Funktionen, Tabelle nennt nur eine Seite → keine Einwilligungs-Differenzierung CHAT-COOKIE-PURPOSE-001 (LOW) — Zweck zu generisch (Art. 13 DSGVO konkret) CHAT-COOKIE-RETENTION-001 (HIGH) — deklariert <90d, KB-typisch >365d → vermutlich unterdeklariert NEU vs vorigem Plan: kein "eigene Banner-Kategorie Chat/AI"-Check — gesetzlich nicht vorgeschrieben (Vermischung Zweck-Transparenz vs Kategorie-Name). Anwender-Frage berechtigt, Konzept geschärft. - _b12_wiring.py + Orchestrator-Wire + V2-Compose-Slot - Cookie-Inventar mit [Chat]/[Chat+AI]-Tag pro Cookie-Name (KB-Lookup) - Smoke (3 Vendors / 5 Cookies): 9 findings korrekt (3 HIGH RETENTION, 3 MEDIUM CLASS-001, 4 LOW PURPOSE) Cookie-Matrix Scan (Browser-Vergleich gegen safetykon.de): - consent-tester/services/cookie_behavior_per_browser.py: eigener fokussierter Scanner. Pro Browser-Profile: cookies before / after reject / after accept in separaten Kontexten. Sequenzielle Runs statt parallel (Race-Conditions). - routes_cookie_matrix.py POST /scan-cookie-matrix - Live-Test safetykon.de: chromium=1, firefox=0, webkit=1, mobile- safari=1 nach reject — Firefox setzt KEIN Cookie nach Reject! (consent-tester Rebuild brachte playwright install-deps für system-libs) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-06 23:25:20 +02:00
Benjamin Admin	37093ff9e3	feat: Browser-Matrix C2 + B11 AI-Retention + Impressum-Specialist-Agent + B1 Mobile Playwright Task #15 Stage 1.c-e — Browser-Matrix Backend-Integration: - _phase_c2_browser_matrix.py: ruft consent-tester /scan-matrix wenn env BROWSER_MATRIX=true, fuellt state["browser_matrix"] + state["browser_aggregate"] + state["browser_matrix_html"] - V2-Mail-Block: 🌐 Browser-Matrix Tabelle (Profile · Score · Sub-Scores PC/RR/BD · Bewertung) mit Worst-of-Header - Orchestrator ruft run_phase_c2 nach run_phase_c KNOWN: Stage 1.b (consent_scanner browser_profile-Param) bleibt zurueckgestellt (Datei in loc-exception, Hook-Patch verweigert). Stage 1.a-Shim laeuft im consent-tester — alle Profile aktuell auf Chromium, echte Engine-Diversitaet kommt mit 1.b. Task #17 TH-RETENTION-002 als B11 ai_retention_granularity_check: - Erkennt AI-Provider-Kontext (vertex/openai/anthropic/etc) - In +-800-char-Window: prueft ≥2 Datenkategorien aus Standard-Liste (Texteingaben/IP/Geraet/Session/Fehlerprotokoll/Zeitstempel) - Wenn 1 pauschale Speicherdauer + ≥2 Kategorien aber kein per-Kategorie-Differential → LOW - Smoke: Elli-Mock-DSE trifft LOW "AI-Speicherdauer pauschal" Task #18 Specialist-Agents Phase-1-Prototyp: - compliance/services/specialist_agents/__init__.py mit Architektur-Doku - impressum_agent.py: 9 Pflichtangaben § 5 TMG + § 1 DL-InfoV als Pattern-Registry (Name, Email, Telefon, HR, USt-IdNr, Vertretungsberechtigt, Aufsichtsbehoerde, Berufsangaben, OS-Link) - business_scope-aware (OS-Link nur fuer ecommerce, Aufsichtsbehoerde nur fuer regulated_profession/financial/insurance) - Phase-1 ist Pattern-Match-only (kein LLM), demonstriert die Schnittstelle. Phase 2 ersetzt Pattern durch System-Prompt + KB. - Smoke: minimal-Impressum triggert 4 Findings korrekt Task #7 B1 Playwright Mobile-Verifikation: - consent-tester/services/mobile_reachability_scanner.py: echte WebKit-launch + p.devices['iPhone 15'] preset + de-DE locale + Europe/Berlin timezone - Footer-Anchor-Suche via locator("footer >> text=/.../i") fuer 13 Reopen-Phrasen - Tap-Target-Boundingbox-Messung (Apple HIG / WCAG ≥44x44) - Click-Behavior: DOM-Modal-Snapshot vor/nach, erkennt CMP-Open - Output: has_anchor, anchor_text, tap_target_px, click_opens_cmp, engine_meta, screenshot_b64 (Footer-Crop wenn kein Anchor) - consent-tester/routes_mobile.py POST /scan-mobile-reachability - Backend _b1_wiring erweitert: ruft Mobile-Endpoint zuerst, Fallback auf statischen HTTP-Fetch. Mobile-Daten enrichen finding.mobile_playwright + Severity-Bump bei tap-target<44 / click-doesnt-open-CMP. KNOWN: WebKit-System-Libs sind im Dockerfile ergaenzt (Stage 1.a- Commit), greifen aber erst nach CI/CD-Rebuild des consent-tester. Bis dahin faellt B1 sauber auf statischen Fetch zurueck. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-06 22:20:25 +02:00
Benjamin Admin	e1dadc8027	feat: Browser-Matrix Stufe 1.a + 2 weitere GT-Findings + Plausibility-LLM-Härtung Stage 1.a Browser-Matrix (Task #15) — Multi-Engine Scaffolding: - consent-tester/Dockerfile: firefox + webkit + Xvfb deps - playwright install chromium firefox webkit - services/browser_profiles.py: Registry mit DEFAULT_PROFILES (Chromium-Headed/Firefox-Headed/WebKit-Headed/Mobile-Safari) + EXTRA_PROFILES (Chrome-Channel, Edge, Brave) - services/multi_browser_scanner.py: run_matrix() orchestriert N parallele Scans + worst-of-Aggregation + 3 Sub-Scores (Pre-Consent 50%, Reject-Respekt 30%, Banner-Design 20%) + Hard-Fail-Cap auf <60% bei Pre-Consent/Reject-Verstoß - routes_matrix.py: POST /scan-matrix Endpoint (eigenes Modul, damit main.py unter 500 LOC bleibt) KNOWN: Stage 1.a-Shim ruft alle Profile auf demselben Chromium, echte Engine-Diversität in Stage 1.b (consent_scanner.py Param) Coverage-Gap 3 (Task #17): 2/3 verbleibende GT-Lücken geschlossen: - B9 impressum_multi_entity_check (IMPRESSUM-001): erkennt USt-IdNr/HR/GF-Fehlen pro Entity bei multi-entity Impressen (Elli: USt-IdNr nur bei Elli Mobility, fehlt bei VW Group Charging) - B10 transfer_mechanism_check (TRANSFER-001): pro Non-EU-Vendor in cmp_vendors prüft DSE auf DPF/SCCs/BCRs/Einwilligung im ±400-char-Window. Findet Vendors ohne benannten Mechanismus. - TH-RETENTION-002 (AI-Datenkategorie-Differenzierung) bleibt semantisch-tief, vorgesehen für Specialist-Agents Task #18. Plausibility-LLM Empty-Response-Härtung (Task #16): - BATCH_SIZE 8 → 4, EXCERPT 4000 → 1500 chars, TIMEOUT 60 → 45s - Single-retry mit halbierter Batch wenn LLM empty content zurückgibt — qwen3:30b-a3b rejektiert manchmal ≥6-Item-Prompts unter format='json'. Falls auch Half-Batch empty: log + skip. - Pipeline läuft jetzt nicht mehr 10min in Timeouts. GT-Coverage Sprung: 10/13 → 11/13 (85%). 4/4 HIGH ✓, 5/6 MEDIUM ✓, 2/3 LOW ✓. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-06 21:42:27 +02:00
Benjamin Admin	efeef73f90	feat(audit): overlapping evidence-slices fuer lueckenlose Beweiskette Statt EIN full-page screenshot: full-page wird per PIL in viewport-grosse Slices geschnitten, jede ueberlappt die vorherige um overlap_px Pixel. Jeder Cookie erscheint in mind. einer Slice, an Slice-Grenzen sogar in zwei → Dedup nach Name eliminiert die Doppel. Warum nicht direkt scroll-based slicing in Playwright? VW's Cookie-Page nutzt scroll-snap / fixed-position — alle viewport-shots kamen identisch zurueck (Header-Overlay). PIL-cut auf dem full-page PNG bypasst das Problem voellig. VW smoke-test (32 slices): per-slice: [0, 0, 2, 5, 5, 3, 4, 7, 4, 3, 4, 5, ...] 103 raw cookies → 79 unique nach dedup 14 vendor records (Google 9, Adobe-Familie 17, etc.) Jeder Slice hat eigenen Timestamp + SHA256 → ZIP-Anhang fuer juristische Beweiskette. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 23:38:13 +02:00
Benjamin Admin	1784b43d72	feat(audit): Screenshot+Tesseract-OCR Cookie-Extract als Vendor-Quelle C Statt fragiler text-Regex + LLM-Cascade-Workarounds: deterministische Pipeline. consent-tester macht Full-Page-Screenshot der Cookie-Richtlinie (akzeptiert Banner, klappt Accordions, brennt Timestamp ein). Backend laesst Tesseract OCR (deu, PSM 4) drueber + anchor-basierter Parser extrahiert {name, category, purpose, duration, type} pro Cookie. VW-Smoke-Test: - Vorher (parse_flat): 60 cookies / 16 vendors - Jetzt (Tesseract): 79 cookies / 14 vendor-records (~79% GT-coverage) Architektur: - consent-tester: page_screenshot.py + /capture-evidence Endpoint - backend: cookie_screenshot_ocr.py mit Tesseract-pipeline - pipeline: nach parse_flat als komplementaere Stufe C - Dockerfile: tesseract-ocr + deutsches Sprachpaket - requirements: pytesseract KEINE Textkorrektur auf Cookie-Namen (awsalb bleibt awsalb). Timestamp im Screenshot = juristischer Beweis was wir zum Scan-Zeitpunkt wirklich auf der Site gesehen haben. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 23:22:35 +02:00
Benjamin Admin	8cbb513e2c	feat(audit): Phase 1 Quick-Wins (P81 + P85 + P70 + P83) + TCF DELETE/INSERT-Fix CI / guardrail-integrity (push) Has been skipped Details CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / detect-changes (push) Successful in 11s Details CI / branch-name (push) Has been skipped Details CI / loc-budget (push) Failing after 16s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 15s Details CI / iace-gt-coverage (push) Has been skipped Details CI / test-python-backend (push) Successful in 38s Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details CI / test-go (push) Has been skipped Details P81 — tests/fixtures/golden_truth/vw_de.json: GT-Fixture mit must_find_cookies (47 VW-Cookies) + expected_vendors (Google, Adobe, Trade Desk, ...). Basis fuer kuenftige Regression-Tests. P85 — banner_screenshot_block.py + consent_scanner.py + main.py: consent-tester macht beim Banner-Detect einen base64-PNG-Screenshot (< 1.5MB). Backend rendert ihn als <img src="data:..."> direkt nach dem GF-1-Pager. Visueller Beweis 'so sah das Banner aus' fuer Dispute mit Marketing/DSB. P70 — rag_provenance.py: classify_finding_provenance() klassifiziert ein Finding als 'rag' (Norm + Quelle), 'mixed' (Norm ohne Quelle) oder 'heuristic' (eigene Interpretation). provenance_badge_html() rendert kleine Badges (✓ RAG / NORM / ⚠ HEURISTIK). Modul ist generisch, kann bei jedem Finding-Renderer einklinkt werden. P83 — scripts/check-rebuild-needed.sh: Prueft ob die im Container deployten BUILD_SHA mit local HEAD uebereinstimmen. Bei Mismatch exit 1 mit 'REBUILD REQUIRED'-Hinweis. Verhindert das 'alter Code im Container'-Problem das uns mehrfach erwischt hat (Frontend-Tabs sichtbar, Backend ohne neuen Service). TCF-Fix — tcf_vendor_authority.py: cookie_library hat keinen UNIQUE-Index auf cookie_name → ON CONFLICT war unmoeglich. Loesung: vor Insert DELETE WHERE source_name='iab_tcf_v2'. Idempotent. + per-Vendor-Commit damit ein Fail die naechsten nicht blockt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 08:24:46 +02:00
Benjamin Admin	cb5dad1a2f	feat(audit): A Audit-Transparenz + B Tabellen-Parse + D HTML-Tables aus DOM CI / detect-changes (push) Successful in 10s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Has been skipped Details CI / test-python-backend (push) Successful in 45s Details CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 20s Details CI / loc-budget (push) Failing after 17s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / test-go (push) Has been skipped Details CI / iace-gt-coverage (push) Has been skipped Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details Drei zusammenhaengende Fixes fuer den VW-Befund (6 Vendors statt 100+): A — audit_quality_checks.py: drei systemische Vorbehalte die IMMER prominent gezeigt werden: * banner_detected=False trotz Cookie-Doc → HIGH 'CMP-Tool ungeladen' * cookie_doc >= 30k chars aber cmp_vendors < 15 → HIGH/MEDIUM 'Vendor-Liste auffaellig kurz fuer Doc-Groesse' * submitted URL aber 0/Mini-Text → MEDIUM 'URL nicht ladbar' Rote Audit-Vorbehalt-Box ueber dem GF-1-Pager. GF-Summary sagt 'Audit unvollstaendig' statt faelschlich 'Keine kritischen Themen'. gf_one_pager nimmt audit_quality_findings in top_findings auf (BEVOR andere Findings). B — cookies_table_parser laeuft jetzt auch auf gecrawltem Cookie-Doc- Text (nicht nur bei User-Paste). Wenn der dsi-discovery-Response Tab/ Pipe-getrennte Tabellen-Reihen liefert, parsen wir sie deterministisch. D — consent-tester/dsi-discovery extrahiert jetzt zusaetzlich zum Text die <table>-Elemente aus dem DOM als list[str] (Tab-getrennt pro Zeile, mind. 2 Zellen, mind. 3 Zeilen, max 10 Tabellen pro Doc). Backend schleust diese als 'html_table'-cmp_payload ein und jagt sie zuerst durch cookies_table_parser → 100% deterministische Vendor-Extraktion ohne LLM. VW-Erwartung: aus der 65k-Cookie-Tabelle werden jetzt 30-50 Vendors deterministisch geparst statt 6 vom LLM-Cascade. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 20:21:28 +02:00
Benjamin Admin	57c0f940a2	feat(consent+report): P56-P67 Mercedes-Audit-Cycle (Anti-Audit, Phase G Vendors, Cookie-Behavior-Validator + 5 Mail-Polish-Items) [migration-approved] CI / detect-changes (push) Successful in 11s Details CI / branch-name (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m19s Details CI / test-go (push) Has been skipped Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 16s Details CI / loc-budget (push) Failing after 15s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / iace-gt-coverage (push) Has been skipped Details CI / test-python-backend (push) Successful in 37s Details P56 Anti-Auditing-Detection als constructive Compliance-Finding (Audit-API- Empfehlung statt Anklage, weil Mercedes berechtigt Bots blockiert) P57 Phase G vendor_details Union mit cmp_vendors -> 42 Anbieter sichtbar P58 Anti-Audit-Detection robuster (Script-Domain-Check + Settings-spezifisch) P59 Cookie-Behavior-Validator (4 Layer, 3-Tier-Severity: MEDIUM=Kategorie- Mismatch / HIGH=Zweck-Mismatch / CRITICAL=beide=Vorsatz-Indiz) + Open Cookie Database (CC0) als Library-Seed (2264 Cookies) P59b Cookie-Behavior in Banner-Check verdrahtet + Mail-Block (BUGFIX: SessionLocal selbst oeffnen, db war im Background-Task nicht im Scope) Mail-Polish nach Mercedes-Review: P63 Banner-Footer-Links auch im wb7-link/role=link erkennen (Shadow-DOM- Walker label-based statt nur <a href>) P64 Re-Access-Severity: MEDIUM statt HIGH, wenn Footer "Einstellungen" oder Mercedes-typisch existiert; OEM-Footer-Detection (wb7-footer) P65 Text-Truncation: Word-Boundary statt Zeichen-Cut (kein "einfa"-Bruch mehr in Sofortmassnahmen) P66 GF-Aktionen: Service-Zweck vs Cookie-Zweck explizit erklaert (haeufige Verwechslung Marketing/GF: "Akamai-Beschreibung" != Cookie- Zweck pro DSK-OH 2024) P67 Stirring-Finding mit "Verlust-Framing"-Erklaerung + Alt-vs-Neutral- Beispiel, statt nur EDPB-Fachbegriff Compliance-Advisor FAQ (admin agent-core/soul): + CNIL/EDPB Top-Bussgelder (Google 100M, Meta 60M, Amazon 35M) + Deutsche Praezedenz (LG Muenchen Google Fonts, EuGH Planet49, BGH I ZR 7/16) + 4 Risiko-Pfade (Bussgeld/Abmahnung/Sammelklage/NOYB) + Berechnungs-Methodik Document-Generator Templates: AGB-DE (142), Impressum (140), Widerrufs- formular-Anlage (143), DSR-Process-Dedup (139), Cookie-Library (144). Architektur: doc_action_mappings.py + banner_dom_walkers.py + cookie_behavior_validator.py + vendor_detail_extractor.py rausgezogen, um die 500-LOC-Caps in agent_doc_check_report.py und banner_text_checker.py einzuhalten. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 06:28:25 +02:00
Benjamin Admin	6f16507c5f	feat(banner): P19 + P20 — Per-Category-Click-Test + Frontend-Drilldown CI / detect-changes (push) Successful in 10s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 17s Details CI / loc-budget (push) Successful in 17s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m54s Details CI / test-go (push) Has been skipped Details CI / iace-gt-coverage (push) Has been skipped Details CI / test-python-backend (push) Successful in 43s Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details P19 (consent-tester): - dp-cookieconsent (TYPO3, Safetykon-Pattern) als CMP-Profil hinzu — Selektoren #dp--cookie-statistics/marketing + a.cc-allow Save-Button - Neues Signal provider_details_visible: nach Kategorie-Toggle prueft Playwright ob im Banner sichtbare Provider-/Cookie-Detail-Elemente erscheinen. Bei dp-cookieconsent (Banner ohne Listing) immer False -> HIGH-Violation "Kategorie zeigt keine Provider-/Cookie-Details — Nutzer kann nicht informiert einwilligen (Art. 7 Abs. 1 DSGVO)" - main.py serialisiert provider_details_visible + cookies_set pro Kategorie P20 (Frontend-Drilldown): - Backend: check_payloads-Tabelle um Spalte 'banner' (JSON) — voller banner_result persistiert (vorher nur in-memory). ALTER TABLE Migration idempotent. - Neuer Endpoint GET /api/compliance/agent/banner/<check_id> — liefert Quality-Score, Phases, Category-Tests, Banner-Checks, alle 46 structured_checks. - Frontend: BannerTab im /sdk/agent/audit/<id> mit Quality-Cards, 3-Phasen-Cookie-Tabelle, Per-Category-Listing (mit P19-Signal rot/gruen), Banner-Verstoesse + Rechtsgrundlagen, 46-Check-Drilldown filterbar nach Severity. - Tab-Switcher in page.tsx um "Cookie-Banner-Analyse" erweitert. - Bonus: 2 alte route.ts auf Next.js 15 Promise-params umgestellt (Build-Fix). Plus: Critical-Findings-Block nutzt provider_details_visible als primaeres Signal statt nur tracking_services-Anzahl. Smoke-Test Safetykon: 4 Critical Findings im Mail, banner-Endpoint liefert 46 checks + 3 phases + 2 categories mit provider_details_visible=False. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:31:13 +02:00
Benjamin Admin	ea4dbb223f	feat(vvt): per-vendor extraction + opt-out check + VVT table in email (V1) When a known CMP (ePaaS, OneTrust) renders the cookie policy, we now extract structured vendor records, probe their opt-out + privacy URLs, score each vendor (0-100), and append a 'VVT-Vorschlag' table to the compliance email — one row per vendor, sortable by compliance score. consent-tester: - DSIDiscoveryResult.cmp_payloads: surfaces raw CMP JSON to callers - DSIDiscoveryResponse: new cmp_payloads field - discover_dsi_documents sets cmp_payloads from cmp_capture - cmp_library/{epaas,onetrust}.py: new extract_vendors(d) returning list[VendorRecord] backend: - _fetch_text() now returns (text, cmp_payloads) tuple - doc_entries store cmp_payloads per doc (mostly cookie) - _autodiscover_missing forwards homepage payloads to the cookie entry - New module vendor_extractor.py: dispatches ePaaS/OneTrust/generic schemas; dedupes vendors across multiple payloads - cookie_link_validator.py extended with validate_vendor_urls(vendors) and score_vendors(vendors) — 0-100 score per vendor based on name, purpose, country, opt-out reachable, privacy URL reachable, cookies with names + expiry - agent_doc_check_extras.build_vvt_table_html: renders the table - Route appends VVT HTML after the provider list, before the document-by-document report - Response JSON gains cmp_vendors for future frontend rendering Example for BMW: ~30 ePaaS providers → table with Name \| Kategorie \| Sitz \| Cookies \| Opt-Out (✓/✗) \| Privacy (✓/✗) \| Score. Sorted by score ascending so the worst-compliant vendors are at the top.	2026-05-17 09:50:11 +02:00
Benjamin Admin	5f2da1de88	feat(consent-tester): Phase E — self-improving CMP library cmp_discovery_log.py: - sqlite log at /data/cmp_discoveries.db: every LLM-discovered CMP pattern recorded with domain, strategy, value, sample text - Auto-promote (user-chosen 'voll automatisch' mode): when LLM returns strategy=url AND extracted text >= 800 words, write a new module /data/auto_cmp/auto_<slug>.py with derived regex matcher + reconstruct - record_discovery() called from dsi_discovery._try_llm_cascade on success cmp_library/_registry.py: - Loads both hand-written modules from services/cmp_library/ AND auto-promoted modules from /data/auto_cmp/ (CMP_AUTO_DIR env) - Auto modules use importlib.util.spec_from_file_location, no package install needed; restart consent-tester to pick up new ones dsi_discovery.py: - _try_llm_cascade now calls record_discovery() on every successful LLM analysis (cached AND fresh) main.py: - GET /cmp-discoveries — admin endpoint listing all logged discoveries - DELETE /cmp-discoveries/{id} — rollback (unlinks auto_*.py) This closes the self-improving loop: first encounter with a new CMP fires the LLM (cost) → discovery is auto-promoted → all future runs against the same vendor pattern hit Phase B (Named CMP) at <50ms with no LLM call.	2026-05-16 23:09:23 +02:00
Benjamin Admin	5e317d2f0f	fix: text extraction 50k char limit was root cause of all Spiegel FNs Build + Deploy / build-admin-compliance (push) Successful in 18s Details Build + Deploy / build-backend-compliance (push) Successful in 12s Details Build + Deploy / build-ai-sdk (push) Successful in 10s Details Build + Deploy / build-developer-portal (push) Successful in 10s Details Build + Deploy / build-tts (push) Successful in 10s Details Build + Deploy / build-document-crawler (push) Successful in 9s Details Build + Deploy / build-dsms-gateway (push) Successful in 10s Details Build + Deploy / build-dsms-node (push) Successful in 15s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 17s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m46s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Failing after 41s Details CI / test-python-backend (push) Successful in 37s Details CI / test-python-document-crawler (push) Successful in 27s Details CI / test-python-dsms-gateway (push) Successful in 22s Details CI / validate-canonical-controls (push) Successful in 13s Details Build + Deploy / trigger-orca (push) Successful in 2m13s Details ROOT CAUSE: main.py line 338 truncated full_text at 50,000 chars. Spiegel DSI has 107,720 chars (13,705 words) — only 47% was extracted. DSB, Art. 77, Betroffenenrechte were all in the truncated portion. Fixes: 1. Raise text limit from 50k to 200k chars in API response + discovery 2. click_button(): add iframe fallback for Sourcepoint/Quantcast 3. dsi_helpers: iterate ALL page.frames for consent buttons 4. Profiler: only check impressum (not full text) for regulated professions, and "rechtsanwalt" must be in first 500 chars (company description) 5. GT: save full Spiegel DSI text (13,705 words) as reference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 15:22:38 +02:00
Benjamin Admin	c867478791	feat(tcf-vendors): GVL cache + vendor extraction + VVT mapping Build + Deploy / build-admin-compliance (push) Successful in 14s Details Build + Deploy / build-backend-compliance (push) Successful in 16s Details Build + Deploy / build-ai-sdk (push) Successful in 20s Details Build + Deploy / build-developer-portal (push) Successful in 12s Details Build + Deploy / build-tts (push) Successful in 15s Details Build + Deploy / build-document-crawler (push) Successful in 13s Details Build + Deploy / build-dsms-gateway (push) Successful in 13s Details Build + Deploy / build-dsms-node (push) Successful in 12s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 16s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m49s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Successful in 45s Details CI / test-python-backend (push) Successful in 38s Details CI / test-python-document-crawler (push) Successful in 26s Details CI / test-python-dsms-gateway (push) Successful in 23s Details CI / validate-canonical-controls (push) Successful in 15s Details Build + Deploy / trigger-orca (push) Successful in 2m23s Details Phase 1-2 of the closed quality loop: - GVL cache (consent-tester/services/gvl_cache.py): downloads and caches IAB Global Vendor List with 24h TTL, resolves vendor IDs to names, purposes, policy URLs, retention, country - Vendor extraction (consent_interceptor.py): extract_tcf_vendors() reads __tcfapi after accept phase, resolves via GVL - Scan response: tcf_vendors field added to /scan endpoint - VVT mapper (vendor_vvt_mapper.py): maps TCF vendors to VVT format with purpose labels, Rechtsgrundlage, Drittland detection - Vendor cross-check (banner_cookie_cross_check.py): checks all TCF vendors against DSI text — missing vendors, undocumented transfers - Compliance check integrates Step 3d: TCF vendors vs DSI Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-12 18:18:50 +02:00
Benjamin Admin	4bfb438c92	feat: 4 banner check upgrades — 30 CMPs, stealth, Shadow DOM, categories Build + Deploy / build-admin-compliance (push) Successful in 2m17s Details Build + Deploy / build-backend-compliance (push) Successful in 3m17s Details Build + Deploy / build-ai-sdk (push) Successful in 56s Details Build + Deploy / build-developer-portal (push) Successful in 1m37s Details Build + Deploy / build-tts (push) Successful in 1m33s Details Build + Deploy / build-document-crawler (push) Successful in 42s Details Build + Deploy / build-dsms-gateway (push) Successful in 33s Details Build + Deploy / build-dsms-node (push) Successful in 16s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 25s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 3m33s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Failing after 1m18s Details CI / test-python-backend (push) Successful in 53s Details CI / test-python-document-crawler (push) Successful in 36s Details CI / test-python-dsms-gateway (push) Successful in 33s Details CI / validate-canonical-controls (push) Successful in 24s Details Build + Deploy / trigger-orca (push) Successful in 3m19s Details 1. 30 CMP selectors (was 10): Added Sourcepoint, Iubenda, Complianz, CookieFirst, HubSpot, Osano, Piwik PRO, Cookie Consent (Insites), Axeptio, Termly, CookieScript, Civic UK, GDPR Cookie Compliance, CookieHub, Ketch, Admiral, Sibbo, Evidon, LiveRamp, Adsimple. Plus improved generic fallback: role=dialog, aria-label, data-* attrs. 2. Playwright stealth mode: playwright-stealth against bot detection. Removes WebDriver flag, simulates plugins, realistic viewport/locale. Launch args: --disable-blink-features=AutomationControlled. 3. Shadow DOM: Recursive JS-based search through shadowRoot elements for consent banners. Fallback click via page.evaluate() when normal Playwright selectors can't penetrate Shadow DOM. 4. Category selection UI: User can choose which cookie categories to test (Notwendig, Statistik, Marketing, Funktional, Praeferenzen). Pill-style checkboxes in BannerCheckTab, forwarded through API chain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-09 08:42:30 +02:00
Benjamin Admin	686834cea0	feat: 4 remaining tasks — EU institutions, banner integration, JS-sites, Caritas fixes Build + Deploy / build-admin-compliance (push) Successful in 8s Details Build + Deploy / build-backend-compliance (push) Successful in 8s Details Build + Deploy / build-ai-sdk (push) Failing after 36s Details Build + Deploy / build-developer-portal (push) Successful in 8s Details Build + Deploy / build-tts (push) Successful in 7s Details Build + Deploy / build-document-crawler (push) Successful in 7s Details Build + Deploy / build-dsms-gateway (push) Successful in 8s Details Build + Deploy / build-dsms-node (push) Successful in 8s Details CI / branch-name (push) Has been skipped Details Build + Deploy / trigger-orca (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 17s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 3m14s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Failing after 46s Details CI / test-python-backend (push) Successful in 43s Details CI / test-python-document-crawler (push) Successful in 29s Details CI / test-python-dsms-gateway (push) Successful in 30s Details CI / validate-canonical-controls (push) Successful in 16s Details 1. EU Institution Checks (Verordnung 2018/1725): - New doc_type "eu_institution" with 9 L1 + 15 L2 checks - Both German + English patterns (EU institutions are multilingual) - Auto-detection via "2018/1725", "EDSB", "EDPS" keywords - Correct article references (Art. 15 instead of 13, Art. 5 instead of 6) 2. Banner Check Integration: - banner_runner.py maps scan results to 36 L1/L2 structured checks - BannerCheckTab shows hierarchical ChecklistView with hints - 3-phase summary (cookies/scripts before/after consent) - /scan endpoint now includes structured_checks in response 3. JS-heavy Website Fixes (dm, Zalando, HWK): - dsi_helpers.py: goto_resilient (networkidle→domcontentloaded fallback) - try_dismiss_consent_banner before text extraction - PDF redirect detection (dm.de redirects to GCS PDF) 4. Caritas False Positive Fixes: - Phone regex allows parentheses: +49 (0)761 → now matches - "Recht auf Widerspruch" (3 words) + §23 KDG → matches Art. 21 - Church authorities: "Katholisches Datenschutzzentrum" recognized Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-08 01:10:10 +02:00
Benjamin Admin	a349111a01	fix: Raise full_text limit 10K→50K + combine all DSI texts for checks Two fixes: 1. consent-tester: full_text truncation raised from 10,000 to 50,000 chars (IHK Internetangebot has ~50K chars, Beschwerderecht was after 10K cutoff) 2. Backend: dse_text now combines Playwright HTML + ALL DSI discovery texts for mandatory content checking. Previously only used first 8K chars from one source, missing Verantwortlicher/DSB that were in DSI documents. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 16:03:56 +02:00
Benjamin Admin	a3f7fb93f4	fix: Scan quality — raise page limit, use full DSI text for checks Bug 1: max_pages was hardcoded to 15 in backend call — raised to 50 Bug 2: DSI documents checked against text_preview (500 chars) — now uses full_text (10,000 chars) for Art. 13 mandatory field checks Bug 3: DSE text not found when Playwright misses DSE page — now falls back to DSI Discovery full_text as second source Bug 4: Backend timeout 120s too short for 50 pages — raised to 300s Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 23:51:03 +02:00
Benjamin Admin	4e63a6050d	feat: Generic legal document discovery (DSI, AGB, Widerruf, Cookie-Richtlinie) New service: dsi_discovery.py — finds ALL legal documents on any website: - Technology-agnostic: HTML, SPA, WordPress, Typo3, custom CMS - Structure-agnostic: accordions, sidebars, footers, inline links, tabs - Format-agnostic: HTML pages, anchor sections, PDFs, cross-domain links - Language-agnostic: 26 EU/EEA languages with document-type keywords Document types discovered: - Datenschutzinformationen / Privacy Policies (Art. 13/14 DSGVO) - AGB / Terms of Service / Nutzungsbedingungen - Widerrufsbelehrung / Right of Withdrawal (§355 BGB) - Cookie-Richtlinie / Cookie Policy - All cross-domain variants (e.g. help.instagram.com from instagram.com) API: POST /dsi-discovery { url, max_documents } Returns: list of documents with title, url, language, type, word_count, text_preview Features: - Expands all accordions, details, tabs, dropdowns before scanning - Follows cross-domain links (same registrable domain) - Re-expands after navigation back to source page - Handles anchor links (#sections) separately from full pages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 21:56:55 +02:00

19 Commits