breakpilot-compliance

Author	SHA1	Message	Date
Benjamin Admin	02ff96f74e	fix: resolve all merge conflict markers from feat/zeroclaw-compliance-agent Build + Deploy / build-admin-compliance (push) Successful in 2m7s Details Build + Deploy / build-backend-compliance (push) Failing after 5m21s Details Build + Deploy / build-ai-sdk (push) Successful in 53s Details Build + Deploy / build-developer-portal (push) Successful in 1m18s Details Build + Deploy / build-tts (push) Successful in 1m42s Details Build + Deploy / build-document-crawler (push) Successful in 45s Details Build + Deploy / build-dsms-gateway (push) Successful in 27s Details Build + Deploy / build-dsms-node (push) Successful in 19s Details CI / branch-name (push) Has been skipped Details Build + Deploy / trigger-orca (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 19s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 3m6s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Successful in 55s Details CI / test-python-backend (push) Successful in 44s Details CI / test-python-document-crawler (push) Successful in 30s Details CI / test-python-dsms-gateway (push) Successful in 26s Details CI / validate-canonical-controls (push) Successful in 18s Details 9 files had conflict markers from the branch merge. All resolved keeping the feature branch version. Also split agent_scan_routes.py (534→367 LOC) by extracting Pydantic models to agent_scan_models.py. [guardrail-change] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-11 12:15:07 +02:00
Benjamin Admin	36c6101b91	Merge feat/zeroclaw-compliance-agent into main Brings all compliance doc-check features: - 162 regex checks + 1874 Master Controls - LLM-agnostic agent with tool calling - Banner check (46 checks, 30 CMPs, stealth, Shadow DOM) - Impressum check (24 checks) - Deep consent verification (DataLayer, GCM, TCF) - CMP E2E tests (39 tests) - HTML email reports, FAQ, persistent history Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-11 11:44:20 +02:00
Benjamin Admin	a349111a01	fix: Raise full_text limit 10K→50K + combine all DSI texts for checks Two fixes: 1. consent-tester: full_text truncation raised from 10,000 to 50,000 chars (IHK Internetangebot has ~50K chars, Beschwerderecht was after 10K cutoff) 2. Backend: dse_text now combines Playwright HTML + ALL DSI discovery texts for mandatory content checking. Previously only used first 8K chars from one source, missing Verantwortlicher/DSB that were in DSI documents. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 16:03:56 +02:00
Benjamin Admin	72761d6066	debug: Log DSI text lengths to diagnose 0% completeness bug Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 14:08:04 +02:00
Benjamin Admin	7c7513525e	feat: Document-centric scan results + DSI deduplication DSI Dedup (consent-tester): - Only H1/H2 headings count as documents (not H3/H4 sub-sections) - Sub-sections (Cookies, Betroffenenrechte, Social Media) are part of parent document's full text, not separate documents - Reduces IHK result from 30 to ~11 real documents Backend (agent_scan_routes): - ScanFinding gets doc_title field linking each finding to its document - doc_title set when creating DSI findings for document attribution Frontend (ScanResult.tsx): - 3 sections: Services table, Document cards, General findings - Documents: expandable cards with completeness bar (green/yellow/red) - Findings grouped under their parent document - Each card shows: title, word count, findings count, % completeness - Findings without doc_title go to "Allgemeine Findings" section Email Summary (agent_scan_helpers): - Findings listed under their parent document - General findings in separate section - No more flat mixed list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 09:56:29 +02:00
Benjamin Admin	cb607bf228	feat: Async scan with polling — no more timeout issues Fundamental fix: scans now run asynchronously with progress polling. Backend: - POST /scan starts background task, returns scan_id immediately - GET /scan/{scan_id} returns status + progress + result when done - 7 progress steps shown: Website scan, DSI discovery, DSE analysis, SOLL/IST comparison, corrections, report, email - In-memory job store (dict with scan_id → status/result) - No timeout limits on scan duration Frontend: - POST starts scan, receives scan_id - Polls GET every 5 seconds (max 120 attempts = 10 min) - Shows live progress message during scan - Displays result when completed, error when failed Proxy: - POST timeout reduced to 30s (just starts the job) - GET timeout 10s (just status check) - No more 504/connection-dropped errors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 07:30:09 +02:00
Benjamin Admin	a3f7fb93f4	fix: Scan quality — raise page limit, use full DSI text for checks Bug 1: max_pages was hardcoded to 15 in backend call — raised to 50 Bug 2: DSI documents checked against text_preview (500 chars) — now uses full_text (10,000 chars) for Art. 13 mandatory field checks Bug 3: DSE text not found when Playwright misses DSE page — now falls back to DSI Discovery full_text as second source Bug 4: Backend timeout 120s too short for 50 pages — raised to 300s Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 23:51:03 +02:00
Benjamin Admin	2f0f76e365	fix: Add missing 'import re' to agent_scan_routes.py NameError: name 're' is not defined at line 146 — the import was accidentally removed when extracting helper functions to agent_scan_helpers.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 22:59:53 +02:00
Benjamin Admin	f960bd052a	fix: Add missing 'import re' to agent_scan_routes.py NameError: name 're' is not defined at line 146 — the import was accidentally removed when extracting helper functions to agent_scan_helpers.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 22:59:53 +02:00
Benjamin Admin	48146cddaf	feat: DSI document discovery + completeness check in agent scan workflow Agent scan now automatically: 1. Discovers all legal documents via consent-tester /dsi-discovery endpoint 2. Classifies each as DSE/AGB/Widerruf/Cookie/Impressum 3. Checks completeness against type-specific checklists: - DSE: 9 Art. 13 DSGVO mandatory fields (controller, DPO, purposes, legal basis, recipients, third-country, retention, rights, complaint) - AGB: §305ff BGB (scope, contract formation, liability, jurisdiction) - Widerruf: §355 BGB (right info, 14-day deadline, form, consequences) 4. Adds findings per document to scan results 5. Shows discovered documents with completeness % in email summary 6. Returns discovered_documents list in API response New files: - dsi_document_checker.py (229 LOC) — checklists + classifier - agent_scan_helpers.py (109 LOC) — extracted summary builder + corrections Refactor: agent_scan_routes.py 537→448 LOC (under 500 budget) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 22:10:13 +02:00
Benjamin Admin	53f6f30cf0	feat: DSI document discovery + completeness check in agent scan workflow Agent scan now automatically: 1. Discovers all legal documents via consent-tester /dsi-discovery endpoint 2. Classifies each as DSE/AGB/Widerruf/Cookie/Impressum 3. Checks completeness against type-specific checklists: - DSE: 9 Art. 13 DSGVO mandatory fields (controller, DPO, purposes, legal basis, recipients, third-country, retention, rights, complaint) - AGB: §305ff BGB (scope, contract formation, liability, jurisdiction) - Widerruf: §355 BGB (right info, 14-day deadline, form, consequences) 4. Adds findings per document to scan results 5. Shows discovered documents with completeness % in email summary 6. Returns discovered_documents list in API response New files: - dsi_document_checker.py (229 LOC) — checklists + classifier - agent_scan_helpers.py (109 LOC) — extracted summary builder + corrections Refactor: agent_scan_routes.py 537→448 LOC (under 500 budget) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 22:09:45 +02:00
Benjamin Admin	58957a4aaa	fix: Playwright user permission + etracker DSE matching + CMP skip 1. Dockerfile: install Playwright AS appuser (not root) so chromium binary is accessible at runtime. Was causing 500 error. 2. DSE service matching: text-search fallback when LLM extraction fails. If "etracker" appears in DSE text, mark as documented even without LLM parsing the service list. 3. CMP skip: consent managers in category "cmp" skipped (not just "other" with id "cmp"). NOT DEPLOYED — RAG pipeline is running. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 19:36:46 +02:00
Benjamin Admin	cedc5de15d	feat: Phase 10 — Playwright website scanner replaces httpx New /website-scan endpoint in consent-tester service: - Real browser renders JavaScript (finds dynamic content) - Clicks navigation menus (discovers hidden sub-pages like IHK DSB page) - Follows links within DSE to find regional privacy policies - Collects rendered HTML for each page (after JS execution) Backend integration: - agent_scan_routes tries Playwright first, falls back to httpx - DSE text and HTML extracted from Playwright-rendered pages - Service detection runs on rendered HTML (catches JS-loaded scripts) Also fixes: - GA regex: G-[A-Z0-9]{8,12} prevents CSS class false positives - etracker added to service registry - External page scanning blocked (same-domain only) - CSS/JS/image files excluded from page list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 19:16:50 +02:00
Benjamin Admin	4bf92f42b8	feat: Phase 9 — Authenticated Testing + Legal Basis Validator (lit. mapping) Phase 9: Playwright login + 5 post-login checks: - §312k BGB: Kündigungsbutton (2 Klicks) - Art. 17 DSGVO: Konto löschen - Art. 20 DSGVO: Daten exportieren - Art. 7(3): Einwilligungen widerrufen - Art. 15: Profildaten einsehen Auto-detects login form selectors. Credentials destroyed after test. Legal Basis Validator: Checks 7 common lit-mapping mistakes: - Cookie tracking on lit. f instead of lit. a (Planet49) - Analytics on lit. b (contract overextension) - Klarna without Art. 22 reference - Session recording without consent Integrated into website scan pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 16:08:41 +02:00
Benjamin Admin	5c5054f740	feat: Phase 3 — registry 82 services, mandatory checker, SDK flow step - website_scanner.py: imports from master service_registry.py (82 services) - agent_scan_routes.py: mandatory content checks (documents + DSE sections) - steps-betrieb.ts: Compliance Agent step added to SDK Flow (seq 5000) - PLAN: Phase 9 (Authenticated Testing) added to product roadmap Mandatory checks know what MUST be there: - Documents: Impressum, DSE, AGB, Widerrufsbelehrung - DSE content: 9 Art. 13 DSGVO fields (DSB, Speicherdauer, etc.) - Impressum content: 5 §5 TMG fields (GF, HRB, USt-ID, etc.) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 15:04:44 +02:00
Benjamin Admin	0ba76d041a	feat: DSE parser + matcher — textblock references in scan findings - dse_parser.py: HTML → structured sections (heading, number, content, parent) Uses heading hierarchy (h1-h4) with regex fallback - dse_matcher.py: matches detected services against DSE sections Exact name → provider → category matching with insertion point suggestion - agent_scan_routes: TextReference model in findings (original text, section, paragraph, correction type, insert_after) Enables showing: "Google Analytics not found in DSE, insert after Section 2.4 Cookies und Tracking" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 11:55:26 +02:00
Benjamin Admin	b06a33a5fe	fix: syntax error — missing closing paren in scan summary builder	2026-04-28 17:41:11 +02:00
Benjamin Admin	6c0e76f96d	feat: show scanned pages in email summary + frontend (expandable list) Email now lists all scanned URLs with checkmark/cross status. Frontend shows collapsible "X Seiten gescannt — Details anzeigen". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 17:26:03 +02:00
Benjamin Admin	0106f3b5b6	fix: use Ollama directly for correction generation (bypass SDK think-mode) SDK LLM chat returns empty content due to Qwen think-mode. Direct Ollama /api/generate call with stream:false gets the full response including think tags which we strip. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 16:30:51 +02:00
Benjamin Admin	b175ad2594	fix: increase LLM timeouts for scan corrections (90s) and DSE extraction (120s) Qwen 3.5:35b needs ~30-60s per call. Multi-call scan was timing out. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 16:05:35 +02:00
Benjamin Admin	711b9b3146	feat: website scanner with SOLL/IST service comparison + corrections - website_scanner.py: multi-page crawl, 20+ service patterns (tracking, CDN, chatbots, payment, fonts, captcha, video), AI text detection - dse_service_extractor.py: LLM extracts services from privacy policy text - agent_scan_routes.py: POST /agent/scan — combines scan + DSE comparison, generates findings (undocumented, outdated, third-country transfer), auto-corrections via Qwen in pre-launch mode Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 15:35:31 +02:00

21 Commits