breakpilot-compliance

Author	SHA1	Message	Date
Benjamin Admin	5e317d2f0f	fix: text extraction 50k char limit was root cause of all Spiegel FNs Build + Deploy / build-admin-compliance (push) Successful in 18s Details Build + Deploy / build-backend-compliance (push) Successful in 12s Details Build + Deploy / build-ai-sdk (push) Successful in 10s Details Build + Deploy / build-developer-portal (push) Successful in 10s Details Build + Deploy / build-tts (push) Successful in 10s Details Build + Deploy / build-document-crawler (push) Successful in 9s Details Build + Deploy / build-dsms-gateway (push) Successful in 10s Details Build + Deploy / build-dsms-node (push) Successful in 15s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 17s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m46s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Failing after 41s Details CI / test-python-backend (push) Successful in 37s Details CI / test-python-document-crawler (push) Successful in 27s Details CI / test-python-dsms-gateway (push) Successful in 22s Details CI / validate-canonical-controls (push) Successful in 13s Details Build + Deploy / trigger-orca (push) Successful in 2m13s Details ROOT CAUSE: main.py line 338 truncated full_text at 50,000 chars. Spiegel DSI has 107,720 chars (13,705 words) — only 47% was extracted. DSB, Art. 77, Betroffenenrechte were all in the truncated portion. Fixes: 1. Raise text limit from 50k to 200k chars in API response + discovery 2. click_button(): add iframe fallback for Sourcepoint/Quantcast 3. dsi_helpers: iterate ALL page.frames for consent buttons 4. Profiler: only check impressum (not full text) for regulated professions, and "rechtsanwalt" must be in first 500 chars (company description) 5. GT: save full Spiegel DSI text (13,705 words) as reference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 15:22:38 +02:00
Benjamin Admin	c702260ec1	fix: 5 regex bugs + text extraction scroll + GT update Build + Deploy / build-admin-compliance (push) Successful in 13s Details Build + Deploy / build-backend-compliance (push) Successful in 23s Details Build + Deploy / build-ai-sdk (push) Successful in 13s Details Build + Deploy / build-developer-portal (push) Successful in 14s Details Build + Deploy / build-tts (push) Successful in 15s Details Build + Deploy / build-document-crawler (push) Successful in 13s Details Build + Deploy / build-dsms-gateway (push) Successful in 15s Details Build + Deploy / build-dsms-node (push) Successful in 14s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 15s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m26s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Successful in 39s Details CI / test-python-backend (push) Successful in 39s Details CI / test-python-document-crawler (push) Successful in 25s Details CI / test-python-dsms-gateway (push) Successful in 22s Details CI / validate-canonical-controls (push) Successful in 15s Details Build + Deploy / trigger-orca (push) Successful in 2m28s Details Root cause: Spiegel DSI text was truncated (lazy-loading) — the rights/DSB/complaints sections at the bottom were never extracted. Fixes: 1. Text extraction: scroll to bottom before innerText (dsi_discovery.py) 2. V.i.S.d.P.: add "verantwortlicher i.s.v." + "§18 Abs. N MStV" pattern 3. USt-IdNr: add "umsatzsteuer-id" + "DE 212 442 423" (with spaces) 4. Profiler: remove generic "anwalt"/"praxis" (false positive on Spiegel "Redaktionsanwalt"), keep only "rechtsanwalt", "kanzlei" etc. 5. Section splitter: auto_fill_from_dsi() fills empty Cookie/Social-Media rows from sections found in the DSI text Ground Truth 06-spiegel.md fully rewritten with verified data from live website — 3 L1 False Negatives identified (DSB, Beschwerderecht, Betroffenenrechte all present on website but not in extracted text). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 01:20:55 +02:00
Benjamin Admin	baca0f6b80	docs: add existing use case context to compiler instruction 3 bestehende Ansätze (IACE deterministisch, Doc-Check LLM, Gap-Analyse regelbasiert) und was der Compiler von jedem übernimmt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-12 12:26:33 +02:00
Benjamin Admin	407a9503e4	fix(profiler): fix B2G false positive + add consulting/manufacturing Build + Deploy / build-admin-compliance (push) Successful in 2m27s Details Build + Deploy / build-backend-compliance (push) Successful in 3m40s Details Build + Deploy / build-ai-sdk (push) Successful in 1m0s Details Build + Deploy / build-developer-portal (push) Successful in 1m16s Details Build + Deploy / build-tts (push) Successful in 1m54s Details Build + Deploy / build-document-crawler (push) Successful in 1m2s Details Build + Deploy / build-dsms-gateway (push) Successful in 31s Details Build + Deploy / build-dsms-node (push) Successful in 20s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 17s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m44s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Successful in 49s Details CI / test-python-backend (push) Successful in 36s Details CI / test-python-document-crawler (push) Successful in 25s Details CI / test-python-dsms-gateway (push) Successful in 21s Details CI / validate-canonical-controls (push) Successful in 14s Details Build + Deploy / trigger-orca (push) Successful in 3m23s Details - Remove generic B2G keywords (behörde, amt, öffentlich) that match in every DSI due to "Aufsichtsbehörde", "Amtsgericht", "veröffentlichen" - Remove "server" from it_services (too generic, appears in every DSI) - Add consulting, manufacturing, media industries - Add B2B fallback for GmbH/AG without B2C signals - Add 10 ground truth files for unified compliance check Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-12 12:20:44 +02:00
Benjamin Admin	1fd7ea6139	docs: Use-Case Compiler instruction for next session Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-12 12:13:33 +02:00
Benjamin Admin	36c6101b91	Merge feat/zeroclaw-compliance-agent into main Brings all compliance doc-check features: - 162 regex checks + 1874 Master Controls - LLM-agnostic agent with tool calling - Banner check (46 checks, 30 CMPs, stealth, Shadow DOM) - Impressum check (24 checks) - Deep consent verification (DataLayer, GCM, TCF) - CMP E2E tests (39 tests) - HTML email reports, FAQ, persistent history Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-11 11:44:20 +02:00
Benjamin Admin	445a2f7c7c	docs: Instruktion fuer RAG-Pipeline — Dokumenten-Upload Backend Vollstaendige Spezifikation: - DB-Schema (iace_uploaded_documents) - 3 Go Endpoints (POST/GET/DELETE) - Async PDF → Text → Chunks → Embed → Qdrant Pipeline - Tenant-isolierte Collections (bp_norms_tenant_{id}) - Multi-Collection RAG-Suche - Frontend-API-Vertrag - Sicherheit (Tenant-Isolation, Datei-Validierung) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-09 08:09:40 +02:00
Benjamin Admin	55e44df256	docs: Instruktion fuer RAG-Pipeline — TRBS + TRGS + ASR Ingest ~120 gemeinfreie Technische Regeln (amtliche Bekanntmachungen §5 UrhG) von baua.de fuer die RAG-Pipeline. Crawling + Embedding + Qdrant-Import. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-09 07:57:40 +02:00
Benjamin Admin	1b5c6bd340	docs: Batch test results for 9 websites + EUIPO analysis Build + Deploy / build-ai-sdk (push) Failing after 33s Details Build + Deploy / build-developer-portal (push) Successful in 7s Details Build + Deploy / build-tts (push) Successful in 7s Details Build + Deploy / build-document-crawler (push) Successful in 7s Details Build + Deploy / build-dsms-gateway (push) Successful in 8s Details Build + Deploy / build-admin-compliance (push) Successful in 1m51s Details Build + Deploy / build-backend-compliance (push) Successful in 8s Details CI / loc-budget (push) Failing after 18s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 19s Details Build + Deploy / build-dsms-node (push) Successful in 8s Details CI / branch-name (push) Has been skipped Details Build + Deploy / trigger-orca (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / nodejs-build (push) Successful in 3m8s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Failing after 46s Details CI / test-python-backend (push) Successful in 41s Details CI / test-python-document-crawler (push) Successful in 32s Details CI / test-python-dsms-gateway (push) Successful in 24s Details Tested BMW, Stadt Koeln, BfDI, Sparkasse, Caritas, TUEV Sued, Spiegel, ETO Gruppe, EUIPO. Key findings: - Stadt Koeln + ETO Gruppe best (95% correctness) - BMW, Sparkasse, Spiegel genuinely deficient (verified) - EUIPO uses EU Regulation 2018/1725, not GDPR — needs separate checklist - ~0-2 false positives per website after LLM verification 7 regex fixes emerged from batch testing (soft hyphens, word insertions, numbered headings, German section names, etc.) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-08 00:41:28 +02:00
Benjamin Admin	313ee5073b	plan: Banner-Check upgrade to L1/L2 with expert hints Detailed plan for upgrading the 22 existing Playwright-based banner checks to the same quality level as the document checks: - 6 L1 + 30 L2 hierarchical checks - Expert hints with EuGH/CNIL/DSK/EDPB references - 3-phase evidence (before consent, after reject, after accept) - Dark pattern detection (button size, color, click asymmetry) - Estimated 3-4h implementation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-07 17:48:11 +02:00
Benjamin Admin	fa4fd87102	fix: 7 regex bugs from IHK Konstanz ground truth analysis Build + Deploy / build-admin-compliance (push) Successful in 9s Details CI / loc-budget (push) Failing after 18s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m57s Details Build + Deploy / trigger-orca (push) Successful in 2m24s Details Build + Deploy / build-backend-compliance (push) Successful in 8s Details Build + Deploy / build-ai-sdk (push) Successful in 42s Details Build + Deploy / build-developer-portal (push) Successful in 8s Details Build + Deploy / build-tts (push) Successful in 7s Details Build + Deploy / build-document-crawler (push) Successful in 7s Details Build + Deploy / build-dsms-gateway (push) Successful in 8s Details Build + Deploy / build-dsms-node (push) Successful in 8s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Failing after 49s Details CI / test-python-backend (push) Successful in 42s Details CI / test-python-document-crawler (push) Successful in 28s Details CI / test-python-dsms-gateway (push) Successful in 23s Details CI / validate-canonical-controls (push) Successful in 15s Details Fixes based on manual verification of all 30 failed checks: 1. Cookie table: recognize "folgende cookies" + column headers as text 2. Cookie names: add JSESSIONID, cookieinfo, et_id, BT_* patterns 3. Essential justified: match "sitzung zuordnen", "betrieb der website" 4. Social bookmarks: recognize as 2-click alternative 5. DSFA plural: "kanaelen" now matches alongside "kanal" 6. Section splitter: skip-headings no longer lose subsequent text (Risikoabwaegung section was cut from DSFA, losing risk scores) 7. Cookie legal basis: accept Art. 6(1)(f) in cookie context Reduces false positives from 7 to ~1-2 for IHK Konstanz test case. Ground truth table: zeroclaw/docs/ground-truth-ihk-konstanz.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-07 14:51:09 +02:00
Benjamin Admin	e19d9ca532	docs: Master Controls spec for document checker — 80-100 specific check criteria Detailed requirements for the pipeline session: - Binary yes/no check_question per control - Concrete pass_criteria + fail_criteria (not 'check completeness') - correction_template from our Template Generator - 8 document types: DSI, Cookie, Impressum, Widerruf, AGB, DSFA, AVV, Loeschkonzept - ~80-100 total controls (not 25K generic ones) - Examples for DSI, Cookie, Impressum with exact field expectations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-07 07:53:36 +02:00
Benjamin Admin	13c5880f51	fix: Restrict sub-section detection to genuinely separate document types Only Cookie and Widerruf sections are checked as separate documents. Social Media, DSFA, Betroffenenrechte, Dienste von Drittanbietern are part of the parent DSI and no longer generate false findings. Added PLAN-rag-document-check.md for Phase 2: - RAG-based checks with document-type-specific Controls - DSFA checklist (Art. 35 + Landes-Listen) - AVV checklist (Art. 28) - Reference detection (sub-doc → parent doc) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-06 11:02:36 +02:00
Benjamin Admin	4f92e5056c	docs: Complete agent architecture reference for reuse in other agents Full documentation of the ZeroClaw compliance agent architecture: - System overview diagram (Frontend → Backend → LLM → Playwright) - Detailed request flow for Website-Scan mode (7 steps) - All 5 components: Frontend, Backend, Consent-Tester, Ollama, Soul Files - 20 banner checks across 3 files - LLM call patterns (/api/generate + /api/chat + think-mode stripping) - Blueprint for creating new agents (5 steps: Soul, Route, Page, Proxy, Docker) - Timeouts, environment variables, file reference with LOC counts Designed as reusable blueprint for Sales, HR, Finance, or other agents. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 22:26:56 +02:00
Benjamin Admin	0837680e03	docs: Add EUIPO Unblu Chat findings (3 new, total 10 findings) Finding 8: Unblu chat consent links to third-party DSE (unblu.com) instead of EUIPO's own privacy policy (Art. 13 DSGVO) Finding 9: Cookie consent delegated to third-party terms without own legal basis (§25 TDDDG) Finding 10: Click-outside-dialog = accept — accidental click counts as consent (Planet49, Art. 7(1) DSGVO) New planned agent checks: - Drittanbieter-DSE-Check: detect consent linking to external DSE - Modal-Dismiss-Check: Playwright test if backdrop click = consent - Dark-Pattern-Sprache: detect "muessen/erforderlich" for non-essential cookies Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 07:48:35 +02:00
Benjamin Admin	7ebd25c59c	docs: Add EUIPO registration as compliance agent reference test case Real-world case from EU authority (EUIPO) with 7 findings: - Grammatically broken consent text (bad DE translation) - Coupling prohibition violation (login = consent, Art. 7(4) DSGVO) - No reject button, no granularity, no active opt-in - Broken link layout (DSE/ToS links appear after submit button) - Includes correction suggestion and planned agent check implementations - Pattern: WSO2 Identity Server default templates (systemic issue) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 07:28:32 +02:00
Benjamin Admin	e318215cc5	refactor: split agent_analyze_routes (420→309 LOC) + agent docs + migration - Extracted website compliance checks + helpers to website_compliance_checks.py - Created agent documentation (zeroclaw/docs/compliance-agent.md) - DB migration 086 executed (compliance_agent_scans table) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-02 08:22:52 +02:00
Benjamin Admin	891fc5bea0	docs: add keyword-based checker problem to migration instruction mandatory_content_checker.py keywords break with alternative formulations. Solution: LLM-based check per mandatory field (9 calls, parallelizable). For other session to implement alongside Dict→Control migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 18:18:45 +02:00
Benjamin Admin	2c9cea74e3	docs: instruction for hardcoded knowledge → Control Library migration 6 files with hardcoded legal knowledge identified. Review deadline 2026-07-01. legal_basis_validator.py marked with warning log on every use. Instruction file for other session to execute migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 16:33:48 +02:00
Benjamin Admin	e35db90232	feat: Phase 5 — DB persistence for scan results + Phase 10 in plan - Migration 086: compliance_agent_scans table (findings, services, corrections) - agent_history_routes.py: POST /scans (save), GET /scans (list), GET /scans/{id} - Scan results survive page reloads and can be reviewed later - Phase 10 (Playwright website scanner) added to product roadmap Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 15:17:51 +02:00
Benjamin Admin	5c5054f740	feat: Phase 3 — registry 82 services, mandatory checker, SDK flow step - website_scanner.py: imports from master service_registry.py (82 services) - agent_scan_routes.py: mandatory content checks (documents + DSE sections) - steps-betrieb.ts: Compliance Agent step added to SDK Flow (seq 5000) - PLAN: Phase 9 (Authenticated Testing) added to product roadmap Mandatory checks know what MUST be there: - Documents: Impressum, DSE, AGB, Widerrufsbelehrung - DSE content: 9 Art. 13 DSGVO fields (DSB, Speicherdauer, etc.) - Impressum content: 5 §5 TMG fields (GF, HRB, USt-ID, etc.) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 15:04:44 +02:00
Benjamin Admin	0266dfd011	docs: Compliance Agent product roadmap — 8 phases, PoC to production P0: UCCA score calibration + control relevance filter P1: Headless browser consent test (before/after cookie banner) + 80+ services P2: Scan acceleration, DB persistence, PDF export P3: Recurring scans, multi-website comparison Investor demo scenario included. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 11:32:27 +02:00
Benjamin Admin	d0dc284cd5	docs: add Phase 5 (Payment/Marketing checks) + Phase 6 (auto-corrections) - Payment: Stripe, PayPal, Klarna (Art. 22 Bonitaetspruefung!), Adyen, Mollie - Marketing: GA, Meta Pixel, TikTok, Hotjar, Clarity, Newsletter-Anbieter - Each service: DSE mention check, consent check, third-country check - Pre-launch mode: agent generates ready-to-insert DSE text blocks via Qwen - Correction types: missing service, wrong legal basis, outdated entry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 15:26:29 +02:00
Benjamin Admin	24fb1e14e0	docs: add Phase 4b — SOLL/IST Dienstleister-Abgleich (DSE vs. Website) Automated comparison: services mentioned in privacy policy vs. actually embedded on website. Three categories: undocumented (Art. 13 violation), outdated (cleanup), correctly documented (check third country only). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 15:20:12 +02:00
Benjamin Admin	6aa753146f	docs: extend plan with third-party service detection + Drittland registry 80+ services: CDN (Cloudflare, Akamai), Fonts (Google Fonts LG München), Tracking (GA, Meta Pixel, Matomo), Captcha, Maps, Video, Payment. Static registry with country, EU adequacy, consent requirement, legal ref. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 15:18:43 +02:00
Benjamin Admin	acd2d5f944	docs: add Phase 4 (Website-Scan) to Control Relevance Filter plan Multi-page crawl: scan 5-10 strategic pages (start, footer links) for chatbot widgets, AI text mentions, and tracking services. Feed results into relevance filter to reduce false positives. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 15:11:19 +02:00
Benjamin Admin	2a6f526c88	docs: plan for Control Relevance Filter (3-stage: rules, LLM, follow-up) Addresses false-positive controls like C_TRANSPARENCY being recommended when no AI usage is evident. Plan for separate implementation session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 14:32:25 +02:00
Benjamin Admin	0c0dd4e3a6	feat: ZeroClaw compliance agent — document analysis + role assignment + email Add autonomous compliance agent that fetches web documents (cookie banners, privacy policies), classifies them via Qwen/Ollama, assesses DSGVO compliance, assigns to the responsible role, and sends notification emails. Components: - ZeroClaw SOP (6-step workflow: fetch, classify, assess, summarize, assign, notify) - Backend: /api/compliance/agent/analyze (combined endpoint) - Backend: /api/compliance/agent/notify (standalone email) - Frontend: /sdk/agent page (Manager UI with URL input + results) - Helper scripts + E2E test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-27 23:28:21 +02:00

28 Commits