breakpilot-compliance

Author	SHA1	Message	Date
Benjamin Admin	686834cea0	feat: 4 remaining tasks — EU institutions, banner integration, JS-sites, Caritas fixes Build + Deploy / build-ai-sdk (push) Failing after 36s Details Build + Deploy / build-developer-portal (push) Successful in 8s Details Build + Deploy / build-tts (push) Successful in 7s Details Build + Deploy / build-document-crawler (push) Successful in 7s Details Build + Deploy / build-admin-compliance (push) Successful in 8s Details Build + Deploy / build-backend-compliance (push) Successful in 8s Details CI / nodejs-build (push) Successful in 3m14s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Failing after 46s Details CI / test-python-backend (push) Successful in 43s Details CI / test-python-document-crawler (push) Successful in 29s Details CI / test-python-dsms-gateway (push) Successful in 30s Details CI / validate-canonical-controls (push) Successful in 16s Details Build + Deploy / build-dsms-gateway (push) Successful in 8s Details Build + Deploy / build-dsms-node (push) Successful in 8s Details CI / branch-name (push) Has been skipped Details Build + Deploy / trigger-orca (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 17s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details 1. EU Institution Checks (Verordnung 2018/1725): - New doc_type "eu_institution" with 9 L1 + 15 L2 checks - Both German + English patterns (EU institutions are multilingual) - Auto-detection via "2018/1725", "EDSB", "EDPS" keywords - Correct article references (Art. 15 instead of 13, Art. 5 instead of 6) 2. Banner Check Integration: - banner_runner.py maps scan results to 36 L1/L2 structured checks - BannerCheckTab shows hierarchical ChecklistView with hints - 3-phase summary (cookies/scripts before/after consent) - /scan endpoint now includes structured_checks in response 3. JS-heavy Website Fixes (dm, Zalando, HWK): - dsi_helpers.py: goto_resilient (networkidle→domcontentloaded fallback) - try_dismiss_consent_banner before text extraction - PDF redirect detection (dm.de redirects to GCS PDF) 4. Caritas False Positive Fixes: - Phone regex allows parentheses: +49 (0)761 → now matches - "Recht auf Widerspruch" (3 words) + §23 KDG → matches Art. 21 - Church authorities: "Katholisches Datenschutzzentrum" recognized Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-08 01:10:10 +02:00
Benjamin Admin	608fb7faf5	fix: DSI self-extraction + banner L1/L2 check definitions Build + Deploy / build-developer-portal (push) Successful in 1m26s Details Build + Deploy / build-tts (push) Successful in 1m38s Details Build + Deploy / build-document-crawler (push) Successful in 37s Details Build + Deploy / build-dsms-gateway (push) Successful in 26s Details Build + Deploy / build-dsms-node (push) Successful in 11s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 18s Details CI / secret-scan (push) Has been skipped Details CI / nodejs-build (push) Successful in 3m7s Details CI / dep-audit (push) Has been skipped Details Build + Deploy / build-admin-compliance (push) Successful in 2m22s Details Build + Deploy / build-backend-compliance (push) Successful in 3m20s Details Build + Deploy / build-ai-sdk (push) Successful in 54s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go (push) Failing after 46s Details CI / test-python-backend (push) Successful in 45s Details CI / test-python-document-crawler (push) Successful in 30s Details CI / test-python-dsms-gateway (push) Successful in 27s Details CI / validate-canonical-controls (push) Successful in 17s Details Build + Deploy / trigger-orca (push) Successful in 3m37s Details CI / sbom-scan (push) Has been skipped Details 1. DSI Discovery fix for direct-URL use case (e.g. example.com/datenschutz): - Self-extraction: if the URL itself is a DSE page, extract its text directly from the page body (main/article/content element) - Remove "datenschutz" from NOISE_TITLES — it's a legitimate doc title - Fixes safetykon.de/datenschutz returning 0 documents 2. Banner check definitions (36 checks: 6 L1 + 30 L2): - consent-tester/checks/banner_checks.py with expert-level hints - EDPB 3/2022, CNIL rulings, EuGH C-673/17, §25 TDDDG references - check_key maps to existing consent_scanner check codes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-07 20:53:13 +02:00
Benjamin Admin	a349111a01	fix: Raise full_text limit 10K→50K + combine all DSI texts for checks Two fixes: 1. consent-tester: full_text truncation raised from 10,000 to 50,000 chars (IHK Internetangebot has ~50K chars, Beschwerderecht was after 10K cutoff) 2. Backend: dse_text now combines Playwright HTML + ALL DSI discovery texts for mandatory content checking. Previously only used first 8K chars from one source, missing Verantwortlicher/DSB that were in DSI documents. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 16:03:56 +02:00
Benjamin Admin	e494cf62bb	fix: Increase page load timeouts — IHK site needs >30s for networkidle - Initial page.goto timeout: 30s → 60s (IHK loads many JS resources) - Per-page navigation timeout: 20s → 45s (heavy JS sites) - Reduced extra wait from 3s+1s back to 2s+0.5s (goto timeout handles slow loads) - Playwright scanner page timeout: 20s → 45s Root cause: IHK website has heavy JavaScript that takes >30s to reach 'networkidle' state, causing DSI discovery to fail immediately. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 13:10:59 +02:00
Benjamin Admin	d547e63663	fix: DSI dedup prefers 'Datenschutzinformation*' titles + better JS content extraction Bug 1 fix: When merging documents with identical word_count, prefer titles starting with 'Datenschutzinformation' over generic section headings like 'Zweck und Rechtsgrundlage'. This restores the main 'Datenschutzinformationen zum Internetangebot' document. Bug 2 fix: After navigating to a document page, wait 3s (was 2s) for JS content loading, then try 10+ content selectors before falling back to body text (with nav/header/footer removed). Handles IHK-style JS navigation where content loads after page.goto() completes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 12:26:42 +02:00
Benjamin Admin	6c5e086356	fix: DSI dedup — skip anchor links, filter noise, merge duplicates + fix false positives Dedup fixes: - Anchor links (#cookies, #betroffenenrechte) on same page are skipped entirely - Noise titles filtered: 'drucken', 'nach oben', 'Datenschutz' (too generic) - Documents with < 50 words filtered (navigation snippets) - Documents with identical word_count merged (same page, different title) - URL-only titles filtered False positive fixes (dsi_document_checker.py): - 'Kontaktdaten des Verantwortlichen' pattern for controller check - 'Zweck und Rechtsgrundlage' combined heading pattern - 'Welche Daten werden verarbeitet' question-style headings - 'Betroffenenrechte' as standalone heading - 'Welche Rechte hat der Betroffene' question pattern - 'Daten werden geloescht' retention pattern - 'Auftragsverarbeiter' as recipient indicator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 11:41:07 +02:00
Benjamin Admin	7c7513525e	feat: Document-centric scan results + DSI deduplication DSI Dedup (consent-tester): - Only H1/H2 headings count as documents (not H3/H4 sub-sections) - Sub-sections (Cookies, Betroffenenrechte, Social Media) are part of parent document's full text, not separate documents - Reduces IHK result from 30 to ~11 real documents Backend (agent_scan_routes): - ScanFinding gets doc_title field linking each finding to its document - doc_title set when creating DSI findings for document attribution Frontend (ScanResult.tsx): - 3 sections: Services table, Document cards, General findings - Documents: expandable cards with completeness bar (green/yellow/red) - Findings grouped under their parent document - Each card shows: title, word count, findings count, % completeness - Findings without doc_title go to "Allgemeine Findings" section Email Summary (agent_scan_helpers): - Findings listed under their parent document - General findings in separate section - No more flat mixed list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 09:56:29 +02:00
Benjamin Admin	f3e44cf59f	fix: restore all missing consent-tester service modules banner_detector.py, script_analyzer.py, category_tester.py, authenticated_scanner.py were only on the feature branch — needed for consent-tester to start. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 00:14:26 +02:00
Benjamin Admin	3fade26d89	fix: restore consent-tester requirements.txt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 00:06:50 +02:00
Benjamin Admin	797ed667a2	fix: restore consent-tester Dockerfile (was lost from main) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 00:05:19 +02:00
Benjamin Admin	a3f7fb93f4	fix: Scan quality — raise page limit, use full DSI text for checks Bug 1: max_pages was hardcoded to 15 in backend call — raised to 50 Bug 2: DSI documents checked against text_preview (500 chars) — now uses full_text (10,000 chars) for Art. 13 mandatory field checks Bug 3: DSE text not found when Playwright misses DSE page — now falls back to DSI Discovery full_text as second source Bug 4: Backend timeout 120s too short for 50 pages — raised to 300s Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 23:51:03 +02:00
Benjamin Admin	a846bd8910	fix: Exhaustive crawl — no arbitrary page/document limits Both scanners now search until done, not until a counter runs out: playwright_scanner.py: - Default max_pages raised from 15 to 50 - Added 3-minute timeout as safety net - Recursive link discovery on EVERY visited page (not just DSE pages) - Stops when: all links visited OR max_pages OR timeout dsi_discovery.py: - Default max_documents raised from 30 to 100 - Added 5-minute timeout as safety net - Recursive: on each visited page, searches for MORE DSI links - Processes ALL discovered links exhaustively - Stops when: no more pending links OR max_documents OR timeout The scanners now behave like a real user: they follow every relevant link they find, and on each new page they look for more links. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 22:21:57 +02:00
Benjamin Admin	4e63a6050d	feat: Generic legal document discovery (DSI, AGB, Widerruf, Cookie-Richtlinie) New service: dsi_discovery.py — finds ALL legal documents on any website: - Technology-agnostic: HTML, SPA, WordPress, Typo3, custom CMS - Structure-agnostic: accordions, sidebars, footers, inline links, tabs - Format-agnostic: HTML pages, anchor sections, PDFs, cross-domain links - Language-agnostic: 26 EU/EEA languages with document-type keywords Document types discovered: - Datenschutzinformationen / Privacy Policies (Art. 13/14 DSGVO) - AGB / Terms of Service / Nutzungsbedingungen - Widerrufsbelehrung / Right of Withdrawal (§355 BGB) - Cookie-Richtlinie / Cookie Policy - All cross-domain variants (e.g. help.instagram.com from instagram.com) API: POST /dsi-discovery { url, max_documents } Returns: list of documents with title, url, language, type, word_count, text_preview Features: - Expands all accordions, details, tabs, dropdowns before scanning - Follows cross-domain links (same registrable domain) - Re-expands after navigation back to source page - Handles anchor links (#sections) separately from full pages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 21:56:55 +02:00
Benjamin Admin	b997b4a475	feat: 9 new banner checks (12-20), total 20 compliance checks Check 12: Click count — reject requires more clicks than accept (CNIL 150M EUR) Check 13: Color contrast — reject button invisible (same bg as banner) Check 14: Google Consent Mode — analytics_storage 'granted' as default Check 15: Pre-consent cookies — tracking cookies set before any interaction Check 16: Registration coupling — login button = consent (Art. 7(4) DSGVO) Check 17: Language mismatch — banner vs page language (all 26 EU languages) Check 18: Consent cookie expiry — >13 months violates CNIL guidelines Check 19: Nudging — reject button below fold / requires scrolling Check 20: Emotional language (Stirring) — "volle Funktionalitaet" etc. Language detection covers: BG, CS, DA, DE, EL, EN, ES, ET, FI, FR, GA, HR, HU, IS, IT, LT, LV, MT, NL, NO, PL, PT, RO, SK, SL, SV New file: banner_advanced_checks.py (396 LOC) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 08:39:00 +02:00
Benjamin Admin	5d138f265b	feat: 3 new banner legal checks (11 total) + extract banner_text_checker New checks (from EUIPO reference case): - Check 9: Third-party DSE link — detects when consent dialog links to external domain's privacy policy instead of own DSE (Art. 13 DSGVO) - Check 10: Dark-pattern language — detects "muessen/erforderlich" for non-essential cookies suggesting false technical necessity (EDPB Rn. 70) - Check 11: Non-modal dismiss = consent — detects when clicking outside dialog closes it (possibly treating as consent, Planet49 violation) Refactor: extracted _check_banner_text (375 LOC) from consent_scanner.py into services/banner_text_checker.py to keep both files under 500 LOC. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 08:02:46 +02:00

15 Commits