breakpilot-compliance

Author	SHA1	Message	Date
Benjamin Admin	6c5e086356	fix: DSI dedup — skip anchor links, filter noise, merge duplicates + fix false positives Dedup fixes: - Anchor links (#cookies, #betroffenenrechte) on same page are skipped entirely - Noise titles filtered: 'drucken', 'nach oben', 'Datenschutz' (too generic) - Documents with < 50 words filtered (navigation snippets) - Documents with identical word_count merged (same page, different title) - URL-only titles filtered False positive fixes (dsi_document_checker.py): - 'Kontaktdaten des Verantwortlichen' pattern for controller check - 'Zweck und Rechtsgrundlage' combined heading pattern - 'Welche Daten werden verarbeitet' question-style headings - 'Betroffenenrechte' as standalone heading - 'Welche Rechte hat der Betroffene' question pattern - 'Daten werden geloescht' retention pattern - 'Auftragsverarbeiter' as recipient indicator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 11:41:07 +02:00
Benjamin Admin	7c7513525e	feat: Document-centric scan results + DSI deduplication DSI Dedup (consent-tester): - Only H1/H2 headings count as documents (not H3/H4 sub-sections) - Sub-sections (Cookies, Betroffenenrechte, Social Media) are part of parent document's full text, not separate documents - Reduces IHK result from 30 to ~11 real documents Backend (agent_scan_routes): - ScanFinding gets doc_title field linking each finding to its document - doc_title set when creating DSI findings for document attribution Frontend (ScanResult.tsx): - 3 sections: Services table, Document cards, General findings - Documents: expandable cards with completeness bar (green/yellow/red) - Findings grouped under their parent document - Each card shows: title, word count, findings count, % completeness - Findings without doc_title go to "Allgemeine Findings" section Email Summary (agent_scan_helpers): - Findings listed under their parent document - General findings in separate section - No more flat mixed list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 09:56:29 +02:00
Benjamin Admin	a846bd8910	fix: Exhaustive crawl — no arbitrary page/document limits Both scanners now search until done, not until a counter runs out: playwright_scanner.py: - Default max_pages raised from 15 to 50 - Added 3-minute timeout as safety net - Recursive link discovery on EVERY visited page (not just DSE pages) - Stops when: all links visited OR max_pages OR timeout dsi_discovery.py: - Default max_documents raised from 30 to 100 - Added 5-minute timeout as safety net - Recursive: on each visited page, searches for MORE DSI links - Processes ALL discovered links exhaustively - Stops when: no more pending links OR max_documents OR timeout The scanners now behave like a real user: they follow every relevant link they find, and on each new page they look for more links. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 22:21:57 +02:00
Benjamin Admin	4e63a6050d	feat: Generic legal document discovery (DSI, AGB, Widerruf, Cookie-Richtlinie) New service: dsi_discovery.py — finds ALL legal documents on any website: - Technology-agnostic: HTML, SPA, WordPress, Typo3, custom CMS - Structure-agnostic: accordions, sidebars, footers, inline links, tabs - Format-agnostic: HTML pages, anchor sections, PDFs, cross-domain links - Language-agnostic: 26 EU/EEA languages with document-type keywords Document types discovered: - Datenschutzinformationen / Privacy Policies (Art. 13/14 DSGVO) - AGB / Terms of Service / Nutzungsbedingungen - Widerrufsbelehrung / Right of Withdrawal (§355 BGB) - Cookie-Richtlinie / Cookie Policy - All cross-domain variants (e.g. help.instagram.com from instagram.com) API: POST /dsi-discovery { url, max_documents } Returns: list of documents with title, url, language, type, word_count, text_preview Features: - Expands all accordions, details, tabs, dropdowns before scanning - Follows cross-domain links (same registrable domain) - Re-expands after navigation back to source page - Handles anchor links (#sections) separately from full pages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 21:56:55 +02:00

4 Commits