breakpilot-compliance

Author	SHA1	Message	Date
Benjamin Admin	b22351fc6e	fix: Exhaustive crawl — no arbitrary page/document limits CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 14s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m37s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Successful in 38s Details CI / test-python-backend (push) Successful in 36s Details CI / test-python-document-crawler (push) Successful in 24s Details CI / test-python-dsms-gateway (push) Successful in 21s Details CI / validate-canonical-controls (push) Successful in 15s Details Both scanners now search until done, not until a counter runs out: playwright_scanner.py: - Default max_pages raised from 15 to 50 - Added 3-minute timeout as safety net - Recursive link discovery on EVERY visited page (not just DSE pages) - Stops when: all links visited OR max_pages OR timeout dsi_discovery.py: - Default max_documents raised from 30 to 100 - Added 5-minute timeout as safety net - Recursive: on each visited page, searches for MORE DSI links - Processes ALL discovered links exhaustively - Stops when: no more pending links OR max_documents OR timeout The scanners now behave like a real user: they follow every relevant link they find, and on each new page they look for more links. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 22:22:00 +02:00
Benjamin Admin	298c95731a	feat: Generic legal document discovery (DSI, AGB, Widerruf, Cookie-Richtlinie) CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 22s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m35s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Successful in 52s Details CI / test-python-backend (push) Successful in 42s Details CI / test-python-document-crawler (push) Successful in 29s Details CI / test-python-dsms-gateway (push) Successful in 21s Details CI / validate-canonical-controls (push) Successful in 14s Details New service: dsi_discovery.py — finds ALL legal documents on any website: - Technology-agnostic: HTML, SPA, WordPress, Typo3, custom CMS - Structure-agnostic: accordions, sidebars, footers, inline links, tabs - Format-agnostic: HTML pages, anchor sections, PDFs, cross-domain links - Language-agnostic: 26 EU/EEA languages with document-type keywords Document types discovered: - Datenschutzinformationen / Privacy Policies (Art. 13/14 DSGVO) - AGB / Terms of Service / Nutzungsbedingungen - Widerrufsbelehrung / Right of Withdrawal (§355 BGB) - Cookie-Richtlinie / Cookie Policy - All cross-domain variants (e.g. help.instagram.com from instagram.com) API: POST /dsi-discovery { url, max_documents } Returns: list of documents with title, url, language, type, word_count, text_preview Features: - Expands all accordions, details, tabs, dropdowns before scanning - Follows cross-domain links (same registrable domain) - Re-expands after navigation back to source page - Handles anchor links (#sections) separately from full pages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 21:57:37 +02:00

2 Commits