Commit Graph

4 Commits

Author SHA1 Message Date
Benjamin Admin 686834cea0 feat: 4 remaining tasks — EU institutions, banner integration, JS-sites, Caritas fixes
Build + Deploy / build-ai-sdk (push) Failing after 36s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 7s
Build + Deploy / build-document-crawler (push) Successful in 7s
Build + Deploy / build-admin-compliance (push) Successful in 8s
Build + Deploy / build-backend-compliance (push) Successful in 8s
CI / nodejs-build (push) Successful in 3m14s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 46s
CI / test-python-backend (push) Successful in 43s
CI / test-python-document-crawler (push) Successful in 29s
CI / test-python-dsms-gateway (push) Successful in 30s
CI / validate-canonical-controls (push) Successful in 16s
Build + Deploy / build-dsms-gateway (push) Successful in 8s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
Build + Deploy / trigger-orca (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
1. EU Institution Checks (Verordnung 2018/1725):
   - New doc_type "eu_institution" with 9 L1 + 15 L2 checks
   - Both German + English patterns (EU institutions are multilingual)
   - Auto-detection via "2018/1725", "EDSB", "EDPS" keywords
   - Correct article references (Art. 15 instead of 13, Art. 5 instead of 6)

2. Banner Check Integration:
   - banner_runner.py maps scan results to 36 L1/L2 structured checks
   - BannerCheckTab shows hierarchical ChecklistView with hints
   - 3-phase summary (cookies/scripts before/after consent)
   - /scan endpoint now includes structured_checks in response

3. JS-heavy Website Fixes (dm, Zalando, HWK):
   - dsi_helpers.py: goto_resilient (networkidle→domcontentloaded fallback)
   - try_dismiss_consent_banner before text extraction
   - PDF redirect detection (dm.de redirects to GCS PDF)

4. Caritas False Positive Fixes:
   - Phone regex allows parentheses: +49 (0)761 → now matches
   - "Recht auf Widerspruch" (3 words) + §23 KDG → matches Art. 21
   - Church authorities: "Katholisches Datenschutzzentrum" recognized

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-08 01:10:10 +02:00
Benjamin Admin a349111a01 fix: Raise full_text limit 10K→50K + combine all DSI texts for checks
Two fixes:
1. consent-tester: full_text truncation raised from 10,000 to 50,000 chars
   (IHK Internetangebot has ~50K chars, Beschwerderecht was after 10K cutoff)
2. Backend: dse_text now combines Playwright HTML + ALL DSI discovery texts
   for mandatory content checking. Previously only used first 8K chars from
   one source, missing Verantwortlicher/DSB that were in DSI documents.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 16:03:56 +02:00
Benjamin Admin a3f7fb93f4 fix: Scan quality — raise page limit, use full DSI text for checks
Bug 1: max_pages was hardcoded to 15 in backend call — raised to 50
Bug 2: DSI documents checked against text_preview (500 chars) — now uses
       full_text (10,000 chars) for Art. 13 mandatory field checks
Bug 3: DSE text not found when Playwright misses DSE page — now falls
       back to DSI Discovery full_text as second source
Bug 4: Backend timeout 120s too short for 50 pages — raised to 300s

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 23:51:03 +02:00
Benjamin Admin 4e63a6050d feat: Generic legal document discovery (DSI, AGB, Widerruf, Cookie-Richtlinie)
New service: dsi_discovery.py — finds ALL legal documents on any website:
- Technology-agnostic: HTML, SPA, WordPress, Typo3, custom CMS
- Structure-agnostic: accordions, sidebars, footers, inline links, tabs
- Format-agnostic: HTML pages, anchor sections, PDFs, cross-domain links
- Language-agnostic: 26 EU/EEA languages with document-type keywords

Document types discovered:
- Datenschutzinformationen / Privacy Policies (Art. 13/14 DSGVO)
- AGB / Terms of Service / Nutzungsbedingungen
- Widerrufsbelehrung / Right of Withdrawal (§355 BGB)
- Cookie-Richtlinie / Cookie Policy
- All cross-domain variants (e.g. help.instagram.com from instagram.com)

API: POST /dsi-discovery { url, max_documents }
Returns: list of documents with title, url, language, type, word_count, text_preview

Features:
- Expands all accordions, details, tabs, dropdowns before scanning
- Follows cross-domain links (same registrable domain)
- Re-expands after navigation back to source page
- Handles anchor links (#sections) separately from full pages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 21:56:55 +02:00