Benjamin Admin
686834cea0
feat: 4 remaining tasks — EU institutions, banner integration, JS-sites, Caritas fixes
...
Build + Deploy / build-ai-sdk (push) Failing after 36s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 7s
Build + Deploy / build-document-crawler (push) Successful in 7s
Build + Deploy / build-admin-compliance (push) Successful in 8s
Build + Deploy / build-backend-compliance (push) Successful in 8s
CI / nodejs-build (push) Successful in 3m14s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 46s
CI / test-python-backend (push) Successful in 43s
CI / test-python-document-crawler (push) Successful in 29s
CI / test-python-dsms-gateway (push) Successful in 30s
CI / validate-canonical-controls (push) Successful in 16s
Build + Deploy / build-dsms-gateway (push) Successful in 8s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
Build + Deploy / trigger-orca (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
1. EU Institution Checks (Verordnung 2018/1725):
- New doc_type "eu_institution" with 9 L1 + 15 L2 checks
- Both German + English patterns (EU institutions are multilingual)
- Auto-detection via "2018/1725", "EDSB", "EDPS" keywords
- Correct article references (Art. 15 instead of 13, Art. 5 instead of 6)
2. Banner Check Integration:
- banner_runner.py maps scan results to 36 L1/L2 structured checks
- BannerCheckTab shows hierarchical ChecklistView with hints
- 3-phase summary (cookies/scripts before/after consent)
- /scan endpoint now includes structured_checks in response
3. JS-heavy Website Fixes (dm, Zalando, HWK):
- dsi_helpers.py: goto_resilient (networkidle→domcontentloaded fallback)
- try_dismiss_consent_banner before text extraction
- PDF redirect detection (dm.de redirects to GCS PDF)
4. Caritas False Positive Fixes:
- Phone regex allows parentheses: +49 (0)761 → now matches
- "Recht auf Widerspruch" (3 words) + §23 KDG → matches Art. 21
- Church authorities: "Katholisches Datenschutzzentrum" recognized
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-08 01:10:10 +02:00
Benjamin Admin
a349111a01
fix: Raise full_text limit 10K→50K + combine all DSI texts for checks
...
Two fixes:
1. consent-tester: full_text truncation raised from 10,000 to 50,000 chars
(IHK Internetangebot has ~50K chars, Beschwerderecht was after 10K cutoff)
2. Backend: dse_text now combines Playwright HTML + ALL DSI discovery texts
for mandatory content checking. Previously only used first 8K chars from
one source, missing Verantwortlicher/DSB that were in DSI documents.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-05 16:03:56 +02:00
Benjamin Admin
a3f7fb93f4
fix: Scan quality — raise page limit, use full DSI text for checks
...
Bug 1: max_pages was hardcoded to 15 in backend call — raised to 50
Bug 2: DSI documents checked against text_preview (500 chars) — now uses
full_text (10,000 chars) for Art. 13 mandatory field checks
Bug 3: DSE text not found when Playwright misses DSE page — now falls
back to DSI Discovery full_text as second source
Bug 4: Backend timeout 120s too short for 50 pages — raised to 300s
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-04 23:51:03 +02:00
Benjamin Admin
4e63a6050d
feat: Generic legal document discovery (DSI, AGB, Widerruf, Cookie-Richtlinie)
...
New service: dsi_discovery.py — finds ALL legal documents on any website:
- Technology-agnostic: HTML, SPA, WordPress, Typo3, custom CMS
- Structure-agnostic: accordions, sidebars, footers, inline links, tabs
- Format-agnostic: HTML pages, anchor sections, PDFs, cross-domain links
- Language-agnostic: 26 EU/EEA languages with document-type keywords
Document types discovered:
- Datenschutzinformationen / Privacy Policies (Art. 13/14 DSGVO)
- AGB / Terms of Service / Nutzungsbedingungen
- Widerrufsbelehrung / Right of Withdrawal (§355 BGB)
- Cookie-Richtlinie / Cookie Policy
- All cross-domain variants (e.g. help.instagram.com from instagram.com)
API: POST /dsi-discovery { url, max_documents }
Returns: list of documents with title, url, language, type, word_count, text_preview
Features:
- Expands all accordions, details, tabs, dropdowns before scanning
- Follows cross-domain links (same registrable domain)
- Re-expands after navigation back to source page
- Handles anchor links (#sections) separately from full pages
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-04 21:56:55 +02:00