Commit Graph

8 Commits

Author SHA1 Message Date
Benjamin Admin f3e44cf59f fix: restore all missing consent-tester service modules
banner_detector.py, script_analyzer.py, category_tester.py, authenticated_scanner.py
were only on the feature branch — needed for consent-tester to start.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 00:14:26 +02:00
Benjamin Admin 3fade26d89 fix: restore consent-tester requirements.txt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 00:06:50 +02:00
Benjamin Admin 797ed667a2 fix: restore consent-tester Dockerfile (was lost from main)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 00:05:19 +02:00
Benjamin Admin a3f7fb93f4 fix: Scan quality — raise page limit, use full DSI text for checks
Bug 1: max_pages was hardcoded to 15 in backend call — raised to 50
Bug 2: DSI documents checked against text_preview (500 chars) — now uses
       full_text (10,000 chars) for Art. 13 mandatory field checks
Bug 3: DSE text not found when Playwright misses DSE page — now falls
       back to DSI Discovery full_text as second source
Bug 4: Backend timeout 120s too short for 50 pages — raised to 300s

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 23:51:03 +02:00
Benjamin Admin a846bd8910 fix: Exhaustive crawl — no arbitrary page/document limits
Both scanners now search until done, not until a counter runs out:

playwright_scanner.py:
- Default max_pages raised from 15 to 50
- Added 3-minute timeout as safety net
- Recursive link discovery on EVERY visited page (not just DSE pages)
- Stops when: all links visited OR max_pages OR timeout

dsi_discovery.py:
- Default max_documents raised from 30 to 100
- Added 5-minute timeout as safety net
- Recursive: on each visited page, searches for MORE DSI links
- Processes ALL discovered links exhaustively
- Stops when: no more pending links OR max_documents OR timeout

The scanners now behave like a real user: they follow every relevant
link they find, and on each new page they look for more links.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 22:21:57 +02:00
Benjamin Admin 4e63a6050d feat: Generic legal document discovery (DSI, AGB, Widerruf, Cookie-Richtlinie)
New service: dsi_discovery.py — finds ALL legal documents on any website:
- Technology-agnostic: HTML, SPA, WordPress, Typo3, custom CMS
- Structure-agnostic: accordions, sidebars, footers, inline links, tabs
- Format-agnostic: HTML pages, anchor sections, PDFs, cross-domain links
- Language-agnostic: 26 EU/EEA languages with document-type keywords

Document types discovered:
- Datenschutzinformationen / Privacy Policies (Art. 13/14 DSGVO)
- AGB / Terms of Service / Nutzungsbedingungen
- Widerrufsbelehrung / Right of Withdrawal (§355 BGB)
- Cookie-Richtlinie / Cookie Policy
- All cross-domain variants (e.g. help.instagram.com from instagram.com)

API: POST /dsi-discovery { url, max_documents }
Returns: list of documents with title, url, language, type, word_count, text_preview

Features:
- Expands all accordions, details, tabs, dropdowns before scanning
- Follows cross-domain links (same registrable domain)
- Re-expands after navigation back to source page
- Handles anchor links (#sections) separately from full pages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 21:56:55 +02:00
Benjamin Admin b997b4a475 feat: 9 new banner checks (12-20), total 20 compliance checks
Check 12: Click count — reject requires more clicks than accept (CNIL 150M EUR)
Check 13: Color contrast — reject button invisible (same bg as banner)
Check 14: Google Consent Mode — analytics_storage 'granted' as default
Check 15: Pre-consent cookies — tracking cookies set before any interaction
Check 16: Registration coupling — login button = consent (Art. 7(4) DSGVO)
Check 17: Language mismatch — banner vs page language (all 26 EU languages)
Check 18: Consent cookie expiry — >13 months violates CNIL guidelines
Check 19: Nudging — reject button below fold / requires scrolling
Check 20: Emotional language (Stirring) — "volle Funktionalitaet" etc.

Language detection covers: BG, CS, DA, DE, EL, EN, ES, ET, FI, FR, GA,
HR, HU, IS, IT, LT, LV, MT, NL, NO, PL, PT, RO, SK, SL, SV

New file: banner_advanced_checks.py (396 LOC)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 08:39:00 +02:00
Benjamin Admin 5d138f265b feat: 3 new banner legal checks (11 total) + extract banner_text_checker
New checks (from EUIPO reference case):
- Check 9: Third-party DSE link — detects when consent dialog links to
  external domain's privacy policy instead of own DSE (Art. 13 DSGVO)
- Check 10: Dark-pattern language — detects "muessen/erforderlich" for
  non-essential cookies suggesting false technical necessity (EDPB Rn. 70)
- Check 11: Non-modal dismiss = consent — detects when clicking outside
  dialog closes it (possibly treating as consent, Planet49 violation)

Refactor: extracted _check_banner_text (375 LOC) from consent_scanner.py
into services/banner_text_checker.py to keep both files under 500 LOC.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 08:02:46 +02:00