Commit Graph

10 Commits

Author SHA1 Message Date
Benjamin Admin a6618af5ed feat: Generic legal document discovery (DSI, AGB, Widerruf, Cookie-Richtlinie)
New service: dsi_discovery.py — finds ALL legal documents on any website:
- Technology-agnostic: HTML, SPA, WordPress, Typo3, custom CMS
- Structure-agnostic: accordions, sidebars, footers, inline links, tabs
- Format-agnostic: HTML pages, anchor sections, PDFs, cross-domain links
- Language-agnostic: 26 EU/EEA languages with document-type keywords

Document types discovered:
- Datenschutzinformationen / Privacy Policies (Art. 13/14 DSGVO)
- AGB / Terms of Service / Nutzungsbedingungen
- Widerrufsbelehrung / Right of Withdrawal (§355 BGB)
- Cookie-Richtlinie / Cookie Policy
- All cross-domain variants (e.g. help.instagram.com from instagram.com)

API: POST /dsi-discovery { url, max_documents }
Returns: list of documents with title, url, language, type, word_count, text_preview

Features:
- Expands all accordions, details, tabs, dropdowns before scanning
- Follows cross-domain links (same registrable domain)
- Re-expands after navigation back to source page
- Handles anchor links (#sections) separately from full pages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 21:56:29 +02:00
Benjamin Admin 9df2a001bb feat: 9 new banner checks (12-20), total 20 compliance checks
Check 12: Click count — reject requires more clicks than accept (CNIL 150M EUR)
Check 13: Color contrast — reject button invisible (same bg as banner)
Check 14: Google Consent Mode — analytics_storage 'granted' as default
Check 15: Pre-consent cookies — tracking cookies set before any interaction
Check 16: Registration coupling — login button = consent (Art. 7(4) DSGVO)
Check 17: Language mismatch — banner vs page language (all 26 EU languages)
Check 18: Consent cookie expiry — >13 months violates CNIL guidelines
Check 19: Nudging — reject button below fold / requires scrolling
Check 20: Emotional language (Stirring) — "volle Funktionalitaet" etc.

Language detection covers: BG, CS, DA, DE, EL, EN, ES, ET, FI, FR, GA,
HR, HU, IS, IT, LT, LV, MT, NL, NO, PL, PT, RO, SK, SL, SV

New file: banner_advanced_checks.py (396 LOC)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 08:38:46 +02:00
Benjamin Admin c47450fe58 feat: 3 new banner legal checks (11 total) + extract banner_text_checker
New checks (from EUIPO reference case):
- Check 9: Third-party DSE link — detects when consent dialog links to
  external domain's privacy policy instead of own DSE (Art. 13 DSGVO)
- Check 10: Dark-pattern language — detects "muessen/erforderlich" for
  non-essential cookies suggesting false technical necessity (EDPB Rn. 70)
- Check 11: Non-modal dismiss = consent — detects when clicking outside
  dialog closes it (possibly treating as consent, Planet49 violation)

Refactor: extracted _check_banner_text (375 LOC) from consent_scanner.py
into services/banner_text_checker.py to keep both files under 500 LOC.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 08:01:54 +02:00
Benjamin Admin 0a6ec9235e feat: 8 cookie banner legal checks (Playwright)
1. Impressum link accessible from banner (§5 TMG, LG Rostock)
2. DSE link in banner (Art. 13 DSGVO, informierte Einwilligung)
3. Wrong wording: "Zustimmung zur DSE" — DSE is Art. 13 obligation,
   not consent. Correct: "zur Kenntnis genommen"
4. Reject button visible (§25 TDDDG, no hidden reject)
5. Pre-ticked checkboxes detected (EuGH C-673/17 Planet49)
6. Dark Pattern: button size comparison — accept vs reject area
   ratio >2.5x or font size ratio >1.5x = dark pattern
7. Cookie Wall detection (Phase B — site blocked after reject)
8. Re-access to settings (Art. 7(3) — revocation as easy as consent)

All checks run via Playwright on the actual rendered banner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 11:55:54 +02:00
Benjamin Admin 6864849115 feat: Phase 11 — granular cookie category testing
Tests each consent category in isolation:
- Phase D: Only "Statistics" enabled → checks if only analytics loads
- Phase E: Only "Marketing" enabled → checks if only ads load
- Phase F: Only "Functional" enabled → checks no tracking loads

CMP-specific category selectors for Cookiebot, OneTrust, Usercentrics,
Didomi. Generic fallback via toggle/checkbox keyword detection.

SERVICE_CATEGORY_MAP maps 35+ services to expected categories.
Violations: "Facebook Pixel loads with only Statistics enabled" = miscategorization.

Frontend: category test results shown below Phase A-C with
per-category violation cards.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 21:15:23 +02:00
Benjamin Admin 58957a4aaa fix: Playwright user permission + etracker DSE matching + CMP skip
1. Dockerfile: install Playwright AS appuser (not root) so chromium
   binary is accessible at runtime. Was causing 500 error.
2. DSE service matching: text-search fallback when LLM extraction fails.
   If "etracker" appears in DSE text, mark as documented even without
   LLM parsing the service list.
3. CMP skip: consent managers in category "cmp" skipped (not just "other"
   with id "cmp").

NOT DEPLOYED — RAG pipeline is running.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 19:36:46 +02:00
Benjamin Admin cedc5de15d feat: Phase 10 — Playwright website scanner replaces httpx
New /website-scan endpoint in consent-tester service:
- Real browser renders JavaScript (finds dynamic content)
- Clicks navigation menus (discovers hidden sub-pages like IHK DSB page)
- Follows links within DSE to find regional privacy policies
- Collects rendered HTML for each page (after JS execution)

Backend integration:
- agent_scan_routes tries Playwright first, falls back to httpx
- DSE text and HTML extracted from Playwright-rendered pages
- Service detection runs on rendered HTML (catches JS-loaded scripts)

Also fixes:
- GA regex: G-[A-Z0-9]{8,12} prevents CSS class false positives
- etracker added to service registry
- External page scanning blocked (same-domain only)
- CSS/JS/image files excluded from page list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 19:16:50 +02:00
Benjamin Admin 5eeef3a9c3 fix: 4 bugs from IHK scan — false positives + missing etracker
1. GA regex: G-\w{5,} matched CSS classes (g-7031048). Now requires
   G-[A-Z0-9]{8,12} (uppercase after G-, 8-12 chars = real GA4 ID)
2. External page scanning: DSE-internal links now SAME DOMAIN only.
   Previously followed links to etracker.com, google.de/policies etc.
   and detected services on THOSE sites as IHK services.
3. Added etracker to service registry (DE, ePrivacy-certified)
4. CSS/JS/image files excluded from page scanning
5. Navigation-pattern links for deeper DSE sub-pages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 19:08:07 +02:00
Benjamin Admin 4bf92f42b8 feat: Phase 9 — Authenticated Testing + Legal Basis Validator (lit. mapping)
Phase 9: Playwright login + 5 post-login checks:
- §312k BGB: Kündigungsbutton (2 Klicks)
- Art. 17 DSGVO: Konto löschen
- Art. 20 DSGVO: Daten exportieren
- Art. 7(3): Einwilligungen widerrufen
- Art. 15: Profildaten einsehen
Auto-detects login form selectors. Credentials destroyed after test.

Legal Basis Validator: Checks 7 common lit-mapping mistakes:
- Cookie tracking on lit. f instead of lit. a (Planet49)
- Analytics on lit. b (contract overextension)
- Klarna without Art. 22 reference
- Session recording without consent
Integrated into website scan pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 16:08:41 +02:00
Benjamin Admin d105842bf2 feat: consent-tester microservice — Playwright 3-phase cookie test
New independent service (port 8094) with headless Chromium:
- Phase A: What loads BEFORE any consent interaction
- Phase B: What loads AFTER rejecting consent (CRITICAL if tracking persists)
- Phase C: What loads AFTER accepting (check against cookie policy)
- 10 CMP-specific selectors (Didomi, OneTrust, Cookiebot, Usercentrics, etc.)
- Generic fallback via button text matching
- 18 tracking service patterns for script classification

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 12:14:41 +02:00