Commit Graph

6 Commits

Author SHA1 Message Date
Benjamin Admin c9c0fb5965 feat(cookie-check): enhanced patterns + active opt-out link validator
cookie_checks.py:
- cookie_names_listed: now also matches CMP placeholder notation
  (BMW: 'Adfpc###', 'CT###') and 'Diese Datenverarbeitung verwendet die
  folgenden Cookies oder ähnliche Technologien' as list-shape signal.
  Cryptic vendor names like 'audience', 'adformfrpid' are accepted via
  the surrounding markup, not by hard-coding each one.
- cookie_providers_named: new pattern 'Gesetzt von: <Firma>' (BMW/ePaaS
  per-cookie vendor naming) + recognition of full legal-form names
  (Adform A/S, BMW AG, Adobe Systems Software Ireland Limited).
- cookie_duration_values: now matches 'Ablauf: 1 Jahr' / 'Speicherdauer:
  30 Tage' (BMW format) in addition to the legacy '<n> <unit>'.

New L1 + L2 checks for controller in cookie-policy:
- cookie_controller (L1): the cookie policy must name Verantwortlich(er)
- cookie_controller_address (L2): PLZ + Ort or address keywords
- cookie_controller_contact_or_link (L2): email/phone OR link back to
  Datenschutzerklärung (the practical equivalent — BMW does this)

New L2 checks (parented under opt_out):
- cookie_optout_links: detects per-provider opt-out URLs in the text
- cookie_privacy_policy_links: per-provider privacy-policy URLs

New service: cookie_link_validator.py
- extract_links(text): pulls all https?://… URLs that follow 'Opt-Out
  Link:' / 'Link zur Privacy Policy:' (deduped)
- validate_links(links): probes every URL concurrently (HEAD first, GET
  fallback for 405/403). 10 parallel, 8s per request, 60s batch cap.
  Returns reachable=True/False + status + final_url.
- build_check_items(): renders 2 CheckItems (opt-out + privacy-policy),
  each pass if ALL links 2xx/3xx, fail with up-to-5 broken-link examples.

Hook in _check_single: doc_type=='cookie' triggers the validator after
regex+MC checks. Recomputes correctness with the new L2 items.

This addresses two concrete BMW observations:
1. BMW's per-cookie structure (Name + Zweck + Ablauf, Gesetzt von: …,
   Opt-Out Link: …) now recognised → 'Konkrete Cookie-Namen aufgelistet'
   and 'Konkrete Speicherdauern' should pass.
2. Defective opt-out URLs surface as compliance findings rather than
   silently passing — Art. 7(3) DSGVO requires a working withdrawal
   path per provider.
2026-05-17 09:38:32 +02:00
Benjamin Admin 51d91d20ed fix: 6 false positives from Stadt Koeln + Caritas verification
Build + Deploy / build-admin-compliance (push) Successful in 9s
Build + Deploy / build-backend-compliance (push) Successful in 8s
Build + Deploy / build-ai-sdk (push) Successful in 40s
Build + Deploy / build-developer-portal (push) Successful in 7s
Build + Deploy / build-tts (push) Successful in 8s
Build + Deploy / build-document-crawler (push) Successful in 8s
Build + Deploy / build-dsms-gateway (push) Successful in 8s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m11s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 45s
CI / test-python-backend (push) Successful in 41s
CI / test-python-document-crawler (push) Successful in 29s
CI / test-python-dsms-gateway (push) Successful in 27s
CI / validate-canonical-controls (push) Successful in 17s
Build + Deploy / trigger-orca (push) Successful in 2m23s
- Phone regex allows parentheses: +49 (0)761 now matches
- "Recht auf Widerspruch" (3 words) + §23 KDG recognized
- Church authorities: "Katholisches Datenschutzzentrum", KdoeR
- "Artikel 6 Absatz 1 Buchstabe a" (unabbreviated) now matches
- "PHP Session ID" (with spaces) alongside "PHPSESSID"

6 FP eliminated across Caritas (KDG) and Stadt Koeln (verbose forms).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-08 01:31:36 +02:00
Benjamin Admin e50f3dfbee feat: All 138 hints rewritten as expert-level legal guidance
Build + Deploy / build-admin-compliance (push) Successful in 9s
Build + Deploy / build-backend-compliance (push) Successful in 10s
Build + Deploy / build-ai-sdk (push) Successful in 9s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 8s
Build + Deploy / build-document-crawler (push) Successful in 8s
Build + Deploy / build-dsms-gateway (push) Successful in 8s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 18s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m22s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 49s
CI / test-python-backend (push) Successful in 43s
CI / test-python-document-crawler (push) Successful in 32s
CI / test-python-dsms-gateway (push) Successful in 26s
CI / validate-canonical-controls (push) Successful in 18s
Build + Deploy / trigger-orca (push) Successful in 2m10s
Every hint now reads like a mini-consultation from a data protection
lawyer — with specific legal references, court rulings, and common
mistakes. Examples:

- EuGH C-210/16 (Fanpage), C-298/17 (Kontaktpflicht), C-311/18 (Schrems II)
- BGH I ZR 228/03 (ladungsfaehige Anschrift), XI ZR 388/10 (AGB)
- EDSA Guidelines 2/2019 (lit. b misuse), WP 248 Rev.01 (DSFA)
- DSK-Orientierungshilfe, CNIL-Leitlinien, SDM, BSI-IT-Grundschutz
- §25 TDDDG, §38 BDSG, §309 BGB, §312k BGB, Art. 246a EGBGB

This is the core value proposition: no lawyer can deliver this level
of specific, actionable compliance feedback in 60 seconds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 17:13:37 +02:00
Benjamin Admin fa4fd87102 fix: 7 regex bugs from IHK Konstanz ground truth analysis
Build + Deploy / build-admin-compliance (push) Successful in 9s
Build + Deploy / build-backend-compliance (push) Successful in 8s
Build + Deploy / build-ai-sdk (push) Successful in 42s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 7s
Build + Deploy / build-document-crawler (push) Successful in 7s
Build + Deploy / build-dsms-gateway (push) Successful in 8s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 18s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m57s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 49s
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Successful in 28s
CI / test-python-dsms-gateway (push) Successful in 23s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m24s
Fixes based on manual verification of all 30 failed checks:
1. Cookie table: recognize "folgende cookies" + column headers as text
2. Cookie names: add JSESSIONID, cookieinfo, et_id, BT_* patterns
3. Essential justified: match "sitzung zuordnen", "betrieb der website"
4. Social bookmarks: recognize as 2-click alternative
5. DSFA plural: "kanaelen" now matches alongside "kanal"
6. Section splitter: skip-headings no longer lose subsequent text
   (Risikoabwaegung section was cut from DSFA, losing risk scores)
7. Cookie legal basis: accept Art. 6(1)(f) in cookie context

Reduces false positives from 7 to ~1-2 for IHK Konstanz test case.
Ground truth table: zeroclaw/docs/ground-truth-ihk-konstanz.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 14:51:09 +02:00
Benjamin Admin 293c58d0dd feat: Add actionable hints to all 138 compliance checks
Build + Deploy / build-admin-compliance (push) Successful in 1m40s
Build + Deploy / build-backend-compliance (push) Successful in 7s
Build + Deploy / build-ai-sdk (push) Successful in 35s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 7s
Build + Deploy / build-document-crawler (push) Successful in 8s
Build + Deploy / build-dsms-gateway (push) Successful in 7s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 16s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m50s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 40s
CI / test-python-backend (push) Successful in 37s
CI / test-python-document-crawler (push) Successful in 25s
CI / test-python-dsms-gateway (push) Successful in 23s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m28s
Each check now has a "hint" field explaining what is missing and
what the customer should do to fix it. Hints are shown in the
frontend below failed checks in red text.

Examples:
- "Bei Verarbeitung auf Basis von Art. 6(1)(f) muss dokumentiert
  werden, warum Ihr berechtigtes Interesse die Rechte der
  Betroffenen ueberwiegt."
- "Die ladungsfaehige Anschrift fehlt. Erforderlich: Strasse,
  Hausnummer, PLZ und Ort."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 14:05:01 +02:00
Benjamin Admin b363c28539 feat: Add 76 Level-2 regex checks for document correctness verification
Split dsi_document_checker.py (466 LOC) into doc_checks/ package (9 files).
Two-pass L1→L2 logic: L1 checks "Is it mentioned?", L2 checks "Is it correct?"
(e.g. controller has full address, specific Art. 6 lit., concrete time periods).

138 total checks (62 L1 + 76 L2) across 7 doc types:
- DSE Art. 13: 31, Impressum §5 TMG: 16, Cookie §25 TDDDG: 15
- Widerruf §355: 15, AGB §305ff: 21, Social Media Art. 26: 20, DSFA Art. 35: 18

Frontend: hierarchical L1→L2 display with dual progress bars
(green=completeness, blue=correctness).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 12:37:03 +02:00