Commit Graph

184 Commits

Author SHA1 Message Date
Benjamin Admin 51d91d20ed fix: 6 false positives from Stadt Koeln + Caritas verification
Build + Deploy / build-admin-compliance (push) Successful in 9s
Build + Deploy / build-backend-compliance (push) Successful in 8s
Build + Deploy / build-ai-sdk (push) Successful in 40s
Build + Deploy / build-developer-portal (push) Successful in 7s
Build + Deploy / build-tts (push) Successful in 8s
Build + Deploy / build-document-crawler (push) Successful in 8s
Build + Deploy / build-dsms-gateway (push) Successful in 8s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m11s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 45s
CI / test-python-backend (push) Successful in 41s
CI / test-python-document-crawler (push) Successful in 29s
CI / test-python-dsms-gateway (push) Successful in 27s
CI / validate-canonical-controls (push) Successful in 17s
Build + Deploy / trigger-orca (push) Successful in 2m23s
- Phone regex allows parentheses: +49 (0)761 now matches
- "Recht auf Widerspruch" (3 words) + §23 KDG recognized
- Church authorities: "Katholisches Datenschutzzentrum", KdoeR
- "Artikel 6 Absatz 1 Buchstabe a" (unabbreviated) now matches
- "PHP Session ID" (with spaces) alongside "PHPSESSID"

6 FP eliminated across Caritas (KDG) and Stadt Koeln (verbose forms).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-08 01:31:36 +02:00
Benjamin Admin 686834cea0 feat: 4 remaining tasks — EU institutions, banner integration, JS-sites, Caritas fixes
Build + Deploy / build-admin-compliance (push) Successful in 8s
Build + Deploy / build-backend-compliance (push) Successful in 8s
Build + Deploy / build-ai-sdk (push) Failing after 36s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 7s
Build + Deploy / build-document-crawler (push) Successful in 7s
Build + Deploy / build-dsms-gateway (push) Successful in 8s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
Build + Deploy / trigger-orca (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m14s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 46s
CI / test-python-backend (push) Successful in 43s
CI / test-python-document-crawler (push) Successful in 29s
CI / test-python-dsms-gateway (push) Successful in 30s
CI / validate-canonical-controls (push) Successful in 16s
1. EU Institution Checks (Verordnung 2018/1725):
   - New doc_type "eu_institution" with 9 L1 + 15 L2 checks
   - Both German + English patterns (EU institutions are multilingual)
   - Auto-detection via "2018/1725", "EDSB", "EDPS" keywords
   - Correct article references (Art. 15 instead of 13, Art. 5 instead of 6)

2. Banner Check Integration:
   - banner_runner.py maps scan results to 36 L1/L2 structured checks
   - BannerCheckTab shows hierarchical ChecklistView with hints
   - 3-phase summary (cookies/scripts before/after consent)
   - /scan endpoint now includes structured_checks in response

3. JS-heavy Website Fixes (dm, Zalando, HWK):
   - dsi_helpers.py: goto_resilient (networkidle→domcontentloaded fallback)
   - try_dismiss_consent_banner before text extraction
   - PDF redirect detection (dm.de redirects to GCS PDF)

4. Caritas False Positive Fixes:
   - Phone regex allows parentheses: +49 (0)761 → now matches
   - "Recht auf Widerspruch" (3 words) + §23 KDG → matches Art. 21
   - Church authorities: "Katholisches Datenschutzzentrum" recognized

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-08 01:10:10 +02:00
Benjamin Admin 3efc491ec5 fix: 5 false positives from etogruppe.com ground truth
Build + Deploy / build-admin-compliance (push) Successful in 2m22s
Build + Deploy / build-backend-compliance (push) Successful in 3m21s
Build + Deploy / build-ai-sdk (push) Successful in 53s
Build + Deploy / build-developer-portal (push) Successful in 1m16s
Build + Deploy / build-tts (push) Successful in 1m38s
Build + Deploy / build-document-crawler (push) Successful in 41s
Build + Deploy / build-dsms-gateway (push) Successful in 26s
Build + Deploy / build-dsms-node (push) Successful in 12s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 20s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m18s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 59s
CI / test-python-backend (push) Successful in 47s
CI / test-python-document-crawler (push) Successful in 32s
CI / test-python-dsms-gateway (push) Successful in 27s
CI / validate-canonical-controls (push) Successful in 16s
Build + Deploy / trigger-orca (push) Successful in 3m23s
1. Soft hyphens (­/\xad) stripped before regex matching —
   fixes "Daten­übertrag­barkeit" not matching
2. Art. 15/17/20: allow adjectives between "Recht auf" and keyword
   ("Recht auf unentgeltliche Auskunft" now matches)
3. DSB contact: regex spans up to 300 chars across newlines
   (DSB section with company address between heading and email)
4. Löschkonzept: added "Fortfall", "Entfall", "Beendigung" as
   deletion trigger words alongside "Ablauf"/"Wegfall"

Reduces etogruppe FPs from 5 to ~1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 23:51:04 +02:00
Benjamin Admin e50f3dfbee feat: All 138 hints rewritten as expert-level legal guidance
Build + Deploy / build-admin-compliance (push) Successful in 9s
Build + Deploy / build-backend-compliance (push) Successful in 10s
Build + Deploy / build-ai-sdk (push) Successful in 9s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 8s
Build + Deploy / build-document-crawler (push) Successful in 8s
Build + Deploy / build-dsms-gateway (push) Successful in 8s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 18s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m22s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 49s
CI / test-python-backend (push) Successful in 43s
CI / test-python-document-crawler (push) Successful in 32s
CI / test-python-dsms-gateway (push) Successful in 26s
CI / validate-canonical-controls (push) Successful in 18s
Build + Deploy / trigger-orca (push) Successful in 2m10s
Every hint now reads like a mini-consultation from a data protection
lawyer — with specific legal references, court rulings, and common
mistakes. Examples:

- EuGH C-210/16 (Fanpage), C-298/17 (Kontaktpflicht), C-311/18 (Schrems II)
- BGH I ZR 228/03 (ladungsfaehige Anschrift), XI ZR 388/10 (AGB)
- EDSA Guidelines 2/2019 (lit. b misuse), WP 248 Rev.01 (DSFA)
- DSK-Orientierungshilfe, CNIL-Leitlinien, SDM, BSI-IT-Grundschutz
- §25 TDDDG, §38 BDSG, §309 BGB, §312k BGB, Art. 246a EGBGB

This is the core value proposition: no lawyer can deliver this level
of specific, actionable compliance feedback in 60 seconds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 17:13:37 +02:00
Benjamin Admin a2f8366171 improve: Drittlandtransfer hint mentions Privacy Shield invalidity
Build + Deploy / build-admin-compliance (push) Successful in 2m23s
Build + Deploy / build-backend-compliance (push) Successful in 3m32s
Build + Deploy / build-ai-sdk (push) Successful in 57s
Build + Deploy / build-developer-portal (push) Successful in 1m22s
Build + Deploy / build-tts (push) Successful in 1m35s
Build + Deploy / build-document-crawler (push) Successful in 39s
Build + Deploy / build-dsms-gateway (push) Successful in 26s
Build + Deploy / build-dsms-node (push) Successful in 11s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 19s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m22s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 50s
CI / test-python-backend (push) Successful in 45s
CI / test-python-document-crawler (push) Successful in 33s
CI / test-python-dsms-gateway (push) Successful in 26s
CI / validate-canonical-controls (push) Successful in 19s
Build + Deploy / trigger-orca (push) Successful in 3m16s
Hint now explicitly warns that EU-US Privacy Shield is invalid since
Schrems II (July 2020) and recommends DPF or SCC as replacements.
This is the kind of specific, actionable feedback that makes the tool
valuable — catching outdated legal references no human would spot
in under a minute.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 17:01:56 +02:00
Benjamin Admin f51671737a fix: Correct Ollama model name + strict blank-line heading detection
Build + Deploy / build-admin-compliance (push) Failing after 48s
Build + Deploy / build-backend-compliance (push) Successful in 9s
Build + Deploy / build-ai-sdk (push) Successful in 8s
Build + Deploy / build-developer-portal (push) Successful in 9s
Build + Deploy / build-tts (push) Successful in 7s
Build + Deploy / build-document-crawler (push) Successful in 8s
Build + Deploy / build-dsms-gateway (push) Successful in 7s
Build + Deploy / build-dsms-node (push) Successful in 7s
CI / branch-name (push) Has been skipped
Build + Deploy / trigger-orca (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Failing after 2m3s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 45s
CI / test-python-backend (push) Successful in 40s
CI / test-python-document-crawler (push) Successful in 34s
CI / test-python-dsms-gateway (push) Successful in 27s
CI / validate-canonical-controls (push) Successful in 15s
1. LLM model: qwen3:32b → qwen3.5:35b-a3b (actual model on Mac Mini)
2. Section splitter: headings MUST be preceded by a blank line.
   This prevents cookie table entries ("Funktionale Cookies",
   "Session Cookies") from splitting the cookie section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 15:53:53 +02:00
Benjamin Admin 4f29e5ff3c feat: LLM verification for regex FAILs + section-split hardening
Build + Deploy / build-admin-compliance (push) Successful in 1m49s
Build + Deploy / build-backend-compliance (push) Successful in 9s
Build + Deploy / build-ai-sdk (push) Successful in 8s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 9s
Build + Deploy / build-document-crawler (push) Successful in 8s
Build + Deploy / build-dsms-gateway (push) Successful in 7s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 15s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m55s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 45s
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Successful in 27s
CI / test-python-dsms-gateway (push) Successful in 26s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m13s
Path to 100% correctness: Regex finds 80%, LLM catches the rest.

1. LLM verification (llm_verify.py):
   - Every regex FAIL is re-checked by Qwen (qwen3:32b)
   - Binary YES/NO question with evidence extraction
   - Overturned checks marked with [LLM] prefix in matched_text
   - Graceful fallback if LLM unavailable

2. Section splitter hardening:
   - Short lines (<16 chars) only treated as headings if preceded
     by blank line — prevents table column headers ("Funktion",
     "Speicherdauer") from splitting cookie sections
   - Fixes IHK cookie section: 288 words → full section

3. DSFA documentation patterns expanded:
   - Recognizes "4.) Ergebnis:" numbered result sections
   - Matches risk assessment conclusions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 15:34:07 +02:00
Benjamin Admin fa4fd87102 fix: 7 regex bugs from IHK Konstanz ground truth analysis
Build + Deploy / build-admin-compliance (push) Successful in 9s
Build + Deploy / build-backend-compliance (push) Successful in 8s
Build + Deploy / build-ai-sdk (push) Successful in 42s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 7s
Build + Deploy / build-document-crawler (push) Successful in 7s
Build + Deploy / build-dsms-gateway (push) Successful in 8s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 18s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m57s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 49s
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Successful in 28s
CI / test-python-dsms-gateway (push) Successful in 23s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m24s
Fixes based on manual verification of all 30 failed checks:
1. Cookie table: recognize "folgende cookies" + column headers as text
2. Cookie names: add JSESSIONID, cookieinfo, et_id, BT_* patterns
3. Essential justified: match "sitzung zuordnen", "betrieb der website"
4. Social bookmarks: recognize as 2-click alternative
5. DSFA plural: "kanaelen" now matches alongside "kanal"
6. Section splitter: skip-headings no longer lose subsequent text
   (Risikoabwaegung section was cut from DSFA, losing risk scores)
7. Cookie legal basis: accept Art. 6(1)(f) in cookie context

Reduces false positives from 7 to ~1-2 for IHK Konstanz test case.
Ground truth table: zeroclaw/docs/ground-truth-ihk-konstanz.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 14:51:09 +02:00
Benjamin Admin 293c58d0dd feat: Add actionable hints to all 138 compliance checks
Build + Deploy / build-admin-compliance (push) Successful in 1m40s
Build + Deploy / build-backend-compliance (push) Successful in 7s
Build + Deploy / build-ai-sdk (push) Successful in 35s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 7s
Build + Deploy / build-document-crawler (push) Successful in 8s
Build + Deploy / build-dsms-gateway (push) Successful in 7s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 16s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m50s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 40s
CI / test-python-backend (push) Successful in 37s
CI / test-python-document-crawler (push) Successful in 25s
CI / test-python-dsms-gateway (push) Successful in 23s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m28s
Each check now has a "hint" field explaining what is missing and
what the customer should do to fix it. Hints are shown in the
frontend below failed checks in red text.

Examples:
- "Bei Verarbeitung auf Basis von Art. 6(1)(f) muss dokumentiert
  werden, warum Ihr berechtigtes Interesse die Rechte der
  Betroffenen ueberwiegt."
- "Die ladungsfaehige Anschrift fehlt. Erforderlich: Strasse,
  Hausnummer, PLZ und Ort."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 14:05:01 +02:00
Benjamin Admin 870953f579 fix: PLZ regex matches lowercase text and D-78467 format
Patterns ran on text.lower() but searched [A-Z] — changed to [a-z].
Also accept D-12345 prefix (common German format).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 13:28:00 +02:00
Benjamin Admin b363c28539 feat: Add 76 Level-2 regex checks for document correctness verification
Split dsi_document_checker.py (466 LOC) into doc_checks/ package (9 files).
Two-pass L1→L2 logic: L1 checks "Is it mentioned?", L2 checks "Is it correct?"
(e.g. controller has full address, specific Art. 6 lit., concrete time periods).

138 total checks (62 L1 + 76 L2) across 7 doc types:
- DSE Art. 13: 31, Impressum §5 TMG: 16, Cookie §25 TDDDG: 15
- Widerruf §355: 15, AGB §305ff: 21, Social Media Art. 26: 20, DSFA Art. 35: 18

Frontend: hierarchical L1→L2 display with dual progress bars
(green=completeness, blue=correctness).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 12:37:03 +02:00
Benjamin Admin 3c12e06faf feat: Fix DSFA dedup + expand all checklists to 56 total checks
Fixes:
- 'Risikoabwaegung' is sub-section of DSFA → added to SKIP_HEADINGS
- 'Social Media' standalone heading → recognized as social_media DSE
- Removed 'risikobew' from DSFA pattern (was too broad)

Expanded checklists:
- Widerruf: 4→7 checks (+Empfaenger, kein Grund, §312k Button)
- AGB: 4→9 checks (+Zahlung, Lieferung, Gewaehrleistung, Kuendigung, Datenschutz)
- Social Media: +1 (Social Bookmarks)
- DSFA: +1 (LFDI Richtlinie)

Total: 47→56 Regex-Checks across 7 document types:
DSI=9, Cookie=5, Social Media=10, DSFA=8, Impressum=6, Widerruf=7, AGB=9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 11:55:29 +02:00
Benjamin Admin 4642abba23 feat: Expand Social Media (10 checks) + DSFA (8 checks) checklists
Art. 26 Joint Controller (10 checks, was 7):
+ Auflistung der genutzten Plattformen
+ Rechtsgrundlage (Art. 6)
+ Social Bookmarks vs. Plugins Hinweis
Improved: broader patterns for joint parties, contact point, data types

DSFA Art. 35 (8 checks, was 5):
+ Schwellwertanalyse / Auslösepruefung
+ Beruecksichtigung Landesbehörden-Richtlinie (LFDI)
+ Dokumentation der Ergebnisse
Improved: IHK-specific patterns (Kanäle, systematische Beobachtung,
geringer Umfang, sensitive Daten)

Total: 40 → 47 Regex-Checks across all document types.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 11:17:16 +02:00
Benjamin Admin 3853a0838a feat: Art. 26 Joint Controller + DSFA checklists for Social Media sections
New checklists:
- JOINT_CONTROLLER_CHECKLIST (Art. 26 DSGVO, 7 checks):
  Joint parties, arrangement, contact point, processing split,
  data categories, third-country transfer (USA), rights
- DSFA_CHECKLIST (Art. 35 DSGVO, 5 checks):
  Description, necessity, risk assessment, measures, DSB involvement

Section detection: 'Datenschutzerklaerung fuer Social Media' → social_media,
'Datenschutzfolgeabschaetzung/Risikoanalyse' → dsfa

classify_document_type: DSFA and social_media detected before generic DSE

Frontend: DOC_TYPES dropdown + ChecklistView labels updated

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 10:49:32 +02:00
Benjamin Admin 45446aef16 fix: 8 quality + UX improvements
1. Cookie 'Zwecke' false positive: added 'um...zu', 'dienen', 'helfen',
   'ermöglichen' patterns — catches purpose descriptions without 'Zweck'
2. Kurzhinweis: added empty all_checks for short documents (<200 words)
3. Bezeichnungsfeld: placeholder shows 'Version / Stand' for typed docs,
   'Dokumentname' for 'Sonstiges'
4. DocCheckTab state persistence: entries + results survive navigation
5. DocCheck history: saves each check with date, doc count, findings
6. History display: 'Letzte Pruefungen' section at bottom of tab
7. ChecklistView: shows 'X von Y Pruefpunkten bestanden' per document
8. Results persist in localStorage across page navigation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 09:37:47 +02:00
Benjamin Admin a680276c86 fix: Filter controls by test_procedure content — eliminates governance false positives
Only use controls whose test_procedure mentions document-type-specific terms:
- DSI: test_procedure must contain 'datenschutzerkl' or 'art. 13/14'
- Cookie: must contain 'cookie', 'einwilligung', 'consent'
- Impressum: must contain 'impressum'

This filters out internal governance controls (Datenmodelle, Infrastruktur)
that are irrelevant for public document checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 20:42:35 +02:00
Benjamin Admin fa45b5793c feat: Control Library check via SQL (canonical_controls) instead of Qdrant
Complete rewrite of rag_document_checker.py:
- Queries canonical_controls table (294K controls, 10K data_protection)
- Filters by category + title keywords per document type
- Uses test_procedure field as actual check instructions
- Regex pre-check extracts key terms from procedure → fast match
- LLM fallback only for regex misses (saves tokens)
- /no_think prefix for direct JSON output

SQL approach advantages:
- Structured data with test_procedure, pass_criteria, fail_criteria
- Category filtering (data_protection, compliance, governance)
- No Qdrant API key issues
- Controls are actual check criteria, not general legal texts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 20:26:56 +02:00
Benjamin Admin 6da36d87c2 fix: Robust JSON parsing for LLM responses — handles unquoted keys, fallback extraction
LLM returns {fulfilled: true} instead of {"fulfilled": true}.
Now fixes unquoted keys, True→true, and falls back to text-based
boolean extraction when JSON parsing fails entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 15:18:52 +02:00
Benjamin Admin e50c4d659e fix: Disable Qwen thinking mode for RAG checks (/no_think prefix)
Qwen 3.5 uses all tokens for thinking, leaving response empty.
Using /no_think prefix to get direct JSON output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 15:12:51 +02:00
Benjamin Admin 9f16e6d535 fix: Read Qwen response from 'thinking' field when 'response' is empty
Qwen 3.5 with latest Ollama returns structured thinking in separate
'thinking' field, leaving 'response' empty. Now checks both fields.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 15:07:09 +02:00
Benjamin Admin f4374cfe8d feat: Semantic Qdrant search — embed query via bge-m3, vector search in local Qdrant
Replaces scroll+filter approach with proper semantic search:
1. Embed query via bp-core-embedding-service (bge-m3, 1024 dim)
2. Vector search in Qdrant (bp_compliance_datenschutz + bp_compliance_gesetze)
3. Sort by cosine similarity score
4. No API key needed — local Qdrant on Mac Mini

Falls back gracefully: SDK first, then semantic Qdrant, then empty.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 14:46:06 +02:00
Benjamin Admin 7b8440191e fix: Better error logging + increase LLM timeout to 120s for RAG check
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 14:33:58 +02:00
Benjamin Admin 510f513811 fix: Qdrant search uses chunk_text + section/category filter
Payload structure: chunk_text (not text), section (Article 13),
category, regulation_id. Scrolls 100 points per collection,
filters client-side against regulation keywords.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 14:28:32 +02:00
Benjamin Admin b50c4ec940 fix: RAG checker falls back to local Qdrant when Go SDK returns 401
Go SDK points to external Qdrant (qdrant-dev.breakpilot.ai) with expired API key.
Fallback: search directly in local Qdrant (bp-core-qdrant:6333) which has
all collections: bp_compliance_datenschutz, bp_compliance_gesetze, atomic_controls_dedup.

Search strategy:
1. Try Go SDK RAG endpoint (preferred, has embedding-based search)
2. Fallback: Qdrant scroll with text-based regulation filter

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 14:23:52 +02:00
Benjamin Admin 090da0f71b feat: RAG-based document verification against 144K Control Library
New module: rag_document_checker.py
- Searches RAG (Qdrant) for controls relevant to document type
- Filters by regulation (DSGVO Art.13, TDDDG §25, BGB §355 etc.)
- LLM (Qwen 3.5:35b) verifies each control against document text
- Returns fulfilled/missing with evidence text + severity
- Supports: DSI, Cookie, Impressum, Widerruf, AGB, DSFA, AVV, Loeschkonzept

Integration in doc-check endpoint:
- Regex checklist runs first (fast, deterministic)
- RAG checks run after (semantic, catches what regex misses)
- Both results combined in single response

LLM prompt returns JSON: {fulfilled, evidence, issue, severity}
Think-tags stripped, JSON extracted from response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 13:19:15 +02:00
Benjamin Admin 4c68caac4e feat: Multi-URL Document Check with full checklist visibility
New "Dokumenten-Pruefung" tab in Compliance Agent:
- User adds multiple URLs with document type (DSI, AGB, Impressum, Cookie, Widerruf)
- Each document loaded via Playwright, accordions expanded, text extracted
- Checked against type-specific legal checklist
- Optional: Cookie banner check via checkbox

Checklisten-UX (solves "100% looks like nothing was checked"):
- All checks shown per document: green checkmark + matched text excerpt
- Red X for missing fields with legal reference
- Builds user trust: "9 Punkte geprueft, alle bestanden"
- Expandable per document with completeness bar

New checklists:
- Impressum: §5 TMG (6 fields: name, address, contact, register, VAT, representative)
- Cookie-Richtlinie: §25 TDDDG (5 fields: types, purposes, retention, third-party, opt-out)

Backend:
- POST /agent/doc-check — async with polling (same pattern as /scan)
- DocCheckResult includes checks[] with passed/failed + matched_text
- dsi_document_checker returns all_checks in SCORE finding
- Email report shows per-document checklist

Files: agent_doc_check_routes.py (280 LOC), DocCheckTab.tsx (248 LOC),
ChecklistView.tsx (130 LOC), dsi_document_checker.py (+70 LOC)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 10:08:40 +02:00
Benjamin Admin 8fb2061e9b fix: Eliminate GA false positive + handle short DSI documents
Service detection:
- Only search script tags + src/href attributes for service patterns
- Prevents false positives from DSE text mentioning services
  (e.g. IHK DSE describes etracker, 'google analytics' in text)
- Technical patterns (with regex chars) still checked in full HTML

Short documents:
- Documents with < 200 words flagged as 'Kurzhinweis' instead of
  'MANGELHAFT' — too short for Art. 13 completeness check
- Prevents 96-word navigation pages from showing 8 missing fields

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 18:21:37 +02:00
Benjamin Admin 8d6959e8b2 fix: Expand Art. 13 patterns for generic matching across all websites
Complaint (Art. 13(2)(d)):
+ 'recht auf beschwerde', 'art. 77', 'beschwerde...wenden/einlegen',
  'zuständige behörde' — IHK uses 'Recht auf Beschwerde gem. Art. 77'

Legal basis (Art. 13(1)(c)):
+ 'gemäß Art.', '§ X IHKG/BDSG/LDSG/BBiG/TDDDG', 'einwilligung gem',
  'verarbeitung auf grundlage' — catches statutory references

Third country (Art. 13(1)(f)):
+ 'Übermittlung ausserhalb', 'EWR/EEA', 'Data Privacy Framework'

Retention (Art. 13(2)(a)):
+ 'Dauer der Speicherung', 'Aufbewahrungsdauer/-pflicht/-zeit',
  'gesetzliche Aufbewahrung' — common German DSE headings

All patterns are generic, not IHK-specific.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 17:45:02 +02:00
Benjamin Admin e3ae35891f fix: 0% completeness bug — SCORE finding was not generated at 100%
Root cause: When all 9 Art. 13 checks passed (100%), no SCORE finding
was created (line: 'if pct < 100'). The backend then defaulted to
completeness=0 because it looked for the SCORE finding to extract the %.

Fix: Always generate SCORE finding, even at 100%. Added 'OK' severity
for fully compliant documents.

This was the cause of 8 documents showing '0% MANGELHAFT' despite
containing all required information.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 15:34:04 +02:00
Benjamin Admin 6c5e086356 fix: DSI dedup — skip anchor links, filter noise, merge duplicates + fix false positives
Dedup fixes:
- Anchor links (#cookies, #betroffenenrechte) on same page are skipped entirely
- Noise titles filtered: 'drucken', 'nach oben', 'Datenschutz' (too generic)
- Documents with < 50 words filtered (navigation snippets)
- Documents with identical word_count merged (same page, different title)
- URL-only titles filtered

False positive fixes (dsi_document_checker.py):
- 'Kontaktdaten des Verantwortlichen' pattern for controller check
- 'Zweck und Rechtsgrundlage' combined heading pattern
- 'Welche Daten werden verarbeitet' question-style headings
- 'Betroffenenrechte' as standalone heading
- 'Welche Rechte hat der Betroffene' question pattern
- 'Daten werden geloescht' retention pattern
- 'Auftragsverarbeiter' as recipient indicator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 11:41:07 +02:00
Benjamin Admin f967480cd9 fix: Add missing service_registry.py to main
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 23:34:00 +02:00
Benjamin Admin a18ef16378 fix: Add missing service modules required by agent_scan_routes
These files existed on the feature branch but were never cherry-picked
to main, causing ModuleNotFoundError on import:
- dse_parser.py — parses DSE HTML into structured sections
- dse_matcher.py — matches detected services against DSE sections
- mandatory_content_checker.py — checks Art. 13 DSGVO mandatory fields
- legal_basis_validator.py — validates legal basis (lit. a-f)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 23:22:30 +02:00
Benjamin Admin 48146cddaf feat: DSI document discovery + completeness check in agent scan workflow
Agent scan now automatically:
1. Discovers all legal documents via consent-tester /dsi-discovery endpoint
2. Classifies each as DSE/AGB/Widerruf/Cookie/Impressum
3. Checks completeness against type-specific checklists:
   - DSE: 9 Art. 13 DSGVO mandatory fields (controller, DPO, purposes,
     legal basis, recipients, third-country, retention, rights, complaint)
   - AGB: §305ff BGB (scope, contract formation, liability, jurisdiction)
   - Widerruf: §355 BGB (right info, 14-day deadline, form, consequences)
4. Adds findings per document to scan results
5. Shows discovered documents with completeness % in email summary
6. Returns discovered_documents list in API response

New files:
- dsi_document_checker.py (229 LOC) — checklists + classifier
- agent_scan_helpers.py (109 LOC) — extracted summary builder + corrections

Refactor: agent_scan_routes.py 537→448 LOC (under 500 budget)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 22:10:13 +02:00
Benjamin Admin 53f6f30cf0 feat: DSI document discovery + completeness check in agent scan workflow
Agent scan now automatically:
1. Discovers all legal documents via consent-tester /dsi-discovery endpoint
2. Classifies each as DSE/AGB/Widerruf/Cookie/Impressum
3. Checks completeness against type-specific checklists:
   - DSE: 9 Art. 13 DSGVO mandatory fields (controller, DPO, purposes,
     legal basis, recipients, third-country, retention, rights, complaint)
   - AGB: §305ff BGB (scope, contract formation, liability, jurisdiction)
   - Widerruf: §355 BGB (right info, 14-day deadline, form, consequences)
4. Adds findings per document to scan results
5. Shows discovered documents with completeness % in email summary
6. Returns discovered_documents list in API response

New files:
- dsi_document_checker.py (229 LOC) — checklists + classifier
- agent_scan_helpers.py (109 LOC) — extracted summary builder + corrections

Refactor: agent_scan_routes.py 537→448 LOC (under 500 budget)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 22:09:45 +02:00
Benjamin Admin d3c8811fdb feat: IAB TCF 2.2 — TC String encoder + purpose mapping + UI
- TCFEncoderService: generates base64url-encoded TC Strings per IAB spec
  with 12 purposes, vendor consent bitfield, CMP metadata
- Category-to-purpose mapping (necessary→none, statistics→1,7,8,9,10,
  marketing→1,2,3,4,5,6,7,12, functional→1,11)
- tcf_routes: 5 endpoints (purposes, features, mapping, encode, encode-categories)
- banner_consent_service: auto-generates TC String when tcf_enabled=true
- TCFSettings.tsx: enable/disable toggle, purpose grid with category mapping,
  TC String test generator, CMP registration info
- New "TCF/IAB" tab in cookie-banner page (7 tabs total)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 07:01:37 +02:00
Benjamin Admin eb4ea8bc42 feat: EmailDeliveryService + professional DSR email templates
- EmailDeliveryService: load template → find published version →
  render {{variables}} → send via SMTP → audit log. Fallback to
  inline HTML when no published template exists.
- Migration 117: Professional HTML/text content for all 5 DSR
  templates (receipt, completion, rejection, identity, extension)
  with branded styling and proper Art. references
- DSRArt11Service now uses EmailDeliveryService with dsr_rejection
  template instead of hardcoded HTML

[migration-approved]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 23:38:32 +02:00
Benjamin Admin 060f351da7 feat: Art. 11 DSGVO — reject DSR when data subject not identifiable
- New DSRArt11Service: handles rejection with proper legal basis,
  automated email notification to requester explaining Art. 11
- POST /dsr/{id}/reject-art11 endpoint
- ActionButtons.tsx: "Nicht identifizierbar (Art. 11)" button
  shown when identity is not yet verified
- Also fixes: DSR export type-cast rollback handling

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 23:30:18 +02:00
Benjamin Admin c55d0ab12a fix: DSR export type-cast bug + session rollback on partial failures
- tenant_id kept as string (PostgreSQL handles UUID cast)
- Einwilligungen query uses CAST(:tid AS VARCHAR) for compatibility
- Each data source query wrapped with rollback on failure to prevent
  cascading "transaction aborted" errors

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 23:15:25 +02:00
Benjamin Admin 02468c94c0 feat: DSR User Data Export — Art. 15 PDF + Art. 20 JSON/CSV
- DSRExportService: aggregates all CMP data about a user from
  Banner Consents, Einwilligungen, Audit Trail, DSR History
- GET /dsr/{id}/export-user-data?format=json|csv|pdf endpoint
- PDF: A4 reportlab with 4 sections (Consents, Einwilligungen,
  Audit-Trail, DSR-Anfragen) + cover page
- CSV: BOM-encoded for Excel with flattened data rows
- JSON: structured export with all data categories
- ActionButtons.tsx: PDF/JSON/CSV export buttons now functional

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 22:42:03 +02:00
Benjamin Admin 630fffc0cc feat: Academy integration — training gap detection after document approval (F7)
- Migration 115: compliance_role_training_mapping table (org roles → training codes)
- TrainingLinkService: queries training_modules/matrix/assignments to find gaps
  per person and role. Gracefully degrades when Go training tables don't exist yet.
- document_review_routes: 2 new endpoints (training-requirements, training-gaps)
- _notify_approval() now checks training gaps and sends emails to persons
  with outstanding modules, linking to /sdk/training/learner

[migration-approved]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 22:03:25 +02:00
Benjamin Admin 965af3a34c feat: A/B Testing + Compliance Report PDF (F5 + F8)
F5: A/B Testing for Consent Rate
- Migration 116: banner_variants table + variant tracking in audit log
- BannerABService: deterministic sticky bucketing via device hash,
  chi-squared significance testing, variant CRUD
- banner_ab_routes: 6 endpoints (CRUD + stats + assign)
- ABTestPanel.tsx: variant creation, traffic sliders, opt-in comparison
  chart with winner/significance badges
- New "A/B-Test" tab in cookie-banner page

F8: Compliance Report PDF
- CompliancePDFGenerator: reportlab-based A4 PDF covering all modules
  (Company Profile, TOM, VVT, DSFA, Risks, Vendors, Incidents,
  Reviews, Consents, Roles)
- compliance_report_routes: GET /compliance/report/pdf
- "Compliance-Report herunterladen" button on SDK dashboard

[migration-approved]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 21:42:50 +02:00
Benjamin Admin c3fcfe88ee feat: Vendor-level consent + Consent analytics (F4 + F6)
F4: Granular Vendor-Level Consent
- Migration 113: vendor_consents JSONB on banner_consents + audit_log
- ConsentCreate schema + BannerConsentDB model extended
- banner_consent_service stores vendor_consents alongside categories
- Audit trail includes vendor-level decisions + user_agent

F6: Consent Rate Analytics
- Migration 114: user_agent on audit_log + time-series index
- BannerAnalyticsService: time series, category breakdown, device stats
- banner_analytics_routes: 4 endpoints (overview, time-series, categories, devices)
- AnalyticsDashboard.tsx: KPIs, bar chart, category bars, device breakdown
- New "Analytik" tab in cookie-banner page

[migration-approved]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 20:58:06 +02:00
Benjamin Admin fe6764df9a fix: ensure JSONB array fields are always arrays in control API
Backend: _ensure_list() converts null/string/malformed JSONB to []
for requirements, test_procedure, evidence, open_anchors, tags.

Frontend: defensive Array.isArray() check on ControlDetail.tsx.

Fixes: TypeError: A.requirements.map is not a function

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 21:18:10 +02:00
Benjamin Admin 29f9a8fea3 feat: Cookie banner vendors per category + {{COOKIE_TABLE}} generator
- CookieBannerOverlay: shows vendors per category with expandable tables
  (Verarbeiter, Cookies, Dauer, Land) for full transparency
- Demo vendors: 4 necessary, 3 statistics, 3 marketing, 3 functional
- cookie_table_generator.py: renders {{COOKIE_TABLE}} Markdown tables
  from vendor configs (DB) or service registry (fallback)
- SERVICE_COOKIES: 16 known vendor-to-cookie mappings with provider + country

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 20:06:57 +02:00
Benjamin Admin db697924ed feat: Cookie banner vendors per category + {{COOKIE_TABLE}} generator
- CookieBannerOverlay: shows vendors per category with expandable tables
  (Verarbeiter, Cookies, Dauer, Land) for full transparency
- Demo vendors: 4 necessary, 3 statistics, 3 marketing, 3 functional
- cookie_table_generator.py: renders {{COOKIE_TABLE}} Markdown tables
  from vendor configs (DB) or service registry (fallback)
- SERVICE_COOKIES: 16 known vendor-to-cookie mappings with provider + country

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 20:06:44 +02:00
Benjamin Admin a1f5d883cc feat: Cookie-Banner ↔ Backend Integration (DSR, Retention, Consent Proof)
Phase 1: Vendor sync from service registry (82+ services → banner vendors)
Phase 2: Category-based retention (marketing=90d, statistics=790d, not hardcoded 365d)
Phase 3: DSR ↔ Banner email linking (link-email, by-email, Art.17 erasure, Art.15/20 export)
Phase 4: Consent sync (Banner → Einwilligungen bridge)
Phase 6: Consent proof (SHA256 config hash + config_version in audit log, Art. 7(1) DSGVO)

New files:
- banner_dsr_service.py — email linking + DSR integration
- vendor_banner_sync.py — service registry → vendor configs
- migration 106 — linked_email, banner_config_hash, consent_version columns

Tests: 20+ new backend tests + 2 Playwright E2E test suites (API + UI)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 19:52:04 +02:00
Benjamin Admin 17c67b4f25 feat: Cookie-Banner ↔ Backend Integration (DSR, Retention, Consent Proof)
Phase 1: Vendor sync from service registry (82+ services → banner vendors)
Phase 2: Category-based retention (marketing=90d, statistics=790d, not hardcoded 365d)
Phase 3: DSR ↔ Banner email linking (link-email, by-email, Art.17 erasure, Art.15/20 export)
Phase 4: Consent sync (Banner → Einwilligungen bridge)
Phase 6: Consent proof (SHA256 config hash + config_version in audit log, Art. 7(1) DSGVO)

New files:
- banner_dsr_service.py — email linking + DSR integration
- vendor_banner_sync.py — service registry → vendor configs
- migration 106 — linked_email, banner_config_hash, consent_version columns

Tests: 20+ new backend tests + 2 Playwright E2E test suites (API + UI)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 19:41:22 +02:00
Benjamin Admin c5b22e0c99 fix: derive intake flags from DETECTED SERVICES, not from text content
Fundamental architecture fix: data processing happens through APIs/scripts/
cookies — NOT through visible page text. A news site about healthcare does
NOT process health data.

Before: Qwen reads website text → guesses "health_data: true" (WRONG)
After: Google Analytics detected → tracking: true (CORRECT, deterministic)

New flow: detect services from HTML → map service categories to flags →
feed flags into UCCA assessment. No LLM needed for flag extraction.

SERVICE_TO_FLAGS maps categories: tracking→tracking, marketing→marketing+
third_party_sharing, payment→payment_data, heatmap→profiling, etc.
SPECIFIC_SERVICE_FLAGS for Klarna (Art.22), Stripe (US transfer), etc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 08:37:51 +02:00
Benjamin Admin 0f3ec9061e fix: false positive findings + restore docs-src + §312k ecommerce filter
1. Intake prompt: "BETREIBER verarbeitet" statt "Text erwaehnt".
   IHK berichtet ueber Gesundheitsdaten → false. Vorher: true.
2. §312k Check: nur bei E-Commerce/Abo-Websites (Warenkorb, Shop, PayPal etc.)
   IHK hat keine Vertraege → kein Kuendigungsbutton noetig.
3. docs-src/ restored from commit 9824304

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 08:26:59 +02:00
Benjamin Admin e318215cc5 refactor: split agent_analyze_routes (420→309 LOC) + agent docs + migration
- Extracted website compliance checks + helpers to website_compliance_checks.py
- Created agent documentation (zeroclaw/docs/compliance-agent.md)
- DB migration 086 executed (compliance_agent_scans table)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 08:22:52 +02:00