Benjamin Admin
b363c28539
feat: Add 76 Level-2 regex checks for document correctness verification
...
Split dsi_document_checker.py (466 LOC) into doc_checks/ package (9 files).
Two-pass L1→L2 logic: L1 checks "Is it mentioned?", L2 checks "Is it correct?"
(e.g. controller has full address, specific Art. 6 lit., concrete time periods).
138 total checks (62 L1 + 76 L2) across 7 doc types:
- DSE Art. 13: 31, Impressum §5 TMG: 16, Cookie §25 TDDDG: 15
- Widerruf §355: 15, AGB §305ff: 21, Social Media Art. 26: 20, DSFA Art. 35: 18
Frontend: hierarchical L1→L2 display with dual progress bars
(green=completeness, blue=correctness).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-07 12:37:03 +02:00
Benjamin Admin
3c12e06faf
feat: Fix DSFA dedup + expand all checklists to 56 total checks
...
Fixes:
- 'Risikoabwaegung' is sub-section of DSFA → added to SKIP_HEADINGS
- 'Social Media' standalone heading → recognized as social_media DSE
- Removed 'risikobew' from DSFA pattern (was too broad)
Expanded checklists:
- Widerruf: 4→7 checks (+Empfaenger, kein Grund, §312k Button)
- AGB: 4→9 checks (+Zahlung, Lieferung, Gewaehrleistung, Kuendigung, Datenschutz)
- Social Media: +1 (Social Bookmarks)
- DSFA: +1 (LFDI Richtlinie)
Total: 47→56 Regex-Checks across 7 document types:
DSI=9, Cookie=5, Social Media=10, DSFA=8, Impressum=6, Widerruf=7, AGB=9
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-07 11:55:29 +02:00
Benjamin Admin
4642abba23
feat: Expand Social Media (10 checks) + DSFA (8 checks) checklists
...
Art. 26 Joint Controller (10 checks, was 7):
+ Auflistung der genutzten Plattformen
+ Rechtsgrundlage (Art. 6)
+ Social Bookmarks vs. Plugins Hinweis
Improved: broader patterns for joint parties, contact point, data types
DSFA Art. 35 (8 checks, was 5):
+ Schwellwertanalyse / Auslösepruefung
+ Beruecksichtigung Landesbehörden-Richtlinie (LFDI)
+ Dokumentation der Ergebnisse
Improved: IHK-specific patterns (Kanäle, systematische Beobachtung,
geringer Umfang, sensitive Daten)
Total: 40 → 47 Regex-Checks across all document types.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-07 11:17:16 +02:00
Benjamin Admin
3853a0838a
feat: Art. 26 Joint Controller + DSFA checklists for Social Media sections
...
New checklists:
- JOINT_CONTROLLER_CHECKLIST (Art. 26 DSGVO, 7 checks):
Joint parties, arrangement, contact point, processing split,
data categories, third-country transfer (USA), rights
- DSFA_CHECKLIST (Art. 35 DSGVO, 5 checks):
Description, necessity, risk assessment, measures, DSB involvement
Section detection: 'Datenschutzerklaerung fuer Social Media' → social_media,
'Datenschutzfolgeabschaetzung/Risikoanalyse' → dsfa
classify_document_type: DSFA and social_media detected before generic DSE
Frontend: DOC_TYPES dropdown + ChecklistView labels updated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-07 10:49:32 +02:00
Benjamin Admin
45446aef16
fix: 8 quality + UX improvements
...
1. Cookie 'Zwecke' false positive: added 'um...zu', 'dienen', 'helfen',
'ermöglichen' patterns — catches purpose descriptions without 'Zweck'
2. Kurzhinweis: added empty all_checks for short documents (<200 words)
3. Bezeichnungsfeld: placeholder shows 'Version / Stand' for typed docs,
'Dokumentname' for 'Sonstiges'
4. DocCheckTab state persistence: entries + results survive navigation
5. DocCheck history: saves each check with date, doc count, findings
6. History display: 'Letzte Pruefungen' section at bottom of tab
7. ChecklistView: shows 'X von Y Pruefpunkten bestanden' per document
8. Results persist in localStorage across page navigation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-07 09:37:47 +02:00
Benjamin Admin
4c68caac4e
feat: Multi-URL Document Check with full checklist visibility
...
New "Dokumenten-Pruefung" tab in Compliance Agent:
- User adds multiple URLs with document type (DSI, AGB, Impressum, Cookie, Widerruf)
- Each document loaded via Playwright, accordions expanded, text extracted
- Checked against type-specific legal checklist
- Optional: Cookie banner check via checkbox
Checklisten-UX (solves "100% looks like nothing was checked"):
- All checks shown per document: green checkmark + matched text excerpt
- Red X for missing fields with legal reference
- Builds user trust: "9 Punkte geprueft, alle bestanden"
- Expandable per document with completeness bar
New checklists:
- Impressum: §5 TMG (6 fields: name, address, contact, register, VAT, representative)
- Cookie-Richtlinie: §25 TDDDG (5 fields: types, purposes, retention, third-party, opt-out)
Backend:
- POST /agent/doc-check — async with polling (same pattern as /scan)
- DocCheckResult includes checks[] with passed/failed + matched_text
- dsi_document_checker returns all_checks in SCORE finding
- Email report shows per-document checklist
Files: agent_doc_check_routes.py (280 LOC), DocCheckTab.tsx (248 LOC),
ChecklistView.tsx (130 LOC), dsi_document_checker.py (+70 LOC)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-06 10:08:40 +02:00
Benjamin Admin
8fb2061e9b
fix: Eliminate GA false positive + handle short DSI documents
...
Service detection:
- Only search script tags + src/href attributes for service patterns
- Prevents false positives from DSE text mentioning services
(e.g. IHK DSE describes etracker, 'google analytics' in text)
- Technical patterns (with regex chars) still checked in full HTML
Short documents:
- Documents with < 200 words flagged as 'Kurzhinweis' instead of
'MANGELHAFT' — too short for Art. 13 completeness check
- Prevents 96-word navigation pages from showing 8 missing fields
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-05 18:21:37 +02:00
Benjamin Admin
8d6959e8b2
fix: Expand Art. 13 patterns for generic matching across all websites
...
Complaint (Art. 13(2)(d)):
+ 'recht auf beschwerde', 'art. 77', 'beschwerde...wenden/einlegen',
'zuständige behörde' — IHK uses 'Recht auf Beschwerde gem. Art. 77'
Legal basis (Art. 13(1)(c)):
+ 'gemäß Art.', '§ X IHKG/BDSG/LDSG/BBiG/TDDDG', 'einwilligung gem',
'verarbeitung auf grundlage' — catches statutory references
Third country (Art. 13(1)(f)):
+ 'Übermittlung ausserhalb', 'EWR/EEA', 'Data Privacy Framework'
Retention (Art. 13(2)(a)):
+ 'Dauer der Speicherung', 'Aufbewahrungsdauer/-pflicht/-zeit',
'gesetzliche Aufbewahrung' — common German DSE headings
All patterns are generic, not IHK-specific.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-05 17:45:02 +02:00
Benjamin Admin
e3ae35891f
fix: 0% completeness bug — SCORE finding was not generated at 100%
...
Root cause: When all 9 Art. 13 checks passed (100%), no SCORE finding
was created (line: 'if pct < 100'). The backend then defaulted to
completeness=0 because it looked for the SCORE finding to extract the %.
Fix: Always generate SCORE finding, even at 100%. Added 'OK' severity
for fully compliant documents.
This was the cause of 8 documents showing '0% MANGELHAFT' despite
containing all required information.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-05 15:34:04 +02:00
Benjamin Admin
6c5e086356
fix: DSI dedup — skip anchor links, filter noise, merge duplicates + fix false positives
...
Dedup fixes:
- Anchor links (#cookies, #betroffenenrechte) on same page are skipped entirely
- Noise titles filtered: 'drucken', 'nach oben', 'Datenschutz' (too generic)
- Documents with < 50 words filtered (navigation snippets)
- Documents with identical word_count merged (same page, different title)
- URL-only titles filtered
False positive fixes (dsi_document_checker.py):
- 'Kontaktdaten des Verantwortlichen' pattern for controller check
- 'Zweck und Rechtsgrundlage' combined heading pattern
- 'Welche Daten werden verarbeitet' question-style headings
- 'Betroffenenrechte' as standalone heading
- 'Welche Rechte hat der Betroffene' question pattern
- 'Daten werden geloescht' retention pattern
- 'Auftragsverarbeiter' as recipient indicator
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-05 11:41:07 +02:00
Benjamin Admin
48146cddaf
feat: DSI document discovery + completeness check in agent scan workflow
...
Agent scan now automatically:
1. Discovers all legal documents via consent-tester /dsi-discovery endpoint
2. Classifies each as DSE/AGB/Widerruf/Cookie/Impressum
3. Checks completeness against type-specific checklists:
- DSE: 9 Art. 13 DSGVO mandatory fields (controller, DPO, purposes,
legal basis, recipients, third-country, retention, rights, complaint)
- AGB: §305ff BGB (scope, contract formation, liability, jurisdiction)
- Widerruf: §355 BGB (right info, 14-day deadline, form, consequences)
4. Adds findings per document to scan results
5. Shows discovered documents with completeness % in email summary
6. Returns discovered_documents list in API response
New files:
- dsi_document_checker.py (229 LOC) — checklists + classifier
- agent_scan_helpers.py (109 LOC) — extracted summary builder + corrections
Refactor: agent_scan_routes.py 537→448 LOC (under 500 budget)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-04 22:10:13 +02:00