Benjamin Admin
|
6c5e086356
|
fix: DSI dedup — skip anchor links, filter noise, merge duplicates + fix false positives
Dedup fixes:
- Anchor links (#cookies, #betroffenenrechte) on same page are skipped entirely
- Noise titles filtered: 'drucken', 'nach oben', 'Datenschutz' (too generic)
- Documents with < 50 words filtered (navigation snippets)
- Documents with identical word_count merged (same page, different title)
- URL-only titles filtered
False positive fixes (dsi_document_checker.py):
- 'Kontaktdaten des Verantwortlichen' pattern for controller check
- 'Zweck und Rechtsgrundlage' combined heading pattern
- 'Welche Daten werden verarbeitet' question-style headings
- 'Betroffenenrechte' as standalone heading
- 'Welche Rechte hat der Betroffene' question pattern
- 'Daten werden geloescht' retention pattern
- 'Auftragsverarbeiter' as recipient indicator
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-05-05 11:41:07 +02:00 |
|
Benjamin Admin
|
48146cddaf
|
feat: DSI document discovery + completeness check in agent scan workflow
Agent scan now automatically:
1. Discovers all legal documents via consent-tester /dsi-discovery endpoint
2. Classifies each as DSE/AGB/Widerruf/Cookie/Impressum
3. Checks completeness against type-specific checklists:
- DSE: 9 Art. 13 DSGVO mandatory fields (controller, DPO, purposes,
legal basis, recipients, third-country, retention, rights, complaint)
- AGB: §305ff BGB (scope, contract formation, liability, jurisdiction)
- Widerruf: §355 BGB (right info, 14-day deadline, form, consequences)
4. Adds findings per document to scan results
5. Shows discovered documents with completeness % in email summary
6. Returns discovered_documents list in API response
New files:
- dsi_document_checker.py (229 LOC) — checklists + classifier
- agent_scan_helpers.py (109 LOC) — extracted summary builder + corrections
Refactor: agent_scan_routes.py 537→448 LOC (under 500 budget)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-05-04 22:10:13 +02:00 |
|