be9cfdc2d496ecd9533054a90e5a6bdff7354906
154 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
be9cfdc2d4 |
feat(compliance-check): skip Widerruf for B2B, limit MCs, fix industry
Build + Deploy / build-admin-compliance (push) Successful in 2m1s
Build + Deploy / build-tts (push) Successful in 2m48s
Build + Deploy / build-document-crawler (push) Successful in 52s
Build + Deploy / build-dsms-node (push) Successful in 13s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 15s
CI / secret-scan (push) Has been skipped
Build + Deploy / build-backend-compliance (push) Successful in 4m20s
Build + Deploy / build-ai-sdk (push) Successful in 53s
Build + Deploy / build-developer-portal (push) Successful in 2m6s
Build + Deploy / build-dsms-gateway (push) Successful in 11s
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m45s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 45s
CI / test-python-backend (push) Successful in 41s
CI / test-python-document-crawler (push) Successful in 26s
CI / test-python-dsms-gateway (push) Successful in 21s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 3m17s
- Skip Widerrufsbelehrung check entirely for B2B/B2G businesses - Limit MC checks to top 20 per doc_type (by severity) to reduce noise (e.g. 75 impressum MCs → 20, avoiding 55 irrelevant FAILs) - Add consulting/manufacturing industry keywords (arbeitssicherheit, brandschutz, werkzeugbau, etc.) - Lower industry detection threshold from 2 to 1 keyword hit Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
b42e1cd091 |
feat(cmp): timezone→geo_country mapping + timezone parameter
Build + Deploy / build-admin-compliance (push) Successful in 2m10s
Build + Deploy / build-backend-compliance (push) Successful in 5m20s
Build + Deploy / build-ai-sdk (push) Successful in 57s
Build + Deploy / build-developer-portal (push) Successful in 1m15s
Build + Deploy / build-tts (push) Successful in 2m3s
Build + Deploy / build-document-crawler (push) Successful in 53s
Build + Deploy / build-dsms-gateway (push) Successful in 38s
Build + Deploy / build-dsms-node (push) Successful in 20s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 18s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m40s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 48s
CI / test-python-backend (push) Successful in 44s
CI / test-python-document-crawler (push) Successful in 26s
CI / test-python-dsms-gateway (push) Successful in 25s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 3m32s
Add _resolve_geo_from_timezone() with 35-country IANA timezone map. Accept timezone field in ConsentCreate schema and pass through to service. Populate geo_country automatically from browser timezone. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
4a7e09bbb0 |
fix(impressum): regex [A-Z] never matches on lowercased text
Build + Deploy / build-admin-compliance (push) Successful in 12s
Build + Deploy / build-backend-compliance (push) Successful in 14s
Build + Deploy / build-ai-sdk (push) Successful in 20s
Build + Deploy / build-developer-portal (push) Successful in 13s
Build + Deploy / build-tts (push) Successful in 12s
Build + Deploy / build-document-crawler (push) Successful in 14s
Build + Deploy / build-dsms-gateway (push) Successful in 13s
Build + Deploy / build-dsms-node (push) Successful in 18s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 15s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m39s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 46s
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Successful in 27s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m28s
All patterns matched against text_lower but used [A-Z] character class. Changed to [a-zA-Z] so patterns like "geschäftsführung: dr. oliver" are found. Also added "Pflicht"/"Detail" labels to the two progress bars to clarify what 100% vs 8% means. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
74f00bbb0f |
feat(compliance-check): split shared URLs into sections per doc_type
Build + Deploy / build-admin-compliance (push) Successful in 2m4s
Build + Deploy / build-backend-compliance (push) Successful in 3m39s
Build + Deploy / build-ai-sdk (push) Successful in 50s
Build + Deploy / build-developer-portal (push) Successful in 1m12s
Build + Deploy / build-tts (push) Successful in 2m16s
Build + Deploy / build-document-crawler (push) Successful in 1m9s
Build + Deploy / build-dsms-gateway (push) Successful in 35s
Build + Deploy / build-dsms-node (push) Successful in 32s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 16s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m37s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 43s
CI / test-python-backend (push) Successful in 39s
CI / test-python-document-crawler (push) Successful in 27s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 3m16s
When the same URL is used for multiple document types (e.g. /datenschutz for DSI + Cookie + DSB), the section splitter now: - Detects duplicate URLs and fetches text only once - Splits text at classified headings (Cookie, Google Analytics, etc.) - Assigns matching sections to each doc_type - DSI always keeps the full text Extracted to section_splitter.py (170 LOC) to keep routes under 500. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
407a9503e4 |
fix(profiler): fix B2G false positive + add consulting/manufacturing
Build + Deploy / build-admin-compliance (push) Successful in 2m27s
Build + Deploy / build-backend-compliance (push) Successful in 3m40s
Build + Deploy / build-ai-sdk (push) Successful in 1m0s
Build + Deploy / build-developer-portal (push) Successful in 1m16s
Build + Deploy / build-tts (push) Successful in 1m54s
Build + Deploy / build-document-crawler (push) Successful in 1m2s
Build + Deploy / build-dsms-gateway (push) Successful in 31s
Build + Deploy / build-dsms-node (push) Successful in 20s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m44s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 49s
CI / test-python-backend (push) Successful in 36s
CI / test-python-document-crawler (push) Successful in 25s
CI / test-python-dsms-gateway (push) Successful in 21s
CI / validate-canonical-controls (push) Successful in 14s
Build + Deploy / trigger-orca (push) Successful in 3m23s
- Remove generic B2G keywords (behörde, amt, öffentlich) that match in every DSI due to "Aufsichtsbehörde", "Amtsgericht", "veröffentlichen" - Remove "server" from it_services (too generic, appears in every DSI) - Add consulting, manufacturing, media industries - Add B2B fallback for GmbH/AG without B2C signals - Add 10 ground truth files for unified compliance check Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
ce77cde309 |
fix(compliance-check): batch LLM verification + increase poll timeout
Build + Deploy / build-admin-compliance (push) Successful in 1m52s
Build + Deploy / build-backend-compliance (push) Successful in 18s
Build + Deploy / build-ai-sdk (push) Successful in 11s
Build + Deploy / build-developer-portal (push) Successful in 11s
Build + Deploy / build-tts (push) Successful in 12s
Build + Deploy / build-document-crawler (push) Successful in 14s
Build + Deploy / build-dsms-gateway (push) Successful in 10s
Build + Deploy / build-dsms-node (push) Successful in 12s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 15s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m35s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 42s
CI / test-python-backend (push) Successful in 37s
CI / test-python-document-crawler (push) Successful in 25s
CI / test-python-dsms-gateway (push) Successful in 21s
CI / validate-canonical-controls (push) Successful in 16s
Build + Deploy / trigger-orca (push) Successful in 2m24s
- LLM verify now sends ALL failed checks in one batched call instead of one Ollama call per check (80+ calls → 1 per document) - Increase frontend poll timeout from 6 min to 15 min Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
b6ad958b69 |
feat(compliance-check): integrate banner cross-check + extract to module
Build + Deploy / build-admin-compliance (push) Successful in 1m57s
Build + Deploy / build-backend-compliance (push) Successful in 3m20s
Build + Deploy / build-ai-sdk (push) Successful in 48s
Build + Deploy / build-developer-portal (push) Successful in 1m6s
Build + Deploy / build-tts (push) Successful in 1m43s
Build + Deploy / build-document-crawler (push) Successful in 44s
Build + Deploy / build-dsms-gateway (push) Successful in 31s
Build + Deploy / build-dsms-node (push) Successful in 18s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 16s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m40s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 47s
CI / test-python-backend (push) Successful in 38s
CI / test-python-document-crawler (push) Successful in 28s
CI / test-python-dsms-gateway (push) Successful in 20s
CI / validate-canonical-controls (push) Successful in 14s
Build + Deploy / trigger-orca (push) Successful in 3m26s
Add automatic banner check (Step 3b) and banner-vs-cookie cross-check (Step 3c) to unified compliance check. Extract cross-check logic to banner_cookie_cross_check.py to keep routes under 500 LOC. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
66d30568e2 |
feat(dsms): Stufe 1 — Gap-Analyse Report wird in DSMS archiviert
Build + Deploy / build-admin-compliance (push) Successful in 1m41s
Build + Deploy / build-backend-compliance (push) Successful in 14s
Build + Deploy / build-ai-sdk (push) Successful in 41s
Build + Deploy / build-developer-portal (push) Successful in 10s
Build + Deploy / build-tts (push) Successful in 10s
Build + Deploy / build-document-crawler (push) Successful in 10s
Build + Deploy / build-dsms-gateway (push) Successful in 10s
Build + Deploy / build-dsms-node (push) Successful in 11s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 14s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m31s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 48s
CI / test-python-backend (push) Failing after 1s
CI / test-python-document-crawler (push) Successful in 32s
CI / test-python-dsms-gateway (push) Successful in 25s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m23s
- Go DSMS Client (internal/dsms/client.go): Archive() + Verify() - Python DSMS Client (compliance/services/dsms_client.py): archive_to_dsms() + verify_dsms() - Gap-Analyse AnalyzeProject() archiviert Report-JSON nach DSMS - Response enthält dsms_cid wenn Archivierung erfolgreich - Frontend: Grünes "Revisionssicher archiviert" Badge mit CID im GapDashboard - DSMS Proxy Route (/api/sdk/v1/dsms/[...path]) für Verify-Abfragen Stufe 2 (Evidence Upload → DSMS) und Stufe 3 (Version Chains) folgen. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
397de741c1 |
feat(cmp): Phase 2 — script blocking + cookie tracking
Migration 108: scripts_blocked, scripts_released, cookies_set JSONB columns. Backend models/schema/service/serializer/routes extended. Admin detail modal shows released scripts and set cookies with categories. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
051890c370 |
feat(cmp): restore vendor-agnostic fields + module wiring
Build + Deploy / build-admin-compliance (push) Successful in 2m0s
Build + Deploy / build-backend-compliance (push) Successful in 14s
Build + Deploy / build-ai-sdk (push) Successful in 10s
Build + Deploy / build-developer-portal (push) Successful in 14s
Build + Deploy / build-tts (push) Successful in 11s
Build + Deploy / build-document-crawler (push) Successful in 11s
Build + Deploy / build-dsms-gateway (push) Successful in 10s
Build + Deploy / build-dsms-node (push) Successful in 13s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 18s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m55s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 45s
CI / test-python-backend (push) Successful in 41s
CI / test-python-document-crawler (push) Successful in 30s
CI / test-python-dsms-gateway (push) Successful in 26s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m17s
Re-add 13 vendor-agnostic columns to banner models/serializers/service (consent_method, banner_version, device_type, browser, os, etc.) that were lost when another session overwrote the code. Keep vendor_consents dict from the other session. Add list_consents method back to BannerConsentService. Wire CookieBanner, Loeschfristen and UseCases into Document Generator contextBridge (CMP_NAME, analytics tools, retention months, feature flags). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
0d0e705117 |
feat: Unified Compliance-Check — 8 document types in one form
New 3-tab structure: Website-Scan, Compliance-Check, Banner-Check. Compliance-Check Tab (replaces Dokumenten-Pruefung + Impressum-Check): - 8 document rows: DSI, Impressum, Social Media, Cookie, AGB, Nutzungsbedingungen, Widerruf, DSB-Kontakt - Each row: URL input + "Text laden" + file upload + manual text - "Text laden" extracts via consent-tester, shows in editable textarea - User verifies/corrects text before checking - Empty fields = "not present" → own finding Business Profiler (business_profiler.py): - Detects B2B/B2C/B2G from all documents together - Recognizes regulated professions, online shops, editorial content - Context-aware: INFO checks become PASS/FAIL based on profile Backend: /compliance-check + /extract-text endpoints Frontend: ComplianceCheckTab.tsx + DocumentRow.tsx API proxies: compliance-check/route.ts + extract-text/route.ts Also: Impressum regex fixes (Telefon, AG, Geschaeftsfuehrung) and INFO severity for context-dependent checks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
0c25832b5c |
fix: Context-aware Impressum checks + 3 regex fixes
3 Regex fixes: - Telefon: matches '0761 / 48 98 09 01' format (spaces around /) - Registergericht: matches 'AG Freiburg' (not just 'Amtsgericht') - Vertretung: matches 'Geschaeftsfuehrung:' (not just 'Geschaeftsfuehrer:') 6 checks changed from FAIL to INFO severity: - V.i.S.d.P.: only relevant if website has editorial content - Streitbeilegung: only relevant for B2C online shops - Berufsrecht: only relevant for regulated professions - Stammkapital: legally required but rarely enforced - Aufsichtsbehoerde: only for licensed activities - Berufshaftpflicht: only for mandatory insurance INFO checks don't count towards completeness percentage. They appear as hints, not findings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
36c6101b91 |
Merge feat/zeroclaw-compliance-agent into main
Brings all compliance doc-check features: - 162 regex checks + 1874 Master Controls - LLM-agnostic agent with tool calling - Banner check (46 checks, 30 CMPs, stealth, Shadow DOM) - Impressum check (24 checks) - Deep consent verification (DataLayer, GCM, TCF) - CMP E2E tests (39 tests) - HTML email reports, FAQ, persistent history Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
289ec5f396 |
feat(cmp): vendor-agnostic consent data model — 13 new fields
Build + Deploy / build-admin-compliance (push) Successful in 2m28s
Build + Deploy / build-backend-compliance (push) Successful in 3m48s
Build + Deploy / build-ai-sdk (push) Failing after 45s
Build + Deploy / build-developer-portal (push) Successful in 1m28s
Build + Deploy / build-tts (push) Successful in 1m48s
Build + Deploy / build-document-crawler (push) Successful in 48s
Build + Deploy / build-dsms-gateway (push) Successful in 34s
Build + Deploy / build-dsms-node (push) Successful in 20s
CI / branch-name (push) Has been skipped
Build + Deploy / trigger-orca (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 24s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m1s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 49s
CI / test-python-backend (push) Successful in 45s
CI / test-python-document-crawler (push) Successful in 31s
CI / test-python-dsms-gateway (push) Successful in 27s
CI / validate-canonical-controls (push) Successful in 18s
Extend banner consent records with consent_method, banner_version, banner_config_hash, geo, page_url, referrer, device info, session_id and consent_scope for full Art. 7 DSGVO proof with any tracking vendor. Migration 107, backward-compatible (all fields nullable). Admin detail modal shows tracking context, device info and technical data. Fix pre-existing str|None → Optional[str] for Python 3.9 compat. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
58f370f4ff |
feat: LLM-agnostic Compliance Agent with tool calling
New agent architecture for intelligent MC evaluation: agent_tools.py (367 LOC): - 5 tools in OpenAI function-calling format - query_controls: async DB query for MCs by doc_type - evaluate_controls_batch: deterministic keyword matching - search_document: text search with context - get_document_stats: word count, sections, language - submit_results: finalize check results compliance_agent.py (398 LOC): - ComplianceAgent class with agent loop - 3 LLM providers: Ollama, OpenAI-compatible (OVH), Anthropic - Tool call dispatch + result collection - System prompt for systematic compliance analysis - run_compliance_check() convenience function Hybrid mode: - COMPLIANCE_USE_AGENT=false (default): deterministic regex - COMPLIANCE_USE_AGENT=true: LLM agent with tool calling - Agent fallback to regex if LLM unavailable Works with Qwen 35B (Ollama), Qwen 120B (OVH vLLM), Claude. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
bdbc30e47b |
feat(cmp): unified consent view — Website-Besucher + Login-Nutzer tabs
Merges two separate consent views into one unified page at /sdk/einwilligungen: - Tab "Website-Besucher": device-based banner consents with site selector - Tab "Login-Nutzer": user-based DSGVO consents (existing, unchanged) Backend: - New endpoint GET /admin/consents for paginated banner consent records - Fix: categories JSON string parsing (was iterating chars instead of array) CMP Dashboard: - Dynamic site selector replacing hardcoded "preview-test-site" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
9cbbc6ee2f |
feat: LLM interpretation layer for failed MC checks
Deterministic pass/fail stays unchanged. After keyword checking, ONE batched LLM call enriches the top 10 severity FAILs with context-specific recommendations based on the actual document. Example: If document uses Google Analytics but lacks transfer mechanism → LLM generates: "Sie nutzen Google Analytics (USA). Ergaenzen Sie einen Verweis auf das EU-US Data Privacy Framework und pruefen Sie die DPF-Zertifizierung unter dataprivacyframework.gov." - Pass/fail: deterministic (keyword matching, reproducible) - Hint enrichment: LLM (contextual, one call for all fails) - Temperature 0.3 for consistency - Graceful fallback if Ollama unavailable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
5ea83e9b33 |
feat: Deterministic MC checking — ALL controls, no LLM, reproducible
Replaced LLM-based MC verification with deterministic keyword matching: - Extracts keywords from pass_criteria/fail_criteria - Matches against document text via regex (case-insensitive) - PASS if >= 60% of criteria keywords found AND no fail_criteria triggered - Same text + same MCs = same result every time Checks ALL MCs for the doc_type (max_controls=0): - DSE: all 571 controls checked in <1 second - Impressum: all 75 controls - Cookie: all 381 controls No LLM calls needed — purely deterministic keyword matching. Bigram extraction for compound terms (e.g. "standardvertragsklauseln"). Stop word filtering for German legal text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
26b222d53d |
feat: Integrate 1.874 Master Controls into document checking
Rewritten rag_document_checker.py to use doc_check_controls table instead of generic canonical_controls. Each MC has: - check_question: binary YES/NO for LLM - pass_criteria: JSONB list of concrete requirements - fail_criteria: JSONB list of common mistakes Flow: Regex checks (fast) → LLM verify FAILs → MC deep check (15 per doc) MC results appear as additional L2 checks in the report. Coverage: 571 DSE, 381 Cookie, 309 Loeschkonzept, 153 Widerruf, 147 DSFA, 125 AVV, 113 AGB, 75 Impressum = 1.874 total. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
82951785ec |
feat: Impressum checks expanded from 16 to 24 (GAP analysis)
8 new checks: Reglementierte Berufe, Grundkapital, Aufsichtsbehoerde, Berufshaftpflicht, rechtswidrige Disclaimer, Kammer, Berufsbezeichnung, berufsrechtliche Regelungen. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
51d91d20ed |
fix: 6 false positives from Stadt Koeln + Caritas verification
Build + Deploy / build-admin-compliance (push) Successful in 9s
Build + Deploy / build-backend-compliance (push) Successful in 8s
Build + Deploy / build-ai-sdk (push) Successful in 40s
Build + Deploy / build-developer-portal (push) Successful in 7s
Build + Deploy / build-tts (push) Successful in 8s
Build + Deploy / build-document-crawler (push) Successful in 8s
Build + Deploy / build-dsms-gateway (push) Successful in 8s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m11s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 45s
CI / test-python-backend (push) Successful in 41s
CI / test-python-document-crawler (push) Successful in 29s
CI / test-python-dsms-gateway (push) Successful in 27s
CI / validate-canonical-controls (push) Successful in 17s
Build + Deploy / trigger-orca (push) Successful in 2m23s
- Phone regex allows parentheses: +49 (0)761 now matches - "Recht auf Widerspruch" (3 words) + §23 KDG recognized - Church authorities: "Katholisches Datenschutzzentrum", KdoeR - "Artikel 6 Absatz 1 Buchstabe a" (unabbreviated) now matches - "PHP Session ID" (with spaces) alongside "PHPSESSID" 6 FP eliminated across Caritas (KDG) and Stadt Koeln (verbose forms). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
686834cea0 |
feat: 4 remaining tasks — EU institutions, banner integration, JS-sites, Caritas fixes
Build + Deploy / build-admin-compliance (push) Successful in 8s
Build + Deploy / build-backend-compliance (push) Successful in 8s
Build + Deploy / build-ai-sdk (push) Failing after 36s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 7s
Build + Deploy / build-document-crawler (push) Successful in 7s
Build + Deploy / build-dsms-gateway (push) Successful in 8s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
Build + Deploy / trigger-orca (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m14s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 46s
CI / test-python-backend (push) Successful in 43s
CI / test-python-document-crawler (push) Successful in 29s
CI / test-python-dsms-gateway (push) Successful in 30s
CI / validate-canonical-controls (push) Successful in 16s
1. EU Institution Checks (Verordnung 2018/1725): - New doc_type "eu_institution" with 9 L1 + 15 L2 checks - Both German + English patterns (EU institutions are multilingual) - Auto-detection via "2018/1725", "EDSB", "EDPS" keywords - Correct article references (Art. 15 instead of 13, Art. 5 instead of 6) 2. Banner Check Integration: - banner_runner.py maps scan results to 36 L1/L2 structured checks - BannerCheckTab shows hierarchical ChecklistView with hints - 3-phase summary (cookies/scripts before/after consent) - /scan endpoint now includes structured_checks in response 3. JS-heavy Website Fixes (dm, Zalando, HWK): - dsi_helpers.py: goto_resilient (networkidle→domcontentloaded fallback) - try_dismiss_consent_banner before text extraction - PDF redirect detection (dm.de redirects to GCS PDF) 4. Caritas False Positive Fixes: - Phone regex allows parentheses: +49 (0)761 → now matches - "Recht auf Widerspruch" (3 words) + §23 KDG → matches Art. 21 - Church authorities: "Katholisches Datenschutzzentrum" recognized Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
3efc491ec5 |
fix: 5 false positives from etogruppe.com ground truth
Build + Deploy / build-admin-compliance (push) Successful in 2m22s
Build + Deploy / build-backend-compliance (push) Successful in 3m21s
Build + Deploy / build-ai-sdk (push) Successful in 53s
Build + Deploy / build-developer-portal (push) Successful in 1m16s
Build + Deploy / build-tts (push) Successful in 1m38s
Build + Deploy / build-document-crawler (push) Successful in 41s
Build + Deploy / build-dsms-gateway (push) Successful in 26s
Build + Deploy / build-dsms-node (push) Successful in 12s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 20s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m18s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 59s
CI / test-python-backend (push) Successful in 47s
CI / test-python-document-crawler (push) Successful in 32s
CI / test-python-dsms-gateway (push) Successful in 27s
CI / validate-canonical-controls (push) Successful in 16s
Build + Deploy / trigger-orca (push) Successful in 3m23s
1. Soft hyphens (/\xad) stripped before regex matching —
fixes "Datenübertragbarkeit" not matching
2. Art. 15/17/20: allow adjectives between "Recht auf" and keyword
("Recht auf unentgeltliche Auskunft" now matches)
3. DSB contact: regex spans up to 300 chars across newlines
(DSB section with company address between heading and email)
4. Löschkonzept: added "Fortfall", "Entfall", "Beendigung" as
deletion trigger words alongside "Ablauf"/"Wegfall"
Reduces etogruppe FPs from 5 to ~1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
e50f3dfbee |
feat: All 138 hints rewritten as expert-level legal guidance
Build + Deploy / build-admin-compliance (push) Successful in 9s
Build + Deploy / build-backend-compliance (push) Successful in 10s
Build + Deploy / build-ai-sdk (push) Successful in 9s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 8s
Build + Deploy / build-document-crawler (push) Successful in 8s
Build + Deploy / build-dsms-gateway (push) Successful in 8s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 18s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m22s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 49s
CI / test-python-backend (push) Successful in 43s
CI / test-python-document-crawler (push) Successful in 32s
CI / test-python-dsms-gateway (push) Successful in 26s
CI / validate-canonical-controls (push) Successful in 18s
Build + Deploy / trigger-orca (push) Successful in 2m10s
Every hint now reads like a mini-consultation from a data protection lawyer — with specific legal references, court rulings, and common mistakes. Examples: - EuGH C-210/16 (Fanpage), C-298/17 (Kontaktpflicht), C-311/18 (Schrems II) - BGH I ZR 228/03 (ladungsfaehige Anschrift), XI ZR 388/10 (AGB) - EDSA Guidelines 2/2019 (lit. b misuse), WP 248 Rev.01 (DSFA) - DSK-Orientierungshilfe, CNIL-Leitlinien, SDM, BSI-IT-Grundschutz - §25 TDDDG, §38 BDSG, §309 BGB, §312k BGB, Art. 246a EGBGB This is the core value proposition: no lawyer can deliver this level of specific, actionable compliance feedback in 60 seconds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
a2f8366171 |
improve: Drittlandtransfer hint mentions Privacy Shield invalidity
Build + Deploy / build-admin-compliance (push) Successful in 2m23s
Build + Deploy / build-backend-compliance (push) Successful in 3m32s
Build + Deploy / build-ai-sdk (push) Successful in 57s
Build + Deploy / build-developer-portal (push) Successful in 1m22s
Build + Deploy / build-tts (push) Successful in 1m35s
Build + Deploy / build-document-crawler (push) Successful in 39s
Build + Deploy / build-dsms-gateway (push) Successful in 26s
Build + Deploy / build-dsms-node (push) Successful in 11s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 19s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m22s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 50s
CI / test-python-backend (push) Successful in 45s
CI / test-python-document-crawler (push) Successful in 33s
CI / test-python-dsms-gateway (push) Successful in 26s
CI / validate-canonical-controls (push) Successful in 19s
Build + Deploy / trigger-orca (push) Successful in 3m16s
Hint now explicitly warns that EU-US Privacy Shield is invalid since Schrems II (July 2020) and recommends DPF or SCC as replacements. This is the kind of specific, actionable feedback that makes the tool valuable — catching outdated legal references no human would spot in under a minute. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
f51671737a |
fix: Correct Ollama model name + strict blank-line heading detection
Build + Deploy / build-admin-compliance (push) Failing after 48s
Build + Deploy / build-backend-compliance (push) Successful in 9s
Build + Deploy / build-ai-sdk (push) Successful in 8s
Build + Deploy / build-developer-portal (push) Successful in 9s
Build + Deploy / build-tts (push) Successful in 7s
Build + Deploy / build-document-crawler (push) Successful in 8s
Build + Deploy / build-dsms-gateway (push) Successful in 7s
Build + Deploy / build-dsms-node (push) Successful in 7s
CI / branch-name (push) Has been skipped
Build + Deploy / trigger-orca (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Failing after 2m3s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 45s
CI / test-python-backend (push) Successful in 40s
CI / test-python-document-crawler (push) Successful in 34s
CI / test-python-dsms-gateway (push) Successful in 27s
CI / validate-canonical-controls (push) Successful in 15s
1. LLM model: qwen3:32b → qwen3.5:35b-a3b (actual model on Mac Mini)
2. Section splitter: headings MUST be preceded by a blank line.
This prevents cookie table entries ("Funktionale Cookies",
"Session Cookies") from splitting the cookie section.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
4f29e5ff3c |
feat: LLM verification for regex FAILs + section-split hardening
Build + Deploy / build-admin-compliance (push) Successful in 1m49s
Build + Deploy / build-backend-compliance (push) Successful in 9s
Build + Deploy / build-ai-sdk (push) Successful in 8s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 9s
Build + Deploy / build-document-crawler (push) Successful in 8s
Build + Deploy / build-dsms-gateway (push) Successful in 7s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 15s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m55s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 45s
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Successful in 27s
CI / test-python-dsms-gateway (push) Successful in 26s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m13s
Path to 100% correctness: Regex finds 80%, LLM catches the rest.
1. LLM verification (llm_verify.py):
- Every regex FAIL is re-checked by Qwen (qwen3:32b)
- Binary YES/NO question with evidence extraction
- Overturned checks marked with [LLM] prefix in matched_text
- Graceful fallback if LLM unavailable
2. Section splitter hardening:
- Short lines (<16 chars) only treated as headings if preceded
by blank line — prevents table column headers ("Funktion",
"Speicherdauer") from splitting cookie sections
- Fixes IHK cookie section: 288 words → full section
3. DSFA documentation patterns expanded:
- Recognizes "4.) Ergebnis:" numbered result sections
- Matches risk assessment conclusions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
fa4fd87102 |
fix: 7 regex bugs from IHK Konstanz ground truth analysis
Build + Deploy / build-admin-compliance (push) Successful in 9s
Build + Deploy / build-backend-compliance (push) Successful in 8s
Build + Deploy / build-ai-sdk (push) Successful in 42s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 7s
Build + Deploy / build-document-crawler (push) Successful in 7s
Build + Deploy / build-dsms-gateway (push) Successful in 8s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 18s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m57s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 49s
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Successful in 28s
CI / test-python-dsms-gateway (push) Successful in 23s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m24s
Fixes based on manual verification of all 30 failed checks: 1. Cookie table: recognize "folgende cookies" + column headers as text 2. Cookie names: add JSESSIONID, cookieinfo, et_id, BT_* patterns 3. Essential justified: match "sitzung zuordnen", "betrieb der website" 4. Social bookmarks: recognize as 2-click alternative 5. DSFA plural: "kanaelen" now matches alongside "kanal" 6. Section splitter: skip-headings no longer lose subsequent text (Risikoabwaegung section was cut from DSFA, losing risk scores) 7. Cookie legal basis: accept Art. 6(1)(f) in cookie context Reduces false positives from 7 to ~1-2 for IHK Konstanz test case. Ground truth table: zeroclaw/docs/ground-truth-ihk-konstanz.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
293c58d0dd |
feat: Add actionable hints to all 138 compliance checks
Build + Deploy / build-admin-compliance (push) Successful in 1m40s
Build + Deploy / build-backend-compliance (push) Successful in 7s
Build + Deploy / build-ai-sdk (push) Successful in 35s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 7s
Build + Deploy / build-document-crawler (push) Successful in 8s
Build + Deploy / build-dsms-gateway (push) Successful in 7s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 16s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m50s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 40s
CI / test-python-backend (push) Successful in 37s
CI / test-python-document-crawler (push) Successful in 25s
CI / test-python-dsms-gateway (push) Successful in 23s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m28s
Each check now has a "hint" field explaining what is missing and what the customer should do to fix it. Hints are shown in the frontend below failed checks in red text. Examples: - "Bei Verarbeitung auf Basis von Art. 6(1)(f) muss dokumentiert werden, warum Ihr berechtigtes Interesse die Rechte der Betroffenen ueberwiegt." - "Die ladungsfaehige Anschrift fehlt. Erforderlich: Strasse, Hausnummer, PLZ und Ort." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
870953f579 |
fix: PLZ regex matches lowercase text and D-78467 format
Patterns ran on text.lower() but searched [A-Z] — changed to [a-z]. Also accept D-12345 prefix (common German format). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
b363c28539 |
feat: Add 76 Level-2 regex checks for document correctness verification
Split dsi_document_checker.py (466 LOC) into doc_checks/ package (9 files). Two-pass L1→L2 logic: L1 checks "Is it mentioned?", L2 checks "Is it correct?" (e.g. controller has full address, specific Art. 6 lit., concrete time periods). 138 total checks (62 L1 + 76 L2) across 7 doc types: - DSE Art. 13: 31, Impressum §5 TMG: 16, Cookie §25 TDDDG: 15 - Widerruf §355: 15, AGB §305ff: 21, Social Media Art. 26: 20, DSFA Art. 35: 18 Frontend: hierarchical L1→L2 display with dual progress bars (green=completeness, blue=correctness). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
3c12e06faf |
feat: Fix DSFA dedup + expand all checklists to 56 total checks
Fixes: - 'Risikoabwaegung' is sub-section of DSFA → added to SKIP_HEADINGS - 'Social Media' standalone heading → recognized as social_media DSE - Removed 'risikobew' from DSFA pattern (was too broad) Expanded checklists: - Widerruf: 4→7 checks (+Empfaenger, kein Grund, §312k Button) - AGB: 4→9 checks (+Zahlung, Lieferung, Gewaehrleistung, Kuendigung, Datenschutz) - Social Media: +1 (Social Bookmarks) - DSFA: +1 (LFDI Richtlinie) Total: 47→56 Regex-Checks across 7 document types: DSI=9, Cookie=5, Social Media=10, DSFA=8, Impressum=6, Widerruf=7, AGB=9 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
4642abba23 |
feat: Expand Social Media (10 checks) + DSFA (8 checks) checklists
Art. 26 Joint Controller (10 checks, was 7): + Auflistung der genutzten Plattformen + Rechtsgrundlage (Art. 6) + Social Bookmarks vs. Plugins Hinweis Improved: broader patterns for joint parties, contact point, data types DSFA Art. 35 (8 checks, was 5): + Schwellwertanalyse / Auslösepruefung + Beruecksichtigung Landesbehörden-Richtlinie (LFDI) + Dokumentation der Ergebnisse Improved: IHK-specific patterns (Kanäle, systematische Beobachtung, geringer Umfang, sensitive Daten) Total: 40 → 47 Regex-Checks across all document types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
3853a0838a |
feat: Art. 26 Joint Controller + DSFA checklists for Social Media sections
New checklists: - JOINT_CONTROLLER_CHECKLIST (Art. 26 DSGVO, 7 checks): Joint parties, arrangement, contact point, processing split, data categories, third-country transfer (USA), rights - DSFA_CHECKLIST (Art. 35 DSGVO, 5 checks): Description, necessity, risk assessment, measures, DSB involvement Section detection: 'Datenschutzerklaerung fuer Social Media' → social_media, 'Datenschutzfolgeabschaetzung/Risikoanalyse' → dsfa classify_document_type: DSFA and social_media detected before generic DSE Frontend: DOC_TYPES dropdown + ChecklistView labels updated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
45446aef16 |
fix: 8 quality + UX improvements
1. Cookie 'Zwecke' false positive: added 'um...zu', 'dienen', 'helfen', 'ermöglichen' patterns — catches purpose descriptions without 'Zweck' 2. Kurzhinweis: added empty all_checks for short documents (<200 words) 3. Bezeichnungsfeld: placeholder shows 'Version / Stand' for typed docs, 'Dokumentname' for 'Sonstiges' 4. DocCheckTab state persistence: entries + results survive navigation 5. DocCheck history: saves each check with date, doc count, findings 6. History display: 'Letzte Pruefungen' section at bottom of tab 7. ChecklistView: shows 'X von Y Pruefpunkten bestanden' per document 8. Results persist in localStorage across page navigation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
a680276c86 |
fix: Filter controls by test_procedure content — eliminates governance false positives
Only use controls whose test_procedure mentions document-type-specific terms: - DSI: test_procedure must contain 'datenschutzerkl' or 'art. 13/14' - Cookie: must contain 'cookie', 'einwilligung', 'consent' - Impressum: must contain 'impressum' This filters out internal governance controls (Datenmodelle, Infrastruktur) that are irrelevant for public document checks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
fa45b5793c |
feat: Control Library check via SQL (canonical_controls) instead of Qdrant
Complete rewrite of rag_document_checker.py: - Queries canonical_controls table (294K controls, 10K data_protection) - Filters by category + title keywords per document type - Uses test_procedure field as actual check instructions - Regex pre-check extracts key terms from procedure → fast match - LLM fallback only for regex misses (saves tokens) - /no_think prefix for direct JSON output SQL approach advantages: - Structured data with test_procedure, pass_criteria, fail_criteria - Category filtering (data_protection, compliance, governance) - No Qdrant API key issues - Controls are actual check criteria, not general legal texts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
6da36d87c2 |
fix: Robust JSON parsing for LLM responses — handles unquoted keys, fallback extraction
LLM returns {fulfilled: true} instead of {"fulfilled": true}.
Now fixes unquoted keys, True→true, and falls back to text-based
boolean extraction when JSON parsing fails entirely.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
e50c4d659e |
fix: Disable Qwen thinking mode for RAG checks (/no_think prefix)
Qwen 3.5 uses all tokens for thinking, leaving response empty. Using /no_think prefix to get direct JSON output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
9f16e6d535 |
fix: Read Qwen response from 'thinking' field when 'response' is empty
Qwen 3.5 with latest Ollama returns structured thinking in separate 'thinking' field, leaving 'response' empty. Now checks both fields. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
f4374cfe8d |
feat: Semantic Qdrant search — embed query via bge-m3, vector search in local Qdrant
Replaces scroll+filter approach with proper semantic search: 1. Embed query via bp-core-embedding-service (bge-m3, 1024 dim) 2. Vector search in Qdrant (bp_compliance_datenschutz + bp_compliance_gesetze) 3. Sort by cosine similarity score 4. No API key needed — local Qdrant on Mac Mini Falls back gracefully: SDK first, then semantic Qdrant, then empty. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
7b8440191e |
fix: Better error logging + increase LLM timeout to 120s for RAG check
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
510f513811 |
fix: Qdrant search uses chunk_text + section/category filter
Payload structure: chunk_text (not text), section (Article 13), category, regulation_id. Scrolls 100 points per collection, filters client-side against regulation keywords. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
b50c4ec940 |
fix: RAG checker falls back to local Qdrant when Go SDK returns 401
Go SDK points to external Qdrant (qdrant-dev.breakpilot.ai) with expired API key. Fallback: search directly in local Qdrant (bp-core-qdrant:6333) which has all collections: bp_compliance_datenschutz, bp_compliance_gesetze, atomic_controls_dedup. Search strategy: 1. Try Go SDK RAG endpoint (preferred, has embedding-based search) 2. Fallback: Qdrant scroll with text-based regulation filter Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
090da0f71b |
feat: RAG-based document verification against 144K Control Library
New module: rag_document_checker.py
- Searches RAG (Qdrant) for controls relevant to document type
- Filters by regulation (DSGVO Art.13, TDDDG §25, BGB §355 etc.)
- LLM (Qwen 3.5:35b) verifies each control against document text
- Returns fulfilled/missing with evidence text + severity
- Supports: DSI, Cookie, Impressum, Widerruf, AGB, DSFA, AVV, Loeschkonzept
Integration in doc-check endpoint:
- Regex checklist runs first (fast, deterministic)
- RAG checks run after (semantic, catches what regex misses)
- Both results combined in single response
LLM prompt returns JSON: {fulfilled, evidence, issue, severity}
Think-tags stripped, JSON extracted from response.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
4c68caac4e |
feat: Multi-URL Document Check with full checklist visibility
New "Dokumenten-Pruefung" tab in Compliance Agent: - User adds multiple URLs with document type (DSI, AGB, Impressum, Cookie, Widerruf) - Each document loaded via Playwright, accordions expanded, text extracted - Checked against type-specific legal checklist - Optional: Cookie banner check via checkbox Checklisten-UX (solves "100% looks like nothing was checked"): - All checks shown per document: green checkmark + matched text excerpt - Red X for missing fields with legal reference - Builds user trust: "9 Punkte geprueft, alle bestanden" - Expandable per document with completeness bar New checklists: - Impressum: §5 TMG (6 fields: name, address, contact, register, VAT, representative) - Cookie-Richtlinie: §25 TDDDG (5 fields: types, purposes, retention, third-party, opt-out) Backend: - POST /agent/doc-check — async with polling (same pattern as /scan) - DocCheckResult includes checks[] with passed/failed + matched_text - dsi_document_checker returns all_checks in SCORE finding - Email report shows per-document checklist Files: agent_doc_check_routes.py (280 LOC), DocCheckTab.tsx (248 LOC), ChecklistView.tsx (130 LOC), dsi_document_checker.py (+70 LOC) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
8fb2061e9b |
fix: Eliminate GA false positive + handle short DSI documents
Service detection: - Only search script tags + src/href attributes for service patterns - Prevents false positives from DSE text mentioning services (e.g. IHK DSE describes etracker, 'google analytics' in text) - Technical patterns (with regex chars) still checked in full HTML Short documents: - Documents with < 200 words flagged as 'Kurzhinweis' instead of 'MANGELHAFT' — too short for Art. 13 completeness check - Prevents 96-word navigation pages from showing 8 missing fields Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
8d6959e8b2 |
fix: Expand Art. 13 patterns for generic matching across all websites
Complaint (Art. 13(2)(d)): + 'recht auf beschwerde', 'art. 77', 'beschwerde...wenden/einlegen', 'zuständige behörde' — IHK uses 'Recht auf Beschwerde gem. Art. 77' Legal basis (Art. 13(1)(c)): + 'gemäß Art.', '§ X IHKG/BDSG/LDSG/BBiG/TDDDG', 'einwilligung gem', 'verarbeitung auf grundlage' — catches statutory references Third country (Art. 13(1)(f)): + 'Übermittlung ausserhalb', 'EWR/EEA', 'Data Privacy Framework' Retention (Art. 13(2)(a)): + 'Dauer der Speicherung', 'Aufbewahrungsdauer/-pflicht/-zeit', 'gesetzliche Aufbewahrung' — common German DSE headings All patterns are generic, not IHK-specific. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
e3ae35891f |
fix: 0% completeness bug — SCORE finding was not generated at 100%
Root cause: When all 9 Art. 13 checks passed (100%), no SCORE finding was created (line: 'if pct < 100'). The backend then defaulted to completeness=0 because it looked for the SCORE finding to extract the %. Fix: Always generate SCORE finding, even at 100%. Added 'OK' severity for fully compliant documents. This was the cause of 8 documents showing '0% MANGELHAFT' despite containing all required information. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
6c5e086356 |
fix: DSI dedup — skip anchor links, filter noise, merge duplicates + fix false positives
Dedup fixes: - Anchor links (#cookies, #betroffenenrechte) on same page are skipped entirely - Noise titles filtered: 'drucken', 'nach oben', 'Datenschutz' (too generic) - Documents with < 50 words filtered (navigation snippets) - Documents with identical word_count merged (same page, different title) - URL-only titles filtered False positive fixes (dsi_document_checker.py): - 'Kontaktdaten des Verantwortlichen' pattern for controller check - 'Zweck und Rechtsgrundlage' combined heading pattern - 'Welche Daten werden verarbeitet' question-style headings - 'Betroffenenrechte' as standalone heading - 'Welche Rechte hat der Betroffene' question pattern - 'Daten werden geloescht' retention pattern - 'Auftragsverarbeiter' as recipient indicator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |