Qwen 3.5 with latest Ollama returns structured thinking in separate
'thinking' field, leaving 'response' empty. Now checks both fields.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces scroll+filter approach with proper semantic search:
1. Embed query via bp-core-embedding-service (bge-m3, 1024 dim)
2. Vector search in Qdrant (bp_compliance_datenschutz + bp_compliance_gesetze)
3. Sort by cosine similarity score
4. No API key needed — local Qdrant on Mac Mini
Falls back gracefully: SDK first, then semantic Qdrant, then empty.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Go SDK points to external Qdrant (qdrant-dev.breakpilot.ai) with expired API key.
Fallback: search directly in local Qdrant (bp-core-qdrant:6333) which has
all collections: bp_compliance_datenschutz, bp_compliance_gesetze, atomic_controls_dedup.
Search strategy:
1. Try Go SDK RAG endpoint (preferred, has embedding-based search)
2. Fallback: Qdrant scroll with text-based regulation filter
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New module: rag_document_checker.py
- Searches RAG (Qdrant) for controls relevant to document type
- Filters by regulation (DSGVO Art.13, TDDDG §25, BGB §355 etc.)
- LLM (Qwen 3.5:35b) verifies each control against document text
- Returns fulfilled/missing with evidence text + severity
- Supports: DSI, Cookie, Impressum, Widerruf, AGB, DSFA, AVV, Loeschkonzept
Integration in doc-check endpoint:
- Regex checklist runs first (fast, deterministic)
- RAG checks run after (semantic, catches what regex misses)
- Both results combined in single response
LLM prompt returns JSON: {fulfilled, evidence, issue, severity}
Think-tags stripped, JSON extracted from response.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only Cookie and Widerruf sections are checked as separate documents.
Social Media, DSFA, Betroffenenrechte, Dienste von Drittanbietern are
part of the parent DSI and no longer generate false findings.
Added PLAN-rag-document-check.md for Phase 2:
- RAG-based checks with document-type-specific Controls
- DSFA checklist (Art. 35 + Landes-Listen)
- AVV checklist (Art. 28)
- Reference detection (sub-doc → parent doc)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a single URL contains multiple document sections (e.g. IHK DSI page
with Cookies, Social Media, Dienste von Drittanbietern), the system now:
1. Extracts full page text (main document check as before)
2. Splits text at heading boundaries (short uppercase lines)
3. Classifies each section: Cookie→cookie checklist, Social Media→DSI etc.
4. Runs type-specific checklist per section
5. Returns all results: main doc + sub-sections
Section type detection via SECTION_TYPE_MAP patterns:
- 'Cookie*' → §25 TDDDG checklist
- 'Dienste von Drittanbietern' → DSI checklist
- 'Social Media' → DSI checklist (Art. 26 joint controllership)
- 'Widerrufsrecht' → §355 BGB checklist
- 'Impressum' → §5 TMG checklist
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New "Dokumenten-Pruefung" tab in Compliance Agent:
- User adds multiple URLs with document type (DSI, AGB, Impressum, Cookie, Widerruf)
- Each document loaded via Playwright, accordions expanded, text extracted
- Checked against type-specific legal checklist
- Optional: Cookie banner check via checkbox
Checklisten-UX (solves "100% looks like nothing was checked"):
- All checks shown per document: green checkmark + matched text excerpt
- Red X for missing fields with legal reference
- Builds user trust: "9 Punkte geprueft, alle bestanden"
- Expandable per document with completeness bar
New checklists:
- Impressum: §5 TMG (6 fields: name, address, contact, register, VAT, representative)
- Cookie-Richtlinie: §25 TDDDG (5 fields: types, purposes, retention, third-party, opt-out)
Backend:
- POST /agent/doc-check — async with polling (same pattern as /scan)
- DocCheckResult includes checks[] with passed/failed + matched_text
- dsi_document_checker returns all_checks in SCORE finding
- Email report shows per-document checklist
Files: agent_doc_check_routes.py (280 LOC), DocCheckTab.tsx (248 LOC),
ChecklistView.tsx (130 LOC), dsi_document_checker.py (+70 LOC)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Service detection:
- Only search script tags + src/href attributes for service patterns
- Prevents false positives from DSE text mentioning services
(e.g. IHK DSE describes etracker, 'google analytics' in text)
- Technical patterns (with regex chars) still checked in full HTML
Short documents:
- Documents with < 200 words flagged as 'Kurzhinweis' instead of
'MANGELHAFT' — too short for Art. 13 completeness check
- Prevents 96-word navigation pages from showing 8 missing fields
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes:
1. consent-tester: full_text truncation raised from 10,000 to 50,000 chars
(IHK Internetangebot has ~50K chars, Beschwerderecht was after 10K cutoff)
2. Backend: dse_text now combines Playwright HTML + ALL DSI discovery texts
for mandatory content checking. Previously only used first 8K chars from
one source, missing Verantwortlicher/DSB that were in DSI documents.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: When all 9 Art. 13 checks passed (100%), no SCORE finding
was created (line: 'if pct < 100'). The backend then defaulted to
completeness=0 because it looked for the SCORE finding to extract the %.
Fix: Always generate SCORE finding, even at 100%. Added 'OK' severity
for fully compliant documents.
This was the cause of 8 documents showing '0% MANGELHAFT' despite
containing all required information.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DSI Dedup (consent-tester):
- Only H1/H2 headings count as documents (not H3/H4 sub-sections)
- Sub-sections (Cookies, Betroffenenrechte, Social Media) are part of
parent document's full text, not separate documents
- Reduces IHK result from 30 to ~11 real documents
Backend (agent_scan_routes):
- ScanFinding gets doc_title field linking each finding to its document
- doc_title set when creating DSI findings for document attribution
Frontend (ScanResult.tsx):
- 3 sections: Services table, Document cards, General findings
- Documents: expandable cards with completeness bar (green/yellow/red)
- Findings grouped under their parent document
- Each card shows: title, word count, findings count, % completeness
- Findings without doc_title go to "Allgemeine Findings" section
Email Summary (agent_scan_helpers):
- Findings listed under their parent document
- General findings in separate section
- No more flat mixed list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fundamental fix: scans now run asynchronously with progress polling.
Backend:
- POST /scan starts background task, returns scan_id immediately
- GET /scan/{scan_id} returns status + progress + result when done
- 7 progress steps shown: Website scan, DSI discovery, DSE analysis,
SOLL/IST comparison, corrections, report, email
- In-memory job store (dict with scan_id → status/result)
- No timeout limits on scan duration
Frontend:
- POST starts scan, receives scan_id
- Polls GET every 5 seconds (max 120 attempts = 10 min)
- Shows live progress message during scan
- Displays result when completed, error when failed
Proxy:
- POST timeout reduced to 30s (just starts the job)
- GET timeout 10s (just status check)
- No more 504/connection-dropped errors
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug 1: max_pages was hardcoded to 15 in backend call — raised to 50
Bug 2: DSI documents checked against text_preview (500 chars) — now uses
full_text (10,000 chars) for Art. 13 mandatory field checks
Bug 3: DSE text not found when Playwright misses DSE page — now falls
back to DSI Discovery full_text as second source
Bug 4: Backend timeout 120s too short for 50 pages — raised to 300s
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These files existed on the feature branch but were never cherry-picked
to main, causing ModuleNotFoundError on import:
- dse_parser.py — parses DSE HTML into structured sections
- dse_matcher.py — matches detected services against DSE sections
- mandatory_content_checker.py — checks Art. 13 DSGVO mandatory fields
- legal_basis_validator.py — validates legal basis (lit. a-f)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NameError: name 're' is not defined at line 146 — the import was
accidentally removed when extracting helper functions to agent_scan_helpers.py.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NameError: name 're' is not defined at line 146 — the import was
accidentally removed when extracting helper functions to agent_scan_helpers.py.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- EmailDeliveryService: load template → find published version →
render {{variables}} → send via SMTP → audit log. Fallback to
inline HTML when no published template exists.
- Migration 117: Professional HTML/text content for all 5 DSR
templates (receipt, completion, rejection, identity, extension)
with branded styling and proper Art. references
- DSRArt11Service now uses EmailDeliveryService with dsr_rejection
template instead of hardcoded HTML
[migration-approved]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New DSRArt11Service: handles rejection with proper legal basis,
automated email notification to requester explaining Art. 11
- POST /dsr/{id}/reject-art11 endpoint
- ActionButtons.tsx: "Nicht identifizierbar (Art. 11)" button
shown when identity is not yet verified
- Also fixes: DSR export type-cast rollback handling
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- tenant_id kept as string (PostgreSQL handles UUID cast)
- Einwilligungen query uses CAST(:tid AS VARCHAR) for compatibility
- Each data source query wrapped with rollback on failure to prevent
cascading "transaction aborted" errors
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- DSRExportService: aggregates all CMP data about a user from
Banner Consents, Einwilligungen, Audit Trail, DSR History
- GET /dsr/{id}/export-user-data?format=json|csv|pdf endpoint
- PDF: A4 reportlab with 4 sections (Consents, Einwilligungen,
Audit-Trail, DSR-Anfragen) + cover page
- CSV: BOM-encoded for Excel with flattened data rows
- JSON: structured export with all data categories
- ActionButtons.tsx: PDF/JSON/CSV export buttons now functional
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Migration 115: compliance_role_training_mapping table (org roles → training codes)
- TrainingLinkService: queries training_modules/matrix/assignments to find gaps
per person and role. Gracefully degrades when Go training tables don't exist yet.
- document_review_routes: 2 new endpoints (training-requirements, training-gaps)
- _notify_approval() now checks training gaps and sends emails to persons
with outstanding modules, linking to /sdk/training/learner
[migration-approved]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Migration 111: 3 new tables (org_roles, document_reviews, document_role_mapping)
with seed data mapping all 71 doc types to 7 compliance roles
- org_role_routes.py: CRUD for roles, seed defaults, test email, mapping API
- document_review_routes.py: Review lifecycle (create→send→approve/reject)
with approval notification to all affected roles
- Migration 112: SOP template (ISO 9001 structure, 21 placeholders)
- Added standard_operating_procedure to TemplateType, doc-labels, presets
[migration-approved]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migration 110: Updated descriptions and version for 12 previously
unreviewed templates (asset_management, backup, change_management,
cloud_security, devsecops, incident_response, logging, patch_management,
secrets_management, vulnerability_management, informationspflichten,
verpflichtungserklaerung).
All templates assessed as "Very Good" quality — only incremental
updates needed (AI Act, CRA, NIS2UmsuCG references in descriptions).
informationspflichten: Kept as separate compact checklist (distinct
from the full privacy_policy DSI template).
verpflichtungserklaerung: Kept as standalone HR document (employee
signs at onboarding). Added to HR & Mitarbeiter category.
Result: 88 templates, 44 at v1.1+, 0 unreviewed remaining.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 108: Remove DSI duplicate (023 + 093 both wrote privacy_policy DE),
remove outdated EN v1, create English Privacy Notice v2 with all
modular sections (data categories table, retention periods, processor
vs. controller guidance, Art. 21 right to object highlighted)
DB now has exactly 2 privacy_policy templates: DE + EN, both v2.0.0
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 106: Remove AGB duplicates and obsolete templates (terms_of_service
DE/EN v1.0, liability clause) — replaced by agb v2.0
- 107: English Terms and Conditions v2 (EU-compliant, same structure
as DE version with all IF-blocks)
DB now has exactly 2 AGB templates: DE + EN, both v2.0.0
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend: _ensure_list() converts null/string/malformed JSONB to []
for requirements, test_procedure, evidence, open_anchors, tags.
Frontend: defensive Array.isArray() check on ControlDetail.tsx.
Fixes: TypeError: A.requirements.map is not a function
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CookieBannerOverlay: shows vendors per category with expandable tables
(Verarbeiter, Cookies, Dauer, Land) for full transparency
- Demo vendors: 4 necessary, 3 statistics, 3 marketing, 3 functional
- cookie_table_generator.py: renders {{COOKIE_TABLE}} Markdown tables
from vendor configs (DB) or service registry (fallback)
- SERVICE_COOKIES: 16 known vendor-to-cookie mappings with provider + country
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CookieBannerOverlay: shows vendors per category with expandable tables
(Verarbeiter, Cookies, Dauer, Land) for full transparency
- Demo vendors: 4 necessary, 3 statistics, 3 marketing, 3 functional
- cookie_table_generator.py: renders {{COOKIE_TABLE}} Markdown tables
from vendor configs (DB) or service registry (fallback)
- SERVICE_COOKIES: 16 known vendor-to-cookie mappings with provider + country
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fundamental architecture fix: data processing happens through APIs/scripts/
cookies — NOT through visible page text. A news site about healthcare does
NOT process health data.
Before: Qwen reads website text → guesses "health_data: true" (WRONG)
After: Google Analytics detected → tracking: true (CORRECT, deterministic)
New flow: detect services from HTML → map service categories to flags →
feed flags into UCCA assessment. No LLM needed for flag extraction.
SERVICE_TO_FLAGS maps categories: tracking→tracking, marketing→marketing+
third_party_sharing, payment→payment_data, heatmap→profiling, etc.
SPECIFIC_SERVICE_FLAGS for Klarna (Art.22), Stripe (US transfer), etc.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New templates for the Vendor Compliance module:
- 105: Transfer Impact Assessment (TIA) — Schrems II risk assessment
with country evaluation, government access assessment, supplementary
measures, risk matrix, and go/conditional/deny decision
- 105: SCC Companion Document — annexes to EU Decision 2021/914
(module selection C2C/C2P/P2P/P2C, party details, data description,
TOMs, sub-processor list)
Template recommendations: SCC+TIA triggered by tech_third_country answer
Generator: New "Drittlandtransfer" category
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 3 of the Document Templates Masterplan:
- 103: 4 new security policies (information_security_policy, password_policy,
encryption_policy, access_control_policy) + updates for CRA (056) and
all 15 HR/Vendor/BCM policies (072)
New templates:
- Information Security Policy: ISMS-Leitlinie (ISO 27001, BSI, NIS2)
- Password Policy: BSI/NIST compliant (12+ chars, MFA, no forced rotation)
- Encryption Policy: BSI TR-02102, algorithms, key management, TLS config
- Access Control Policy: RBAC, Least Privilege, Zero Trust, rezertification
Updates: AI Act + NIS2UmsuCG references for CRA and all 15 HR/Vendor/BCM
Generator: 6 new categories (security, HR, data, vendor, BCM policies)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>