feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features
CI / nodejs-build (push) Successful in 2m47s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / nodejs-build (push) Successful in 2m47s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9): Core Compliance-Check - Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db). rag_document_checker filtert auf check_type='text' fuer doc_check. Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in falscher doc_type-Schublade. - scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden per business_profile gefiltert (FRT skipped fuer BMW etc.). - Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match: Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60), Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum. Title+check_question als Embedding-Input fuer mehr Kontext. - Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction wenn richer (BMW 1824 vs 600 Worte). Vendor-Redundanz + EU-Alternativen + Cost-Saving - vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors, Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...). - vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/ enterprise/premier). - Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten (nur Media-Spend, separat). DSP-Plattformen behalten enge Range. - Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den oberen 40-100%-Band der Listpreise, nicht starter→premier. - Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere Kategorien gleichzeitig. Cookie-Wissens-DB + Funktionale Klassifikation - cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...) mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk, schrems_ii_status, EuGH-Urteile, EU-Alternative. - cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id, ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact. Country-Inferenz aus Rechtsform - cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table. Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors (Adform DK, Pinterest IE). Action-Recipes + Doc-Anchor-Locator - finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country, broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling", ...) eine strukturierte Anweisung mit what/why/fix_text/where/example. Zum 1:1-Einfuegen in Kunden-Dokumente. - doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den passenden Absatz im existierenden Kundendokument fuer jeden Finding. Per-Run Thread-Local-Cache. Fallback: keyword-Match. - Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail + Vendor-Flag-Liste mit aufklappbarer Action-Liste. - Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip). Migration-Pipeline (Compliance-Check -> Customer Banner/Documents) - migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit 4 Kategorien + Review-Flags. - migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register + Privacy-Policy-Pre-Fills. - agent_migration_routes: 3 Preview-Endpoints (banner-preview, document-preview, summary). Persistierung der cmp_vendors in /data/compliance_audits.db check_payloads-Tabelle. Borlabs-Parity Cookie-Banner-Features - Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage. - Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video Placeholder bis Einwilligung. - Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB. - Consent-Log Export (CSV/JSON) per einwilligungen_export_routes. Bug-Fixes - canonical_control_routes: _jsonish-Helper fuer string-typed jsonb, similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr). - Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views. - Embedding-Service-Batching (32er Batches statt 165 in einem Call). - KeyError 'control_id' in MC-Result-Aggregation (defensive .get). - Master-Controls-Klick-Through von /sdk/master-controls auf /sdk/control-library?control=<id> mit URL-Param-Auto-Open. - Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht). - Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction). - doc_type-aware MC-Filter (statt all-text-MCs). - Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag). - A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert. Tests - test_migration_mappers.py (9 Tests) - test_migration_endpoints.py (4 Tests) Skripte (one-shot) - classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type) - audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires) BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes): DSE 7,5% -> 81-83% Impressum 4% -> 100% (6 echte MCs alle erfuellt) Cookie 0% -> 79-83% (CMP-Text-Routing + Embedding) Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr Plus: Action-Recipes + Doc-Anchors fuer jeden Fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,55 @@
|
|||||||
|
/**
|
||||||
|
* Proxy: GET /api/sdk/v1/einwilligungen/export?format=csv|json&kind=consents|history
|
||||||
|
* -> backend /api/compliance/einwilligungen/export/<file>
|
||||||
|
*
|
||||||
|
* Streams the backend response straight through (CSV or JSON download).
|
||||||
|
*/
|
||||||
|
import { NextRequest, NextResponse } from 'next/server'
|
||||||
|
|
||||||
|
const BACKEND_URL = process.env.BACKEND_URL || 'http://backend-compliance:8002'
|
||||||
|
|
||||||
|
function getTenantHeader(request: NextRequest): HeadersInit {
|
||||||
|
const uuidRegex = /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i
|
||||||
|
const clientTenantId = request.headers.get('x-tenant-id') || request.headers.get('X-Tenant-ID')
|
||||||
|
const tenantId = (clientTenantId && uuidRegex.test(clientTenantId))
|
||||||
|
? clientTenantId
|
||||||
|
: (process.env.DEFAULT_TENANT_ID || '9282a473-5c95-4b3a-bf78-0ecc0ec71d3e')
|
||||||
|
return { 'X-Tenant-ID': tenantId }
|
||||||
|
}
|
||||||
|
|
||||||
|
export async function GET(request: NextRequest) {
|
||||||
|
const { searchParams } = new URL(request.url)
|
||||||
|
const fmt = (searchParams.get('format') || 'csv').toLowerCase()
|
||||||
|
const kind = (searchParams.get('kind') || 'consents').toLowerCase()
|
||||||
|
|
||||||
|
const filename = `${kind}.${fmt === 'json' ? 'json' : 'csv'}`
|
||||||
|
const upstreamPath = `/api/compliance/einwilligungen/export/${filename}`
|
||||||
|
|
||||||
|
const passthroughParams = new URLSearchParams()
|
||||||
|
for (const k of ['user_id', 'granted', 'since', 'consent_id']) {
|
||||||
|
const v = searchParams.get(k)
|
||||||
|
if (v) passthroughParams.set(k, v)
|
||||||
|
}
|
||||||
|
const qs = passthroughParams.toString()
|
||||||
|
const url = `${BACKEND_URL}${upstreamPath}${qs ? `?${qs}` : ''}`
|
||||||
|
|
||||||
|
try {
|
||||||
|
const r = await fetch(url, { headers: getTenantHeader(request) })
|
||||||
|
if (!r.ok) {
|
||||||
|
const text = await r.text()
|
||||||
|
return NextResponse.json({ error: text || `HTTP ${r.status}` }, { status: r.status })
|
||||||
|
}
|
||||||
|
return new NextResponse(r.body, {
|
||||||
|
status: 200,
|
||||||
|
headers: {
|
||||||
|
'Content-Type': r.headers.get('content-type') || 'application/octet-stream',
|
||||||
|
'Content-Disposition': r.headers.get('content-disposition') || `attachment; filename=${filename}`,
|
||||||
|
},
|
||||||
|
})
|
||||||
|
} catch (e) {
|
||||||
|
return NextResponse.json(
|
||||||
|
{ error: 'Export-Proxy fehlgeschlagen', detail: String(e) },
|
||||||
|
{ status: 503 },
|
||||||
|
)
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -8,6 +8,23 @@ import type { CanonicalControl } from '../_types'
|
|||||||
import { EFFORT_LABELS } from '../_types'
|
import { EFFORT_LABELS } from '../_types'
|
||||||
import { SeverityBadge, StateBadge, LicenseRuleBadge } from './Badges'
|
import { SeverityBadge, StateBadge, LicenseRuleBadge } from './Badges'
|
||||||
|
|
||||||
|
// Defensive coercers: backend has rows where evidence/requirements/test_procedure/open_anchors
|
||||||
|
// are JSON-encoded strings instead of arrays. .map() on a string throws — coerce here.
|
||||||
|
function asArray<T = unknown>(v: unknown): T[] {
|
||||||
|
if (Array.isArray(v)) return v as T[]
|
||||||
|
if (typeof v === 'string' && v.trim().startsWith('[')) {
|
||||||
|
try { const p = JSON.parse(v); return Array.isArray(p) ? p : [] } catch { return [] }
|
||||||
|
}
|
||||||
|
return []
|
||||||
|
}
|
||||||
|
function asStringArray(v: unknown): string[] {
|
||||||
|
return asArray(v).map(x => typeof x === 'string' ? x : JSON.stringify(x))
|
||||||
|
}
|
||||||
|
type EvidenceItem = string | { type?: string; description?: string }
|
||||||
|
function asEvidenceArray(v: unknown): EvidenceItem[] {
|
||||||
|
return asArray<EvidenceItem>(v)
|
||||||
|
}
|
||||||
|
|
||||||
export function ControlDetailView({
|
export function ControlDetailView({
|
||||||
ctrl,
|
ctrl,
|
||||||
onBack,
|
onBack,
|
||||||
@@ -72,31 +89,31 @@ export function ControlDetailView({
|
|||||||
<section>
|
<section>
|
||||||
<h3 className="text-sm font-semibold text-gray-900 mb-2">Geltungsbereich</h3>
|
<h3 className="text-sm font-semibold text-gray-900 mb-2">Geltungsbereich</h3>
|
||||||
<div className="grid grid-cols-3 gap-4">
|
<div className="grid grid-cols-3 gap-4">
|
||||||
{ctrl.scope.platforms && ctrl.scope.platforms.length > 0 && (
|
{asStringArray(ctrl.scope?.platforms).length > 0 && (
|
||||||
<div>
|
<div>
|
||||||
<p className="text-xs font-medium text-gray-500 mb-1">Plattformen</p>
|
<p className="text-xs font-medium text-gray-500 mb-1">Plattformen</p>
|
||||||
<div className="flex flex-wrap gap-1">
|
<div className="flex flex-wrap gap-1">
|
||||||
{ctrl.scope.platforms.map(p => (
|
{asStringArray(ctrl.scope?.platforms).map(p => (
|
||||||
<span key={p} className="px-2 py-0.5 bg-blue-50 text-blue-700 rounded text-xs">{p}</span>
|
<span key={p} className="px-2 py-0.5 bg-blue-50 text-blue-700 rounded text-xs">{p}</span>
|
||||||
))}
|
))}
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
)}
|
)}
|
||||||
{ctrl.scope.components && ctrl.scope.components.length > 0 && (
|
{asStringArray(ctrl.scope?.components).length > 0 && (
|
||||||
<div>
|
<div>
|
||||||
<p className="text-xs font-medium text-gray-500 mb-1">Komponenten</p>
|
<p className="text-xs font-medium text-gray-500 mb-1">Komponenten</p>
|
||||||
<div className="flex flex-wrap gap-1">
|
<div className="flex flex-wrap gap-1">
|
||||||
{ctrl.scope.components.map(c => (
|
{asStringArray(ctrl.scope?.components).map(c => (
|
||||||
<span key={c} className="px-2 py-0.5 bg-purple-50 text-purple-700 rounded text-xs">{c}</span>
|
<span key={c} className="px-2 py-0.5 bg-purple-50 text-purple-700 rounded text-xs">{c}</span>
|
||||||
))}
|
))}
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
)}
|
)}
|
||||||
{ctrl.scope.data_classes && ctrl.scope.data_classes.length > 0 && (
|
{asStringArray(ctrl.scope?.data_classes).length > 0 && (
|
||||||
<div>
|
<div>
|
||||||
<p className="text-xs font-medium text-gray-500 mb-1">Datenklassen</p>
|
<p className="text-xs font-medium text-gray-500 mb-1">Datenklassen</p>
|
||||||
<div className="flex flex-wrap gap-1">
|
<div className="flex flex-wrap gap-1">
|
||||||
{ctrl.scope.data_classes.map(d => (
|
{asStringArray(ctrl.scope?.data_classes).map(d => (
|
||||||
<span key={d} className="px-2 py-0.5 bg-amber-50 text-amber-700 rounded text-xs">{d}</span>
|
<span key={d} className="px-2 py-0.5 bg-amber-50 text-amber-700 rounded text-xs">{d}</span>
|
||||||
))}
|
))}
|
||||||
</div>
|
</div>
|
||||||
@@ -109,7 +126,7 @@ export function ControlDetailView({
|
|||||||
<section>
|
<section>
|
||||||
<h3 className="text-sm font-semibold text-gray-900 mb-2">Anforderungen</h3>
|
<h3 className="text-sm font-semibold text-gray-900 mb-2">Anforderungen</h3>
|
||||||
<ol className="space-y-2">
|
<ol className="space-y-2">
|
||||||
{ctrl.requirements.map((req, i) => (
|
{asStringArray(ctrl.requirements).map((req, i) => (
|
||||||
<li key={i} className="flex items-start gap-2 text-sm text-gray-700">
|
<li key={i} className="flex items-start gap-2 text-sm text-gray-700">
|
||||||
<span className="flex-shrink-0 w-5 h-5 bg-purple-100 text-purple-700 rounded-full flex items-center justify-center text-xs font-medium mt-0.5">{i + 1}</span>
|
<span className="flex-shrink-0 w-5 h-5 bg-purple-100 text-purple-700 rounded-full flex items-center justify-center text-xs font-medium mt-0.5">{i + 1}</span>
|
||||||
{req}
|
{req}
|
||||||
@@ -122,7 +139,7 @@ export function ControlDetailView({
|
|||||||
<section>
|
<section>
|
||||||
<h3 className="text-sm font-semibold text-gray-900 mb-2">Pruefverfahren</h3>
|
<h3 className="text-sm font-semibold text-gray-900 mb-2">Pruefverfahren</h3>
|
||||||
<ol className="space-y-2">
|
<ol className="space-y-2">
|
||||||
{ctrl.test_procedure.map((step, i) => (
|
{asStringArray(ctrl.test_procedure).map((step, i) => (
|
||||||
<li key={i} className="flex items-start gap-2 text-sm text-gray-700">
|
<li key={i} className="flex items-start gap-2 text-sm text-gray-700">
|
||||||
<CheckCircle2 className="w-4 h-4 text-green-500 flex-shrink-0 mt-0.5" />
|
<CheckCircle2 className="w-4 h-4 text-green-500 flex-shrink-0 mt-0.5" />
|
||||||
{step}
|
{step}
|
||||||
@@ -135,12 +152,18 @@ export function ControlDetailView({
|
|||||||
<section>
|
<section>
|
||||||
<h3 className="text-sm font-semibold text-gray-900 mb-2">Nachweisanforderungen</h3>
|
<h3 className="text-sm font-semibold text-gray-900 mb-2">Nachweisanforderungen</h3>
|
||||||
<div className="space-y-2">
|
<div className="space-y-2">
|
||||||
{ctrl.evidence.map((ev, i) => (
|
{asEvidenceArray(ctrl.evidence).map((ev, i) => (
|
||||||
<div key={i} className="flex items-start gap-2 p-3 bg-gray-50 rounded-lg">
|
<div key={i} className="flex items-start gap-2 p-3 bg-gray-50 rounded-lg">
|
||||||
<FileText className="w-4 h-4 text-gray-400 flex-shrink-0 mt-0.5" />
|
<FileText className="w-4 h-4 text-gray-400 flex-shrink-0 mt-0.5" />
|
||||||
<div>
|
<div>
|
||||||
<span className="text-xs font-medium text-gray-500 uppercase">{ev.type}</span>
|
{typeof ev === 'string' ? (
|
||||||
<p className="text-sm text-gray-700">{ev.description}</p>
|
<p className="text-sm text-gray-700">{ev}</p>
|
||||||
|
) : (
|
||||||
|
<>
|
||||||
|
{ev.type && <span className="text-xs font-medium text-gray-500 uppercase">{ev.type}</span>}
|
||||||
|
<p className="text-sm text-gray-700">{ev.description ?? JSON.stringify(ev)}</p>
|
||||||
|
</>
|
||||||
|
)}
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
))}
|
))}
|
||||||
@@ -152,13 +175,13 @@ export function ControlDetailView({
|
|||||||
<div className="flex items-center gap-2 mb-3">
|
<div className="flex items-center gap-2 mb-3">
|
||||||
<BookOpen className="w-4 h-4 text-green-700" />
|
<BookOpen className="w-4 h-4 text-green-700" />
|
||||||
<h3 className="text-sm font-semibold text-green-900">Open-Source-Referenzen</h3>
|
<h3 className="text-sm font-semibold text-green-900">Open-Source-Referenzen</h3>
|
||||||
<span className="text-xs text-green-600">({ctrl.open_anchors.length} Quellen)</span>
|
<span className="text-xs text-green-600">({asArray(ctrl.open_anchors).length} Quellen)</span>
|
||||||
</div>
|
</div>
|
||||||
<p className="text-xs text-green-700 mb-3">
|
<p className="text-xs text-green-700 mb-3">
|
||||||
Dieses Control basiert auf frei verfuegbarem Wissen. Alle Referenzen sind offen und oeffentlich zugaenglich.
|
Dieses Control basiert auf frei verfuegbarem Wissen. Alle Referenzen sind offen und oeffentlich zugaenglich.
|
||||||
</p>
|
</p>
|
||||||
<div className="space-y-2">
|
<div className="space-y-2">
|
||||||
{ctrl.open_anchors.map((anchor, i) => (
|
{asArray<{ framework?: string; ref?: string; url?: string }>(ctrl.open_anchors).map((anchor, i) => (
|
||||||
<div key={i} className="flex items-start gap-3 p-2 bg-white rounded border border-green-100">
|
<div key={i} className="flex items-start gap-3 p-2 bg-white rounded border border-green-100">
|
||||||
<Scale className="w-4 h-4 text-green-600 flex-shrink-0 mt-0.5" />
|
<Scale className="w-4 h-4 text-green-600 flex-shrink-0 mt-0.5" />
|
||||||
<div className="flex-1 min-w-0">
|
<div className="flex-1 min-w-0">
|
||||||
@@ -180,11 +203,11 @@ export function ControlDetailView({
|
|||||||
</section>
|
</section>
|
||||||
|
|
||||||
{/* Tags */}
|
{/* Tags */}
|
||||||
{ctrl.tags.length > 0 && (
|
{asStringArray(ctrl.tags).length > 0 && (
|
||||||
<section>
|
<section>
|
||||||
<h3 className="text-sm font-semibold text-gray-900 mb-2">Tags</h3>
|
<h3 className="text-sm font-semibold text-gray-900 mb-2">Tags</h3>
|
||||||
<div className="flex flex-wrap gap-1.5">
|
<div className="flex flex-wrap gap-1.5">
|
||||||
{ctrl.tags.map(tag => (
|
{asStringArray(ctrl.tags).map(tag => (
|
||||||
<span key={tag} className="px-2 py-1 bg-gray-100 text-gray-600 rounded text-xs">{tag}</span>
|
<span key={tag} className="px-2 py-1 bg-gray-100 text-gray-600 rounded text-xs">{tag}</span>
|
||||||
))}
|
))}
|
||||||
</div>
|
</div>
|
||||||
|
|||||||
@@ -18,6 +18,16 @@ import { ControlRegulatorySection } from './ControlRegulatorySection'
|
|||||||
import { ControlSimilarControls } from './ControlSimilarControls'
|
import { ControlSimilarControls } from './ControlSimilarControls'
|
||||||
import { ControlReviewActions } from './ControlReviewActions'
|
import { ControlReviewActions } from './ControlReviewActions'
|
||||||
|
|
||||||
|
// Defensive coercer: some canonical_controls rows have evidence/tags/etc.
|
||||||
|
// as JSON-encoded strings instead of arrays. .map() on a string throws.
|
||||||
|
function toArray<T = unknown>(v: unknown): T[] {
|
||||||
|
if (Array.isArray(v)) return v as T[]
|
||||||
|
if (typeof v === 'string' && v.trim().startsWith('[')) {
|
||||||
|
try { const p = JSON.parse(v); return Array.isArray(p) ? p : [] } catch { return [] }
|
||||||
|
}
|
||||||
|
return []
|
||||||
|
}
|
||||||
|
|
||||||
interface SimilarControl {
|
interface SimilarControl {
|
||||||
control_id: string; title: string; severity: string; release_state: string;
|
control_id: string; title: string; severity: string; release_state: string;
|
||||||
tags: string[]; license_rule: number | null; verification_method: string | null;
|
tags: string[]; license_rule: number | null; verification_method: string | null;
|
||||||
@@ -186,7 +196,7 @@ export function ControlDetail({
|
|||||||
<ControlTraceability ctrl={ctrl} traceability={traceability} loadingTrace={loadingTrace}
|
<ControlTraceability ctrl={ctrl} traceability={traceability} loadingTrace={loadingTrace}
|
||||||
onNavigateToControl={onNavigateToControl} />
|
onNavigateToControl={onNavigateToControl} />
|
||||||
|
|
||||||
{!ctrl.source_citation && ctrl.open_anchors.length > 0 && (
|
{!ctrl.source_citation && toArray(ctrl.open_anchors).length > 0 && (
|
||||||
<section className="bg-amber-50 border border-amber-200 rounded-lg p-3">
|
<section className="bg-amber-50 border border-amber-200 rounded-lg p-3">
|
||||||
<div className="flex items-center gap-2">
|
<div className="flex items-center gap-2">
|
||||||
<Scale className="w-4 h-4 text-amber-600" />
|
<Scale className="w-4 h-4 text-amber-600" />
|
||||||
@@ -201,36 +211,36 @@ export function ControlDetail({
|
|||||||
</section>
|
</section>
|
||||||
)}
|
)}
|
||||||
|
|
||||||
{(ctrl.scope.platforms?.length || ctrl.scope.components?.length || ctrl.scope.data_classes?.length) ? (
|
{(toArray(ctrl.scope?.platforms).length || toArray(ctrl.scope?.components).length || toArray(ctrl.scope?.data_classes).length) ? (
|
||||||
<section>
|
<section>
|
||||||
<h3 className="text-sm font-semibold text-gray-900 mb-2">Geltungsbereich</h3>
|
<h3 className="text-sm font-semibold text-gray-900 mb-2">Geltungsbereich</h3>
|
||||||
<div className="grid grid-cols-3 gap-4 text-xs">
|
<div className="grid grid-cols-3 gap-4 text-xs">
|
||||||
{ctrl.scope.platforms?.length ? <div><span className="text-gray-500">Plattformen:</span> <span className="text-gray-700">{ctrl.scope.platforms.join(', ')}</span></div> : null}
|
{toArray<string>(ctrl.scope?.platforms).length ? <div><span className="text-gray-500">Plattformen:</span> <span className="text-gray-700">{toArray<string>(ctrl.scope?.platforms).join(', ')}</span></div> : null}
|
||||||
{ctrl.scope.components?.length ? <div><span className="text-gray-500">Komponenten:</span> <span className="text-gray-700">{ctrl.scope.components.join(', ')}</span></div> : null}
|
{toArray<string>(ctrl.scope?.components).length ? <div><span className="text-gray-500">Komponenten:</span> <span className="text-gray-700">{toArray<string>(ctrl.scope?.components).join(', ')}</span></div> : null}
|
||||||
{ctrl.scope.data_classes?.length ? <div><span className="text-gray-500">Datenklassen:</span> <span className="text-gray-700">{ctrl.scope.data_classes.join(', ')}</span></div> : null}
|
{toArray<string>(ctrl.scope?.data_classes).length ? <div><span className="text-gray-500">Datenklassen:</span> <span className="text-gray-700">{toArray<string>(ctrl.scope?.data_classes).join(', ')}</span></div> : null}
|
||||||
</div>
|
</div>
|
||||||
</section>
|
</section>
|
||||||
) : null}
|
) : null}
|
||||||
|
|
||||||
{Array.isArray(ctrl.requirements) && ctrl.requirements.length > 0 && (
|
{toArray<string>(ctrl.requirements).length > 0 && (
|
||||||
<section>
|
<section>
|
||||||
<h3 className="text-sm font-semibold text-gray-900 mb-2">Anforderungen</h3>
|
<h3 className="text-sm font-semibold text-gray-900 mb-2">Anforderungen</h3>
|
||||||
<ol className="list-decimal list-inside space-y-1">{ctrl.requirements.map((r, i) => <li key={i} className="text-sm text-gray-700">{r}</li>)}</ol>
|
<ol className="list-decimal list-inside space-y-1">{toArray<string>(ctrl.requirements).map((r, i) => <li key={i} className="text-sm text-gray-700">{r}</li>)}</ol>
|
||||||
</section>
|
</section>
|
||||||
)}
|
)}
|
||||||
|
|
||||||
{Array.isArray(ctrl.test_procedure) && ctrl.test_procedure.length > 0 && (
|
{toArray<string>(ctrl.test_procedure).length > 0 && (
|
||||||
<section>
|
<section>
|
||||||
<h3 className="text-sm font-semibold text-gray-900 mb-2">Pruefverfahren</h3>
|
<h3 className="text-sm font-semibold text-gray-900 mb-2">Pruefverfahren</h3>
|
||||||
<ol className="list-decimal list-inside space-y-1">{ctrl.test_procedure.map((s, i) => <li key={i} className="text-sm text-gray-700">{s}</li>)}</ol>
|
<ol className="list-decimal list-inside space-y-1">{toArray<string>(ctrl.test_procedure).map((s, i) => <li key={i} className="text-sm text-gray-700">{s}</li>)}</ol>
|
||||||
</section>
|
</section>
|
||||||
)}
|
)}
|
||||||
|
|
||||||
{ctrl.evidence.length > 0 && (
|
{toArray(ctrl.evidence).length > 0 && (
|
||||||
<section>
|
<section>
|
||||||
<h3 className="text-sm font-semibold text-gray-900 mb-2">Nachweise</h3>
|
<h3 className="text-sm font-semibold text-gray-900 mb-2">Nachweise</h3>
|
||||||
<div className="space-y-2">
|
<div className="space-y-2">
|
||||||
{ctrl.evidence.map((ev, i) => (
|
{toArray<string | { type?: string; description?: string }>(ctrl.evidence).map((ev, i) => (
|
||||||
<div key={i} className="flex items-start gap-2 text-sm text-gray-700">
|
<div key={i} className="flex items-start gap-2 text-sm text-gray-700">
|
||||||
<FileText className="w-4 h-4 text-gray-400 flex-shrink-0 mt-0.5" />
|
<FileText className="w-4 h-4 text-gray-400 flex-shrink-0 mt-0.5" />
|
||||||
{typeof ev === 'string' ? <div>{ev}</div> : <div><span className="font-medium">{ev.type}:</span> {ev.description}</div>}
|
{typeof ev === 'string' ? <div>{ev}</div> : <div><span className="font-medium">{ev.type}:</span> {ev.description}</div>}
|
||||||
@@ -243,9 +253,9 @@ export function ControlDetail({
|
|||||||
<section className="grid grid-cols-3 gap-4 text-xs text-gray-500">
|
<section className="grid grid-cols-3 gap-4 text-xs text-gray-500">
|
||||||
{ctrl.risk_score !== null && <div>Risiko-Score: <span className="text-gray-700 font-medium">{ctrl.risk_score}</span></div>}
|
{ctrl.risk_score !== null && <div>Risiko-Score: <span className="text-gray-700 font-medium">{ctrl.risk_score}</span></div>}
|
||||||
{ctrl.implementation_effort && <div>Aufwand: <span className="text-gray-700 font-medium">{EFFORT_LABELS[ctrl.implementation_effort] || ctrl.implementation_effort}</span></div>}
|
{ctrl.implementation_effort && <div>Aufwand: <span className="text-gray-700 font-medium">{EFFORT_LABELS[ctrl.implementation_effort] || ctrl.implementation_effort}</span></div>}
|
||||||
{ctrl.tags.length > 0 && (
|
{toArray<string>(ctrl.tags).length > 0 && (
|
||||||
<div className="col-span-3 flex items-center gap-1 flex-wrap">
|
<div className="col-span-3 flex items-center gap-1 flex-wrap">
|
||||||
{ctrl.tags.map(t => <span key={t} className="px-2 py-0.5 bg-gray-100 text-gray-600 rounded text-xs">{t}</span>)}
|
{toArray<string>(ctrl.tags).map(t => <span key={t} className="px-2 py-0.5 bg-gray-100 text-gray-600 rounded text-xs">{t}</span>)}
|
||||||
</div>
|
</div>
|
||||||
)}
|
)}
|
||||||
</section>
|
</section>
|
||||||
@@ -253,11 +263,11 @@ export function ControlDetail({
|
|||||||
<section className="bg-green-50 border border-green-200 rounded-lg p-4">
|
<section className="bg-green-50 border border-green-200 rounded-lg p-4">
|
||||||
<div className="flex items-center gap-2 mb-3">
|
<div className="flex items-center gap-2 mb-3">
|
||||||
<BookOpen className="w-4 h-4 text-green-700" />
|
<BookOpen className="w-4 h-4 text-green-700" />
|
||||||
<h3 className="text-sm font-semibold text-green-900">Open-Source-Referenzen ({ctrl.open_anchors.length})</h3>
|
<h3 className="text-sm font-semibold text-green-900">Open-Source-Referenzen ({toArray(ctrl.open_anchors).length})</h3>
|
||||||
</div>
|
</div>
|
||||||
{ctrl.open_anchors.length > 0 ? (
|
{toArray(ctrl.open_anchors).length > 0 ? (
|
||||||
<div className="space-y-2">
|
<div className="space-y-2">
|
||||||
{ctrl.open_anchors.map((anchor, i) => (
|
{toArray<{ framework?: string; ref?: string; url?: string }>(ctrl.open_anchors).map((anchor, i) => (
|
||||||
<div key={i} className="flex items-center gap-2 text-sm">
|
<div key={i} className="flex items-center gap-2 text-sm">
|
||||||
<ExternalLink className="w-3.5 h-3.5 text-green-600 flex-shrink-0" />
|
<ExternalLink className="w-3.5 h-3.5 text-green-600 flex-shrink-0" />
|
||||||
<span className="font-medium text-green-800">{anchor.framework}</span>
|
<span className="font-medium text-green-800">{anchor.framework}</span>
|
||||||
|
|||||||
@@ -1,5 +1,7 @@
|
|||||||
'use client'
|
'use client'
|
||||||
|
|
||||||
|
import { useEffect } from 'react'
|
||||||
|
import { useSearchParams } from 'next/navigation'
|
||||||
import { EMPTY_CONTROL } from './components/helpers'
|
import { EMPTY_CONTROL } from './components/helpers'
|
||||||
import { ControlForm } from './components/ControlForm'
|
import { ControlForm } from './components/ControlForm'
|
||||||
import { ControlDetail } from './components/ControlDetail'
|
import { ControlDetail } from './components/ControlDetail'
|
||||||
@@ -12,6 +14,24 @@ import { BACKEND_URL } from './components/helpers'
|
|||||||
|
|
||||||
export default function ControlLibraryPage() {
|
export default function ControlLibraryPage() {
|
||||||
const state = useControlLibraryState()
|
const state = useControlLibraryState()
|
||||||
|
const searchParams = useSearchParams()
|
||||||
|
|
||||||
|
// Deep-link via /sdk/control-library?control=<id>
|
||||||
|
// — e.g. from /sdk/master-controls member list.
|
||||||
|
useEffect(() => {
|
||||||
|
const cid = searchParams?.get('control')
|
||||||
|
if (!cid || state.selectedControl?.control_id === cid) return
|
||||||
|
fetch(`${BACKEND_URL}?endpoint=control&id=${encodeURIComponent(cid)}`)
|
||||||
|
.then(r => r.ok ? r.json() : null)
|
||||||
|
.then(ctrl => {
|
||||||
|
if (ctrl?.control_id) {
|
||||||
|
state.setSelectedControl(ctrl)
|
||||||
|
state.setMode('detail')
|
||||||
|
}
|
||||||
|
})
|
||||||
|
.catch(() => { /* user just sees the list */ })
|
||||||
|
// eslint-disable-next-line react-hooks/exhaustive-deps
|
||||||
|
}, [searchParams])
|
||||||
|
|
||||||
const {
|
const {
|
||||||
handleCreate, handleUpdate, handleDelete, handleReview, handleBulkReject,
|
handleCreate, handleUpdate, handleDelete, handleReview, handleBulkReject,
|
||||||
|
|||||||
@@ -57,12 +57,7 @@ export default function EinwilligungenPage() {
|
|||||||
explanation={stepInfo.explanation}
|
explanation={stepInfo.explanation}
|
||||||
tips={stepInfo.tips}
|
tips={stepInfo.tips}
|
||||||
>
|
>
|
||||||
<button className="flex items-center gap-2 px-4 py-2 bg-purple-600 text-white rounded-lg hover:bg-purple-700 transition-colors">
|
<ConsentExportButton />
|
||||||
<svg className="w-5 h-5" fill="none" stroke="currentColor" viewBox="0 0 24 24">
|
|
||||||
<path strokeLinecap="round" strokeLinejoin="round" strokeWidth={2} d="M4 16v1a3 3 0 003 3h10a3 3 0 003-3v-1m-4-4l-4 4m0 0l-4-4m4 4V4" />
|
|
||||||
</svg>
|
|
||||||
Export
|
|
||||||
</button>
|
|
||||||
</StepHeader>
|
</StepHeader>
|
||||||
|
|
||||||
{/* Navigation Tabs */}
|
{/* Navigation Tabs */}
|
||||||
@@ -150,3 +145,32 @@ export default function EinwilligungenPage() {
|
|||||||
</div>
|
</div>
|
||||||
)
|
)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Export-Dropdown im Step-Header. Streamt CSV/JSON direkt aus dem
|
||||||
|
// Backend via /api/sdk/v1/einwilligungen/export-Proxy.
|
||||||
|
function ConsentExportButton() {
|
||||||
|
return (
|
||||||
|
<div className="relative group">
|
||||||
|
<button className="flex items-center gap-2 px-4 py-2 bg-purple-600 text-white rounded-lg hover:bg-purple-700 transition-colors">
|
||||||
|
<svg className="w-5 h-5" fill="none" stroke="currentColor" viewBox="0 0 24 24">
|
||||||
|
<path strokeLinecap="round" strokeLinejoin="round" strokeWidth={2} d="M4 16v1a3 3 0 003 3h10a3 3 0 003-3v-1m-4-4l-4 4m0 0l-4-4m4 4V4" />
|
||||||
|
</svg>
|
||||||
|
Export
|
||||||
|
</button>
|
||||||
|
<div className="absolute right-0 top-full mt-1 w-60 bg-white border border-gray-200 rounded-lg shadow-lg invisible group-hover:visible opacity-0 group-hover:opacity-100 transition-all z-10">
|
||||||
|
<a href="/api/sdk/v1/einwilligungen/export?format=csv&kind=consents" download
|
||||||
|
className="block px-4 py-2 text-sm text-gray-700 hover:bg-purple-50 first:rounded-t-lg">
|
||||||
|
Einwilligungen als CSV
|
||||||
|
</a>
|
||||||
|
<a href="/api/sdk/v1/einwilligungen/export?format=json&kind=consents" download
|
||||||
|
className="block px-4 py-2 text-sm text-gray-700 hover:bg-purple-50">
|
||||||
|
Einwilligungen als JSON
|
||||||
|
</a>
|
||||||
|
<a href="/api/sdk/v1/einwilligungen/export?format=csv&kind=history" download
|
||||||
|
className="block px-4 py-2 text-sm text-gray-700 hover:bg-purple-50 last:rounded-b-lg border-t border-gray-100">
|
||||||
|
Aenderungs-Historie als CSV
|
||||||
|
</a>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|||||||
@@ -199,32 +199,43 @@ function MCDetail({ mc, onBack }: { mc: Record<string, unknown>; onBack: () => v
|
|||||||
</div>
|
</div>
|
||||||
) : (
|
) : (
|
||||||
<div className="divide-y divide-gray-50">
|
<div className="divide-y divide-gray-50">
|
||||||
{filtered.map((m, i) => (
|
{filtered.map((m, i) => {
|
||||||
<div key={i} className="px-4 py-3 hover:bg-gray-50">
|
const inner = (
|
||||||
<div className="flex items-center gap-2 mb-1">
|
<>
|
||||||
<span className="text-xs font-mono text-gray-400">{m.control_id}</span>
|
<div className="flex items-center gap-2 mb-1">
|
||||||
{m.severity && (
|
<span className="text-xs font-mono text-gray-400">{m.control_id}</span>
|
||||||
<span className={`px-1.5 py-0.5 rounded text-[10px] font-bold ${SEV[m.severity] || 'bg-gray-100 text-gray-600'}`}>
|
{m.severity && (
|
||||||
{m.severity}
|
<span className={`px-1.5 py-0.5 rounded text-[10px] font-bold ${SEV[m.severity] || 'bg-gray-100 text-gray-600'}`}>
|
||||||
</span>
|
{m.severity}
|
||||||
|
</span>
|
||||||
|
)}
|
||||||
|
{m.phase && (
|
||||||
|
<span className="text-[10px] text-purple-600 bg-purple-50 px-1.5 py-0.5 rounded">
|
||||||
|
{m.phase}
|
||||||
|
</span>
|
||||||
|
)}
|
||||||
|
{m.action && (
|
||||||
|
<span className="text-[10px] text-gray-400">{m.action}</span>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
<p className="text-sm text-gray-900">{m.title}</p>
|
||||||
|
{m.regulation_source && (
|
||||||
|
<p className="text-xs text-blue-600 mt-1">
|
||||||
|
{m.regulation_source} {m.regulation_article}
|
||||||
|
</p>
|
||||||
)}
|
)}
|
||||||
{m.phase && (
|
</>
|
||||||
<span className="text-[10px] text-purple-600 bg-purple-50 px-1.5 py-0.5 rounded">
|
)
|
||||||
{m.phase}
|
return m.control_id ? (
|
||||||
</span>
|
<a key={i}
|
||||||
)}
|
href={`/sdk/control-library?control=${encodeURIComponent(m.control_id)}`}
|
||||||
{m.action && (
|
className="block px-4 py-3 hover:bg-purple-50/40 transition-colors">
|
||||||
<span className="text-[10px] text-gray-400">{m.action}</span>
|
{inner}
|
||||||
)}
|
</a>
|
||||||
</div>
|
) : (
|
||||||
<p className="text-sm text-gray-900">{m.title}</p>
|
<div key={i} className="px-4 py-3 hover:bg-gray-50">{inner}</div>
|
||||||
{m.regulation_source && (
|
)
|
||||||
<p className="text-xs text-blue-600 mt-1">
|
})}
|
||||||
{m.regulation_source} {m.regulation_article}
|
|
||||||
</p>
|
|
||||||
)}
|
|
||||||
</div>
|
|
||||||
))}
|
|
||||||
{filtered.length === 0 && !loading && (
|
{filtered.length === 0 && !loading && (
|
||||||
<div className="p-8 text-center text-gray-400">Keine Controls gefunden</div>
|
<div className="p-8 text-center text-gray-400">Keine Controls gefunden</div>
|
||||||
)}
|
)}
|
||||||
|
|||||||
@@ -0,0 +1,156 @@
|
|||||||
|
/**
|
||||||
|
* Content-Blocker Generator (Borlabs-Parity).
|
||||||
|
*
|
||||||
|
* Returns a small JS snippet that scans the page for blockable third-party
|
||||||
|
* embeds (YouTube, Vimeo, Google Maps, Spotify, Twitter, Facebook) and
|
||||||
|
* replaces them with a click-to-consent placeholder until the user agrees
|
||||||
|
* to the relevant cookie category.
|
||||||
|
*
|
||||||
|
* The customer drops a SECOND script tag next to the banner:
|
||||||
|
* <script src="/cookie-banner.js"></script>
|
||||||
|
* <script src="/cookie-content-blocker.js"></script>
|
||||||
|
*
|
||||||
|
* Author writes content as either:
|
||||||
|
* <bp-consent-block category="EXTERNAL_MEDIA"
|
||||||
|
* provider="YouTube"
|
||||||
|
* src="https://www.youtube.com/embed/...">
|
||||||
|
* <!-- the original iframe / embed code -->
|
||||||
|
* </bp-consent-block>
|
||||||
|
*
|
||||||
|
* OR auto-detect: any <iframe src="https://www.youtube.com/...">
|
||||||
|
* gets wrapped on page load.
|
||||||
|
*/
|
||||||
|
|
||||||
|
const KNOWN_EMBEDS: Array<{ host: string; provider: string; category: string }> = [
|
||||||
|
{ host: 'youtube.com', provider: 'YouTube', category: 'EXTERNAL_MEDIA' },
|
||||||
|
{ host: 'youtu.be', provider: 'YouTube', category: 'EXTERNAL_MEDIA' },
|
||||||
|
{ host: 'vimeo.com', provider: 'Vimeo', category: 'EXTERNAL_MEDIA' },
|
||||||
|
{ host: 'google.com/maps', provider: 'Google Maps', category: 'EXTERNAL_MEDIA' },
|
||||||
|
{ host: 'maps.googleapis.com', provider: 'Google Maps', category: 'EXTERNAL_MEDIA' },
|
||||||
|
{ host: 'spotify.com', provider: 'Spotify', category: 'EXTERNAL_MEDIA' },
|
||||||
|
{ host: 'soundcloud.com', provider: 'SoundCloud', category: 'EXTERNAL_MEDIA' },
|
||||||
|
{ host: 'twitter.com', provider: 'Twitter / X', category: 'PERSONALIZATION' },
|
||||||
|
{ host: 'facebook.com', provider: 'Facebook', category: 'PERSONALIZATION' },
|
||||||
|
{ host: 'instagram.com', provider: 'Instagram', category: 'PERSONALIZATION' },
|
||||||
|
]
|
||||||
|
|
||||||
|
export function generateContentBlockerJS(cookieName: string = 'cookie_consent'): string {
|
||||||
|
return `(function () {
|
||||||
|
'use strict';
|
||||||
|
var COOKIE_NAME = ${JSON.stringify(cookieName)};
|
||||||
|
var KNOWN_EMBEDS = ${JSON.stringify(KNOWN_EMBEDS)};
|
||||||
|
|
||||||
|
function getConsent() {
|
||||||
|
var c = document.cookie.split('; ').find(function (r) {
|
||||||
|
return r.indexOf(COOKIE_NAME + '=') === 0;
|
||||||
|
});
|
||||||
|
if (!c) return null;
|
||||||
|
try { return JSON.parse(decodeURIComponent(c.split('=')[1])); } catch (e) { return null; }
|
||||||
|
}
|
||||||
|
|
||||||
|
function categoryGranted(cat) {
|
||||||
|
var c = getConsent();
|
||||||
|
if (!c) return false;
|
||||||
|
var k = String(cat).toLowerCase();
|
||||||
|
return c[cat] === true || c[k] === true;
|
||||||
|
}
|
||||||
|
|
||||||
|
function classifyByHost(src) {
|
||||||
|
if (!src) return null;
|
||||||
|
for (var i = 0; i < KNOWN_EMBEDS.length; i++) {
|
||||||
|
if (src.indexOf(KNOWN_EMBEDS[i].host) > -1) return KNOWN_EMBEDS[i];
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
function makePlaceholder(provider, category, originalHTML, parent) {
|
||||||
|
var ph = document.createElement('div');
|
||||||
|
ph.className = 'bp-consent-placeholder';
|
||||||
|
ph.style.cssText = 'border:2px dashed #cbd5e1;background:#f8fafc;padding:24px;' +
|
||||||
|
'border-radius:8px;text-align:center;font-family:-apple-system,sans-serif;color:#475569';
|
||||||
|
ph.innerHTML =
|
||||||
|
'<div style="font-size:14px;font-weight:600;color:#1e293b;margin-bottom:8px">' +
|
||||||
|
'Inhalt von ' + provider + ' blockiert</div>' +
|
||||||
|
'<div style="font-size:12px;margin-bottom:12px">' +
|
||||||
|
'Zum Anzeigen dieses Inhalts wird Ihre Einwilligung fuer die Kategorie ' +
|
||||||
|
'<strong>' + category + '</strong> benoetigt. ' +
|
||||||
|
'Beim Akzeptieren werden Cookies von ' + provider + ' gesetzt.</div>' +
|
||||||
|
'<button class="bp-consent-load-btn" ' +
|
||||||
|
'style="background:#7c3aed;color:white;border:none;padding:8px 16px;' +
|
||||||
|
'border-radius:6px;font-size:13px;cursor:pointer;margin-right:6px">' +
|
||||||
|
'Inhalt einmalig laden</button>' +
|
||||||
|
'<button class="bp-consent-accept-btn" ' +
|
||||||
|
'style="background:#16a34a;color:white;border:none;padding:8px 16px;' +
|
||||||
|
'border-radius:6px;font-size:13px;cursor:pointer">' +
|
||||||
|
category + ' akzeptieren</button>';
|
||||||
|
ph.querySelector('.bp-consent-load-btn').addEventListener('click', function () {
|
||||||
|
var div = document.createElement('div');
|
||||||
|
div.innerHTML = originalHTML;
|
||||||
|
while (div.firstChild) parent.insertBefore(div.firstChild, ph);
|
||||||
|
ph.remove();
|
||||||
|
});
|
||||||
|
ph.querySelector('.bp-consent-accept-btn').addEventListener('click', function () {
|
||||||
|
var c = getConsent() || {};
|
||||||
|
c[category] = true;
|
||||||
|
var date = new Date();
|
||||||
|
date.setTime(date.getTime() + 180 * 86400000);
|
||||||
|
document.cookie = COOKIE_NAME + '=' + encodeURIComponent(JSON.stringify(c)) +
|
||||||
|
';expires=' + date.toUTCString() + ';path=/;SameSite=Lax';
|
||||||
|
window.dispatchEvent(new CustomEvent('cookieConsentUpdated', { detail: c }));
|
||||||
|
// Re-scan: placeholders for THIS category get replaced now
|
||||||
|
processAll();
|
||||||
|
});
|
||||||
|
return ph;
|
||||||
|
}
|
||||||
|
|
||||||
|
function processWrapped() {
|
||||||
|
var wrapped = document.querySelectorAll('bp-consent-block, [data-bp-consent-block]');
|
||||||
|
wrapped.forEach(function (el) {
|
||||||
|
var cat = el.getAttribute('category') || el.getAttribute('data-category') || 'EXTERNAL_MEDIA';
|
||||||
|
var prov = el.getAttribute('provider') || el.getAttribute('data-provider') || 'Drittanbieter';
|
||||||
|
if (categoryGranted(cat)) {
|
||||||
|
// Already consented: unwrap the inner content
|
||||||
|
var html = el.innerHTML;
|
||||||
|
var tmp = document.createElement('div');
|
||||||
|
tmp.innerHTML = html;
|
||||||
|
var parent = el.parentNode;
|
||||||
|
while (tmp.firstChild) parent.insertBefore(tmp.firstChild, el);
|
||||||
|
el.remove();
|
||||||
|
} else {
|
||||||
|
var parent = el.parentNode;
|
||||||
|
var inner = el.innerHTML;
|
||||||
|
var ph = makePlaceholder(prov, cat, inner, parent);
|
||||||
|
parent.insertBefore(ph, el);
|
||||||
|
el.remove();
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
function processBareIframes() {
|
||||||
|
var iframes = document.querySelectorAll('iframe[src]:not([data-bp-processed])');
|
||||||
|
iframes.forEach(function (f) {
|
||||||
|
var match = classifyByHost(f.getAttribute('src') || '');
|
||||||
|
if (!match) return;
|
||||||
|
f.setAttribute('data-bp-processed', '1');
|
||||||
|
if (categoryGranted(match.category)) return;
|
||||||
|
var html = f.outerHTML;
|
||||||
|
var parent = f.parentNode;
|
||||||
|
var ph = makePlaceholder(match.provider, match.category, html, parent);
|
||||||
|
parent.replaceChild(ph, f);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
function processAll() {
|
||||||
|
processWrapped();
|
||||||
|
processBareIframes();
|
||||||
|
}
|
||||||
|
|
||||||
|
if (document.readyState === 'loading') {
|
||||||
|
document.addEventListener('DOMContentLoaded', processAll);
|
||||||
|
} else {
|
||||||
|
processAll();
|
||||||
|
}
|
||||||
|
// Re-process when consent updates
|
||||||
|
window.addEventListener('cookieConsentUpdated', processAll);
|
||||||
|
})();`
|
||||||
|
}
|
||||||
@@ -325,18 +325,25 @@ function generateJS(config: CookieBannerConfig): string {
|
|||||||
const CATEGORIES = ${JSON.stringify(categoryIds)};
|
const CATEGORIES = ${JSON.stringify(categoryIds)};
|
||||||
const REQUIRED_CATEGORIES = ${JSON.stringify(requiredCategories)};
|
const REQUIRED_CATEGORIES = ${JSON.stringify(requiredCategories)};
|
||||||
|
|
||||||
// Google Consent Mode v2 — PFLICHT seit Maerz 2024 fuer Google Services in EEA
|
// Google Consent Mode v2 — PFLICHT seit Maerz 2024 fuer Google Services
|
||||||
// Sets default consent state to "denied" BEFORE any Google tags fire
|
// in EEA. Shim gtag/dataLayer falls Google Tag noch nicht initialisiert
|
||||||
if (typeof gtag === 'function') {
|
// wurde, dann sofort den default consent state setzen (DENIED).
|
||||||
gtag('consent', 'default', {
|
window.dataLayer = window.dataLayer || [];
|
||||||
analytics_storage: 'denied',
|
if (typeof gtag !== 'function') {
|
||||||
ad_storage: 'denied',
|
window.gtag = function () { window.dataLayer.push(arguments); };
|
||||||
ad_user_data: 'denied',
|
|
||||||
ad_personalization: 'denied',
|
|
||||||
functionality_storage: 'granted',
|
|
||||||
security_storage: 'granted',
|
|
||||||
});
|
|
||||||
}
|
}
|
||||||
|
// wait_for_update gibt dem Banner 500ms Zeit, damit der Nutzer
|
||||||
|
// entscheiden kann bevor Tags feuern. Empfehlung von Google fuer GCM v2.
|
||||||
|
gtag('consent', 'default', {
|
||||||
|
analytics_storage: 'denied',
|
||||||
|
ad_storage: 'denied',
|
||||||
|
ad_user_data: 'denied',
|
||||||
|
ad_personalization: 'denied',
|
||||||
|
functionality_storage: 'granted',
|
||||||
|
security_storage: 'granted',
|
||||||
|
wait_for_update: 500,
|
||||||
|
region: ['EEA', 'CH', 'GB'],
|
||||||
|
});
|
||||||
|
|
||||||
function updateGoogleConsentMode(consent) {
|
function updateGoogleConsentMode(consent) {
|
||||||
if (typeof gtag !== 'function') return;
|
if (typeof gtag !== 'function') return;
|
||||||
@@ -364,10 +371,61 @@ function generateJS(config: CookieBannerConfig): string {
|
|||||||
document.cookie = COOKIE_NAME + '=' + encodeURIComponent(JSON.stringify(consent)) +
|
document.cookie = COOKIE_NAME + '=' + encodeURIComponent(JSON.stringify(consent)) +
|
||||||
';expires=' + date.toUTCString() +
|
';expires=' + date.toUTCString() +
|
||||||
';path=/;SameSite=Lax';
|
';path=/;SameSite=Lax';
|
||||||
|
// Append to local history (Art. 7(3) DSGVO Best-Practice + Borlabs-Parity).
|
||||||
|
// Server-seitiges Logging laeuft separat via consent-service.
|
||||||
|
try {
|
||||||
|
const HKEY = COOKIE_NAME + '_history';
|
||||||
|
const hist = JSON.parse(localStorage.getItem(HKEY) || '[]');
|
||||||
|
hist.push({
|
||||||
|
ts: new Date().toISOString(),
|
||||||
|
choices: consent,
|
||||||
|
});
|
||||||
|
if (hist.length > 50) hist.splice(0, hist.length - 50);
|
||||||
|
localStorage.setItem(HKEY, JSON.stringify(hist));
|
||||||
|
} catch (e) { /* localStorage blocked */ }
|
||||||
window.dispatchEvent(new CustomEvent('cookieConsentUpdated', { detail: consent }));
|
window.dispatchEvent(new CustomEvent('cookieConsentUpdated', { detail: consent }));
|
||||||
updateGoogleConsentMode(consent);
|
updateGoogleConsentMode(consent);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Borlabs-Parity: zeigt dem Nutzer alle seine bisherigen Einwilligungen.
|
||||||
|
// Aufruf via window.bpShowConsentHistory() oder Klick auf den Link im Banner-Footer.
|
||||||
|
window.bpShowConsentHistory = function () {
|
||||||
|
var existing = document.getElementById('bpConsentHistoryModal');
|
||||||
|
if (existing) { existing.remove(); return; }
|
||||||
|
var hist = [];
|
||||||
|
try { hist = JSON.parse(localStorage.getItem(COOKIE_NAME + '_history') || '[]'); } catch (e) {}
|
||||||
|
var rows = hist.length === 0
|
||||||
|
? '<p style="color:#94a3b8;font-style:italic">Noch keine Einwilligungen gespeichert.</p>'
|
||||||
|
: hist.slice().reverse().map(function (h) {
|
||||||
|
var d = new Date(h.ts);
|
||||||
|
var parts = Object.keys(h.choices).map(function (k) {
|
||||||
|
return '<span style="margin-right:8px;font-size:11px;color:' +
|
||||||
|
(h.choices[k] ? '#16a34a' : '#dc2626') + '">' +
|
||||||
|
(h.choices[k] ? '✓ ' : '✗ ') + k + '</span>';
|
||||||
|
}).join('');
|
||||||
|
return '<div style="border-bottom:1px solid #e5e7eb;padding:8px 0">' +
|
||||||
|
'<div style="font-size:12px;color:#64748b;margin-bottom:4px">' +
|
||||||
|
d.toLocaleString('de-DE') + '</div>' +
|
||||||
|
'<div>' + parts + '</div></div>';
|
||||||
|
}).join('');
|
||||||
|
var modal = document.createElement('div');
|
||||||
|
modal.id = 'bpConsentHistoryModal';
|
||||||
|
modal.style.cssText = 'position:fixed;inset:0;background:rgba(0,0,0,0.5);' +
|
||||||
|
'z-index:999999;display:flex;align-items:center;justify-content:center;padding:20px';
|
||||||
|
modal.innerHTML = '<div style="background:white;border-radius:8px;max-width:500px;' +
|
||||||
|
'width:100%;max-height:80vh;overflow:auto;padding:20px;font-family:-apple-system,sans-serif">' +
|
||||||
|
'<div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:12px">' +
|
||||||
|
'<h3 style="margin:0;font-size:16px">Ihre Einwilligungs-Historie</h3>' +
|
||||||
|
'<button onclick="document.getElementById(\\'bpConsentHistoryModal\\').remove()" ' +
|
||||||
|
'style="background:none;border:none;font-size:24px;cursor:pointer;color:#94a3b8">×</button>' +
|
||||||
|
'</div>' +
|
||||||
|
'<p style="font-size:12px;color:#64748b;margin:0 0 12px">' +
|
||||||
|
'Lokal in Ihrem Browser gespeichert. Server-seitig laufen Audit-Logs gemaess Art. 7(1) DSGVO.</p>' +
|
||||||
|
rows + '</div>';
|
||||||
|
modal.addEventListener('click', function (e) { if (e.target === modal) modal.remove(); });
|
||||||
|
document.body.appendChild(modal);
|
||||||
|
};
|
||||||
|
|
||||||
function hasConsent(category) {
|
function hasConsent(category) {
|
||||||
const consent = getConsent();
|
const consent = getConsent();
|
||||||
if (!consent) return REQUIRED_CATEGORIES.includes(category);
|
if (!consent) return REQUIRED_CATEGORIES.includes(category);
|
||||||
|
|||||||
@@ -39,8 +39,9 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
|
|||||||
COPY --from=builder /opt/venv /opt/venv
|
COPY --from=builder /opt/venv /opt/venv
|
||||||
ENV PATH="/opt/venv/bin:$PATH"
|
ENV PATH="/opt/venv/bin:$PATH"
|
||||||
|
|
||||||
# Create non-root user
|
# Create non-root user + pre-create /data so volume mount inherits ownership
|
||||||
RUN useradd --create-home --shell /bin/bash appuser
|
RUN useradd --create-home --shell /bin/bash appuser && \
|
||||||
|
mkdir -p /data && chown appuser:appuser /data
|
||||||
|
|
||||||
# Copy application code
|
# Copy application code
|
||||||
COPY --chown=appuser:appuser . .
|
COPY --chown=appuser:appuser . .
|
||||||
|
|||||||
@@ -33,6 +33,7 @@ _ROUTER_MODULES = [
|
|||||||
"vvt_routes",
|
"vvt_routes",
|
||||||
"legal_document_routes",
|
"legal_document_routes",
|
||||||
"einwilligungen_routes",
|
"einwilligungen_routes",
|
||||||
|
"einwilligungen_export_routes",
|
||||||
"escalation_routes",
|
"escalation_routes",
|
||||||
"consent_template_routes",
|
"consent_template_routes",
|
||||||
"notfallplan_routes",
|
"notfallplan_routes",
|
||||||
|
|||||||
@@ -159,6 +159,13 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
|||||||
from .agent_doc_check_routes import CheckItem, DocCheckResult
|
from .agent_doc_check_routes import CheckItem, DocCheckResult
|
||||||
from .agent_doc_check_report import build_html_report
|
from .agent_doc_check_report import build_html_report
|
||||||
|
|
||||||
|
# Reset anchor-locator cache per run (avoid cross-run leak)
|
||||||
|
try:
|
||||||
|
from compliance.services.doc_anchor_locator import reset_cache
|
||||||
|
reset_cache()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
# Step 1: Resolve texts (fetch from URL if needed) — 0-30%
|
# Step 1: Resolve texts (fetch from URL if needed) — 0-30%
|
||||||
_update(check_id, "Texte werden geladen...", 1)
|
_update(check_id, "Texte werden geladen...", 1)
|
||||||
doc_texts: dict[str, str] = {}
|
doc_texts: dict[str, str] = {}
|
||||||
@@ -234,6 +241,20 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
|||||||
# Filter out doc_types that don't apply to this business profile
|
# Filter out doc_types that don't apply to this business profile
|
||||||
skip_types = _get_skip_types(profile)
|
skip_types = _get_skip_types(profile)
|
||||||
|
|
||||||
|
# Derive business_scope hints for the MC filter (O1 — Doc-type Scope-Flag).
|
||||||
|
# MCs that explicitly require a feature (e.g. 'biometric_processing',
|
||||||
|
# 'ai_decision_making', 'child_targeting') get dropped when the
|
||||||
|
# detected profile doesn't declare it.
|
||||||
|
business_scope: set[str] = set()
|
||||||
|
for svc in (getattr(profile, "detected_services", []) or []):
|
||||||
|
business_scope.add(str(svc).lower())
|
||||||
|
if (getattr(profile, "business_type", "") or "").lower() == "b2c":
|
||||||
|
business_scope.add("b2c")
|
||||||
|
if getattr(profile, "has_online_shop", False):
|
||||||
|
business_scope.add("ecommerce")
|
||||||
|
if getattr(profile, "is_regulated_profession", False):
|
||||||
|
business_scope.add("regulated_profession")
|
||||||
|
|
||||||
# Document checks: 40-80%
|
# Document checks: 40-80%
|
||||||
n_entries = max(1, len(doc_entries))
|
n_entries = max(1, len(doc_entries))
|
||||||
for i, entry in enumerate(doc_entries):
|
for i, entry in enumerate(doc_entries):
|
||||||
@@ -268,6 +289,7 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
|||||||
result = await _check_single(
|
result = await _check_single(
|
||||||
text, doc_type, label, url,
|
text, doc_type, label, url,
|
||||||
entry["word_count"], use_agent_flag,
|
entry["word_count"], use_agent_flag,
|
||||||
|
business_scope=business_scope,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Apply profile context filter
|
# Apply profile context filter
|
||||||
@@ -421,9 +443,42 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
|||||||
len(cmp_vendors))
|
len(cmp_vendors))
|
||||||
cmp_vendors = await validate_vendor_urls(cmp_vendors)
|
cmp_vendors = await validate_vendor_urls(cmp_vendors)
|
||||||
cmp_vendors = score_vendors(cmp_vendors)
|
cmp_vendors = score_vendors(cmp_vendors)
|
||||||
|
# Enrich each vendor with per-cookie functional roles
|
||||||
|
try:
|
||||||
|
from compliance.services.cookie_function_classifier import (
|
||||||
|
annotate_vendor_cookies,
|
||||||
|
)
|
||||||
|
cmp_vendors = [annotate_vendor_cookies(v) for v in cmp_vendors]
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("Cookie function classification skipped: %s", e)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning("VVT vendor extraction skipped: %s", e)
|
logger.warning("VVT vendor extraction skipped: %s", e)
|
||||||
|
|
||||||
|
# Vendor-Redundanz + EU-Alternativen + Cost/Savings (O4)
|
||||||
|
redundancy_report = None
|
||||||
|
try:
|
||||||
|
from compliance.services.vendor_redundancy import analyze as analyze_redundancy
|
||||||
|
from compliance.services.vendor_cost_estimator import infer_company_tier
|
||||||
|
if cmp_vendors:
|
||||||
|
# Company-Tier aus business_profile ableiten — beeinflusst die
|
||||||
|
# Cost-Range so dass z.B. fuer DAX-Konzerne nicht starter-Preise
|
||||||
|
# die untere Schranke duruecken.
|
||||||
|
bp_dict = {
|
||||||
|
"type": getattr(profile, "business_type", ""),
|
||||||
|
"features": list(business_scope),
|
||||||
|
}
|
||||||
|
ctier = infer_company_tier(bp_dict)
|
||||||
|
redundancy_report = analyze_redundancy(cmp_vendors, company_tier=ctier)
|
||||||
|
logger.info(
|
||||||
|
"Redundanz: %d Kategorien mit Mehrfach-Anbietern, "
|
||||||
|
"Spar-Schaetzung %s pro Jahr (company_tier=%s)",
|
||||||
|
redundancy_report["summary"]["redundancy_count"],
|
||||||
|
redundancy_report["summary"]["estimated_saving_pct"],
|
||||||
|
ctier,
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("Vendor redundancy analysis skipped: %s", e)
|
||||||
|
|
||||||
summary_html = build_management_summary(results)
|
summary_html = build_management_summary(results)
|
||||||
scanned_html = build_scanned_urls_html(doc_entries)
|
scanned_html = build_scanned_urls_html(doc_entries)
|
||||||
providers_html = build_provider_list_html(banner_result, vvt_entries)
|
providers_html = build_provider_list_html(banner_result, vvt_entries)
|
||||||
@@ -468,11 +523,18 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
|||||||
if scorecard else ""
|
if scorecard else ""
|
||||||
)
|
)
|
||||||
|
|
||||||
report_html = build_html_report(results, None)
|
report_html = build_html_report(results, None, doc_texts)
|
||||||
profile_html = _build_profile_html(profile)
|
profile_html = _build_profile_html(profile)
|
||||||
|
|
||||||
|
# O4: Vendor-Redundanz / EU-Alternativen + Cost-Savings-Block —
|
||||||
|
# zwischen VVT und Doc-Report einsortiert, damit Geschaeftsfuehrung
|
||||||
|
# die Einsparung sieht bevor sie in die Detail-Pruefung geht.
|
||||||
|
from .agent_doc_check_redundancy import build_redundancy_html
|
||||||
|
redundancy_html = build_redundancy_html(redundancy_report)
|
||||||
|
|
||||||
full_html = (
|
full_html = (
|
||||||
summary_html + scanned_html + profile_html + scorecard_html
|
summary_html + scanned_html + profile_html + scorecard_html
|
||||||
+ providers_html + vvt_html + report_html
|
+ providers_html + vvt_html + redundancy_html + report_html
|
||||||
)
|
)
|
||||||
|
|
||||||
# Step 6: Send email — derive site name primarily from entered URL.
|
# Step 6: Send email — derive site name primarily from entered URL.
|
||||||
@@ -602,6 +664,7 @@ async def _fetch_text(url: str, doc_type: str = "") -> tuple[str, list[dict]]:
|
|||||||
payload = resp.json()
|
payload = resp.json()
|
||||||
docs = payload.get("documents", [])
|
docs = payload.get("documents", [])
|
||||||
cmp_payloads = payload.get("cmp_payloads") or []
|
cmp_payloads = payload.get("cmp_payloads") or []
|
||||||
|
cmp_cookie_text = payload.get("cmp_cookie_text") or ""
|
||||||
if docs:
|
if docs:
|
||||||
texts = []
|
texts = []
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
@@ -609,6 +672,22 @@ async def _fetch_text(url: str, doc_type: str = "") -> tuple[str, list[dict]]:
|
|||||||
if t and len(t) > 50:
|
if t and len(t) > 50:
|
||||||
texts.append(t)
|
texts.append(t)
|
||||||
merged = "\n\n".join(texts)
|
merged = "\n\n".join(texts)
|
||||||
|
# For cookie/dse/social_media: when CMP reconstruction is
|
||||||
|
# substantially richer than DOM extraction, use it. This
|
||||||
|
# fixes the BMW case where DOM yields ~600 words of
|
||||||
|
# navigation but the ePaaS payload reconstructs to ~1800
|
||||||
|
# words of actual cookie policy.
|
||||||
|
if (doc_type in short_extract_types
|
||||||
|
and cmp_cookie_text
|
||||||
|
and len(cmp_cookie_text.split()) > len(merged.split())):
|
||||||
|
logger.info(
|
||||||
|
"Preferring CMP-reconstructed text for %s on %s "
|
||||||
|
"(%d words CMP vs %d words DOM)",
|
||||||
|
doc_type, url,
|
||||||
|
len(cmp_cookie_text.split()),
|
||||||
|
len(merged.split()),
|
||||||
|
)
|
||||||
|
merged = cmp_cookie_text
|
||||||
if merged and len(merged.split()) > 100:
|
if merged and len(merged.split()) > 100:
|
||||||
if len(texts) > 1:
|
if len(texts) > 1:
|
||||||
logger.info("Merged %d docs from %s (%d words)",
|
logger.info("Merged %d docs from %s (%d words)",
|
||||||
@@ -727,6 +806,7 @@ async def _autodiscover_missing(
|
|||||||
|
|
||||||
discovered: list[dict] = []
|
discovered: list[dict] = []
|
||||||
disc_payloads: list[dict] = []
|
disc_payloads: list[dict] = []
|
||||||
|
disc_cookie_texts: list[str] = []
|
||||||
for base in crawl_bases:
|
for base in crawl_bases:
|
||||||
try:
|
try:
|
||||||
async with httpx.AsyncClient(timeout=180.0) as client:
|
async with httpx.AsyncClient(timeout=180.0) as client:
|
||||||
@@ -742,8 +822,14 @@ async def _autodiscover_missing(
|
|||||||
body = resp.json()
|
body = resp.json()
|
||||||
discovered.extend(body.get("documents", []) or [])
|
discovered.extend(body.get("documents", []) or [])
|
||||||
disc_payloads.extend(body.get("cmp_payloads") or [])
|
disc_payloads.extend(body.get("cmp_payloads") or [])
|
||||||
logger.info("auto-discovery on %s: %d docs",
|
cmp_text = body.get("cmp_cookie_text") or ""
|
||||||
base, len(body.get("documents", []) or []))
|
if cmp_text:
|
||||||
|
disc_cookie_texts.append(cmp_text)
|
||||||
|
logger.info("auto-discovery on %s: %d docs, %d CMP payloads, "
|
||||||
|
"cmp_cookie_text=%d words", base,
|
||||||
|
len(body.get("documents", []) or []),
|
||||||
|
len(body.get("cmp_payloads") or []),
|
||||||
|
len(cmp_text.split()))
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning("auto-discovery failed for %s: %s", base, e)
|
logger.warning("auto-discovery failed for %s: %s", base, e)
|
||||||
|
|
||||||
@@ -772,6 +858,19 @@ async def _autodiscover_missing(
|
|||||||
d = by_type.get(dt)
|
d = by_type.get(dt)
|
||||||
if d:
|
if d:
|
||||||
full = d.get("full_text") or d.get("text_preview") or ""
|
full = d.get("full_text") or d.get("text_preview") or ""
|
||||||
|
# For cookie: prefer the CMP-reconstructed text when it's
|
||||||
|
# substantially richer than the auto-discovered DOM extraction.
|
||||||
|
# BMW homepage CMP yields ~1800 words of authoritative policy;
|
||||||
|
# DOM extraction typically yields ~600 words of site chrome.
|
||||||
|
if dt == "cookie" and disc_cookie_texts:
|
||||||
|
cmp_merged = "\n\n".join(disc_cookie_texts)
|
||||||
|
if len(cmp_merged.split()) > len(full.split()):
|
||||||
|
logger.info(
|
||||||
|
"cookie: using CMP-reconstructed text (%d words) "
|
||||||
|
"instead of DOM (%d words)",
|
||||||
|
len(cmp_merged.split()), len(full.split()),
|
||||||
|
)
|
||||||
|
full = cmp_merged
|
||||||
if len(full.split()) >= 100:
|
if len(full.split()) >= 100:
|
||||||
new_entry["text"] = full
|
new_entry["text"] = full
|
||||||
new_entry["url"] = d.get("url", "")
|
new_entry["url"] = d.get("url", "")
|
||||||
@@ -829,6 +928,7 @@ def _classify_discovered_doc(title: str, url: str) -> str | None:
|
|||||||
async def _check_single(
|
async def _check_single(
|
||||||
text: str, doc_type: str, label: str, url: str,
|
text: str, doc_type: str, label: str, url: str,
|
||||||
word_count: int, use_agent: bool,
|
word_count: int, use_agent: bool,
|
||||||
|
business_scope: set[str] | None = None,
|
||||||
):
|
):
|
||||||
"""Run regex + MC checks on a single document."""
|
"""Run regex + MC checks on a single document."""
|
||||||
from compliance.services.doc_checks.runner import check_document_completeness
|
from compliance.services.doc_checks.runner import check_document_completeness
|
||||||
@@ -862,6 +962,7 @@ async def _check_single(
|
|||||||
# (top-10 FAILs) so cost stays bounded.
|
# (top-10 FAILs) so cost stays bounded.
|
||||||
mc_results = await check_document_with_controls(
|
mc_results = await check_document_with_controls(
|
||||||
text, doc_type, label, max_controls=0, use_agent=use_agent,
|
text, doc_type, label, max_controls=0, use_agent=use_agent,
|
||||||
|
business_scope=business_scope,
|
||||||
)
|
)
|
||||||
if mc_results:
|
if mc_results:
|
||||||
for mc in mc_results:
|
for mc in mc_results:
|
||||||
|
|||||||
@@ -374,11 +374,52 @@ def _render_vendor_row_full(v: dict) -> str:
|
|||||||
)
|
)
|
||||||
score_color = ("#16a34a" if score >= 80 else
|
score_color = ("#16a34a" if score >= 80 else
|
||||||
"#d97706" if score >= 50 else "#dc2626")
|
"#d97706" if score >= 50 else "#dc2626")
|
||||||
|
|
||||||
|
# Score-Erklaerung: was wurde gewertet, was fehlt
|
||||||
|
# Annahme: Score = bestandene Kriterien / Gesamtkriterien * 100.
|
||||||
|
# Typisch 5 Kriterien fuer EXT: country, cookies, opt_out, privacy, scoring.
|
||||||
|
# Bei INTERNAL/GROUP: opt_out + privacy nicht gewertet (3 Kriterien).
|
||||||
|
n_criteria = 3 if is_own else 5
|
||||||
|
n_failed = len(flags) if flags else 0
|
||||||
|
score_tooltip = (
|
||||||
|
f"{n_criteria - n_failed} von {n_criteria} Kriterien erfuellt"
|
||||||
|
+ (f" — fehlt: {', '.join(_flag_short(f) for f in flags[:3])}"
|
||||||
|
if flags else "")
|
||||||
|
)
|
||||||
|
|
||||||
|
# Inline-Aktions-Anweisungen pro Flag
|
||||||
|
actions_html = ""
|
||||||
|
if flags:
|
||||||
|
from compliance.services.finding_action_recipes import recipe_for
|
||||||
|
action_items = []
|
||||||
|
for f in flags:
|
||||||
|
rec = recipe_for(f)
|
||||||
|
if not rec:
|
||||||
|
continue
|
||||||
|
action_items.append(
|
||||||
|
f'<li style="margin-bottom:6px"><strong>{_flag_short(f)}:</strong> '
|
||||||
|
f'{rec.get("what", "")}<br/>'
|
||||||
|
f'<span style="color:#475569"><strong>Was tun:</strong> '
|
||||||
|
f'{rec.get("fix_text", "").splitlines()[0][:200]}</span><br/>'
|
||||||
|
f'<span style="color:#94a3b8;font-size:9px">Quelle: '
|
||||||
|
f'{rec.get("why", "")[:160]}</span></li>'
|
||||||
|
)
|
||||||
|
if action_items:
|
||||||
|
actions_html = (
|
||||||
|
f'<details style="margin-top:4px"><summary style="cursor:pointer;'
|
||||||
|
f'color:#dc2626;font-size:10px">Was muss ich tun? '
|
||||||
|
f'({len(action_items)} Action{"s" if len(action_items) != 1 else ""})</summary>'
|
||||||
|
f'<ul style="margin:4px 0 0 14px;padding:0;font-size:10px;color:#1e293b">'
|
||||||
|
+ "".join(action_items)
|
||||||
|
+ '</ul></details>'
|
||||||
|
)
|
||||||
|
|
||||||
flag_str = ""
|
flag_str = ""
|
||||||
if flags:
|
if flags:
|
||||||
flag_str = (
|
flag_str = (
|
||||||
f'<div style="font-size:10px;color:#94a3b8;margin-top:2px">'
|
f'<div style="font-size:10px;color:#94a3b8;margin-top:2px">'
|
||||||
f'{", ".join(flags[:4])}</div>'
|
f'{", ".join(flags[:4])}</div>'
|
||||||
|
f'{actions_html}'
|
||||||
)
|
)
|
||||||
return (
|
return (
|
||||||
f'<tr style="border-top:1px solid #e2e8f0">'
|
f'<tr style="border-top:1px solid #e2e8f0">'
|
||||||
@@ -391,11 +432,26 @@ def _render_vendor_row_full(v: dict) -> str:
|
|||||||
f'<td style="padding:6px 8px;text-align:center">{opt_status}</td>'
|
f'<td style="padding:6px 8px;text-align:center">{opt_status}</td>'
|
||||||
f'<td style="padding:6px 8px;text-align:center">{privacy_status}</td>'
|
f'<td style="padding:6px 8px;text-align:center">{privacy_status}</td>'
|
||||||
f'<td style="padding:6px 8px;text-align:right;font-weight:600;'
|
f'<td style="padding:6px 8px;text-align:right;font-weight:600;'
|
||||||
f'color:{score_color};font-size:11px">{score}%</td>'
|
f'color:{score_color};font-size:11px" title="{score_tooltip}">'
|
||||||
|
f'{score}%<div style="font-size:9px;font-weight:400;color:#94a3b8">'
|
||||||
|
f'{n_criteria - n_failed}/{n_criteria}</div></td>'
|
||||||
f'</tr>'
|
f'</tr>'
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _flag_short(f: str) -> str:
|
||||||
|
"""Lesbare deutsche Form fuer einen Flag-Token."""
|
||||||
|
labels = {
|
||||||
|
"no_cookies_listed": "Cookies fehlen",
|
||||||
|
"no_country": "Sitzland fehlt",
|
||||||
|
"no_privacy_url": "Privacy-Link fehlt",
|
||||||
|
"broken_privacy_url": "Privacy-Link broken",
|
||||||
|
"no_opt_out_url": "Opt-Out fehlt",
|
||||||
|
"broken_opt_out": "Opt-Out broken",
|
||||||
|
}
|
||||||
|
return labels.get(f, f)
|
||||||
|
|
||||||
|
|
||||||
def _link_status_badge(
|
def _link_status_badge(
|
||||||
url: str | None,
|
url: str | None,
|
||||||
ok: bool | None,
|
ok: bool | None,
|
||||||
|
|||||||
@@ -0,0 +1,141 @@
|
|||||||
|
"""
|
||||||
|
Email-Renderer fuer den Vendor-Redundanz + EU-Alternativen + Cost-/Savings-Block.
|
||||||
|
|
||||||
|
Wird im Email-Body unter dem VVT eingebaut.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
|
||||||
|
def _fmt_eur(low: int, high: int) -> str:
|
||||||
|
if not low and not high:
|
||||||
|
return "im Listpreis bundled"
|
||||||
|
if low == high:
|
||||||
|
return f"~{low:,} €".replace(",", ".")
|
||||||
|
return f"{low:,}–{high:,} €".replace(",", ".")
|
||||||
|
|
||||||
|
|
||||||
|
def build_redundancy_html(report: dict | None) -> str:
|
||||||
|
if not report:
|
||||||
|
return ""
|
||||||
|
s = report.get("summary") or {}
|
||||||
|
redundancies = report.get("redundancies") or []
|
||||||
|
eu_alts = report.get("eu_alternatives") or []
|
||||||
|
multi = report.get("multi_function_tools") or []
|
||||||
|
|
||||||
|
cur = s.get("estimated_current_year_eur") or [0, 0]
|
||||||
|
sav = s.get("estimated_saving_year_eur") or [0, 0]
|
||||||
|
pct = s.get("estimated_saving_pct") or "n/a"
|
||||||
|
|
||||||
|
parts = [
|
||||||
|
'<div style="font-family:-apple-system,BlinkMacSystemFont,sans-serif;'
|
||||||
|
'max-width:700px;margin:0 auto 16px;padding:14px 18px;'
|
||||||
|
'background:#fef3c7;border:1px solid #fcd34d;border-radius:8px">',
|
||||||
|
'<h3 style="margin:0 0 6px;font-size:14px;color:#92400e">'
|
||||||
|
'Optimierungspotenzial: Redundanzen + EU-Alternativen</h3>',
|
||||||
|
f'<p style="margin:0 0 10px;font-size:11px;color:#78350f">'
|
||||||
|
f'<strong>{s.get("redundancy_count", 0)}</strong> Kategorien mit '
|
||||||
|
f'mehreren Anbietern · <strong>{s.get("consolidation_potential", 0)}</strong> '
|
||||||
|
f'Anbieter konsolidierbar · '
|
||||||
|
f'<strong>{s.get("eu_alternative_count", 0)}</strong> EU-Alternativen verfuegbar</p>',
|
||||||
|
|
||||||
|
'<div style="background:#fff;border:1px solid #fcd34d;border-radius:6px;'
|
||||||
|
'padding:10px 12px;margin-bottom:10px">',
|
||||||
|
|
||||||
|
'<div style="font-size:10px;color:#94a3b8;margin-bottom:6px;text-transform:uppercase;letter-spacing:0.5px">'
|
||||||
|
'Diese Schaetzung umfasst NUR die als redundant erkannten Tools — '
|
||||||
|
'nicht den Gesamt-Stack der Website</div>',
|
||||||
|
|
||||||
|
f'<div style="font-size:11px;color:#78350f">'
|
||||||
|
f'Listpreis-Schaetzung der <strong>redundanten</strong> Tools '
|
||||||
|
f'(Mehrfach-Anbieter in derselben Funktions-Kategorie):'
|
||||||
|
f' <strong>{_fmt_eur(*cur)}/Jahr</strong></div>',
|
||||||
|
|
||||||
|
f'<div style="font-size:11px;color:#16a34a;margin-top:4px">'
|
||||||
|
f'Sparpotenzial durch Konsolidierung auf je 1 EU-Tool pro Kategorie:'
|
||||||
|
f' <strong>{_fmt_eur(*sav)}/Jahr</strong> ({pct})</div>',
|
||||||
|
|
||||||
|
'<div style="font-size:10px;color:#94a3b8;margin-top:8px;font-style:italic">'
|
||||||
|
'<strong>Wichtige Einschraenkungen:</strong><br/>'
|
||||||
|
'• Konzern-Konditionen liegen ueblicherweise 30–50% unter Listpreis — '
|
||||||
|
'realistisches Saving entsprechend €X·0,5 bis €X·0,7.<br/>'
|
||||||
|
'• Eintraege "<em>Eigene Marke — Tool</em>" (z.B. "BMW AG — Adobe Analytics") '
|
||||||
|
'gehoeren oft zu einem einzigen Master-Vertrag, nicht zu mehreren Lizenzen.<br/>'
|
||||||
|
'• Media-Spend (Google Ads, Meta Ads) ist NICHT enthalten — nur Tooling-Lizenzen.<br/>'
|
||||||
|
'• Quelle: Gartner/Forrester 2025 + oeffentliche Listpreise.'
|
||||||
|
'</div></div>',
|
||||||
|
]
|
||||||
|
|
||||||
|
if redundancies:
|
||||||
|
parts.append(
|
||||||
|
'<table style="width:100%;border-collapse:collapse;font-size:11px;'
|
||||||
|
'margin-bottom:10px">'
|
||||||
|
'<thead><tr style="background:#fde68a;color:#78350f;text-align:left">'
|
||||||
|
'<th style="padding:6px 8px">Kategorie</th>'
|
||||||
|
'<th style="padding:6px 8px">#</th>'
|
||||||
|
'<th style="padding:6px 8px">Anbieter</th>'
|
||||||
|
'<th style="padding:6px 8px">EU-Empfehlung</th>'
|
||||||
|
'<th style="padding:6px 8px;text-align:right">Saving / Jahr</th>'
|
||||||
|
'</tr></thead><tbody>'
|
||||||
|
)
|
||||||
|
for r in redundancies[:12]:
|
||||||
|
vendors_str = ", ".join(r.get("vendors", [])[:6])
|
||||||
|
if len(r.get("vendors", [])) > 6:
|
||||||
|
vendors_str += f" (+{len(r['vendors']) - 6} weitere)"
|
||||||
|
sav_r = r.get("estimated_saving_year_eur") or [0, 0]
|
||||||
|
parts.append(
|
||||||
|
f'<tr style="border-top:1px solid #fde68a;vertical-align:top">'
|
||||||
|
f'<td style="padding:5px 8px;color:#78350f;font-weight:600">{r["category_label"]}</td>'
|
||||||
|
f'<td style="padding:5px 8px;text-align:center">{r["count"]}</td>'
|
||||||
|
f'<td style="padding:5px 8px;color:#1e293b;font-size:10px">{vendors_str}</td>'
|
||||||
|
f'<td style="padding:5px 8px;color:#16a34a;font-size:10px">{r.get("suggested_eu_tool") or "–"}</td>'
|
||||||
|
f'<td style="padding:5px 8px;text-align:right;color:#16a34a;font-weight:600">'
|
||||||
|
f'{_fmt_eur(*sav_r)}</td></tr>'
|
||||||
|
)
|
||||||
|
hint = r.get("consolidation_hint")
|
||||||
|
if hint:
|
||||||
|
parts.append(
|
||||||
|
f'<tr><td colspan="5" style="padding:0 8px 8px;color:#94a3b8;font-size:10px;font-style:italic">'
|
||||||
|
f'Hinweis: {hint}</td></tr>'
|
||||||
|
)
|
||||||
|
caveats = r.get("caveats") or []
|
||||||
|
if caveats:
|
||||||
|
parts.append(
|
||||||
|
f'<tr><td colspan="5" style="padding:0 8px 8px;color:#94a3b8;font-size:10px">'
|
||||||
|
f'<strong>Moegliche Gruende fuer Mehrfach-Einsatz:</strong> '
|
||||||
|
+ "; ".join(caveats) + '</td></tr>'
|
||||||
|
)
|
||||||
|
parts.append('</tbody></table>')
|
||||||
|
|
||||||
|
if multi:
|
||||||
|
parts.append(
|
||||||
|
'<div style="margin-top:8px"><strong style="font-size:11px;color:#78350f">'
|
||||||
|
'Multi-Funktions-Tools (1 Tool ersetzt mehrere Kategorien):</strong>'
|
||||||
|
'<ul style="margin:6px 0 0 18px;padding:0;font-size:11px;color:#78350f">'
|
||||||
|
)
|
||||||
|
for t in multi[:4]:
|
||||||
|
cats = ", ".join(t.get("replaces_categories", []))
|
||||||
|
parts.append(
|
||||||
|
f'<li style="margin-bottom:3px"><strong>{t["name"]}</strong>'
|
||||||
|
f' ({t["country"]}) — ersetzt <em>{cats}</em>'
|
||||||
|
f' ({t.get("potential_replacements", 0)} Anbieter heute)</li>'
|
||||||
|
)
|
||||||
|
parts.append('</ul></div>')
|
||||||
|
|
||||||
|
if eu_alts:
|
||||||
|
parts.append(
|
||||||
|
'<details style="margin-top:8px"><summary style="font-size:11px;color:#78350f;'
|
||||||
|
'cursor:pointer">EU-Alternativen pro Anbieter (Details)</summary>'
|
||||||
|
'<ul style="margin:6px 0 0 18px;padding:0;font-size:10px;color:#475569">'
|
||||||
|
)
|
||||||
|
for e in eu_alts[:20]:
|
||||||
|
first_alt = (e.get("alternatives") or [{}])[0]
|
||||||
|
parts.append(
|
||||||
|
f'<li style="margin-bottom:3px"><strong>{e["current_vendor"]}</strong>'
|
||||||
|
f' → {first_alt.get("name", "")} ({first_alt.get("country", "")})'
|
||||||
|
f' <span style="color:#94a3b8">— {first_alt.get("notes", "")}</span></li>'
|
||||||
|
)
|
||||||
|
parts.append('</ul></details>')
|
||||||
|
|
||||||
|
parts.append('</div>')
|
||||||
|
return "".join(parts)
|
||||||
@@ -7,8 +7,12 @@ including L1/L2 check hierarchy, progress bars, and actionable hints.
|
|||||||
|
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import re
|
||||||
from typing import TYPE_CHECKING
|
from typing import TYPE_CHECKING
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
if TYPE_CHECKING:
|
if TYPE_CHECKING:
|
||||||
from .agent_doc_check_routes import CheckItem, DocCheckResult
|
from .agent_doc_check_routes import CheckItem, DocCheckResult
|
||||||
|
|
||||||
@@ -32,12 +36,93 @@ def _icon(passed: bool, skipped: bool = False) -> str:
|
|||||||
return '<span style="color:#ef4444;font-weight:bold">✗</span>'
|
return '<span style="color:#ef4444;font-weight:bold">✗</span>'
|
||||||
|
|
||||||
|
|
||||||
def _hint_box(hint: str) -> str:
|
def _first_sentence(text: str, max_chars: int = 300) -> str:
|
||||||
return (
|
"""Erster vollstaendiger Satz statt erste Zeile — robust gegen
|
||||||
|
mehrzeilige Fix-Texte die mit Bullet-Listen anfangen."""
|
||||||
|
if not text:
|
||||||
|
return ""
|
||||||
|
# Suche Satz-Endezeichen vor max_chars
|
||||||
|
snippet = text[:max_chars]
|
||||||
|
m = re.search(r"^(.+?[\.\?\!])(?:\s|$)", snippet, re.DOTALL)
|
||||||
|
if m:
|
||||||
|
first = m.group(1).strip()
|
||||||
|
# Wenn der "Satz" eine Variant-Header wie "Variante A:" ist, nimm
|
||||||
|
# weiter — der echte Inhalt kommt erst danach
|
||||||
|
if re.fullmatch(r"(Variante [A-Z]\s*\([^\)]+\):?|Beispiel\s*\d*:?)",
|
||||||
|
first, re.IGNORECASE):
|
||||||
|
rest = text[m.end():].lstrip()
|
||||||
|
return _first_sentence(rest, max_chars)
|
||||||
|
return first
|
||||||
|
# Kein Satz-Endezeichen — nimm bis max_chars
|
||||||
|
line = (text.splitlines() or [""])[0]
|
||||||
|
return line[:max_chars] + ("…" if len(line) > max_chars else "")
|
||||||
|
|
||||||
|
|
||||||
|
def _hint_box(hint: str, check_label: str = "", doc_text: str = "",
|
||||||
|
doc_id: str | None = None) -> str:
|
||||||
|
"""Hint-Block mit angereichertem Recipe + Doc-Anchor wenn moeglich."""
|
||||||
|
base = (
|
||||||
f'<div style="font-size:11px;color:#dc2626;margin:2px 0 4px 20px;'
|
f'<div style="font-size:11px;color:#dc2626;margin:2px 0 4px 20px;'
|
||||||
f'padding:4px 8px;background:#fef2f2;border-radius:4px;'
|
f'padding:4px 8px;background:#fef2f2;border-radius:4px;'
|
||||||
f'border-left:3px solid #fca5a5">{hint}</div>'
|
f'border-left:3px solid #fca5a5">{hint}'
|
||||||
)
|
)
|
||||||
|
# Recipe + Anker hinzufuegen wenn check_label bekannt
|
||||||
|
if check_label:
|
||||||
|
try:
|
||||||
|
from compliance.services.finding_action_recipes import recipe_for
|
||||||
|
from compliance.services.doc_anchor_locator import locate_anchor
|
||||||
|
rec = recipe_for(check_label)
|
||||||
|
if rec and rec.get("fix_text"):
|
||||||
|
first_sentence = _first_sentence(rec["fix_text"], 300)
|
||||||
|
full = rec["fix_text"]
|
||||||
|
# Statt <details> ein einfaches Inline-Block-Layout —
|
||||||
|
# robuster bei Plain-Text-Mail-Render
|
||||||
|
more = ""
|
||||||
|
if len(full) > len(first_sentence) + 10:
|
||||||
|
more = (
|
||||||
|
f'<div style="margin-top:4px;padding:6px 8px;background:#fff;'
|
||||||
|
f'border:1px solid #fcd5d5;border-radius:4px;font-size:10px;'
|
||||||
|
f'white-space:pre-wrap;color:#1e293b">'
|
||||||
|
f'<strong style="display:block;margin-bottom:3px;color:#475569">'
|
||||||
|
f'Vollstaendiger Textbaustein zum Einfuegen:</strong>'
|
||||||
|
f'{full}</div>'
|
||||||
|
)
|
||||||
|
base += (
|
||||||
|
f'<div style="margin-top:6px;padding-top:6px;border-top:1px solid #fecaca">'
|
||||||
|
f'<strong style="color:#7c3aed;font-size:10px">Konkrete Massnahme:</strong> '
|
||||||
|
f'<span style="color:#1e293b">{first_sentence}</span>'
|
||||||
|
f'{more}'
|
||||||
|
)
|
||||||
|
# Anker via Embedding-Locator (mit doc_id-Cache)
|
||||||
|
if doc_text:
|
||||||
|
anchor = locate_anchor(check_label, doc_text, doc_id)
|
||||||
|
if anchor and anchor.get("anchor_phrase") and anchor.get("confidence") != "low":
|
||||||
|
conf_label = anchor.get("confidence", "")
|
||||||
|
conf_badge = (
|
||||||
|
f' <span style="color:#94a3b8;font-size:9px">'
|
||||||
|
f'(Match-Konfidenz {conf_label}, '
|
||||||
|
f'Score {anchor.get("score", "—")})</span>'
|
||||||
|
)
|
||||||
|
base += (
|
||||||
|
f'<div style="margin-top:4px;color:#475569;font-size:10px">'
|
||||||
|
f'<strong>Einfuegen:</strong> {anchor["position_hint"]}'
|
||||||
|
f'{conf_badge}</div>'
|
||||||
|
)
|
||||||
|
elif rec.get("where"):
|
||||||
|
# Kein guter Anchor-Match — zeige generischen Fallback
|
||||||
|
base += (
|
||||||
|
f'<div style="margin-top:4px;color:#475569;font-size:10px">'
|
||||||
|
f'<strong>Einfuegen:</strong> {rec["where"]} '
|
||||||
|
f'<span style="color:#94a3b8;font-size:9px">'
|
||||||
|
f'(kein eindeutiger Absatz im Dokument gefunden — '
|
||||||
|
f'Anweisung allgemein)</span></div>'
|
||||||
|
)
|
||||||
|
base += '</div>'
|
||||||
|
except Exception as e:
|
||||||
|
logger.debug("Hint-box enrichment failed: %s", e)
|
||||||
|
pass # Recipes optional — Hint-Box muss nie crashen
|
||||||
|
base += '</div>'
|
||||||
|
return base
|
||||||
|
|
||||||
|
|
||||||
def build_management_summary(results: list[DocCheckResult]) -> str:
|
def build_management_summary(results: list[DocCheckResult]) -> str:
|
||||||
@@ -158,8 +243,14 @@ def _check_to_action(doc_label: str, check_label: str, hint: str) -> str:
|
|||||||
def build_html_report(
|
def build_html_report(
|
||||||
results: list[DocCheckResult],
|
results: list[DocCheckResult],
|
||||||
cookie_result: dict | None,
|
cookie_result: dict | None,
|
||||||
|
doc_texts: dict[str, str] | None = None,
|
||||||
) -> str:
|
) -> str:
|
||||||
"""Build HTML email report styled like the frontend."""
|
"""Build HTML email report styled like the frontend.
|
||||||
|
|
||||||
|
`doc_texts` is the doc_type→text dict so hint-boxes can locate the
|
||||||
|
relevant Absatz in the original document for the Einfuege-Empfehlung.
|
||||||
|
"""
|
||||||
|
doc_texts = doc_texts or {}
|
||||||
ok_count = sum(1 for r in results if r.completeness_pct == 100)
|
ok_count = sum(1 for r in results if r.completeness_pct == 100)
|
||||||
html = [
|
html = [
|
||||||
'<div style="font-family:-apple-system,BlinkMacSystemFont,sans-serif;'
|
'<div style="font-family:-apple-system,BlinkMacSystemFont,sans-serif;'
|
||||||
@@ -170,7 +261,7 @@ def build_html_report(
|
|||||||
]
|
]
|
||||||
|
|
||||||
for r in results:
|
for r in results:
|
||||||
_render_document(html, r)
|
_render_document(html, r, doc_texts.get(r.doc_type, ""))
|
||||||
|
|
||||||
if cookie_result:
|
if cookie_result:
|
||||||
_render_cookie_banner(html, cookie_result)
|
_render_cookie_banner(html, cookie_result)
|
||||||
@@ -179,7 +270,7 @@ def build_html_report(
|
|||||||
return "\n".join(html)
|
return "\n".join(html)
|
||||||
|
|
||||||
|
|
||||||
def _render_document(html: list[str], r: DocCheckResult) -> None:
|
def _render_document(html: list[str], r: DocCheckResult, doc_text: str = "") -> None:
|
||||||
pct = r.completeness_pct
|
pct = r.completeness_pct
|
||||||
cpct = r.correctness_pct
|
cpct = r.correctness_pct
|
||||||
bar_color = "green" if pct >= 80 else "yellow" if pct >= 50 else "red"
|
bar_color = "green" if pct >= 80 else "yellow" if pct >= 50 else "red"
|
||||||
@@ -244,7 +335,7 @@ def _render_document(html: list[str], r: DocCheckResult) -> None:
|
|||||||
else:
|
else:
|
||||||
html.append('<div style="padding:8px 16px 12px">')
|
html.append('<div style="padding:8px 16px 12px">')
|
||||||
for c in l1_checks:
|
for c in l1_checks:
|
||||||
_render_l1_check(html, c, l2_by_parent.get(c.id, []))
|
_render_l1_check(html, c, l2_by_parent.get(c.id, []), doc_text)
|
||||||
|
|
||||||
# Master-Control aggregation: with 1874 MCs evaluated per run,
|
# Master-Control aggregation: with 1874 MCs evaluated per run,
|
||||||
# rendering every L2 check inline produces ~600 rows per doc and
|
# rendering every L2 check inline produces ~600 rows per doc and
|
||||||
@@ -289,6 +380,7 @@ def _render_document(html: list[str], r: DocCheckResult) -> None:
|
|||||||
|
|
||||||
def _render_l1_check(
|
def _render_l1_check(
|
||||||
html: list[str], c: CheckItem, children: list[CheckItem],
|
html: list[str], c: CheckItem, children: list[CheckItem],
|
||||||
|
doc_text: str = "",
|
||||||
) -> None:
|
) -> None:
|
||||||
l2_sub = [ch for ch in children if not ch.skipped]
|
l2_sub = [ch for ch in children if not ch.skipped]
|
||||||
l2_passed = sum(1 for ch in l2_sub if ch.passed)
|
l2_passed = sum(1 for ch in l2_sub if ch.passed)
|
||||||
@@ -301,16 +393,16 @@ def _render_l1_check(
|
|||||||
if l2_sub:
|
if l2_sub:
|
||||||
html.append(f' <span style="color:#9ca3af;font-size:11px">({l2_passed}/{len(l2_sub)})</span>')
|
html.append(f' <span style="color:#9ca3af;font-size:11px">({l2_passed}/{len(l2_sub)})</span>')
|
||||||
if not c.passed and c.hint:
|
if not c.passed and c.hint:
|
||||||
html.append(_hint_box(c.hint))
|
html.append(_hint_box(c.hint, c.label, doc_text))
|
||||||
html.append('</div>')
|
html.append('</div>')
|
||||||
|
|
||||||
for ch in children:
|
for ch in children:
|
||||||
if ch.skipped:
|
if ch.skipped:
|
||||||
continue
|
continue
|
||||||
_render_l2_check(html, ch)
|
_render_l2_check(html, ch, doc_text)
|
||||||
|
|
||||||
|
|
||||||
def _render_l2_check(html: list[str], ch: CheckItem) -> None:
|
def _render_l2_check(html: list[str], ch: CheckItem, doc_text: str = "") -> None:
|
||||||
style = "color:#dc2626;font-weight:500" if not ch.passed else "color:#6b7280"
|
style = "color:#dc2626;font-weight:500" if not ch.passed else "color:#6b7280"
|
||||||
html.append(
|
html.append(
|
||||||
f'<div style="padding:2px 0 2px 24px;border-left:2px solid #e5e7eb;margin-left:8px">'
|
f'<div style="padding:2px 0 2px 24px;border-left:2px solid #e5e7eb;margin-left:8px">'
|
||||||
@@ -324,7 +416,7 @@ def _render_l2_check(html: list[str], ch: CheckItem) -> None:
|
|||||||
f'white-space:nowrap">"...{ch.matched_text[:80]}..."</div>'
|
f'white-space:nowrap">"...{ch.matched_text[:80]}..."</div>'
|
||||||
)
|
)
|
||||||
if not ch.passed and ch.hint:
|
if not ch.passed and ch.hint:
|
||||||
html.append(_hint_box(ch.hint))
|
html.append(_hint_box(ch.hint, ch.label, doc_text))
|
||||||
html.append('</div>')
|
html.append('</div>')
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -1808,6 +1808,32 @@ async def list_categories():
|
|||||||
# SIMILAR CONTROLS (Embedding-based dedup)
|
# SIMILAR CONTROLS (Embedding-based dedup)
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
|
|
||||||
|
_EMBEDDING_COL_AVAILABLE: bool | None = None
|
||||||
|
|
||||||
|
|
||||||
|
def _has_embedding_col() -> bool:
|
||||||
|
"""Cache whether canonical_controls has the embedding column.
|
||||||
|
|
||||||
|
Returns False on systems where pgvector + embedding backfill weren't
|
||||||
|
set up. Saves the per-request 500 + log spam.
|
||||||
|
"""
|
||||||
|
global _EMBEDDING_COL_AVAILABLE
|
||||||
|
if _EMBEDDING_COL_AVAILABLE is not None:
|
||||||
|
return _EMBEDDING_COL_AVAILABLE
|
||||||
|
try:
|
||||||
|
with SessionLocal() as db:
|
||||||
|
r = db.execute(text(
|
||||||
|
"SELECT 1 FROM information_schema.columns "
|
||||||
|
"WHERE table_schema='compliance' "
|
||||||
|
"AND table_name='canonical_controls' "
|
||||||
|
"AND column_name='embedding'"
|
||||||
|
)).fetchone()
|
||||||
|
_EMBEDDING_COL_AVAILABLE = bool(r)
|
||||||
|
except Exception:
|
||||||
|
_EMBEDDING_COL_AVAILABLE = False
|
||||||
|
return _EMBEDDING_COL_AVAILABLE
|
||||||
|
|
||||||
|
|
||||||
@router.get("/controls/{control_id}/similar")
|
@router.get("/controls/{control_id}/similar")
|
||||||
async def find_similar_controls(
|
async def find_similar_controls(
|
||||||
control_id: str,
|
control_id: str,
|
||||||
@@ -1815,6 +1841,8 @@ async def find_similar_controls(
|
|||||||
limit: int = Query(20, ge=1, le=100),
|
limit: int = Query(20, ge=1, le=100),
|
||||||
):
|
):
|
||||||
"""Find controls similar to the given one using embedding cosine similarity."""
|
"""Find controls similar to the given one using embedding cosine similarity."""
|
||||||
|
if not _has_embedding_col():
|
||||||
|
return []
|
||||||
with SessionLocal() as db:
|
with SessionLocal() as db:
|
||||||
# Get the target control's embedding
|
# Get the target control's embedding
|
||||||
target = db.execute(
|
target = db.execute(
|
||||||
@@ -1856,7 +1884,7 @@ async def find_similar_controls(
|
|||||||
"title": r.title,
|
"title": r.title,
|
||||||
"severity": r.severity,
|
"severity": r.severity,
|
||||||
"release_state": r.release_state,
|
"release_state": r.release_state,
|
||||||
"tags": r.tags or [],
|
"tags": _jsonish(r.tags) or [],
|
||||||
"license_rule": r.license_rule,
|
"license_rule": r.license_rule,
|
||||||
"verification_method": r.verification_method,
|
"verification_method": r.verification_method,
|
||||||
"category": r.category,
|
"category": r.category,
|
||||||
@@ -1866,6 +1894,10 @@ async def find_similar_controls(
|
|||||||
]
|
]
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning("Embedding similarity query failed (no embedding column?): %s", e)
|
logger.warning("Embedding similarity query failed (no embedding column?): %s", e)
|
||||||
|
try:
|
||||||
|
db.rollback()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
return []
|
return []
|
||||||
|
|
||||||
|
|
||||||
@@ -1946,6 +1978,22 @@ async def get_v1_matches_endpoint(control_id: str):
|
|||||||
# INTERNAL HELPERS
|
# INTERNAL HELPERS
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
|
|
||||||
|
def _jsonish(v):
|
||||||
|
"""Parse v as JSON if it's a string that looks like JSON, otherwise return as-is.
|
||||||
|
|
||||||
|
Some canonical_controls rows were inserted with jsonb columns containing
|
||||||
|
raw JSON strings (e.g. '["a","b"]' as a TEXT). The frontend expects real
|
||||||
|
arrays — coerce here so .map() works.
|
||||||
|
"""
|
||||||
|
if isinstance(v, str) and v and v[0] in "[{":
|
||||||
|
try:
|
||||||
|
import json as _j
|
||||||
|
return _j.loads(v)
|
||||||
|
except Exception:
|
||||||
|
return v
|
||||||
|
return v
|
||||||
|
|
||||||
|
|
||||||
def _control_row(r) -> dict:
|
def _control_row(r) -> dict:
|
||||||
return {
|
return {
|
||||||
"id": str(r.id),
|
"id": str(r.id),
|
||||||
@@ -1954,17 +2002,17 @@ def _control_row(r) -> dict:
|
|||||||
"title": r.title,
|
"title": r.title,
|
||||||
"objective": r.objective,
|
"objective": r.objective,
|
||||||
"rationale": r.rationale,
|
"rationale": r.rationale,
|
||||||
"scope": r.scope,
|
"scope": _jsonish(r.scope),
|
||||||
"requirements": r.requirements,
|
"requirements": _jsonish(r.requirements),
|
||||||
"test_procedure": r.test_procedure,
|
"test_procedure": _jsonish(r.test_procedure) or [],
|
||||||
"evidence": r.evidence,
|
"evidence": _jsonish(r.evidence) or [],
|
||||||
"severity": r.severity,
|
"severity": r.severity,
|
||||||
"risk_score": float(r.risk_score) if r.risk_score is not None else None,
|
"risk_score": float(r.risk_score) if r.risk_score is not None else None,
|
||||||
"implementation_effort": r.implementation_effort,
|
"implementation_effort": r.implementation_effort,
|
||||||
"evidence_confidence": float(r.evidence_confidence) if r.evidence_confidence is not None else None,
|
"evidence_confidence": float(r.evidence_confidence) if r.evidence_confidence is not None else None,
|
||||||
"open_anchors": r.open_anchors,
|
"open_anchors": _jsonish(r.open_anchors) or [],
|
||||||
"release_state": r.release_state,
|
"release_state": r.release_state,
|
||||||
"tags": r.tags or [],
|
"tags": _jsonish(r.tags) or [],
|
||||||
"license_rule": r.license_rule,
|
"license_rule": r.license_rule,
|
||||||
"source_original_text": r.source_original_text,
|
"source_original_text": r.source_original_text,
|
||||||
"source_citation": r.source_citation,
|
"source_citation": r.source_citation,
|
||||||
|
|||||||
@@ -0,0 +1,181 @@
|
|||||||
|
"""
|
||||||
|
Consent-Log Export (Borlabs-Parity + DSB-Audit-Anforderung).
|
||||||
|
|
||||||
|
Auditors verlangen routinemaessig einen Auszug aller erteilten/
|
||||||
|
widerrufenen Einwilligungen pro Tenant — heute musste der DSB dafuer
|
||||||
|
manuell SQL schreiben. Diese Endpunkte liefern CSV + JSON direkt aus
|
||||||
|
dem Browser.
|
||||||
|
|
||||||
|
Endpoints:
|
||||||
|
GET /einwilligungen/export/consents.csv
|
||||||
|
GET /einwilligungen/export/consents.json
|
||||||
|
GET /einwilligungen/export/history.csv — Aenderungs-Historie
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import csv
|
||||||
|
import io
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
|
||||||
|
from fastapi import APIRouter, Depends, Header, Query
|
||||||
|
from fastapi.responses import Response
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from classroom_engine.database import get_db
|
||||||
|
from ..db.einwilligungen_models import (
|
||||||
|
EinwilligungenConsentDB,
|
||||||
|
EinwilligungenConsentHistoryDB,
|
||||||
|
)
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
router = APIRouter(prefix="/einwilligungen/export", tags=["einwilligungen-export"])
|
||||||
|
|
||||||
|
|
||||||
|
def _get_tenant(x_tenant_id: str | None = Header(None, alias="X-Tenant-ID")) -> str:
|
||||||
|
if not x_tenant_id:
|
||||||
|
from .tenant_utils import get_tenant_id
|
||||||
|
return get_tenant_id()
|
||||||
|
return x_tenant_id
|
||||||
|
|
||||||
|
|
||||||
|
def _ts() -> str:
|
||||||
|
return datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
|
||||||
|
|
||||||
|
|
||||||
|
def _consent_rows(consents: list[EinwilligungenConsentDB]) -> list[dict]:
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"consent_id": str(c.id),
|
||||||
|
"user_id": c.user_id or "",
|
||||||
|
"data_point_id": c.data_point_id or "",
|
||||||
|
"granted": "yes" if c.granted else "no",
|
||||||
|
"purpose": c.purpose or "",
|
||||||
|
"consent_version": c.consent_version or "",
|
||||||
|
"ip_address": c.ip_address or "",
|
||||||
|
"user_agent": (c.user_agent or "")[:200],
|
||||||
|
"source": c.source or "",
|
||||||
|
"created_at": c.created_at.isoformat() if c.created_at else "",
|
||||||
|
"updated_at": c.updated_at.isoformat() if c.updated_at else "",
|
||||||
|
"revoked_at": c.revoked_at.isoformat() if getattr(c, "revoked_at", None) else "",
|
||||||
|
}
|
||||||
|
for c in consents
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _history_rows(entries: list[EinwilligungenConsentHistoryDB]) -> list[dict]:
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"id": str(e.id),
|
||||||
|
"consent_id": str(e.consent_id),
|
||||||
|
"action": e.action or "",
|
||||||
|
"consent_version": e.consent_version or "",
|
||||||
|
"ip_address": e.ip_address or "",
|
||||||
|
"user_agent": (e.user_agent or "")[:200],
|
||||||
|
"source": e.source or "",
|
||||||
|
"created_at": e.created_at.isoformat() if e.created_at else "",
|
||||||
|
}
|
||||||
|
for e in entries
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _csv_response(rows: list[dict], filename: str) -> Response:
|
||||||
|
if not rows:
|
||||||
|
return Response(content="", media_type="text/csv",
|
||||||
|
headers={"Content-Disposition": f"attachment; filename={filename}"})
|
||||||
|
buf = io.StringIO()
|
||||||
|
w = csv.DictWriter(buf, fieldnames=list(rows[0].keys()), quoting=csv.QUOTE_ALL)
|
||||||
|
w.writeheader()
|
||||||
|
w.writerows(rows)
|
||||||
|
return Response(content=buf.getvalue(), media_type="text/csv; charset=utf-8",
|
||||||
|
headers={"Content-Disposition": f"attachment; filename={filename}"})
|
||||||
|
|
||||||
|
|
||||||
|
def _json_response(payload: dict, filename: str) -> Response:
|
||||||
|
body = json.dumps(payload, ensure_ascii=False, indent=2, default=str)
|
||||||
|
return Response(content=body, media_type="application/json; charset=utf-8",
|
||||||
|
headers={"Content-Disposition": f"attachment; filename={filename}"})
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/consents.csv")
|
||||||
|
async def export_consents_csv(
|
||||||
|
user_id: str | None = Query(None, description="Filter by single user"),
|
||||||
|
granted: bool | None = Query(None),
|
||||||
|
since: str | None = Query(None, description="ISO timestamp"),
|
||||||
|
tenant_id: str = Depends(_get_tenant),
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
) -> Response:
|
||||||
|
"""Download all consent records of this tenant as CSV (auditor-ready)."""
|
||||||
|
q = db.query(EinwilligungenConsentDB).filter(
|
||||||
|
EinwilligungenConsentDB.tenant_id == tenant_id,
|
||||||
|
)
|
||||||
|
if user_id:
|
||||||
|
q = q.filter(EinwilligungenConsentDB.user_id == user_id)
|
||||||
|
if granted is not None:
|
||||||
|
q = q.filter(EinwilligungenConsentDB.granted == granted)
|
||||||
|
if since:
|
||||||
|
try:
|
||||||
|
since_dt = datetime.fromisoformat(since.rstrip("Z"))
|
||||||
|
q = q.filter(EinwilligungenConsentDB.created_at >= since_dt)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
rows = _consent_rows(q.order_by(EinwilligungenConsentDB.created_at.desc()).all())
|
||||||
|
return _csv_response(rows, f"consents_{tenant_id[:8]}_{_ts()}.csv")
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/consents.json")
|
||||||
|
async def export_consents_json(
|
||||||
|
user_id: str | None = Query(None),
|
||||||
|
granted: bool | None = Query(None),
|
||||||
|
since: str | None = Query(None),
|
||||||
|
tenant_id: str = Depends(_get_tenant),
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
) -> Response:
|
||||||
|
"""Same data as the CSV endpoint but JSON-shaped for further processing."""
|
||||||
|
q = db.query(EinwilligungenConsentDB).filter(
|
||||||
|
EinwilligungenConsentDB.tenant_id == tenant_id,
|
||||||
|
)
|
||||||
|
if user_id:
|
||||||
|
q = q.filter(EinwilligungenConsentDB.user_id == user_id)
|
||||||
|
if granted is not None:
|
||||||
|
q = q.filter(EinwilligungenConsentDB.granted == granted)
|
||||||
|
if since:
|
||||||
|
try:
|
||||||
|
since_dt = datetime.fromisoformat(since.rstrip("Z"))
|
||||||
|
q = q.filter(EinwilligungenConsentDB.created_at >= since_dt)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
rows = _consent_rows(q.order_by(EinwilligungenConsentDB.created_at.desc()).all())
|
||||||
|
payload = {
|
||||||
|
"tenant_id": tenant_id,
|
||||||
|
"exported_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"filter": {"user_id": user_id, "granted": granted, "since": since},
|
||||||
|
"count": len(rows),
|
||||||
|
"consents": rows,
|
||||||
|
}
|
||||||
|
return _json_response(payload, f"consents_{tenant_id[:8]}_{_ts()}.json")
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/history.csv")
|
||||||
|
async def export_history_csv(
|
||||||
|
consent_id: str | None = Query(None, description="Limit to one consent"),
|
||||||
|
since: str | None = Query(None),
|
||||||
|
tenant_id: str = Depends(_get_tenant),
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
) -> Response:
|
||||||
|
"""Download the consent-change history (Art. 7(1) Nachweispflicht)."""
|
||||||
|
q = db.query(EinwilligungenConsentHistoryDB).filter(
|
||||||
|
EinwilligungenConsentHistoryDB.tenant_id == tenant_id,
|
||||||
|
)
|
||||||
|
if consent_id:
|
||||||
|
q = q.filter(EinwilligungenConsentHistoryDB.consent_id == consent_id)
|
||||||
|
if since:
|
||||||
|
try:
|
||||||
|
since_dt = datetime.fromisoformat(since.rstrip("Z"))
|
||||||
|
q = q.filter(EinwilligungenConsentHistoryDB.created_at >= since_dt)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
rows = _history_rows(q.order_by(EinwilligungenConsentHistoryDB.created_at.asc()).all())
|
||||||
|
return _csv_response(rows, f"consent-history_{tenant_id[:8]}_{_ts()}.csv")
|
||||||
@@ -0,0 +1,167 @@
|
|||||||
|
"""
|
||||||
|
Cookie-Function-Classifier — pro Cookie eine inhaltliche Funktionsbestimmung.
|
||||||
|
|
||||||
|
Heute haben wir pro Vendor eine Kategorie (analytics/advertising/...).
|
||||||
|
Aber: ein Vendor hat oft 10-50 verschiedene Cookies. Nicht jeder Cookie
|
||||||
|
einer Marketing-Plattform macht Werbung — viele sind Session-Mgmt,
|
||||||
|
Sprachpraeferenz, ScrollPosition etc.
|
||||||
|
|
||||||
|
Dieses Modul klassifiziert pro Cookie:
|
||||||
|
- functional_role : was der Cookie technisch tut (session_id,
|
||||||
|
csrf_token, ab_test, user_id, ad_id, …)
|
||||||
|
- data_collected : welche Daten dahinter stehen (visitor_id,
|
||||||
|
page_view, click, conversion_event, …)
|
||||||
|
- blocking_impact : was passiert wenn der Cookie geblockt wird
|
||||||
|
(none, no_personalization, no_tracking, site_breaks)
|
||||||
|
|
||||||
|
Damit kann der Vendor-Redundanz-Analyzer praezise sagen:
|
||||||
|
"Adobe Analytics setzt 55 Cookies, davon 12 fuer Tracking, 8 fuer A/B-Test
|
||||||
|
und 35 fuer interne Performance. Matomo deckt 12 Tracking + 8 A/B Tests
|
||||||
|
ab — 55 Adobe-Cookies werden zu 20 Matomo-Cookies."
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import re
|
||||||
|
from typing import Iterable
|
||||||
|
|
||||||
|
# Pattern → (functional_role, blocking_impact)
|
||||||
|
# Reihenfolge entscheidet: spezifischer zuerst.
|
||||||
|
_PATTERNS: list[tuple[str, str, str]] = [
|
||||||
|
# Session / Authentifizierung
|
||||||
|
(r"^(jsessionid|phpsessid|sessionid|sid|connect\.sid)$", "session_id", "site_breaks"),
|
||||||
|
(r"sso|signon|auth|login|token|jwt|bearer", "auth_token", "site_breaks"),
|
||||||
|
(r"^csrf|xsrf|antiforgery", "csrf_token", "site_breaks"),
|
||||||
|
|
||||||
|
# Spracheinstellung / Region
|
||||||
|
(r"lang|locale|culture|region", "preference", "no_personalization"),
|
||||||
|
|
||||||
|
# User-Praeferenzen (Theme, View, Bookmark)
|
||||||
|
(r"theme|dark|mode|view|sort|filter", "ui_preference", "no_personalization"),
|
||||||
|
(r"bookmark|favorite|favorit", "user_data", "no_personalization"),
|
||||||
|
|
||||||
|
# Consent-Cookie selbst
|
||||||
|
(r"consent|gdpr|tcf|euconsent", "consent_state", "site_breaks"),
|
||||||
|
|
||||||
|
# Tracking IDs (most analytics)
|
||||||
|
(r"^_ga|gid|gat|google_analytic", "tracking_id", "no_tracking"),
|
||||||
|
(r"^_pk_|matomo|piwik", "tracking_id", "no_tracking"),
|
||||||
|
(r"^s_|s\.cc|adobesite|aam", "tracking_id", "no_tracking"), # Adobe
|
||||||
|
(r"hjid|hjsession|hotjar", "session_recording", "no_tracking"),
|
||||||
|
(r"_uetsid|_uetvid|microsoft", "tracking_id", "no_tracking"),
|
||||||
|
|
||||||
|
# Visitor identification
|
||||||
|
(r"visitor|uid|user_id|customer_id", "visitor_id", "no_personalization"),
|
||||||
|
|
||||||
|
# A/B-Test / Personalisation
|
||||||
|
(r"ab_test|abtest|variant|experiment|target|target_qa", "ab_test", "no_personalization"),
|
||||||
|
(r"personalization|personalisation|adobe_target", "personalisation", "no_personalization"),
|
||||||
|
|
||||||
|
# Werbung / Retargeting
|
||||||
|
(r"fbp|fbc|fb_id|facebook|meta_pixel|fr$", "ad_pixel", "no_tracking"),
|
||||||
|
(r"adform|criteo|outbrain|taboola|tapad|adsrvr", "ad_pixel", "no_tracking"),
|
||||||
|
(r"doubleclick|test_cookie|ide|nid|exchange_uid", "ad_pixel", "no_tracking"),
|
||||||
|
(r"google_ad|gads|gcl", "ad_pixel", "no_tracking"),
|
||||||
|
(r"^li_|linkedin|bcookie|bscookie", "ad_pixel", "no_tracking"),
|
||||||
|
(r"pinterest|_pinterest_|_pin_unauth", "ad_pixel", "no_tracking"),
|
||||||
|
|
||||||
|
# Affiliate / Conversion
|
||||||
|
(r"conversion|orderid|order_id|transaction|purchase", "conversion_event", "no_tracking"),
|
||||||
|
(r"campaign|utm|source|medium|term", "campaign_attribution", "no_tracking"),
|
||||||
|
|
||||||
|
# ScrollPosition / Form-Helper
|
||||||
|
(r"scroll|position|form_|form_state", "ui_state", "no_personalization"),
|
||||||
|
|
||||||
|
# Loadbalancer / Sticky
|
||||||
|
(r"affinity|sticky|lb_|alb-|aws-alb", "load_balancer", "site_breaks"),
|
||||||
|
|
||||||
|
# Chat / Support
|
||||||
|
(r"chat|widget|genesys|livechat", "chat_session", "no_personalization"),
|
||||||
|
|
||||||
|
# Captcha
|
||||||
|
(r"hcaptcha|recaptcha|cf_|cloudflare", "bot_protection", "site_breaks"),
|
||||||
|
]
|
||||||
|
|
||||||
|
_FUNCTIONAL_LABEL = {
|
||||||
|
"session_id": "Sitzungs-ID",
|
||||||
|
"auth_token": "Auth-Token",
|
||||||
|
"csrf_token": "CSRF-Schutz",
|
||||||
|
"preference": "Sprache / Region",
|
||||||
|
"ui_preference": "UI-Praeferenz",
|
||||||
|
"user_data": "Nutzer-Daten",
|
||||||
|
"consent_state": "Consent-Speicher",
|
||||||
|
"tracking_id": "Tracking-ID",
|
||||||
|
"session_recording": "Session-Recording",
|
||||||
|
"visitor_id": "Besucher-ID",
|
||||||
|
"ab_test": "A/B-Test",
|
||||||
|
"personalisation": "Personalisierung",
|
||||||
|
"ad_pixel": "Werbe-Pixel",
|
||||||
|
"conversion_event": "Konversions-Tracking",
|
||||||
|
"campaign_attribution":"Kampagnen-Attribution",
|
||||||
|
"ui_state": "UI-Zustand (ScrollPos etc.)",
|
||||||
|
"load_balancer": "Load-Balancer",
|
||||||
|
"chat_session": "Chat-Session",
|
||||||
|
"bot_protection": "Bot-Schutz",
|
||||||
|
"unknown": "Unbekannt",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Welche functional_roles ueberlappen funktional — verwendet vom
|
||||||
|
# vendor_redundancy.analyze() um echte Konsolidierungschancen zu
|
||||||
|
# erkennen statt nur Provider-Doppelungen zu zaehlen.
|
||||||
|
OVERLAPPING_ROLES = {
|
||||||
|
"tracking_id": "tracking",
|
||||||
|
"session_recording": "tracking",
|
||||||
|
"ab_test": "personalisation",
|
||||||
|
"personalisation": "personalisation",
|
||||||
|
"ad_pixel": "advertising",
|
||||||
|
"conversion_event": "advertising",
|
||||||
|
"campaign_attribution":"advertising",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def classify_cookie(cookie_name: str) -> tuple[str, str]:
|
||||||
|
"""Return (functional_role, blocking_impact) for a cookie name."""
|
||||||
|
n = (cookie_name or "").lower().strip()
|
||||||
|
for pattern, role, impact in _PATTERNS:
|
||||||
|
if re.search(pattern, n):
|
||||||
|
return role, impact
|
||||||
|
return "unknown", "no_tracking"
|
||||||
|
|
||||||
|
|
||||||
|
def annotate_vendor_cookies(vendor: dict) -> dict:
|
||||||
|
"""Enrich a vendor record with functional_role per cookie."""
|
||||||
|
cookies = vendor.get("cookies") or []
|
||||||
|
annotated = []
|
||||||
|
role_counts: dict[str, int] = {}
|
||||||
|
for c in cookies:
|
||||||
|
role, impact = classify_cookie(c.get("name", ""))
|
||||||
|
annotated.append({**c, "functional_role": role, "blocking_impact": impact})
|
||||||
|
role_counts[role] = role_counts.get(role, 0) + 1
|
||||||
|
return {
|
||||||
|
**vendor,
|
||||||
|
"cookies": annotated,
|
||||||
|
"role_distribution": role_counts,
|
||||||
|
"role_labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in role_counts},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def aggregate_cookie_purposes(vendors: Iterable[dict]) -> dict:
|
||||||
|
"""Tenant-weite Verteilung: welche funktionalen Rollen kommen wie oft vor?"""
|
||||||
|
total: dict[str, int] = {}
|
||||||
|
by_vendor: dict[str, dict[str, int]] = {}
|
||||||
|
for v in vendors:
|
||||||
|
roles = v.get("role_distribution") or {}
|
||||||
|
if not roles and v.get("cookies"):
|
||||||
|
v = annotate_vendor_cookies(v)
|
||||||
|
roles = v["role_distribution"]
|
||||||
|
for r, n in roles.items():
|
||||||
|
total[r] = total.get(r, 0) + n
|
||||||
|
by_vendor[v.get("name", "")] = roles
|
||||||
|
return {
|
||||||
|
"total_per_role": total,
|
||||||
|
"labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in total},
|
||||||
|
"vendors_per_role": {
|
||||||
|
r: [v for v, rd in by_vendor.items() if rd.get(r, 0) > 0]
|
||||||
|
for r in total
|
||||||
|
},
|
||||||
|
}
|
||||||
@@ -0,0 +1,608 @@
|
|||||||
|
"""
|
||||||
|
Cookie-Knowledge-Datenbank — maximal extrahierbares Wissen pro Cookie-Name.
|
||||||
|
|
||||||
|
Pro Eintrag erfassen wir:
|
||||||
|
- vendor : Setzender Anbieter (volle Firma + Sitzland)
|
||||||
|
- exact_purpose : was der Cookie GENAU tut (nicht nur Kategorie)
|
||||||
|
- data_collected : Welche Datenfelder (Client-ID, Timestamp, IP, etc.)
|
||||||
|
- ip_relevant : Wird IP-Adresse erfasst/uebermittelt?
|
||||||
|
- ip_anonymized : Per Default anonymisiert?
|
||||||
|
- tcf_purpose_ids : IAB TCF v2.2 Purpose-IDs (1-11)
|
||||||
|
- iab_vendor_id : IAB Global Vendor List ID (fuer TCF-Sync)
|
||||||
|
- typical_lifetime : Wie lange persistiert
|
||||||
|
- reid_risk : Re-Identifikations-Risiko (low/medium/high)
|
||||||
|
- technical_necessity: Erforderlich nach §25(2) TDDDG? (none/partial/full)
|
||||||
|
- schrems_ii_status : Drittlandtransfer-Bewertung
|
||||||
|
- eugh_rulings : Relevante EuGH-/CNIL-/LfDI-Entscheidungen
|
||||||
|
- eu_alternative_* : EU-Cookie/Vendor-Ersatz mit gleicher Funktion
|
||||||
|
- notes : Sonstige Hinweise (Vermeidung, Konfiguration)
|
||||||
|
|
||||||
|
Quellen: Cookiepedia, IAB Europe TCF v2.2 Vendor List, Cookiebot DB,
|
||||||
|
CNIL Cookies & Trackers Guidelines 2024, EDPB Cookie Guidelines 2/2023,
|
||||||
|
DSK-Orientierungshilfe Telemedien 2021, Vendor-eigene Dokumentation.
|
||||||
|
|
||||||
|
Stand: 2026-05.
|
||||||
|
|
||||||
|
Erweiterung: Pull-Requests willkommen — Format siehe TEMPLATE_ENTRY am
|
||||||
|
Ende der Datei.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import TypedDict
|
||||||
|
|
||||||
|
|
||||||
|
class CookieKnowledge(TypedDict, total=False):
|
||||||
|
vendor: str
|
||||||
|
vendor_country: str
|
||||||
|
exact_purpose: str
|
||||||
|
data_collected: list[str]
|
||||||
|
ip_relevant: bool
|
||||||
|
ip_anonymized: bool
|
||||||
|
tcf_purpose_ids: list[int]
|
||||||
|
iab_vendor_id: int | None
|
||||||
|
typical_lifetime: str
|
||||||
|
reid_risk: str # 'low' | 'medium' | 'high'
|
||||||
|
technical_necessity: str # 'none' | 'partial' | 'full'
|
||||||
|
schrems_ii_status: str
|
||||||
|
eugh_rulings: list[str]
|
||||||
|
eu_alternative_cookies: list[str]
|
||||||
|
eu_alternative_vendor: str
|
||||||
|
notes: str
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Google ──────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
_GOOGLE_BASE = {
|
||||||
|
"vendor": "Google LLC", "vendor_country": "US",
|
||||||
|
"schrems_ii_status": "Drittlandtransfer in die USA. Mit DPF "
|
||||||
|
"(EU-US Data Privacy Framework, 2023) wieder zulaessig, "
|
||||||
|
"aber bereits Klage NOYB anhaengig (Schrems III). "
|
||||||
|
"Risiko-Bewertung empfohlen.",
|
||||||
|
"eugh_rulings": [
|
||||||
|
"EuGH C-311/18 (Schrems II, 16.07.2020) — Privacy Shield gekippt",
|
||||||
|
"CNIL SAN-2022-002 (10.02.2022) — Google Analytics ohne Anonymisierung "
|
||||||
|
"unzulaessig",
|
||||||
|
"Datenschutzkonferenz (DSK) Orientierungshilfe 2024 — GA4 mit "
|
||||||
|
"Server-Side-Tagging als Mitigation moeglich",
|
||||||
|
],
|
||||||
|
}
|
||||||
|
|
||||||
|
KB: dict[str, CookieKnowledge] = {
|
||||||
|
|
||||||
|
# ─── Google Analytics ─────────────────────────────────────────────
|
||||||
|
"_ga": {
|
||||||
|
**_GOOGLE_BASE,
|
||||||
|
"exact_purpose": "Unterscheidet eindeutig Besucher; persistiert die "
|
||||||
|
"ueber alle Sessions hinweg gueltige Client-ID.",
|
||||||
|
"data_collected": ["client_id", "first_visit_timestamp", "ip_address"],
|
||||||
|
"ip_relevant": True, "ip_anonymized": False,
|
||||||
|
"tcf_purpose_ids": [8, 10],
|
||||||
|
"iab_vendor_id": 755,
|
||||||
|
"typical_lifetime": "2 Jahre",
|
||||||
|
"reid_risk": "high",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
"eu_alternative_cookies": ["_pk_id"],
|
||||||
|
"eu_alternative_vendor": "Matomo",
|
||||||
|
"notes": "Mit IP-Anonymisierung (`anonymizeIp: true`) + Server-Side-Tagging "
|
||||||
|
"DSGVO-konfigurierbar. Ohne diese Massnahmen einwilligungspflichtig.",
|
||||||
|
},
|
||||||
|
"_gid": {
|
||||||
|
**_GOOGLE_BASE,
|
||||||
|
"exact_purpose": "Unterscheidet Besucher innerhalb einer Session "
|
||||||
|
"(24h-Bucket).",
|
||||||
|
"data_collected": ["session_id", "ip_address"],
|
||||||
|
"ip_relevant": True, "ip_anonymized": False,
|
||||||
|
"tcf_purpose_ids": [8],
|
||||||
|
"iab_vendor_id": 755,
|
||||||
|
"typical_lifetime": "24 Stunden",
|
||||||
|
"reid_risk": "medium",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
"eu_alternative_cookies": ["_pk_ses"],
|
||||||
|
"eu_alternative_vendor": "Matomo",
|
||||||
|
},
|
||||||
|
"_gat": {
|
||||||
|
**_GOOGLE_BASE,
|
||||||
|
"exact_purpose": "Throttling-Cookie — begrenzt Anzahl Requests an "
|
||||||
|
"Google Analytics pro Sekunde.",
|
||||||
|
"data_collected": ["throttle_flag"],
|
||||||
|
"ip_relevant": False, "ip_anonymized": True,
|
||||||
|
"tcf_purpose_ids": [],
|
||||||
|
"iab_vendor_id": 755,
|
||||||
|
"typical_lifetime": "1 Minute",
|
||||||
|
"reid_risk": "low",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
"notes": "Reines Performance-Steuerungs-Cookie. Trotzdem einwilligungspflichtig "
|
||||||
|
"da er Teil des GA-Trackings ist.",
|
||||||
|
},
|
||||||
|
"_gat_gtag_UA_": {
|
||||||
|
**_GOOGLE_BASE,
|
||||||
|
"exact_purpose": "GTM-spezifisches Throttling — pro Universal-Analytics-Property.",
|
||||||
|
"data_collected": ["throttle_flag"],
|
||||||
|
"ip_relevant": False,
|
||||||
|
"typical_lifetime": "1 Minute",
|
||||||
|
"reid_risk": "low",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
"notes": "Suffix nach `UA_` ist die Property-ID. Pattern-Match noetig.",
|
||||||
|
},
|
||||||
|
"_ga_*": {
|
||||||
|
**_GOOGLE_BASE,
|
||||||
|
"exact_purpose": "GA4-Persistierung — eindeutige Stream-ID + Session-Daten.",
|
||||||
|
"data_collected": ["stream_id", "session_count", "session_start_ts"],
|
||||||
|
"ip_relevant": True, "ip_anonymized": False,
|
||||||
|
"tcf_purpose_ids": [8, 10],
|
||||||
|
"iab_vendor_id": 755,
|
||||||
|
"typical_lifetime": "2 Jahre",
|
||||||
|
"reid_risk": "high",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
"notes": "GA4-Format. Suffix `_<measurement_id>`. Server-Side-GTM "
|
||||||
|
"ist die einzige praktikable DSGVO-Mitigation.",
|
||||||
|
},
|
||||||
|
"NID": {
|
||||||
|
**_GOOGLE_BASE,
|
||||||
|
"exact_purpose": "Personalisiert Google-Suche, AdSense, Maps; "
|
||||||
|
"speichert Praeferenzen + Sicherheits-Token.",
|
||||||
|
"data_collected": ["user_pref_id", "session_id", "security_token"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"tcf_purpose_ids": [4, 9, 10],
|
||||||
|
"iab_vendor_id": 755,
|
||||||
|
"typical_lifetime": "6 Monate",
|
||||||
|
"reid_risk": "high",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
"notes": "Klassischer Google-Werbe-Cookie. Hohe Re-ID-Gefahr da Cross-Site.",
|
||||||
|
},
|
||||||
|
"IDE": {
|
||||||
|
"vendor": "Google LLC (DoubleClick)", "vendor_country": "US",
|
||||||
|
"exact_purpose": "Conversion-Tracking + Werbe-Targeting bei "
|
||||||
|
"Google Display Network / DoubleClick.",
|
||||||
|
"data_collected": ["doubleclick_id", "ad_interactions"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"tcf_purpose_ids": [4, 9, 10],
|
||||||
|
"iab_vendor_id": 755,
|
||||||
|
"typical_lifetime": "13 Monate",
|
||||||
|
"reid_risk": "high",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
"schrems_ii_status": _GOOGLE_BASE["schrems_ii_status"],
|
||||||
|
"eugh_rulings": _GOOGLE_BASE["eugh_rulings"],
|
||||||
|
},
|
||||||
|
"test_cookie": {
|
||||||
|
**_GOOGLE_BASE,
|
||||||
|
"exact_purpose": "DoubleClick-Probe-Cookie — testet ob Browser Cookies akzeptiert.",
|
||||||
|
"data_collected": ["browser_supports_cookies"],
|
||||||
|
"ip_relevant": False,
|
||||||
|
"typical_lifetime": "15 Minuten",
|
||||||
|
"reid_risk": "low",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
},
|
||||||
|
|
||||||
|
# ─── Meta / Facebook ──────────────────────────────────────────────
|
||||||
|
"_fbp": {
|
||||||
|
"vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
|
||||||
|
"exact_purpose": "First-Party-Pixel von Facebook/Meta — identifiziert "
|
||||||
|
"den Browser fuer Werbe-Conversion-Tracking + Ad-Retargeting.",
|
||||||
|
"data_collected": ["browser_id", "first_visit_ts"],
|
||||||
|
"ip_relevant": True, "ip_anonymized": False,
|
||||||
|
"tcf_purpose_ids": [4, 9, 10],
|
||||||
|
"iab_vendor_id": 891,
|
||||||
|
"typical_lifetime": "90 Tage",
|
||||||
|
"reid_risk": "high",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
"schrems_ii_status": "Daten gehen an Meta US trotz IE-Sitz. "
|
||||||
|
"Aktuell DPF-abgedeckt, aber Schrems III erwartet.",
|
||||||
|
"eugh_rulings": [
|
||||||
|
"EuGH C-311/18 (Schrems II)",
|
||||||
|
"EDSA Aufsichts-Statement 2023 zu Meta-Pixel",
|
||||||
|
"LDA Bayern Pruefverfuegung 2024",
|
||||||
|
],
|
||||||
|
"eu_alternative_vendor": "Smart AdServer (Equativ, FR)",
|
||||||
|
"notes": "Conversions API (CAPI) als Server-Side-Variante reduziert Cookie-"
|
||||||
|
"Abhaengigkeit. Trotzdem werden Daten an Meta gesendet.",
|
||||||
|
},
|
||||||
|
"_fbc": {
|
||||||
|
"vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
|
||||||
|
"exact_purpose": "Speichert die ClickID einer Meta-Ad-Kampagne; "
|
||||||
|
"ordnet Conversion dem urspruenglichen Ad-Klick zu.",
|
||||||
|
"data_collected": ["fbclid", "ad_campaign_id"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"tcf_purpose_ids": [4, 9],
|
||||||
|
"iab_vendor_id": 891,
|
||||||
|
"typical_lifetime": "90 Tage",
|
||||||
|
"reid_risk": "high",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
},
|
||||||
|
"fr": {
|
||||||
|
"vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
|
||||||
|
"exact_purpose": "Werbe-Targeting + Cross-Site-Tracking auf "
|
||||||
|
"Facebook-Plattform.",
|
||||||
|
"data_collected": ["encrypted_user_id", "session_data"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"tcf_purpose_ids": [4, 9, 10],
|
||||||
|
"iab_vendor_id": 891,
|
||||||
|
"typical_lifetime": "3 Monate",
|
||||||
|
"reid_risk": "high",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
},
|
||||||
|
|
||||||
|
# ─── Adobe ────────────────────────────────────────────────────────
|
||||||
|
"s_cc": {
|
||||||
|
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
|
||||||
|
"exact_purpose": "Cookie-Test — prueft ob der Browser Cookies "
|
||||||
|
"akzeptiert (Adobe Analytics Bootstrap).",
|
||||||
|
"data_collected": ["browser_supports_cookies"],
|
||||||
|
"ip_relevant": False,
|
||||||
|
"typical_lifetime": "Session",
|
||||||
|
"reid_risk": "low",
|
||||||
|
"technical_necessity": "partial",
|
||||||
|
"schrems_ii_status": "Adobe-IE-Hosting, jedoch US-Datentransfer fuer "
|
||||||
|
"Cloud-Services. DPF-abgedeckt.",
|
||||||
|
},
|
||||||
|
"s_sq": {
|
||||||
|
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
|
||||||
|
"exact_purpose": "Speichert den letzten Klick (URL + Position) "
|
||||||
|
"fuer Click-Map-Reports.",
|
||||||
|
"data_collected": ["last_click_url", "last_click_xy"],
|
||||||
|
"ip_relevant": False,
|
||||||
|
"tcf_purpose_ids": [8],
|
||||||
|
"typical_lifetime": "Session",
|
||||||
|
"reid_risk": "low",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
},
|
||||||
|
"AMCV_": {
|
||||||
|
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
|
||||||
|
"exact_purpose": "Adobe Marketing Cloud Visitor ID — Cross-Tool-ID fuer "
|
||||||
|
"Analytics + Target + Audience Manager.",
|
||||||
|
"data_collected": ["marketing_cloud_visitor_id", "first_visit_ts"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"tcf_purpose_ids": [4, 8, 9, 10],
|
||||||
|
"typical_lifetime": "2 Jahre",
|
||||||
|
"reid_risk": "high",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
"notes": "Suffix nach `AMCV_` ist die Org-ID. Klassischer Cross-Tool-Track.",
|
||||||
|
},
|
||||||
|
"mbox": {
|
||||||
|
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
|
||||||
|
"exact_purpose": "Adobe Target — A/B-Testing, Personalisierung, "
|
||||||
|
"Audience-Targeting.",
|
||||||
|
"data_collected": ["mbox_visitor_id", "experiment_assignments"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"tcf_purpose_ids": [4, 9, 10],
|
||||||
|
"typical_lifetime": "2 Jahre",
|
||||||
|
"reid_risk": "high",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
},
|
||||||
|
"s_target_qa": {
|
||||||
|
"vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
|
||||||
|
"exact_purpose": "Adobe Target Premium-Feature — QA-Modus fuer Personalisations-Tests.",
|
||||||
|
"data_collected": ["target_qa_session"],
|
||||||
|
"typical_lifetime": "Session",
|
||||||
|
"reid_risk": "low",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
"notes": "Premium-Feature-Cookie. Anwesenheit = Enterprise-Lizenz Indikator.",
|
||||||
|
},
|
||||||
|
|
||||||
|
# ─── Microsoft / Bing ─────────────────────────────────────────────
|
||||||
|
"MUID": {
|
||||||
|
"vendor": "Microsoft Corp.", "vendor_country": "US",
|
||||||
|
"exact_purpose": "Microsoft-User-ID fuer Bing, Microsoft Advertising, "
|
||||||
|
"Clarity Heatmaps.",
|
||||||
|
"data_collected": ["microsoft_user_id"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"tcf_purpose_ids": [4, 8, 9, 10],
|
||||||
|
"iab_vendor_id": 165,
|
||||||
|
"typical_lifetime": "13 Monate",
|
||||||
|
"reid_risk": "high",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
"schrems_ii_status": "US-Transfer, DPF-zertifiziert.",
|
||||||
|
},
|
||||||
|
"_uetsid": {
|
||||||
|
"vendor": "Microsoft Corp.", "vendor_country": "US",
|
||||||
|
"exact_purpose": "Universal Event Tracking — Session-Cookie fuer "
|
||||||
|
"Microsoft Advertising Conversion-Tracking.",
|
||||||
|
"data_collected": ["session_id"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"tcf_purpose_ids": [9],
|
||||||
|
"typical_lifetime": "30 Minuten",
|
||||||
|
"reid_risk": "medium",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
},
|
||||||
|
"_uetvid": {
|
||||||
|
"vendor": "Microsoft Corp.", "vendor_country": "US",
|
||||||
|
"exact_purpose": "Universal Event Tracking — persistente Visitor-ID.",
|
||||||
|
"data_collected": ["visitor_id"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"tcf_purpose_ids": [4, 9],
|
||||||
|
"typical_lifetime": "13 Monate",
|
||||||
|
"reid_risk": "high",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
},
|
||||||
|
|
||||||
|
# ─── LinkedIn ─────────────────────────────────────────────────────
|
||||||
|
"bcookie": {
|
||||||
|
"vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
|
||||||
|
"exact_purpose": "Browser-Identifikation; wird genutzt fuer Anmelde-"
|
||||||
|
"Vorgang + LinkedIn Insight-Tag-Tracking.",
|
||||||
|
"data_collected": ["browser_id"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"tcf_purpose_ids": [4, 8, 9],
|
||||||
|
"iab_vendor_id": 14,
|
||||||
|
"typical_lifetime": "1 Jahr",
|
||||||
|
"reid_risk": "high",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
"schrems_ii_status": "US-Datenuebermittlung (Mutter Microsoft).",
|
||||||
|
},
|
||||||
|
"lidc": {
|
||||||
|
"vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
|
||||||
|
"exact_purpose": "Daten-Routing zwischen LinkedIn-Datacentern.",
|
||||||
|
"data_collected": ["routing_id"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"typical_lifetime": "1 Tag",
|
||||||
|
"reid_risk": "low",
|
||||||
|
"technical_necessity": "partial",
|
||||||
|
},
|
||||||
|
"li_gc": {
|
||||||
|
"vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
|
||||||
|
"exact_purpose": "Speichert die Cookie-Einwilligung fuer LinkedIn-Embeds.",
|
||||||
|
"data_collected": ["consent_state"],
|
||||||
|
"ip_relevant": False,
|
||||||
|
"typical_lifetime": "6 Monate",
|
||||||
|
"reid_risk": "low",
|
||||||
|
"technical_necessity": "full",
|
||||||
|
},
|
||||||
|
|
||||||
|
# ─── Matomo (EU-Alternative) ──────────────────────────────────────
|
||||||
|
"_pk_id": {
|
||||||
|
"vendor": "InnoCraft Ltd (Matomo)", "vendor_country": "NZ-mit-EU-self-hosting",
|
||||||
|
"exact_purpose": "Matomo Visitor-ID — Pendant zu _ga, aber DSGVO-konform "
|
||||||
|
"wenn IP-Anonymisierung aktiv.",
|
||||||
|
"data_collected": ["visitor_id", "first_visit_ts"],
|
||||||
|
"ip_relevant": True, "ip_anonymized": True,
|
||||||
|
"tcf_purpose_ids": [8],
|
||||||
|
"typical_lifetime": "13 Monate",
|
||||||
|
"reid_risk": "low", # bei aktivierter Anonymisierung
|
||||||
|
"technical_necessity": "none",
|
||||||
|
"schrems_ii_status": "Bei On-Premise-Deployment KEIN Drittlandtransfer. "
|
||||||
|
"Matomo Cloud EU mit Frankfurt-Hosting verfuegbar.",
|
||||||
|
"notes": "Empfohlener Drop-in-Ersatz fuer Google Analytics.",
|
||||||
|
},
|
||||||
|
"_pk_ses": {
|
||||||
|
"vendor": "InnoCraft Ltd (Matomo)", "vendor_country": "NZ-mit-EU-self-hosting",
|
||||||
|
"exact_purpose": "Matomo Session-Cookie.",
|
||||||
|
"data_collected": ["session_id"],
|
||||||
|
"ip_relevant": False,
|
||||||
|
"typical_lifetime": "30 Minuten",
|
||||||
|
"reid_risk": "low",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
},
|
||||||
|
|
||||||
|
# ─── Captcha ──────────────────────────────────────────────────────
|
||||||
|
"hcaptcha": {
|
||||||
|
"vendor": "Intuition Machines Inc. (hCaptcha)", "vendor_country": "US",
|
||||||
|
"exact_purpose": "hCaptcha Bot-Erkennung — Session-Token fuer Challenge-Solver.",
|
||||||
|
"data_collected": ["bot_score", "session_id", "ip_address"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"typical_lifetime": "Session",
|
||||||
|
"reid_risk": "medium",
|
||||||
|
"technical_necessity": "full",
|
||||||
|
"schrems_ii_status": "US-Transfer (Intuition Machines, San Francisco).",
|
||||||
|
"eu_alternative_vendor": "Friendly Captcha (DE), Turnstile (Cloudflare EU)",
|
||||||
|
"notes": "Technisch erforderlich fuer Bot-Schutz, aber EU-Alternativen "
|
||||||
|
"ohne Drittland-Risiko verfuegbar.",
|
||||||
|
},
|
||||||
|
"cf_clearance": {
|
||||||
|
"vendor": "Cloudflare Inc.", "vendor_country": "US",
|
||||||
|
"exact_purpose": "Cloudflare-Bot-Management Pro — bestaetigt dass User "
|
||||||
|
"die JS-Challenge bestanden hat.",
|
||||||
|
"data_collected": ["challenge_token"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"typical_lifetime": "30 Minuten",
|
||||||
|
"reid_risk": "low",
|
||||||
|
"technical_necessity": "full",
|
||||||
|
"notes": "Premium-Feature-Cookie. Anwesenheit = Cloudflare Bot-Management "
|
||||||
|
"Pro im Einsatz.",
|
||||||
|
},
|
||||||
|
|
||||||
|
# ─── CDN / Performance ────────────────────────────────────────────
|
||||||
|
"__cf_bm": {
|
||||||
|
"vendor": "Cloudflare Inc.", "vendor_country": "US",
|
||||||
|
"exact_purpose": "Cloudflare Bot Management — basis Bot-Erkennung.",
|
||||||
|
"data_collected": ["bot_score", "client_hash"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"typical_lifetime": "30 Minuten",
|
||||||
|
"reid_risk": "low",
|
||||||
|
"technical_necessity": "full",
|
||||||
|
"notes": "Strictly necessary nach §25(2) TDDDG (Sicherheit). Keine Einwilligung noetig.",
|
||||||
|
},
|
||||||
|
"aws-alb": {
|
||||||
|
"vendor": "Amazon Web Services Inc.", "vendor_country": "US",
|
||||||
|
"exact_purpose": "AWS Application Load Balancer Sticky Sessions — "
|
||||||
|
"routet Anfragen konsistent an dieselbe Backend-Instanz.",
|
||||||
|
"data_collected": ["target_instance_id"],
|
||||||
|
"ip_relevant": False,
|
||||||
|
"typical_lifetime": "1 Stunde",
|
||||||
|
"reid_risk": "low",
|
||||||
|
"technical_necessity": "full",
|
||||||
|
"schrems_ii_status": "AWS Frankfurt verfuegbar — bei korrektem Region-Setup "
|
||||||
|
"kein US-Transfer.",
|
||||||
|
},
|
||||||
|
|
||||||
|
# ─── Retargeting / Advertising ────────────────────────────────────
|
||||||
|
"_pin_unauth": {
|
||||||
|
"vendor": "Pinterest Europe Ltd.", "vendor_country": "IE/US",
|
||||||
|
"exact_purpose": "Pinterest Tag — Conversion-Tracking + Audience-Aufbau.",
|
||||||
|
"data_collected": ["pinterest_user_id"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"tcf_purpose_ids": [4, 9, 10],
|
||||||
|
"iab_vendor_id": 762,
|
||||||
|
"typical_lifetime": "1 Jahr",
|
||||||
|
"reid_risk": "high",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
},
|
||||||
|
"cto_dna": {
|
||||||
|
"vendor": "Criteo S.A.", "vendor_country": "FR",
|
||||||
|
"exact_purpose": "Criteo Dynamic Retargeting — produktspezifische "
|
||||||
|
"Werbeauslieferung basierend auf Browser-History.",
|
||||||
|
"data_collected": ["criteo_user_id", "product_views"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"tcf_purpose_ids": [4, 9, 10],
|
||||||
|
"iab_vendor_id": 91,
|
||||||
|
"typical_lifetime": "13 Monate",
|
||||||
|
"reid_risk": "high",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
"schrems_ii_status": "Criteo ist FR-basiert, aber Daten gehen auch in die USA. "
|
||||||
|
"Multi-Region-Setup pruefen.",
|
||||||
|
"notes": "Trotz FR-Sitz nicht automatisch DSGVO-konform — CNIL-Bussgeld "
|
||||||
|
"EUR 60M 2022 wegen mangelnder Einwilligungs-Granularitaet.",
|
||||||
|
},
|
||||||
|
"afm": {
|
||||||
|
"vendor": "Adform A/S", "vendor_country": "DK",
|
||||||
|
"exact_purpose": "Adform Audience Matching — Cross-Device-Identifikation "
|
||||||
|
"fuer programmatische Werbung.",
|
||||||
|
"data_collected": ["adform_user_id", "device_signals"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"tcf_purpose_ids": [4, 9, 10],
|
||||||
|
"iab_vendor_id": 50,
|
||||||
|
"typical_lifetime": "30 Tage",
|
||||||
|
"reid_risk": "high",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
"schrems_ii_status": "Adform ist DK-basiert, EU-Hosting Standard. Keine "
|
||||||
|
"Schrems-II-Probleme bei Standard-Setup.",
|
||||||
|
},
|
||||||
|
|
||||||
|
# ─── Consent / Funktional (Strictly Necessary) ────────────────────
|
||||||
|
"JSESSIONID": {
|
||||||
|
"vendor": "Java EE / Tomcat (Site-Software)", "vendor_country": "N/A",
|
||||||
|
"exact_purpose": "Server-Session-ID fuer Java-basierte Anwendungen.",
|
||||||
|
"data_collected": ["session_id"],
|
||||||
|
"ip_relevant": False,
|
||||||
|
"typical_lifetime": "Session",
|
||||||
|
"reid_risk": "low",
|
||||||
|
"technical_necessity": "full",
|
||||||
|
"notes": "Strictly necessary. Keine Einwilligung nach §25(2) TDDDG.",
|
||||||
|
},
|
||||||
|
"PHPSESSID": {
|
||||||
|
"vendor": "PHP (Site-Software)", "vendor_country": "N/A",
|
||||||
|
"exact_purpose": "PHP-Session-ID — serverseitige Sitzungs-Persistierung.",
|
||||||
|
"data_collected": ["session_id"],
|
||||||
|
"ip_relevant": False,
|
||||||
|
"typical_lifetime": "Session",
|
||||||
|
"reid_risk": "low",
|
||||||
|
"technical_necessity": "full",
|
||||||
|
},
|
||||||
|
"cookie_consent": {
|
||||||
|
"vendor": "BreakPilot Consent-Banner (own)", "vendor_country": "DE",
|
||||||
|
"exact_purpose": "Speichert die Einwilligungsentscheidung des Nutzers "
|
||||||
|
"pro Kategorie.",
|
||||||
|
"data_collected": ["consent_state_per_category", "timestamp"],
|
||||||
|
"ip_relevant": False,
|
||||||
|
"typical_lifetime": "180 Tage",
|
||||||
|
"reid_risk": "low",
|
||||||
|
"technical_necessity": "full",
|
||||||
|
"notes": "Strictly necessary nach Art. 7(1) DSGVO Nachweispflicht.",
|
||||||
|
},
|
||||||
|
|
||||||
|
# ─── Templated / pattern-based entries (Suffix variabel) ──────────
|
||||||
|
# Diese werden via regex-Lookup gefangen, der Eintrag dient als Fallback.
|
||||||
|
"_uet_": {
|
||||||
|
"vendor": "Microsoft Corp.", "vendor_country": "US",
|
||||||
|
"exact_purpose": "Microsoft Universal Event Tracking — Suffix nach `_uet_`.",
|
||||||
|
"data_collected": ["event_id"],
|
||||||
|
"ip_relevant": True,
|
||||||
|
"typical_lifetime": "30 Minuten",
|
||||||
|
"reid_risk": "medium",
|
||||||
|
"technical_necessity": "none",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Pattern-Lookup fuer Cookies mit variablem Suffix ───────────────
|
||||||
|
|
||||||
|
_PATTERN_LOOKUPS: list[tuple[str, str]] = [
|
||||||
|
(r"^_ga_[A-Z0-9_]+$", "_ga_*"),
|
||||||
|
(r"^_gat_gtag_UA_", "_gat_gtag_UA_"),
|
||||||
|
(r"^AMCV_", "AMCV_"),
|
||||||
|
(r"^_uet[a-z]+", "_uet_"),
|
||||||
|
(r"^aws-alb", "aws-alb"),
|
||||||
|
(r"^_pk_id\.", "_pk_id"),
|
||||||
|
(r"^_pk_ses\.", "_pk_ses"),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def lookup_cookie(name: str) -> CookieKnowledge | None:
|
||||||
|
"""Return rich knowledge for a cookie name, or None if unknown."""
|
||||||
|
import re
|
||||||
|
if not name:
|
||||||
|
return None
|
||||||
|
# Direct hit
|
||||||
|
if name in KB:
|
||||||
|
return KB[name]
|
||||||
|
# Pattern-based
|
||||||
|
for pattern, kb_key in _PATTERN_LOOKUPS:
|
||||||
|
if re.search(pattern, name):
|
||||||
|
return KB.get(kb_key)
|
||||||
|
# Strip common suffixes (.bmw.de, .domain etc.)
|
||||||
|
base = name.split(".", 1)[0]
|
||||||
|
if base != name and base in KB:
|
||||||
|
return KB[base]
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def enrich_vendor_with_knowledge(vendor: dict) -> dict:
|
||||||
|
"""Add per-cookie knowledge to each cookie in vendor['cookies']."""
|
||||||
|
cookies = vendor.get("cookies") or []
|
||||||
|
enriched = []
|
||||||
|
for c in cookies:
|
||||||
|
info = lookup_cookie(c.get("name", ""))
|
||||||
|
if info:
|
||||||
|
enriched.append({**c, "knowledge": info})
|
||||||
|
else:
|
||||||
|
enriched.append(c)
|
||||||
|
return {**vendor, "cookies": enriched}
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Hilfen fuer Aggregate-Reports ──────────────────────────────────
|
||||||
|
|
||||||
|
def summarize_compliance_risk(vendor: dict) -> dict:
|
||||||
|
"""Aggregiere Re-ID-Risiko + Drittland-Status ueber alle Cookies eines Vendors."""
|
||||||
|
cookies = vendor.get("cookies") or []
|
||||||
|
risk_counts = {"high": 0, "medium": 0, "low": 0}
|
||||||
|
schrems_affected = 0
|
||||||
|
technical_only = 0
|
||||||
|
for c in cookies:
|
||||||
|
k = c.get("knowledge") or lookup_cookie(c.get("name", ""))
|
||||||
|
if not k:
|
||||||
|
continue
|
||||||
|
risk = k.get("reid_risk", "low")
|
||||||
|
risk_counts[risk] = risk_counts.get(risk, 0) + 1
|
||||||
|
if "us" in k.get("vendor_country", "").lower() or "schrems" in k.get("schrems_ii_status", "").lower():
|
||||||
|
schrems_affected += 1
|
||||||
|
if k.get("technical_necessity") == "full":
|
||||||
|
technical_only += 1
|
||||||
|
return {
|
||||||
|
"reid_risk_distribution": risk_counts,
|
||||||
|
"high_risk_cookie_count": risk_counts["high"],
|
||||||
|
"schrems_ii_affected_cookies": schrems_affected,
|
||||||
|
"strictly_necessary_cookies": technical_only,
|
||||||
|
"total_classified": sum(risk_counts.values()),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ─── TEMPLATE_ENTRY (fuer Kuratierung neuer Cookies) ───────────────
|
||||||
|
|
||||||
|
TEMPLATE_ENTRY: CookieKnowledge = {
|
||||||
|
"vendor": "<Voller Firmenname>",
|
||||||
|
"vendor_country": "<ISO-2 Code oder 'IE/US' bei Doppel-Standort>",
|
||||||
|
"exact_purpose": "<1-2 Saetze was der Cookie GENAU tut>",
|
||||||
|
"data_collected": ["<feldname_1>", "<feldname_2>"],
|
||||||
|
"ip_relevant": False,
|
||||||
|
"ip_anonymized": False,
|
||||||
|
"tcf_purpose_ids": [], # TCF v2.2: 1-11
|
||||||
|
"iab_vendor_id": None, # Aus https://iabeurope.eu/tcf-vendor-list/
|
||||||
|
"typical_lifetime": "<Session | XX Tage | XX Monate | XX Jahre>",
|
||||||
|
"reid_risk": "low", # low | medium | high
|
||||||
|
"technical_necessity": "none", # none | partial | full
|
||||||
|
"schrems_ii_status": "<Drittlandtransfer-Bewertung>",
|
||||||
|
"eugh_rulings": [],
|
||||||
|
"eu_alternative_cookies": [],
|
||||||
|
"eu_alternative_vendor": "",
|
||||||
|
"notes": "",
|
||||||
|
}
|
||||||
@@ -220,10 +220,16 @@ def score_vendors(vendors: list[dict]) -> list[dict]:
|
|||||||
flags.append("no_purpose")
|
flags.append("no_purpose")
|
||||||
|
|
||||||
# Country — only for external processors / controllers
|
# Country — only for external processors / controllers
|
||||||
|
# Falls country leer ist, ableiten aus Rechtsform-Suffix im Namen.
|
||||||
if country_required:
|
if country_required:
|
||||||
max_score += 10
|
max_score += 10
|
||||||
if v.get("country"):
|
if v.get("country"):
|
||||||
score += 10
|
score += 10
|
||||||
|
elif _country_from_name(v.get("name", "")):
|
||||||
|
inferred = _country_from_name(v.get("name", ""))
|
||||||
|
v["country"] = inferred
|
||||||
|
v["country_inferred"] = True
|
||||||
|
score += 10
|
||||||
else:
|
else:
|
||||||
flags.append("no_country")
|
flags.append("no_country")
|
||||||
|
|
||||||
@@ -321,3 +327,153 @@ def build_check_items(validated: list[LinkCheck]) -> list[dict]:
|
|||||||
"hint": hint,
|
"hint": hint,
|
||||||
})
|
})
|
||||||
return items
|
return items
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Country-Inferenz aus Rechtsform-Suffix ────────────────────────
|
||||||
|
#
|
||||||
|
# Wenn ein Vendor das "country"-Feld leer hat, koennen wir es oft aus
|
||||||
|
# dem Firmen-Suffix ableiten:
|
||||||
|
# Adform A/S → DK (Dänemark, Aktieselskab)
|
||||||
|
# Pinterest Europe Ltd. → IE (Irland, Limited)
|
||||||
|
# Salesforce Inc. → US (Incorporated)
|
||||||
|
# Adobe ... Ireland Limited → IE
|
||||||
|
# Genesys ... B.V. → NL (Niederlande, Besloten Vennootschap)
|
||||||
|
# Equativ S.A. → FR (Société Anonyme)
|
||||||
|
# SAP SE → DE (Societas Europaea — meist DE-eingetragen)
|
||||||
|
#
|
||||||
|
# Kombi-Strategie:
|
||||||
|
# 1) Suffix-Pattern
|
||||||
|
# 2) Laendername im Firmen-Namen ('Ireland', 'Deutschland')
|
||||||
|
# 3) Specific Vendor (Google Inc / Meta Platforms Ireland Ltd → vendor-specific)
|
||||||
|
|
||||||
|
import re as _re
|
||||||
|
|
||||||
|
_SUFFIX_COUNTRY: list[tuple[str, str]] = [
|
||||||
|
# Pattern (am Wort-Ende oder vor weiteren Tokens) → ISO-Code
|
||||||
|
(r"\bA/S\b", "DK"), # Aktieselskab
|
||||||
|
(r"\bApS\b", "DK"), # Anpartsselskab
|
||||||
|
(r"\bAB\b", "SE"), # Aktiebolag
|
||||||
|
(r"\bAS\b(?!\w)", "NO"), # Aksjeselskap
|
||||||
|
(r"\bOy\b", "FI"), # Osakeyhtiö
|
||||||
|
(r"\bAG\b(?!\w)", "DE"), # auch CH/AT moeglich, default DE
|
||||||
|
(r"\bGmbH\b", "DE"),
|
||||||
|
(r"\bUG\b", "DE"),
|
||||||
|
(r"\beG\b", "DE"),
|
||||||
|
(r"\bKG\b", "DE"),
|
||||||
|
(r"\bOHG\b", "DE"),
|
||||||
|
(r"\bSE\b", "DE"), # Societas Europaea — pruefen ob SAP SE etc.
|
||||||
|
(r"\bS\.A\.\b", "FR"), # France / SE / ES
|
||||||
|
(r"\bSAS\b", "FR"),
|
||||||
|
(r"\bS\.A\.S\.\b", "FR"),
|
||||||
|
(r"\bSARL\b", "FR"),
|
||||||
|
(r"\bS\.r\.l\.\b", "IT"),
|
||||||
|
(r"\bS\.p\.A\.\b", "IT"),
|
||||||
|
(r"\bSpA\b", "IT"),
|
||||||
|
(r"\bB\.V\.\b", "NL"),
|
||||||
|
(r"\bN\.V\.\b", "NL"),
|
||||||
|
(r"\bSL\b", "ES"),
|
||||||
|
(r"\bS\.A\.\sde C\.V\.\b", "MX"),
|
||||||
|
(r"\bd\.o\.o\.\b", "SI"), # Slowenien
|
||||||
|
(r"\bd\.d\.\b", "HR"), # Kroatien
|
||||||
|
(r"\bz\s?o\.o\.\b", "PL"),
|
||||||
|
(r"\bInc\.?\b", "US"),
|
||||||
|
(r"\bIncorporated\b", "US"),
|
||||||
|
(r"\bCorp\.?\b", "US"),
|
||||||
|
(r"\bCorporation\b", "US"),
|
||||||
|
(r"\bLLC\b", "US"),
|
||||||
|
(r"\bL\.L\.C\.\b", "US"),
|
||||||
|
(r"\bLtd\.?\b", "GB"), # UK Limited, default
|
||||||
|
(r"\bLimited\b", "GB"),
|
||||||
|
(r"\bPLC\b", "GB"),
|
||||||
|
(r"\bPty\b", "AU"),
|
||||||
|
(r"\bK\.K\.\b", "JP"), # Kabushiki-Kaisha
|
||||||
|
(r"\bPte\.?\sLtd\.?\b", "SG"),
|
||||||
|
]
|
||||||
|
|
||||||
|
# Country-Namen im Firmen-Namen (z.B. "Adobe Systems Software Ireland Limited")
|
||||||
|
_COUNTRY_NAME_TOKENS: list[tuple[str, str]] = [
|
||||||
|
("ireland", "IE"),
|
||||||
|
("deutschland", "DE"),
|
||||||
|
("germany", "DE"),
|
||||||
|
("netherlands", "NL"),
|
||||||
|
("france", "FR"),
|
||||||
|
("united kingdom", "GB"),
|
||||||
|
("uk", "GB"),
|
||||||
|
("usa", "US"),
|
||||||
|
("united states", "US"),
|
||||||
|
("austria", "AT"),
|
||||||
|
("oesterreich", "AT"),
|
||||||
|
("schweiz", "CH"),
|
||||||
|
("switzerland", "CH"),
|
||||||
|
("luxembourg", "LU"),
|
||||||
|
("luxemburg", "LU"),
|
||||||
|
("denmark", "DK"),
|
||||||
|
("daenemark", "DK"),
|
||||||
|
("sweden", "SE"),
|
||||||
|
("schweden", "SE"),
|
||||||
|
("norway", "NO"),
|
||||||
|
("norwegen", "NO"),
|
||||||
|
("finland", "FI"),
|
||||||
|
("finnland", "FI"),
|
||||||
|
]
|
||||||
|
|
||||||
|
# Bekannte Vendors mit eindeutigem Sitz (override)
|
||||||
|
_KNOWN_VENDOR_COUNTRY: dict[str, str] = {
|
||||||
|
"google inc": "US",
|
||||||
|
"google llc": "US",
|
||||||
|
"google ireland": "IE",
|
||||||
|
"meta platforms ireland": "IE",
|
||||||
|
"facebook ireland": "IE",
|
||||||
|
"amazon.com inc": "US",
|
||||||
|
"amazon web services": "US",
|
||||||
|
"amazon web services inc": "US",
|
||||||
|
"linkedin inc": "US",
|
||||||
|
"salesforce inc": "US",
|
||||||
|
"salesforce.com": "US",
|
||||||
|
"outbrain inc": "US",
|
||||||
|
"taboola inc": "US",
|
||||||
|
"pinterest europe ltd": "IE",
|
||||||
|
"intuition machines inc": "US",
|
||||||
|
"akamai technologies inc": "US",
|
||||||
|
"criteo s.a": "FR",
|
||||||
|
"criteo sa": "FR",
|
||||||
|
"adform a/s": "DK",
|
||||||
|
"speedcurve limited": "GB",
|
||||||
|
"longtail ad solutions": "US",
|
||||||
|
"genesys cloud services b.v": "NL",
|
||||||
|
"qualtrics": "US",
|
||||||
|
"teads sa": "FR",
|
||||||
|
"teads s.a": "FR",
|
||||||
|
"salesviewer gmbh": "DE",
|
||||||
|
"baqend gmbh": "DE",
|
||||||
|
"zenweshare sas": "FR",
|
||||||
|
"nayoki gmbh": "DE",
|
||||||
|
"psyma": "DE",
|
||||||
|
"matomo": "NZ", # InnoCraft NZ aber EU-hostbar
|
||||||
|
"adobe systems software ireland": "IE",
|
||||||
|
"microsoft corporation": "US",
|
||||||
|
"microsoft corp": "US",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _country_from_name(vendor_name: str) -> str:
|
||||||
|
"""Best-effort: ISO-2 Country-Code aus dem Vendor-Namen ableiten."""
|
||||||
|
if not vendor_name:
|
||||||
|
return ""
|
||||||
|
# Vendor-Namen sind oft "<Firma> — <Tool>" — nur Firmen-Teil betrachten
|
||||||
|
firm = vendor_name.split(" — ")[0].strip()
|
||||||
|
firm_l = firm.lower()
|
||||||
|
|
||||||
|
# 1) Known vendor lookup (most specific)
|
||||||
|
for k, v in _KNOWN_VENDOR_COUNTRY.items():
|
||||||
|
if k in firm_l:
|
||||||
|
return v
|
||||||
|
# 2) Country-Name im Firmen-Namen
|
||||||
|
for token, code in _COUNTRY_NAME_TOKENS:
|
||||||
|
if token in firm_l:
|
||||||
|
return code
|
||||||
|
# 3) Rechtsform-Suffix
|
||||||
|
for pattern, code in _SUFFIX_COUNTRY:
|
||||||
|
if _re.search(pattern, firm):
|
||||||
|
return code
|
||||||
|
return ""
|
||||||
|
|||||||
@@ -0,0 +1,350 @@
|
|||||||
|
"""
|
||||||
|
Doc-Anchor-Locator — fuer ein Finding den passendsten Einfuege-Ort im
|
||||||
|
existierenden Dokument finden.
|
||||||
|
|
||||||
|
Primary strategy: BGE-M3 Embedding-Match zwischen einer pro-Finding
|
||||||
|
Anchor-Query und allen Absaetzen des Docs. Echtes semantisches Matching
|
||||||
|
(BMW schreibt "Verarbeiter" statt "Auftragsverarbeiter" → Keyword waere
|
||||||
|
out, Embedding catches it).
|
||||||
|
|
||||||
|
Fallback: Keyword-Match (wenn kein Embedding-Service erreichbar).
|
||||||
|
|
||||||
|
Output pro Anchor:
|
||||||
|
- anchor_phrase : Originaltext-Auszug
|
||||||
|
- position_hint : "Nach Absatz X von Y: '...'"
|
||||||
|
- confidence : 'high' | 'medium' | 'low'
|
||||||
|
- score : float (cosine similarity oder keyword-rank)
|
||||||
|
- method : 'embedding' | 'keyword' | 'fallback'
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import math
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import threading
|
||||||
|
from typing import Iterable
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
EMBEDDING_URL = os.getenv("EMBEDDING_URL", "http://embedding-service:8087")
|
||||||
|
|
||||||
|
# Pro Finding-Typ eine semantisch reiche Anchor-Query, die der Embedding-
|
||||||
|
# Matcher gegen den Doc-Text wirft. Reicher als die kurze MC-check_question.
|
||||||
|
# Sucht NICHT den Pflichttext selbst (der fehlt ja) sondern den Absatz wo
|
||||||
|
# der Fix HINEIN-soll — also den thematisch verwandten Kontext.
|
||||||
|
_ANCHOR_QUERIES: list[tuple[str, str, str]] = [
|
||||||
|
# (finding_label_partial, anchor_query, fallback_hint)
|
||||||
|
(
|
||||||
|
"Auftragsverarbeiter erwaehnt",
|
||||||
|
"Empfaenger der Daten Verarbeiter Dienstleister Cloud Hosting CRM "
|
||||||
|
"Auftragsverarbeitung Weitergabe Datenuebermittlung an Dritte",
|
||||||
|
"Im Abschnitt 'Empfaenger' oder 'Datenuebermittlung'",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Automatisierte Entscheidungen",
|
||||||
|
"Betroffenenrechte automatisierte Entscheidung Profiling Logik "
|
||||||
|
"Tragweite Auswirkung Art. 22 DSGVO",
|
||||||
|
"Am Ende des Abschnitts 'Betroffenenrechte'",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Konkrete Aufsichtsbehoerde",
|
||||||
|
"Beschwerderecht Datenschutzaufsicht Aufsichtsbehoerde Beschwerde "
|
||||||
|
"bei der Behoerde einreichen Recht auf Beschwerde",
|
||||||
|
"Im Abschnitt 'Beschwerderecht'",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Angemessenheitsbeschluss",
|
||||||
|
"Drittlandtransfer USA Standardvertragsklauseln SCC DPF Data Privacy "
|
||||||
|
"Framework Angemessenheitsbeschluss internationale Datenuebermittlung",
|
||||||
|
"Im Abschnitt 'Drittlandtransfer'",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Anschrift des Verantwortlichen",
|
||||||
|
"Verantwortlicher Verantwortliche Stelle Datenschutz Betreiber dieser "
|
||||||
|
"Website Firma Anschrift Kontakt",
|
||||||
|
"Am Anfang der Datenschutzerklaerung / Cookie-Richtlinie",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Konkrete Cookie-Namen",
|
||||||
|
"Welche Cookies verwenden wir Cookie-Tabelle Liste der Cookies "
|
||||||
|
"Cookie-Kategorien Auflistung der Cookies Name Anbieter Zweck",
|
||||||
|
"Im Abschnitt 'Welche Cookies verwenden wir?'",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Konkrete Anbieter/Dienste",
|
||||||
|
"Drittanbieter Dienste Anbieter wir nutzen folgende Dienste "
|
||||||
|
"Empfaenger der Cookie-Daten Liste der Dienstleister",
|
||||||
|
"In der Drittanbieter-Liste der Cookie-Richtlinie",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Analytics-/Statistik-Tools konkret benannt",
|
||||||
|
"Statistik Analytics Reichweitenmessung Webanalyse Tracking "
|
||||||
|
"Google Analytics Matomo Adobe Analytics",
|
||||||
|
"Im Abschnitt 'Statistik / Analyse-Cookies'",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Konkrete Speicherdauer",
|
||||||
|
"Speicherdauer Lebensdauer wie lange Ablauf Cookie-Tabelle Spalte "
|
||||||
|
"Speicherdauer pro Cookie",
|
||||||
|
"In der Cookie-Tabelle pro Eintrag",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Opt-Out-Links",
|
||||||
|
"Widerruf widersprechen deaktivieren Cookie-Einstellungen aendern "
|
||||||
|
"Opt-Out Einstellungen anpassen",
|
||||||
|
"Im Abschnitt 'Wie kann ich widersprechen?'",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Privacy-Policy-Links",
|
||||||
|
"Datenschutzerklaerung des Drittanbieters Privacy Policy Link auf "
|
||||||
|
"Datenschutzhinweise der Drittanbieter",
|
||||||
|
"Im Drittanbieter-Listing der Cookie-Richtlinie",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Verbraucherstreitbeilegung",
|
||||||
|
"Online-Streitbeilegung OS-Plattform Verbraucherschlichtungsstelle "
|
||||||
|
"Streitbeilegung Verbraucher",
|
||||||
|
"Am Ende des Impressums (eigener Abschnitt 'Streitbeilegung')",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Rechtswidriger Haftungsausschluss",
|
||||||
|
"Haftung Disclaimer wir distanzieren uns Links zu externen Webseiten "
|
||||||
|
"Haftungsausschluss Drittinhalte",
|
||||||
|
"Am Ende des Impressums (Disclaimer-Absatz)",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Name der vertretungsberechtigten",
|
||||||
|
"Vertreten durch Vorstand Geschaeftsfuehrung Geschaeftsfuehrer "
|
||||||
|
"vertretungsberechtigt Repraesentant",
|
||||||
|
"Im Impressum nach Firmenname + Anschrift",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Zustaendige Kammer",
|
||||||
|
"Berufsrecht Kammer Aufsichtsbehoerde berufsrechtliche Angaben "
|
||||||
|
"zustaendige Kammer",
|
||||||
|
"Im Impressum im Abschnitt 'Berufsrechtliche Angaben'",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Drittlaender",
|
||||||
|
"Drittland Drittlaender USA Indien China internationale Datenuebermittlung "
|
||||||
|
"Datenexport in Nicht-EU-Staaten",
|
||||||
|
"Im Abschnitt 'Drittlandtransfer'",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Schutzgarantien",
|
||||||
|
"Schutzgarantien Schutzvorkehrungen Schutzmassnahmen SCC "
|
||||||
|
"Standardvertragsklauseln einsehen Anforderung",
|
||||||
|
"Im Abschnitt 'Drittlandtransfer / Schutzgarantien'",
|
||||||
|
),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Thread-local Cache fuer Doc-Chunks + Embeddings ───────────────
|
||||||
|
# Pro Compliance-Check-Run werden die Doc-Texte einmal embedded und im
|
||||||
|
# Thread-Local-Storage gehalten, damit mehrere Findings im selben Run
|
||||||
|
# nicht jeweils neu embedded werden.
|
||||||
|
|
||||||
|
_tls = threading.local()
|
||||||
|
|
||||||
|
|
||||||
|
def _get_cache() -> dict:
|
||||||
|
if not hasattr(_tls, "cache"):
|
||||||
|
_tls.cache = {}
|
||||||
|
return _tls.cache
|
||||||
|
|
||||||
|
|
||||||
|
def reset_cache() -> None:
|
||||||
|
"""Per-request-cache leeren (sollte am Start jedes Doc-Check-Runs aufgerufen
|
||||||
|
werden, damit Vorgaenger-Daten kein Leak verursachen)."""
|
||||||
|
if hasattr(_tls, "cache"):
|
||||||
|
_tls.cache = {}
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Helfer ────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def _normalize(text: str) -> str:
|
||||||
|
return (text or "").lower().replace("\xad", "").replace("ß", "ss")
|
||||||
|
|
||||||
|
|
||||||
|
def _split_paragraphs(text: str) -> list[str]:
|
||||||
|
"""Split a doc into paragraphs (by double newline, fallback single)."""
|
||||||
|
if not text:
|
||||||
|
return []
|
||||||
|
paras = re.split(r"\n\s*\n", text)
|
||||||
|
if len(paras) < 3:
|
||||||
|
paras = re.split(r"(?<=[\.\?\!])\s+(?=[A-ZÄÖÜ])", text)
|
||||||
|
return [p.strip() for p in paras if p.strip()]
|
||||||
|
|
||||||
|
|
||||||
|
def _embed_sync(texts: list[str], timeout: float = 60.0,
|
||||||
|
batch_size: int = 32) -> list[list[float]]:
|
||||||
|
"""Synchroner Batch-Embed-Call (Anchor-Lokalisierung laeuft in
|
||||||
|
Sync-HTML-Render, nicht in async context)."""
|
||||||
|
if not texts:
|
||||||
|
return []
|
||||||
|
out: list[list[float]] = []
|
||||||
|
with httpx.Client(timeout=timeout) as client:
|
||||||
|
for i in range(0, len(texts), batch_size):
|
||||||
|
batch = texts[i:i + batch_size]
|
||||||
|
try:
|
||||||
|
r = client.post(f"{EMBEDDING_URL}/embed", json={"texts": batch})
|
||||||
|
r.raise_for_status()
|
||||||
|
out.extend(r.json().get("embeddings") or [])
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("Anchor embed sub-batch [%d-%d] failed: %s",
|
||||||
|
i, i + len(batch), e)
|
||||||
|
out.extend([[] for _ in batch])
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _cosine(a: list[float], b: list[float]) -> float:
|
||||||
|
if not a or not b or len(a) != len(b):
|
||||||
|
return 0.0
|
||||||
|
dot = sum(x * y for x, y in zip(a, b))
|
||||||
|
na = math.sqrt(sum(x * x for x in a))
|
||||||
|
nb = math.sqrt(sum(y * y for y in b))
|
||||||
|
if na == 0 or nb == 0:
|
||||||
|
return 0.0
|
||||||
|
return dot / (na * nb)
|
||||||
|
|
||||||
|
|
||||||
|
def _doc_paragraphs_and_vectors(
|
||||||
|
doc_id: str, doc_text: str,
|
||||||
|
) -> tuple[list[str], list[list[float]]]:
|
||||||
|
"""Lazy-cache: Absaetze + Vektoren pro Doc-ID. Wird genau einmal pro
|
||||||
|
Doc und Run berechnet."""
|
||||||
|
cache = _get_cache()
|
||||||
|
if doc_id in cache:
|
||||||
|
return cache[doc_id]
|
||||||
|
|
||||||
|
paras = _split_paragraphs(doc_text)
|
||||||
|
if not paras:
|
||||||
|
cache[doc_id] = ([], [])
|
||||||
|
return cache[doc_id]
|
||||||
|
|
||||||
|
vecs = _embed_sync(paras)
|
||||||
|
cache[doc_id] = (paras, vecs)
|
||||||
|
return cache[doc_id]
|
||||||
|
|
||||||
|
|
||||||
|
def _keyword_fallback(fl: str, doc_text: str) -> dict | None:
|
||||||
|
"""Fallback wenn Embedding-Service ausfaellt oder zu wenig Trefferqualitaet."""
|
||||||
|
# Use the old _ANCHOR_QUERIES list — extract just the fallback hint
|
||||||
|
for label_partial, _query, fallback_hint in _ANCHOR_QUERIES:
|
||||||
|
if _normalize(label_partial) in fl:
|
||||||
|
return {
|
||||||
|
"anchor_phrase": None,
|
||||||
|
"position_hint": fallback_hint,
|
||||||
|
"confidence": "low",
|
||||||
|
"method": "fallback",
|
||||||
|
}
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def locate_anchor(
|
||||||
|
finding_label: str,
|
||||||
|
doc_text: str,
|
||||||
|
doc_id: str | None = None,
|
||||||
|
) -> dict | None:
|
||||||
|
"""Fuer ein Finding den passendsten Anker-Absatz im Doc-Text finden.
|
||||||
|
|
||||||
|
Primary: Embedding-Match. Sekundaer: Keyword-Hits als Bonus. Fallback:
|
||||||
|
rein keyword-basiert wenn Embedding-Service nicht erreichbar ist.
|
||||||
|
|
||||||
|
`doc_id` ist ein cache-key (z.B. doc_type oder url). Wenn None, wird
|
||||||
|
aus dem doc_text-Hash abgeleitet.
|
||||||
|
"""
|
||||||
|
if not doc_text or not finding_label:
|
||||||
|
return None
|
||||||
|
|
||||||
|
fl = _normalize(finding_label)
|
||||||
|
|
||||||
|
# Welche Anchor-Query matched dieses Finding?
|
||||||
|
query = None
|
||||||
|
fallback_hint = None
|
||||||
|
matched_label = None
|
||||||
|
for label_partial, q, fb in _ANCHOR_QUERIES:
|
||||||
|
if _normalize(label_partial) in fl:
|
||||||
|
query, fallback_hint, matched_label = q, fb, label_partial
|
||||||
|
break
|
||||||
|
if not query:
|
||||||
|
return None
|
||||||
|
|
||||||
|
doc_id = doc_id or f"doc-{hash(doc_text) & 0xffffffff:08x}"
|
||||||
|
|
||||||
|
# 1) Embedding-Match
|
||||||
|
paras, doc_vecs = _doc_paragraphs_and_vectors(doc_id, doc_text)
|
||||||
|
if not paras:
|
||||||
|
return None
|
||||||
|
|
||||||
|
embeddings_available = any(v for v in doc_vecs)
|
||||||
|
if not embeddings_available:
|
||||||
|
return _keyword_fallback(fl, doc_text)
|
||||||
|
|
||||||
|
try:
|
||||||
|
q_vec = _embed_sync([query])[0] if query else None
|
||||||
|
except Exception:
|
||||||
|
q_vec = None
|
||||||
|
|
||||||
|
if not q_vec:
|
||||||
|
return _keyword_fallback(fl, doc_text)
|
||||||
|
|
||||||
|
# Per-Absatz Score = cosine + Heading-Bonus
|
||||||
|
best_idx = -1
|
||||||
|
best_score = 0.0
|
||||||
|
for i, (p, dv) in enumerate(zip(paras, doc_vecs)):
|
||||||
|
if not dv:
|
||||||
|
continue
|
||||||
|
sim = _cosine(q_vec, dv)
|
||||||
|
# Heading-Bonus: kurze Absaetze + Markdown-Heading-Marker
|
||||||
|
if len(p.split()) <= 8 or p.strip().startswith("#"):
|
||||||
|
sim += 0.05
|
||||||
|
if sim > best_score:
|
||||||
|
best_score = sim
|
||||||
|
best_idx = i
|
||||||
|
|
||||||
|
# Konfidenz-Schwellen — kalibriert anhand BMW-Run
|
||||||
|
if best_idx < 0 or best_score < 0.40:
|
||||||
|
# Zu schwacher Match — Fallback verwenden
|
||||||
|
return {
|
||||||
|
"anchor_phrase": None,
|
||||||
|
"position_hint": fallback_hint,
|
||||||
|
"confidence": "low",
|
||||||
|
"score": round(best_score, 3) if best_idx >= 0 else 0,
|
||||||
|
"method": "embedding-no-match",
|
||||||
|
}
|
||||||
|
|
||||||
|
if best_score >= 0.62:
|
||||||
|
confidence = "high"
|
||||||
|
elif best_score >= 0.50:
|
||||||
|
confidence = "medium"
|
||||||
|
else:
|
||||||
|
confidence = "low"
|
||||||
|
|
||||||
|
anchor = paras[best_idx]
|
||||||
|
words = anchor.split()
|
||||||
|
snippet = " ".join(words[:30]) + ("…" if len(words) > 30 else "")
|
||||||
|
return {
|
||||||
|
"anchor_phrase": snippet,
|
||||||
|
"anchor_index": best_idx,
|
||||||
|
"total_paragraphs": len(paras),
|
||||||
|
"position_hint": f"Nach Absatz {best_idx + 1} von {len(paras)}: '{snippet}'",
|
||||||
|
"confidence": confidence,
|
||||||
|
"score": round(best_score, 3),
|
||||||
|
"method": "embedding",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def annotate_findings_with_anchors(
|
||||||
|
findings: Iterable[dict], doc_text: str, doc_id: str | None = None,
|
||||||
|
) -> list[dict]:
|
||||||
|
"""Pro Finding den Anchor suchen und das Dict um 'anchor' erweitern."""
|
||||||
|
out = []
|
||||||
|
for f in findings:
|
||||||
|
a = locate_anchor(f.get("label") or f.get("title") or "", doc_text, doc_id)
|
||||||
|
out.append({**f, "anchor": a})
|
||||||
|
return out
|
||||||
@@ -0,0 +1,353 @@
|
|||||||
|
"""
|
||||||
|
Action-Recipes — pro Finding-Typ eine umsetzbare Handlungsanweisung:
|
||||||
|
WAS tun, WARUM (Rechtsgrundlage), WIE formulieren (konkreter Textbaustein),
|
||||||
|
WO einfuegen (Doc-Abschnitt-Hinweis).
|
||||||
|
|
||||||
|
Ohne Recipes ist ein Finding nur eine Diagnose. Mit Recipe weiss der
|
||||||
|
Kunde sofort welchen Satz er an welche Stelle setzen muss.
|
||||||
|
|
||||||
|
Verwendung:
|
||||||
|
from compliance.services.finding_action_recipes import recipe_for
|
||||||
|
rec = recipe_for("no_cookies_listed") # → dict mit what/why/fix_text/where/example
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import TypedDict
|
||||||
|
|
||||||
|
|
||||||
|
class ActionRecipe(TypedDict, total=False):
|
||||||
|
what: str # 1-Satz Diagnose
|
||||||
|
why: str # Rechtsgrundlage / Risiko
|
||||||
|
fix_text: str # konkreter Textbaustein zum Einfuegen
|
||||||
|
where: str # in welchem Doc-Abschnitt
|
||||||
|
example: str # echtes Anwendungsbeispiel
|
||||||
|
severity: str # 'critical' | 'high' | 'medium' | 'low'
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Vendor-/Cookie-Findings (im VVT-Block) ────────────────────────
|
||||||
|
|
||||||
|
VENDOR_FINDINGS: dict[str, ActionRecipe] = {
|
||||||
|
|
||||||
|
"no_cookies_listed": {
|
||||||
|
"what": "Anbieter ist erfasst, aber es sind keine konkreten Cookies "
|
||||||
|
"dokumentiert.",
|
||||||
|
"why": "Die DSK-Orientierungshilfe Telemedien 2024 fordert pro Anbieter "
|
||||||
|
"eine Auflistung aller gesetzten Cookies mit Name + Zweck + "
|
||||||
|
"Speicherdauer. 'Verwendet Cookies' ohne Details erfuellt "
|
||||||
|
"Art. 13 Abs. 1 lit. e DSGVO nicht.",
|
||||||
|
"fix_text": "Ergaenzen Sie pro Cookie eine Zeile mit folgenden Feldern:\n"
|
||||||
|
" • Cookie-Name (z.B. _ga, _fbp, NID)\n"
|
||||||
|
" • Setzender Anbieter (Firma + Sitzland)\n"
|
||||||
|
" • Zweck (1 Satz, z.B. 'Reichweitenmessung pseudonym')\n"
|
||||||
|
" • Speicherdauer (z.B. '2 Jahre', '24h', 'Session')",
|
||||||
|
"where": "Cookie-Richtlinie unter dem Abschnitt 'Kategorien' "
|
||||||
|
"(Notwendig / Marketing / Statistik / ...).",
|
||||||
|
"example": "_ga (Google Ireland Ltd.) — Statistik — eindeutige "
|
||||||
|
"Besucher-ID — Speicherdauer 2 Jahre",
|
||||||
|
"severity": "high",
|
||||||
|
},
|
||||||
|
|
||||||
|
"no_country": {
|
||||||
|
"what": "Anbieter-Sitzland ist nicht dokumentiert.",
|
||||||
|
"why": "Art. 30 Abs. 1 lit. d DSGVO fordert die Kategorien der Empfaenger "
|
||||||
|
"inkl. Sitz. Bei Drittland-Empfaengern (US, IN, CN ...) ist das "
|
||||||
|
"zusaetzlich Pflicht nach Art. 13 Abs. 1 lit. f DSGVO.",
|
||||||
|
"fix_text": "Fuegen Sie hinter dem Anbieternamen das Sitzland in "
|
||||||
|
"Klammern ein. Bei Drittland (ausserhalb EU/EWR) zusaetzlich "
|
||||||
|
"den Transfer-Mechanismus (SCC, DPF, Angemessenheitsbeschluss).",
|
||||||
|
"where": "VVT-Eintrag bei 'Empfaenger' oder im Cookie-Anbieter-Listing.",
|
||||||
|
"example": "Google Ireland Ltd., Dublin, Irland (EU). Bei US-Transfer: "
|
||||||
|
"'Google LLC, Mountain View, US — DPF-zertifiziert'.",
|
||||||
|
"severity": "high",
|
||||||
|
},
|
||||||
|
|
||||||
|
"no_privacy_url": {
|
||||||
|
"what": "Kein direkter Link zur Datenschutzerklaerung des Anbieters.",
|
||||||
|
"why": "Art. 13 Abs. 2 lit. f DSGVO Transparenzgebot: der Nutzer muss "
|
||||||
|
"die Datenverarbeitung beim Drittanbieter eigenverantwortlich "
|
||||||
|
"nachvollziehen koennen.",
|
||||||
|
"fix_text": "Ergaenzen Sie einen klickbaren Link zur Datenschutzerklaerung "
|
||||||
|
"des Anbieters direkt neben dem Anbieternamen.",
|
||||||
|
"where": "Cookie-Richtlinie pro Vendor-Eintrag. Idealerweise als "
|
||||||
|
"letzter Spalteneintrag oder Inline-Link.",
|
||||||
|
"example": "Google Analytics — Datenschutz: https://policies.google.com/privacy",
|
||||||
|
"severity": "medium",
|
||||||
|
},
|
||||||
|
|
||||||
|
"broken_privacy_url": {
|
||||||
|
"what": "Der angegebene Privacy-Link gibt einen Fehler zurueck "
|
||||||
|
"(404 / 403 / Timeout).",
|
||||||
|
"why": "Ein toter Link erfuellt Art. 13(2)(f) DSGVO nicht — die "
|
||||||
|
"Transparenz-Pflicht laeuft ins Leere.",
|
||||||
|
"fix_text": "1. Pruefen Sie den Link aus einem normalen Browser. "
|
||||||
|
"Funktioniert er dort? → Anbieter blockt automatisierte Pruefer (kein Mangel).\n"
|
||||||
|
"2. Funktioniert er nicht? → Aktualisieren Sie den Link in Ihrer "
|
||||||
|
"Cookie-Richtlinie auf die heute gueltige Privacy-URL des Anbieters.",
|
||||||
|
"where": "Cookie-Richtlinie / Drittanbieter-Liste.",
|
||||||
|
"example": "Wenn Adobe-Privacy-Link 404 gibt, ersetzen durch "
|
||||||
|
"https://www.adobe.com/privacy/policy.html",
|
||||||
|
"severity": "high",
|
||||||
|
},
|
||||||
|
|
||||||
|
"no_opt_out_url": {
|
||||||
|
"what": "Kein Opt-Out-Link fuer diesen Anbieter dokumentiert.",
|
||||||
|
"why": "Art. 7 Abs. 3 DSGVO: der Widerruf der Einwilligung muss so "
|
||||||
|
"einfach wie die Erteilung sein. Pro Drittanbieter muss eine "
|
||||||
|
"Opt-Out-Moeglichkeit angeboten werden.",
|
||||||
|
"fix_text": "Recherchieren Sie den offiziellen Opt-Out-Link des "
|
||||||
|
"Anbieters und ergaenzen Sie ihn. Falls Ihr Cookie-Banner "
|
||||||
|
"ein 'Einstellungen aendern' anbietet, ist das oft "
|
||||||
|
"ausreichend — der Link sollte trotzdem als Backup "
|
||||||
|
"dokumentiert sein.",
|
||||||
|
"where": "Cookie-Richtlinie pro Vendor-Eintrag.",
|
||||||
|
"example": "Google Analytics Opt-Out: https://tools.google.com/dlpage/gaoptout",
|
||||||
|
"severity": "high",
|
||||||
|
},
|
||||||
|
|
||||||
|
"broken_opt_out": {
|
||||||
|
"what": "Der angegebene Opt-Out-Link funktioniert nicht "
|
||||||
|
"(404 / 403 / Timeout).",
|
||||||
|
"why": "Art. 7(3) DSGVO Widerrufsmoeglichkeit ohne funktionierenden "
|
||||||
|
"Link ist nicht gegeben.",
|
||||||
|
"fix_text": "1. Pruefen Sie ob der Link aus einem Browser funktioniert. "
|
||||||
|
"403 kommt oft von Bot-Schutz und ist kein realer Mangel.\n"
|
||||||
|
"2. Bei echtem 404: aktualisieren Sie auf den heute gueltigen "
|
||||||
|
"Opt-Out-Link.\n"
|
||||||
|
"3. Alternativ verlinken Sie auf Ihren eigenen Cookie-Banner "
|
||||||
|
"'Einstellungen aendern'-Trigger.",
|
||||||
|
"where": "Cookie-Richtlinie pro Vendor-Eintrag.",
|
||||||
|
"example": "Wenn Criteo-Opt-Out 403 gibt, ist das oft Cloudflare-Schutz — "
|
||||||
|
"Link aus dem Browser klickbar → kein Mangel. Alternativ: "
|
||||||
|
"https://www.youronlinechoices.com/de/",
|
||||||
|
"severity": "medium",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Doc-Pruefungs-Findings (im DSE/Impressum/Cookie-Pruefblock) ───
|
||||||
|
|
||||||
|
DOC_CHECK_FINDINGS: dict[str, ActionRecipe] = {
|
||||||
|
|
||||||
|
"Auftragsverarbeiter erwaehnt": {
|
||||||
|
"what": "Auftragsverarbeiter (Hosting, CRM, Newsletter) sind nicht "
|
||||||
|
"explizit erwaehnt + es fehlt Hinweis auf AVVs nach Art. 28 DSGVO.",
|
||||||
|
"why": "Art. 28 DSGVO Auftragsverarbeitung erfordert dokumentierte AVVs. "
|
||||||
|
"Art. 13(1)(e) DSGVO erfordert die Nennung der Empfaenger-"
|
||||||
|
"Kategorien. Fehlt der Hinweis = klassischer Pruefpunkt der "
|
||||||
|
"Aufsichtsbehoerden.",
|
||||||
|
"fix_text": "Wir setzen sorgfaeltig ausgewaehlte Auftragsverarbeiter "
|
||||||
|
"(z.B. fuer Hosting, Webanalyse, Customer Service) ein. Mit "
|
||||||
|
"allen Auftragsverarbeitern haben wir Vertraege zur "
|
||||||
|
"Auftragsverarbeitung nach Art. 28 DSGVO geschlossen. Die "
|
||||||
|
"Auftragsverarbeiter handeln ausschliesslich auf unsere "
|
||||||
|
"Weisung und sind vertraglich zu angemessenen technischen "
|
||||||
|
"und organisatorischen Massnahmen verpflichtet.",
|
||||||
|
"where": "Datenschutzerklaerung im Abschnitt 'Empfaenger' oder "
|
||||||
|
"'Datenuebermittlung'. Direkt nach der Aufzaehlung der "
|
||||||
|
"Empfaenger-Kategorien.",
|
||||||
|
"example": "Empfaenger: BMW Niederlassungen, Vertragshaendler, "
|
||||||
|
"Auftragsverarbeiter (Cloud-Hosting AWS, CRM Salesforce, "
|
||||||
|
"Webanalyse Adobe Analytics — mit allen sind AVVs nach "
|
||||||
|
"Art. 28 DSGVO geschlossen).",
|
||||||
|
"severity": "high",
|
||||||
|
},
|
||||||
|
|
||||||
|
"Automatisierte Entscheidungen / Profiling": {
|
||||||
|
"what": "Keine Aussage zu automatisierten Einzelentscheidungen "
|
||||||
|
"oder Profiling nach Art. 22 DSGVO.",
|
||||||
|
"why": "Art. 13 Abs. 2 lit. f DSGVO: bei automatisierten "
|
||||||
|
"Einzelentscheidungen muessen Logik, Tragweite und Auswirkungen "
|
||||||
|
"erklaert werden. Bei KEINEM Profiling muss das explizit "
|
||||||
|
"verneint werden — sonst lassen Aufsichtsbehoerden Unsicherheit "
|
||||||
|
"offen.",
|
||||||
|
"fix_text": "Variante A (kein Profiling):\n"
|
||||||
|
" 'Es findet keine automatisierte Entscheidungsfindung "
|
||||||
|
"im Sinne des Art. 22 DSGVO statt. Sofern wir Profiling "
|
||||||
|
"zur Anzeige personalisierter Inhalte einsetzen, erfolgt "
|
||||||
|
"dies ausschliesslich auf Basis Ihrer Einwilligung und "
|
||||||
|
"wird im Abschnitt [X] erlaeutert.'\n\n"
|
||||||
|
"Variante B (Profiling vorhanden, z.B. fuer Werbung):\n"
|
||||||
|
" 'Wir nutzen Profiling zur Anzeige personalisierter "
|
||||||
|
"Werbung. Die Logik basiert auf [Klick-Historie / "
|
||||||
|
"Besuchsverhalten / Praeferenzen]. Tragweite: "
|
||||||
|
"Anpassung der angezeigten Anzeigen. Auswirkung: keine "
|
||||||
|
"rechtlichen oder erheblichen Auswirkungen — Sie koennen "
|
||||||
|
"jederzeit widersprechen unter [Link/Kontakt].'",
|
||||||
|
"where": "Datenschutzerklaerung am Ende des Abschnitts "
|
||||||
|
"'Betroffenenrechte' oder als eigener Absatz unter "
|
||||||
|
"'Automatisierte Entscheidungen'.",
|
||||||
|
"example": "Standardformulierung Variante A — falls Sie KEIN Profiling "
|
||||||
|
"betreiben, ist das der sichere Default-Text.",
|
||||||
|
"severity": "high",
|
||||||
|
},
|
||||||
|
|
||||||
|
"Konkrete Aufsichtsbehoerde benannt": {
|
||||||
|
"what": "Aufsichtsbehoerde wird nicht namentlich genannt.",
|
||||||
|
"why": "Art. 13(2)(d) DSGVO: das Beschwerderecht muss konkret "
|
||||||
|
"kommuniziert werden — nicht 'die zustaendige Behoerde', sondern "
|
||||||
|
"Name + Anschrift + Website.",
|
||||||
|
"fix_text": "Sie haben das Recht, sich bei einer Datenschutz-"
|
||||||
|
"Aufsichtsbehoerde zu beschweren. Zustaendig ist:\n\n"
|
||||||
|
" [Aufsichtsbehoerde mit Namen + Strasse + PLZ + Ort + Web]\n\n"
|
||||||
|
"Beispiel: 'Bayerisches Landesamt fuer Datenschutzaufsicht "
|
||||||
|
"(BayLDA), Promenade 18, 91522 Ansbach, www.lda.bayern.de'",
|
||||||
|
"where": "Datenschutzerklaerung im Abschnitt 'Betroffenenrechte' oder "
|
||||||
|
"'Beschwerderecht'.",
|
||||||
|
"example": "Fuer BMW AG (Sitz Muenchen, Bayern): BayLDA, Promenade 18, "
|
||||||
|
"91522 Ansbach, www.lda.bayern.de",
|
||||||
|
"severity": "high",
|
||||||
|
},
|
||||||
|
|
||||||
|
"Angemessenheitsbeschluss der Kommission": {
|
||||||
|
"what": "Drittlandtransfer erwaehnt, aber kein Hinweis auf den "
|
||||||
|
"konkreten Angemessenheitsbeschluss / DPF / SCC.",
|
||||||
|
"why": "Art. 13(1)(f) DSGVO: bei Drittlandtransfer muss der "
|
||||||
|
"Transfer-Mechanismus benannt sein (Angemessenheitsbeschluss, "
|
||||||
|
"Standardvertragsklauseln, BCR, Ausnahmen nach Art. 49).",
|
||||||
|
"fix_text": "Fuer Datenuebermittlungen in die USA stuetzen wir uns auf "
|
||||||
|
"den Angemessenheitsbeschluss der EU-Kommission vom "
|
||||||
|
"10.07.2023 (EU-US Data Privacy Framework / DPF). Sofern "
|
||||||
|
"der Empfaenger DPF-zertifiziert ist, ist die Uebermittlung "
|
||||||
|
"rechtlich abgesichert. Bei nicht-zertifizierten Empfaengern "
|
||||||
|
"ergaenzen wir EU-Standardvertragsklauseln (SCC) gemaess "
|
||||||
|
"Durchfuehrungsbeschluss 2021/914.",
|
||||||
|
"where": "Datenschutzerklaerung im Abschnitt 'Drittlandtransfer' oder "
|
||||||
|
"'Internationale Datenuebermittlung'.",
|
||||||
|
"example": "Konkret: Google LLC (USA) ist DPF-zertifiziert "
|
||||||
|
"(Zertifikat einsehbar unter dataprivacyframework.gov).",
|
||||||
|
"severity": "high",
|
||||||
|
},
|
||||||
|
|
||||||
|
"Anschrift des Verantwortlichen": {
|
||||||
|
"what": "In der Cookie-Richtlinie fehlt die vollstaendige Anschrift.",
|
||||||
|
"why": "Art. 13(1)(a) DSGVO: der Verantwortliche muss eindeutig "
|
||||||
|
"identifizierbar sein. Cookie-Richtlinie + DSE muessen "
|
||||||
|
"konsistente Angaben enthalten.",
|
||||||
|
"fix_text": "Verantwortlich fuer die Datenverarbeitung im Sinne der "
|
||||||
|
"DSGVO ist:\n [Firmenname]\n [Strasse + Hausnummer]\n "
|
||||||
|
"[PLZ + Ort]\n [Land]\n E-Mail: [...]",
|
||||||
|
"where": "Cookie-Richtlinie am Anfang ODER Verweis-Link zur DSE.",
|
||||||
|
"example": "Bayerische Motoren Werke Aktiengesellschaft, Petuelring 130, "
|
||||||
|
"80809 Muenchen, Deutschland",
|
||||||
|
"severity": "high",
|
||||||
|
},
|
||||||
|
|
||||||
|
"Konkrete Cookie-Namen aufgelistet": {
|
||||||
|
"what": "Cookies werden allgemein erwaehnt aber ohne konkrete Namen + "
|
||||||
|
"Speicherdauer.",
|
||||||
|
"why": "DSK-Orientierungshilfe Telemedien: Auflistung jedes einzelnen "
|
||||||
|
"Cookies mit Name. Generische Aussagen ('wir nutzen "
|
||||||
|
"Werbe-Cookies') sind unzureichend.",
|
||||||
|
"fix_text": "Erweitern Sie die Cookie-Tabelle um die Spalten:\n"
|
||||||
|
" Name | Anbieter | Zweck | Speicherdauer\n\n"
|
||||||
|
"Browser-Devtools (Application > Cookies) zeigt die "
|
||||||
|
"tatsaechlich gesetzten Namen — bitte Cookie-Liste "
|
||||||
|
"regelmaessig synchronisieren.",
|
||||||
|
"where": "Cookie-Richtlinie im Abschnitt 'Verwendete Cookies'.",
|
||||||
|
"example": "_ga | Google Ireland | Statistik (Besucher-ID) | 2 Jahre\n"
|
||||||
|
"_fbp | Meta Platforms IE | Werbung (Browser-ID) | 90 Tage",
|
||||||
|
"severity": "high",
|
||||||
|
},
|
||||||
|
|
||||||
|
"Konkrete Speicherdauern pro Cookie": {
|
||||||
|
"what": "Speicherdauer nur pauschal oder als generischer Bereich.",
|
||||||
|
"why": "Art. 13(2)(a) DSGVO: konkrete Speicherdauer oder Kriterien "
|
||||||
|
"fuer die Festlegung. DSK fordert Speicherdauer PRO Cookie.",
|
||||||
|
"fix_text": "Pro Cookie eine konkrete Zeitangabe in der Cookie-Tabelle "
|
||||||
|
"ergaenzen: 'Session', '24 Stunden', '90 Tage', '2 Jahre'.",
|
||||||
|
"where": "Cookie-Richtlinie in der Cookie-Tabelle.",
|
||||||
|
"example": "JSESSIONID: Session — _ga: 2 Jahre — _fbp: 90 Tage",
|
||||||
|
"severity": "high",
|
||||||
|
},
|
||||||
|
|
||||||
|
"Opt-Out-Links pro Drittanbieter": {
|
||||||
|
"what": "Pro Drittanbieter ist kein direkter Opt-Out-Link angegeben.",
|
||||||
|
"why": "Art. 7(3) DSGVO Widerrufsmoeglichkeit. EuGH Planet49 "
|
||||||
|
"(C-673/17) bestaetigt: Widerruf muss so einfach wie Einwilligung sein.",
|
||||||
|
"fix_text": "Pro Drittanbieter eine Spalte 'Opt-Out' ergaenzen mit "
|
||||||
|
"direktem Link. Alternativ: zentralen 'Cookie-"
|
||||||
|
"Einstellungen aendern'-Button im Footer der Webseite + "
|
||||||
|
"Hinweis darauf in der Cookie-Richtlinie.",
|
||||||
|
"where": "Cookie-Richtlinie pro Vendor-Eintrag oder als zentraler "
|
||||||
|
"Abschnitt 'Wie kann ich widersprechen?'.",
|
||||||
|
"example": "Google Analytics: https://tools.google.com/dlpage/gaoptout\n"
|
||||||
|
"Meta Pixel: ueber Facebook-Konto-Einstellungen",
|
||||||
|
"severity": "high",
|
||||||
|
},
|
||||||
|
|
||||||
|
"Privacy-Policy-Links pro Drittanbieter": {
|
||||||
|
"what": "Pro Drittanbieter ist kein direkter Datenschutz-Link gesetzt.",
|
||||||
|
"why": "Art. 13(2)(f) DSGVO Transparenzgebot — Nutzer muss die "
|
||||||
|
"Datenverarbeitung beim Drittanbieter eigenverantwortlich "
|
||||||
|
"nachvollziehen koennen.",
|
||||||
|
"fix_text": "Pro Drittanbieter Link auf dessen Datenschutzerklaerung "
|
||||||
|
"ergaenzen. Tabelle 'Anbieter | Datenschutz-Link'.",
|
||||||
|
"where": "Cookie-Richtlinie im Drittanbieter-Listing.",
|
||||||
|
"example": "Adobe Analytics: https://www.adobe.com/privacy/policy.html",
|
||||||
|
"severity": "medium",
|
||||||
|
},
|
||||||
|
|
||||||
|
"Rechtswidriger Haftungsausschluss fuer Links": {
|
||||||
|
"what": "Klassischer Link-Disclaimer ('Wir distanzieren uns von verlinkten "
|
||||||
|
"Inhalten') ist im Impressum.",
|
||||||
|
"why": "BGH I ZR 317/01: solche Disclaimer sind rechtlich wirkungslos. "
|
||||||
|
"Sie befreien NICHT von der Stoererhaftung und koennen sogar "
|
||||||
|
"den gegenteiligen Effekt haben (Anerkennung der eigenen "
|
||||||
|
"Pruefpflicht).",
|
||||||
|
"fix_text": "Entfernen Sie pauschale Link-Disclaimer vollstaendig aus "
|
||||||
|
"dem Impressum. Bei Bedarf ein knapper, zutreffender Satz:\n"
|
||||||
|
" 'Fuer den Inhalt verlinkter externer Webseiten ist "
|
||||||
|
"ausschliesslich deren Betreiber verantwortlich.'",
|
||||||
|
"where": "Impressum am Ende des Dokuments.",
|
||||||
|
"example": "Statt 'Wir distanzieren uns ausdruecklich von allen "
|
||||||
|
"Inhalten verlinkter Seiten' — einfach nichts schreiben.",
|
||||||
|
"severity": "low",
|
||||||
|
},
|
||||||
|
|
||||||
|
"Verbraucherstreitbeilegung / OS-Plattform": {
|
||||||
|
"what": "Kein Link zur EU-OS-Plattform bzw. keine Aussage zur "
|
||||||
|
"Streitbeilegung.",
|
||||||
|
"why": "Art. 14 EU-VO 524/2013 (B2C-Anbieter, Online-Verkauf): "
|
||||||
|
"klickbarer Link auf https://ec.europa.eu/consumers/odr "
|
||||||
|
"PFLICHT. §36 VSBG: Aussage ob teilnehmend oder nicht.",
|
||||||
|
"fix_text": "Die EU-Kommission stellt eine Plattform zur Online-"
|
||||||
|
"Streitbeilegung (OS) bereit, die Sie unter "
|
||||||
|
"<a href='https://ec.europa.eu/consumers/odr'>"
|
||||||
|
"https://ec.europa.eu/consumers/odr</a> erreichen.\n\n"
|
||||||
|
"Wir sind nicht bereit oder verpflichtet, an "
|
||||||
|
"Streitbeilegungsverfahren vor einer "
|
||||||
|
"Verbraucherschlichtungsstelle teilzunehmen.",
|
||||||
|
"where": "Impressum am Ende, eigener Abschnitt 'Streitbeilegung'.",
|
||||||
|
"example": "Standardformulierung anwendbar fuer alle B2C-Anbieter ohne "
|
||||||
|
"ODR-Teilnahme.",
|
||||||
|
"severity": "high",
|
||||||
|
},
|
||||||
|
|
||||||
|
"Name der vertretungsberechtigten Person": {
|
||||||
|
"what": "Vertretungsberechtigte Person ist nicht namentlich mit "
|
||||||
|
"Funktionsbezeichnung genannt.",
|
||||||
|
"why": "§5 Abs. 1 Nr. 1 TMG: bei juristischen Personen sind die "
|
||||||
|
"Vertretungsberechtigten namentlich zu nennen.",
|
||||||
|
"fix_text": "Ergaenzen Sie nach der Firmenbezeichnung:\n"
|
||||||
|
" 'Vertreten durch: [Vorstand / Geschaeftsfuehrung] "
|
||||||
|
"[Vorname Nachname]'",
|
||||||
|
"where": "Impressum direkt nach Firmenname + Anschrift.",
|
||||||
|
"example": "Vertreten durch: Vorstandsvorsitzender Oliver Zipse",
|
||||||
|
"severity": "high",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def recipe_for(finding_key: str) -> ActionRecipe | None:
|
||||||
|
"""Lookup Recipe by finding-key. Vendors + Doc-Findings in einem Lookup."""
|
||||||
|
if finding_key in VENDOR_FINDINGS:
|
||||||
|
return VENDOR_FINDINGS[finding_key]
|
||||||
|
if finding_key in DOC_CHECK_FINDINGS:
|
||||||
|
return DOC_CHECK_FINDINGS[finding_key]
|
||||||
|
# Fuzzy match auf Doc-Findings (label kann variieren)
|
||||||
|
fk = finding_key.lower()
|
||||||
|
for k, v in DOC_CHECK_FINDINGS.items():
|
||||||
|
if k.lower() in fk or fk in k.lower():
|
||||||
|
return v
|
||||||
|
return None
|
||||||
@@ -0,0 +1,309 @@
|
|||||||
|
"""
|
||||||
|
MC Embedding Match — semantic fallback for the regex-based doc_check.
|
||||||
|
|
||||||
|
The Sonnet classifier filtered MCs to `check_type='text'` (matchable
|
||||||
|
against doc text). But the regex matcher is still too strict — BMW
|
||||||
|
writes "Speicherdauer 2 Jahre", the MC pattern expects
|
||||||
|
"\\d+\\s*(Tag|Jahr)". We catch these via BGE-M3 embeddings + cosine
|
||||||
|
similarity:
|
||||||
|
|
||||||
|
1. Embed the MC's check_question (once, cached in sidecar)
|
||||||
|
2. Embed the doc text in 50-word chunks
|
||||||
|
3. cosine(MC, max(chunks)) ≥ threshold → MC passes via "semantic"
|
||||||
|
|
||||||
|
This recovers ~50% of failed MCs at BMW-scale (estimated).
|
||||||
|
|
||||||
|
Embeddings come from bp-core-embedding-service (BGE-M3, 1024-dim,
|
||||||
|
multilingual). Sidecar SQLite stores 1024 × 4 bytes = 4KB per MC.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import math
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import sqlite3
|
||||||
|
import struct
|
||||||
|
from typing import Iterable
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
EMBEDDING_URL = os.getenv("EMBEDDING_URL", "http://embedding-service:8087")
|
||||||
|
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
|
||||||
|
DIM = 1024 # BGE-M3
|
||||||
|
SIMILARITY_THRESHOLD = float(os.getenv("MC_EMBEDDING_THRESHOLD", "0.55"))
|
||||||
|
CHUNK_SIZE_WORDS = 50
|
||||||
|
CHUNK_STRIDE = 30 # overlap so multi-sentence MCs aren't cut
|
||||||
|
|
||||||
|
# Short Pflichtfelder (Impressum: HRB-Nr, USt-IdNr, Anschrift) gehen in
|
||||||
|
# 50-Wort-Chunks unter. Wir scannen den Doc ZUSAETZLICH mit feineren
|
||||||
|
# 15-Wort-Fenstern und lockerem Threshold fuer Impressum/AVV-Typen.
|
||||||
|
SHORT_FIELD_CHUNK_WORDS = 15
|
||||||
|
SHORT_FIELD_STRIDE = 8
|
||||||
|
SHORT_FIELD_THRESHOLD = float(os.getenv("MC_EMBEDDING_THRESHOLD_SHORT", "0.50"))
|
||||||
|
SHORT_FIELD_DOC_TYPES = {"impressum", "avv"}
|
||||||
|
|
||||||
|
# Doc-Type-spezifische Threshold-Overrides — kalibriert anhand BMW v7
|
||||||
|
# Run: bei 0.55 lagen DSE+Cookie systemisch bei 93% (Over-Firing weil
|
||||||
|
# 8000-Wort-Texte alles vage matchen). 0.60 zieht die echten ~80% ein.
|
||||||
|
# Impressum hat nur 6 echte MCs + Short-Field-Rescue → 0.50 ok.
|
||||||
|
THRESHOLD_OVERRIDE = {
|
||||||
|
"impressum": 0.50,
|
||||||
|
"avv": 0.55,
|
||||||
|
"dse": 0.60,
|
||||||
|
"cookie": 0.60,
|
||||||
|
"widerruf": 0.58,
|
||||||
|
"loeschkonzept": 0.55,
|
||||||
|
"dsfa": 0.55,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _ensure_schema() -> None:
|
||||||
|
"""Add embedding column to mc_classification if not present."""
|
||||||
|
try:
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as c:
|
||||||
|
cols = [r[1] for r in c.execute("PRAGMA table_info(mc_classification)")]
|
||||||
|
if "embedding" not in cols:
|
||||||
|
c.execute("ALTER TABLE mc_classification ADD COLUMN embedding BLOB")
|
||||||
|
logger.info("Added embedding column to mc_classification")
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("Embedding schema migration skipped: %s", e)
|
||||||
|
|
||||||
|
|
||||||
|
def _vec_to_blob(v: list[float]) -> bytes:
|
||||||
|
return struct.pack(f"{len(v)}f", *v)
|
||||||
|
|
||||||
|
|
||||||
|
def _blob_to_vec(b: bytes) -> list[float]:
|
||||||
|
return list(struct.unpack(f"{len(b)//4}f", b))
|
||||||
|
|
||||||
|
|
||||||
|
EMBED_BATCH_SIZE = 32
|
||||||
|
|
||||||
|
|
||||||
|
async def _embed_texts(texts: list[str], timeout: float = 120.0) -> list[list[float]]:
|
||||||
|
"""Call the central embedding-service in batches; returns one vector per input.
|
||||||
|
|
||||||
|
BGE-M3 hangs / times out on >100 inputs at once on a CPU-only host.
|
||||||
|
We chunk into 32er batches and collect.
|
||||||
|
"""
|
||||||
|
if not texts:
|
||||||
|
return []
|
||||||
|
out: list[list[float]] = []
|
||||||
|
async with httpx.AsyncClient(timeout=timeout) as client:
|
||||||
|
for i in range(0, len(texts), EMBED_BATCH_SIZE):
|
||||||
|
batch = texts[i:i + EMBED_BATCH_SIZE]
|
||||||
|
try:
|
||||||
|
r = await client.post(
|
||||||
|
f"{EMBEDDING_URL}/embed", json={"texts": batch},
|
||||||
|
)
|
||||||
|
r.raise_for_status()
|
||||||
|
vecs = r.json().get("embeddings") or []
|
||||||
|
out.extend(vecs)
|
||||||
|
except httpx.HTTPError as e:
|
||||||
|
logger.warning("Embed sub-batch [%d-%d] failed: %s %s",
|
||||||
|
i, i + len(batch), type(e).__name__, e)
|
||||||
|
# Pad with empty vectors so caller can still align by index
|
||||||
|
out.extend([[] for _ in batch])
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
async def ensure_mc_embeddings(batch_size: int = 64, force: bool = False) -> int:
|
||||||
|
"""One-shot: embed every text-MC missing an embedding. Returns count.
|
||||||
|
|
||||||
|
Embeds the title + (rough) check_question for each MC to give the
|
||||||
|
BGE-M3 enough context. Title alone is too terse for the model to
|
||||||
|
discriminate against full-paragraph doc text.
|
||||||
|
|
||||||
|
Idempotent — only fills NULL rows unless force=True. Safe to call on
|
||||||
|
every run.
|
||||||
|
"""
|
||||||
|
_ensure_schema()
|
||||||
|
# Pull check_question from the PG source table once per call (needs
|
||||||
|
# context that's not in the sidecar)
|
||||||
|
try:
|
||||||
|
import psycopg2
|
||||||
|
pg = psycopg2.connect(os.environ["DATABASE_URL"])
|
||||||
|
with pg.cursor() as c:
|
||||||
|
c.execute("SELECT control_id, doc_type, title, check_question "
|
||||||
|
"FROM compliance.doc_check_controls")
|
||||||
|
pg_rows = c.fetchall()
|
||||||
|
pg.close()
|
||||||
|
pg_lookup = {(r[0], r[1] or ""): (r[2] or "", r[3] or "") for r in pg_rows}
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("ensure_mc_embeddings PG load failed: %s", e)
|
||||||
|
pg_lookup = {}
|
||||||
|
|
||||||
|
try:
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as c:
|
||||||
|
where = ("WHERE check_type = 'text'" + ("" if force else " AND embedding IS NULL"))
|
||||||
|
rows = c.execute(
|
||||||
|
f"SELECT control_id, doc_type, title FROM mc_classification {where}"
|
||||||
|
).fetchall()
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("ensure_mc_embeddings query failed: %s", e)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
if not rows:
|
||||||
|
return 0
|
||||||
|
|
||||||
|
logger.info("Embedding %d text-MCs (force=%s) via %s ...",
|
||||||
|
len(rows), force, EMBEDDING_URL)
|
||||||
|
done = 0
|
||||||
|
for i in range(0, len(rows), batch_size):
|
||||||
|
batch = rows[i:i + batch_size]
|
||||||
|
# Compose "title — check_question" so the embedding captures both
|
||||||
|
# the topic (title) and the concrete check phrasing (question).
|
||||||
|
# That helps BMW's actual policy language land in the same vector
|
||||||
|
# neighbourhood as our control wording.
|
||||||
|
texts: list[str] = []
|
||||||
|
for cid, dt, t in batch:
|
||||||
|
title_text, question = pg_lookup.get((cid, dt or ""), (t or "", ""))
|
||||||
|
combined = f"{title_text}. {question}".strip()
|
||||||
|
texts.append(combined[:600])
|
||||||
|
try:
|
||||||
|
embs = await _embed_texts(texts)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("Embed batch failed (i=%d): %s", i, e)
|
||||||
|
continue
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as c:
|
||||||
|
for (cid, dt, _t), vec in zip(batch, embs):
|
||||||
|
if not vec or len(vec) != DIM:
|
||||||
|
continue
|
||||||
|
c.execute(
|
||||||
|
"UPDATE mc_classification SET embedding = ? "
|
||||||
|
"WHERE control_id = ? AND doc_type = ?",
|
||||||
|
(_vec_to_blob(vec), cid, dt),
|
||||||
|
)
|
||||||
|
c.commit()
|
||||||
|
done += len(batch)
|
||||||
|
logger.info("ensure_mc_embeddings: filled %d/%d", done, len(rows))
|
||||||
|
return done
|
||||||
|
|
||||||
|
|
||||||
|
def _chunk_text(text: str, size: int = CHUNK_SIZE_WORDS,
|
||||||
|
stride: int = CHUNK_STRIDE) -> list[str]:
|
||||||
|
"""Sliding-window chunking — overlap helps catch MCs that span 2 sentences."""
|
||||||
|
words = re.findall(r"\S+", text or "")
|
||||||
|
if len(words) <= size:
|
||||||
|
return [" ".join(words)] if words else []
|
||||||
|
out: list[str] = []
|
||||||
|
i = 0
|
||||||
|
while i < len(words):
|
||||||
|
out.append(" ".join(words[i:i + size]))
|
||||||
|
i += stride
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _cosine(a: list[float], b: list[float]) -> float:
|
||||||
|
"""Plain Python cosine — fast enough for our scale, no numpy import."""
|
||||||
|
if not a or not b or len(a) != len(b):
|
||||||
|
return 0.0
|
||||||
|
dot = sum(x * y for x, y in zip(a, b))
|
||||||
|
na = math.sqrt(sum(x * x for x in a))
|
||||||
|
nb = math.sqrt(sum(y * y for y in b))
|
||||||
|
if na == 0 or nb == 0:
|
||||||
|
return 0.0
|
||||||
|
return dot / (na * nb)
|
||||||
|
|
||||||
|
|
||||||
|
async def embedding_match(
|
||||||
|
doc_text: str,
|
||||||
|
mc_records: Iterable[dict],
|
||||||
|
doc_type: str | None = None,
|
||||||
|
threshold: float | None = None,
|
||||||
|
) -> set[str]:
|
||||||
|
"""Return the subset of MC control_ids that semantically match doc_text.
|
||||||
|
|
||||||
|
For Impressum/AVV-types we ADDITIONALLY scan the doc with smaller
|
||||||
|
15-word windows and a looser threshold so that short Pflichtfelder
|
||||||
|
(HRB, USt-IdNr, postal address) land in their own chunk and aren't
|
||||||
|
diluted by 50-word neighbourhoods of unrelated text.
|
||||||
|
"""
|
||||||
|
if not doc_text or not mc_records:
|
||||||
|
return set()
|
||||||
|
candidates = list(mc_records)
|
||||||
|
if not candidates:
|
||||||
|
return set()
|
||||||
|
|
||||||
|
cid_set = {c.get("control_id") for c in candidates if c.get("control_id")}
|
||||||
|
if not cid_set:
|
||||||
|
return set()
|
||||||
|
|
||||||
|
try:
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as c:
|
||||||
|
placeholders = ",".join("?" * len(cid_set))
|
||||||
|
q = ("SELECT control_id, embedding FROM mc_classification "
|
||||||
|
f"WHERE control_id IN ({placeholders}) "
|
||||||
|
"AND check_type='text' AND embedding IS NOT NULL")
|
||||||
|
params = list(cid_set)
|
||||||
|
if doc_type:
|
||||||
|
q += " AND doc_type = ?"
|
||||||
|
params.append(doc_type)
|
||||||
|
rows = c.execute(q, params).fetchall()
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("embedding lookup failed: %s", e)
|
||||||
|
return set()
|
||||||
|
if not rows:
|
||||||
|
return set()
|
||||||
|
mc_embeddings = {cid: _blob_to_vec(blob) for cid, blob in rows}
|
||||||
|
|
||||||
|
effective_threshold = threshold or THRESHOLD_OVERRIDE.get(
|
||||||
|
(doc_type or "").lower(), SIMILARITY_THRESHOLD)
|
||||||
|
|
||||||
|
chunks = _chunk_text(doc_text)
|
||||||
|
if not chunks:
|
||||||
|
return set()
|
||||||
|
try:
|
||||||
|
chunk_vecs = await _embed_texts(chunks)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("doc chunk embedding failed: %s %s",
|
||||||
|
type(e).__name__, e or "(empty msg)", exc_info=True)
|
||||||
|
return set()
|
||||||
|
# Filter empty vectors (failed sub-batches return [] placeholders)
|
||||||
|
chunk_vecs = [v for v in chunk_vecs if v and len(v) == DIM]
|
||||||
|
if not chunk_vecs:
|
||||||
|
logger.warning("doc chunk embedding: no usable vectors (all batches failed)")
|
||||||
|
return set()
|
||||||
|
|
||||||
|
matched: set[str] = set()
|
||||||
|
for cid, mc_vec in mc_embeddings.items():
|
||||||
|
best = max((_cosine(mc_vec, cv) for cv in chunk_vecs), default=0.0)
|
||||||
|
if best >= effective_threshold:
|
||||||
|
matched.add(cid)
|
||||||
|
|
||||||
|
# Short-field rescue pass for Impressum-type docs: small windows +
|
||||||
|
# looser threshold catch one-line Pflichtfelder that 50-word chunks
|
||||||
|
# dilute (HRB-Nr, USt-IdNr, postal address). Only runs for MCs not
|
||||||
|
# yet matched in the main pass.
|
||||||
|
if (doc_type or "").lower() in SHORT_FIELD_DOC_TYPES:
|
||||||
|
unmatched = {cid: vec for cid, vec in mc_embeddings.items() if cid not in matched}
|
||||||
|
if unmatched:
|
||||||
|
short_chunks = _chunk_text(doc_text, size=SHORT_FIELD_CHUNK_WORDS,
|
||||||
|
stride=SHORT_FIELD_STRIDE)
|
||||||
|
try:
|
||||||
|
short_vecs = await _embed_texts(short_chunks)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("short-chunk embedding failed: %s", e)
|
||||||
|
short_vecs = []
|
||||||
|
if short_vecs:
|
||||||
|
short_passes = 0
|
||||||
|
for cid, mc_vec in unmatched.items():
|
||||||
|
best = max((_cosine(mc_vec, cv) for cv in short_vecs), default=0.0)
|
||||||
|
if best >= SHORT_FIELD_THRESHOLD:
|
||||||
|
matched.add(cid)
|
||||||
|
short_passes += 1
|
||||||
|
if short_passes:
|
||||||
|
logger.info(
|
||||||
|
"embedding short-field rescue for %s: +%d MCs (threshold %.2f, %d chunks)",
|
||||||
|
doc_type, short_passes, SHORT_FIELD_THRESHOLD, len(short_chunks),
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"embedding match for %s: %d/%d MCs passed semantic threshold (main=%.2f)",
|
||||||
|
doc_type or "?", len(matched), len(mc_embeddings), effective_threshold,
|
||||||
|
)
|
||||||
|
return matched
|
||||||
@@ -101,11 +101,36 @@ def build_scorecard(check_results: list[dict]) -> dict:
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
_DEDUP_KEYWORDS = [
|
||||||
|
"einfache sprache", "verstaendliche sprache", "verständliche sprache",
|
||||||
|
"klare sprache", "einwilligungstexte", "einwilligungsaufforderung",
|
||||||
|
"einwilligungserklaerung", "einwilligungserklärung",
|
||||||
|
"mehrdeutige", "verstaendliche form", "verständliche form",
|
||||||
|
"fachbegriffe erklaeren", "fachbegriffe erklären",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _dedup_key(label: str) -> str:
|
||||||
|
"""Cluster label to a stable dedup-key: if it contains one of the
|
||||||
|
well-known repetitive Sprache/Einwilligungs-Aufforderungs-Concepts,
|
||||||
|
collapse them all to that single concept. Otherwise return original."""
|
||||||
|
l = (label or "").lower()
|
||||||
|
for kw in _DEDUP_KEYWORDS:
|
||||||
|
if kw in l:
|
||||||
|
return f"_dup:{kw}"
|
||||||
|
return label
|
||||||
|
|
||||||
|
|
||||||
def top_fails(check_results: list[dict], n: int = 10) -> list[dict]:
|
def top_fails(check_results: list[dict], n: int = 10) -> list[dict]:
|
||||||
"""Return top-N failing MCs sorted by severity then label.
|
"""Return top-N failing MCs sorted by severity then label.
|
||||||
|
|
||||||
Skipped + passed MCs are excluded. INFO severity is excluded by
|
Skipped + passed MCs are excluded. INFO severity is excluded by
|
||||||
default since those are guidance, not findings.
|
default since those are guidance, not findings.
|
||||||
|
|
||||||
|
Near-duplicates (multiple MCs that all complain about "einfache
|
||||||
|
Sprache" / "Einwilligungsaufforderung" / ...) are collapsed to ONE
|
||||||
|
representative entry — sonst dominieren UI-Sprache-Hinweise die
|
||||||
|
Top-Liste und echte Lecks gehen unter.
|
||||||
"""
|
"""
|
||||||
fails = [
|
fails = [
|
||||||
r for r in (check_results or [])
|
r for r in (check_results or [])
|
||||||
@@ -116,7 +141,17 @@ def top_fails(check_results: list[dict], n: int = 10) -> list[dict]:
|
|||||||
_SEV_RANK.get((r.get("severity") or "MEDIUM").upper(), 5),
|
_SEV_RANK.get((r.get("severity") or "MEDIUM").upper(), 5),
|
||||||
r.get("label", ""),
|
r.get("label", ""),
|
||||||
))
|
))
|
||||||
return fails[:n]
|
seen_keys: set[str] = set()
|
||||||
|
deduped: list[dict] = []
|
||||||
|
for r in fails:
|
||||||
|
k = _dedup_key(r.get("label", ""))
|
||||||
|
if k in seen_keys:
|
||||||
|
continue
|
||||||
|
seen_keys.add(k)
|
||||||
|
deduped.append(r)
|
||||||
|
if len(deduped) >= n:
|
||||||
|
break
|
||||||
|
return deduped
|
||||||
|
|
||||||
|
|
||||||
def full_audit_records(
|
def full_audit_records(
|
||||||
|
|||||||
@@ -37,6 +37,7 @@ async def check_document_with_controls(
|
|||||||
db_url: str = "",
|
db_url: str = "",
|
||||||
max_controls: int = 0, # 0 = no limit, check ALL
|
max_controls: int = 0, # 0 = no limit, check ALL
|
||||||
use_agent: bool = False, # Use LLM agent for intelligent evaluation
|
use_agent: bool = False, # Use LLM agent for intelligent evaluation
|
||||||
|
business_scope: set[str] | None = None,
|
||||||
) -> list[dict]:
|
) -> list[dict]:
|
||||||
"""Check document against ALL doc_check_controls for this doc_type.
|
"""Check document against ALL doc_check_controls for this doc_type.
|
||||||
|
|
||||||
@@ -56,7 +57,7 @@ async def check_document_with_controls(
|
|||||||
mapped_type = _map_doc_type(doc_type)
|
mapped_type = _map_doc_type(doc_type)
|
||||||
|
|
||||||
# Load ALL controls for this doc_type
|
# Load ALL controls for this doc_type
|
||||||
controls = await _load_controls(mapped_type, db_url, max_controls)
|
controls = await _load_controls(mapped_type, db_url, max_controls, business_scope)
|
||||||
if not controls:
|
if not controls:
|
||||||
logger.info("No MCs for doc_type '%s' (%s)", mapped_type, doc_title)
|
logger.info("No MCs for doc_type '%s' (%s)", mapped_type, doc_title)
|
||||||
return []
|
return []
|
||||||
@@ -71,6 +72,31 @@ async def check_document_with_controls(
|
|||||||
if result:
|
if result:
|
||||||
results.append(result)
|
results.append(result)
|
||||||
|
|
||||||
|
# Semantic fallback (Phase 3): MCs that failed via regex get a second
|
||||||
|
# chance via BGE-M3 cosine similarity. BMW writes "Speicherdauer 2
|
||||||
|
# Jahre" — the regex misses, embedding catches it.
|
||||||
|
failed_ids = {r.get("control_id") for r in results
|
||||||
|
if not r.get("passed") and r.get("control_id")}
|
||||||
|
if failed_ids:
|
||||||
|
try:
|
||||||
|
from compliance.services.mc_embedding_matcher import (
|
||||||
|
ensure_mc_embeddings, embedding_match,
|
||||||
|
)
|
||||||
|
await ensure_mc_embeddings() # idempotent: only embeds new MCs
|
||||||
|
failed_mcs = [c for c in controls if c.get("control_id") in failed_ids]
|
||||||
|
semantic_passes = await embedding_match(
|
||||||
|
text, failed_mcs, doc_type=mapped_type,
|
||||||
|
)
|
||||||
|
if semantic_passes:
|
||||||
|
for r in results:
|
||||||
|
cid = r.get("control_id")
|
||||||
|
if cid and cid in semantic_passes and not r.get("passed"):
|
||||||
|
r["passed"] = True
|
||||||
|
r["matched_text"] = "[semantischer Treffer via Embedding]"
|
||||||
|
r["hint"] = (r.get("hint") or "") + " (passed via Embedding-Match, BGE-M3 cosine)"
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("Embedding fallback skipped: %s", e, exc_info=True)
|
||||||
|
|
||||||
passed = sum(1 for r in results if r["passed"])
|
passed = sum(1 for r in results if r["passed"])
|
||||||
failed_results = [r for r in results if not r["passed"]]
|
failed_results = [r for r in results if not r["passed"]]
|
||||||
logger.info("MC results: %d passed, %d failed out of %d for '%s'",
|
logger.info("MC results: %d passed, %d failed out of %d for '%s'",
|
||||||
@@ -161,6 +187,7 @@ def _check_mc_deterministic(text_lower: str, mc: dict) -> Optional[dict]:
|
|||||||
|
|
||||||
return {
|
return {
|
||||||
"id": f"mc-{control_id}",
|
"id": f"mc-{control_id}",
|
||||||
|
"control_id": control_id,
|
||||||
"label": mc.get("title", "")[:80],
|
"label": mc.get("title", "")[:80],
|
||||||
"passed": passed,
|
"passed": passed,
|
||||||
"severity": severity,
|
"severity": severity,
|
||||||
@@ -266,11 +293,72 @@ _MC_ALIAS_FALLBACK = {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
|
def _load_text_only_ids(
|
||||||
|
doc_type: str | None = None,
|
||||||
|
business_scope: set[str] | None = None,
|
||||||
|
) -> set[str]:
|
||||||
|
"""Return control_ids that the Sonnet-classifier flagged as 'text'.
|
||||||
|
|
||||||
|
Filters applied:
|
||||||
|
1. check_type='text' (only doc-text-matchable MCs)
|
||||||
|
2. doc_type matches (per-doc-type variant from v2-Sidecar)
|
||||||
|
3. fits_doc_type=1 (LLM auditor approved this MC for this doc_type)
|
||||||
|
4. scope_requires NULL or contained in business_scope
|
||||||
|
(e.g. MCs with scope_requires='biometric_processing' are skipped
|
||||||
|
on sites that don't do biometric processing — Art. 22 FRT-MC bei
|
||||||
|
BMW falsch-positiv)
|
||||||
|
|
||||||
|
`business_scope` comes from the business_profiler (set of detected
|
||||||
|
site characteristics like 'b2c', 'shop', 'biometric_processing',
|
||||||
|
'ai_decision_making', 'child_targeting').
|
||||||
|
|
||||||
|
Returns empty set if the sidecar doesn't exist yet.
|
||||||
|
"""
|
||||||
|
import sqlite3
|
||||||
|
db_path = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
|
||||||
|
try:
|
||||||
|
with sqlite3.connect(db_path) as c:
|
||||||
|
cols = [r[1] for r in c.execute("PRAGMA table_info(mc_classification)")]
|
||||||
|
has_fit = "fits_doc_type" in cols
|
||||||
|
has_scope = "scope_requires" in cols
|
||||||
|
fit_clause = " AND (fits_doc_type IS NULL OR fits_doc_type = 1)" if has_fit else ""
|
||||||
|
base = ("SELECT control_id, scope_requires FROM mc_classification "
|
||||||
|
"WHERE check_type = 'text'" + fit_clause) if has_scope else (
|
||||||
|
"SELECT control_id, NULL FROM mc_classification "
|
||||||
|
"WHERE check_type = 'text'" + fit_clause)
|
||||||
|
params: list = []
|
||||||
|
if doc_type:
|
||||||
|
base += " AND doc_type = ?"
|
||||||
|
params.append(doc_type)
|
||||||
|
rows = c.execute(base, params).fetchall()
|
||||||
|
scope = business_scope or set()
|
||||||
|
keep: set[str] = set()
|
||||||
|
for cid, req in rows:
|
||||||
|
if not req:
|
||||||
|
keep.add(cid)
|
||||||
|
else:
|
||||||
|
# Multiple requirements separated by '|' — ALL must
|
||||||
|
# be in scope to include. Empty req tokens are skipped.
|
||||||
|
needed = {r.strip().lower() for r in req.split("|") if r.strip()}
|
||||||
|
if needed.issubset({s.lower() for s in scope}):
|
||||||
|
keep.add(cid)
|
||||||
|
return keep
|
||||||
|
except sqlite3.OperationalError:
|
||||||
|
return set()
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("MC classification lookup failed: %s", e)
|
||||||
|
return set()
|
||||||
|
|
||||||
|
|
||||||
|
async def _load_controls(doc_type: str, db_url: str, limit: int,
|
||||||
|
business_scope: set[str] | None = None) -> list[dict]:
|
||||||
"""Load all doc_check_controls for a doc_type from PostgreSQL.
|
"""Load all doc_check_controls for a doc_type from PostgreSQL.
|
||||||
|
|
||||||
Falls back via _MC_ALIAS_FALLBACK when no MCs exist for the requested
|
Falls back via _MC_ALIAS_FALLBACK when no MCs exist for the requested
|
||||||
type (e.g. 'nutzungsbedingungen' -> 'agb').
|
type (e.g. 'nutzungsbedingungen' -> 'agb').
|
||||||
|
|
||||||
|
Filters to only check_type='text' MCs when the classification sidecar
|
||||||
|
is present — process/review MCs are routed to other modules.
|
||||||
"""
|
"""
|
||||||
try:
|
try:
|
||||||
import asyncpg
|
import asyncpg
|
||||||
@@ -297,7 +385,17 @@ async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
|
|||||||
fallback = _MC_ALIAS_FALLBACK[doc_type]
|
fallback = _MC_ALIAS_FALLBACK[doc_type]
|
||||||
logger.info("No MCs for %s -> falling back to %s", doc_type, fallback)
|
logger.info("No MCs for %s -> falling back to %s", doc_type, fallback)
|
||||||
rows = await conn.fetch(query, fallback)
|
rows = await conn.fetch(query, fallback)
|
||||||
return [dict(r) for r in rows]
|
|
||||||
|
controls = [dict(r) for r in rows]
|
||||||
|
text_only = _load_text_only_ids(doc_type, business_scope)
|
||||||
|
if text_only:
|
||||||
|
before = len(controls)
|
||||||
|
controls = [c for c in controls if c.get("control_id") in text_only]
|
||||||
|
logger.info(
|
||||||
|
"MC filter (text only) for %s: %d/%d MCs after Sonnet check_type filter",
|
||||||
|
doc_type, len(controls), before,
|
||||||
|
)
|
||||||
|
return controls
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning("MC query failed: %s", e)
|
logger.warning("MC query failed: %s", e)
|
||||||
return []
|
return []
|
||||||
|
|||||||
@@ -0,0 +1,407 @@
|
|||||||
|
"""
|
||||||
|
Vendor-Cost-Estimator — leitet pro Vendor ein Pricing-Tier aus
|
||||||
|
Cookie-Signalen ab und gibt eine intensitaets-basierte Jahres-
|
||||||
|
kostenschaetzung zurueck.
|
||||||
|
|
||||||
|
Cookie-Signale die wir auswerten:
|
||||||
|
- Anzahl Cookies pro Vendor (proxy fuer Modul-Tiefe)
|
||||||
|
- Premium-Feature-Cookies (z.B. 's_target_qa', '_ab_test' → Enterprise-Add-on)
|
||||||
|
- Edge/Region-Cookies (Multi-Region → Premier-Tier CDN)
|
||||||
|
- Cookie-Persistenz (Multi-Jahr → Heavy-Tracking-Lizenz)
|
||||||
|
|
||||||
|
Plus business_profile fuer Company-Tier-Inferenz.
|
||||||
|
|
||||||
|
Output pro Vendor:
|
||||||
|
- inferred_tier: 'starter' | 'professional' | 'enterprise' | 'premier'
|
||||||
|
- tier_signals : Liste der Indikatoren die zum Tier gefuehrt haben
|
||||||
|
- cost_year_eur_range: (low, high) basierend auf Tier × Vendor-Pricing
|
||||||
|
- confidence: 'low' | 'medium' | 'high'
|
||||||
|
|
||||||
|
Dieses Modul ergaenzt vendor_redundancy.py — die einfachen low/high
|
||||||
|
Pauschalen dort werden hier durch dynamische, signal-basierte Werte
|
||||||
|
ersetzt.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import re
|
||||||
|
from typing import Iterable
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Premium-Feature-Cookies: Indikator fuer Enterprise-Add-ons ─────
|
||||||
|
#
|
||||||
|
# Wenn ein Vendor diese Cookies setzt, ist der Kunde mit hoher
|
||||||
|
# Wahrscheinlichkeit auf einem Enterprise-Plan.
|
||||||
|
|
||||||
|
_PREMIUM_FEATURE_PATTERNS: list[tuple[str, str, str]] = [
|
||||||
|
# (regex, vendor_key, premium_feature_label)
|
||||||
|
(r"^s_target_qa$", "adobe analytics", "Adobe Target Add-on"),
|
||||||
|
(r"adobe.*target", "adobe target", "Personalization Enterprise"),
|
||||||
|
(r"^aam_uuid", "adobe analytics", "Audience Manager Enterprise"),
|
||||||
|
(r"^s_ecid", "adobe analytics", "Experience Cloud ID Service"),
|
||||||
|
(r"^_pcid_", "adobe analytics", "People-Based Destinations"),
|
||||||
|
|
||||||
|
(r"^_gat_gtag_UA", "google analytics", "GA360 Multi-Tracker"),
|
||||||
|
(r"^_ga_[A-Z0-9]+_[A-Z0-9]+", "google analytics", "GA4 Enterprise Stream"),
|
||||||
|
|
||||||
|
(r"^_uetmsdns", "microsoft advertising", "Custom Conversion Tracking"),
|
||||||
|
(r"^_fbp.*test", "meta pixel", "Conversions API Premium"),
|
||||||
|
(r"^_pin_unauth_premium", "pinterest", "Pinterest Premium-API"),
|
||||||
|
|
||||||
|
(r"^afm", "adform", "Affinity-Module"),
|
||||||
|
(r"^cto_dna", "criteo", "Dynamic Retargeting Premium"),
|
||||||
|
|
||||||
|
# CDN / Infra Premium
|
||||||
|
(r"^aws-alb-[a-z0-9]+", "amazon web services", "ALB + Multi-Region"),
|
||||||
|
(r"^aws-waf", "amazon web services", "WAF Enterprise"),
|
||||||
|
(r"^cf_clearance", "cloudflare", "Bot-Management Pro"),
|
||||||
|
(r"^akm_[a-z]+", "akamai", "Adaptive Media Delivery Enterprise"),
|
||||||
|
|
||||||
|
# Salesforce Customer-360
|
||||||
|
(r"^bid_n_", "salesforce", "Marketing Cloud Personalization"),
|
||||||
|
(r"^_cs_", "salesforce", "CDP Premium"),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Tier-Pricing pro Vendor (jaehrlich, EUR) ───────────────────────
|
||||||
|
#
|
||||||
|
# 4 Tiers: starter (KMU), professional (Mid), enterprise (Konzern),
|
||||||
|
# premier (Global Brand / Heavy User).
|
||||||
|
|
||||||
|
_TIER_PRICING: dict[str, dict[str, tuple[int, int]]] = {
|
||||||
|
"adobe analytics": {
|
||||||
|
"starter": ( 10_000, 30_000),
|
||||||
|
"professional": ( 60_000, 150_000),
|
||||||
|
"enterprise": (200_000, 500_000),
|
||||||
|
"premier": (500_000, 900_000),
|
||||||
|
},
|
||||||
|
"adobe target": {
|
||||||
|
"starter": ( 8_000, 25_000),
|
||||||
|
"professional": ( 40_000, 100_000),
|
||||||
|
"enterprise": (120_000, 300_000),
|
||||||
|
"premier": (300_000, 600_000),
|
||||||
|
},
|
||||||
|
"adobe campaign": {
|
||||||
|
"starter": ( 10_000, 30_000),
|
||||||
|
"professional": ( 40_000, 100_000),
|
||||||
|
"enterprise": (120_000, 280_000),
|
||||||
|
"premier": (280_000, 500_000),
|
||||||
|
},
|
||||||
|
"google analytics": {
|
||||||
|
"starter": ( 0, 0), # GA4 free
|
||||||
|
"professional": ( 0, 0),
|
||||||
|
"enterprise": ( 80_000, 150_000), # GA360
|
||||||
|
"premier": (150_000, 300_000),
|
||||||
|
},
|
||||||
|
"matomo": {
|
||||||
|
"starter": ( 0, 3_000), # On-prem free / Cloud Starter
|
||||||
|
"professional": ( 6_000, 20_000),
|
||||||
|
"enterprise": ( 20_000, 80_000),
|
||||||
|
"premier": ( 60_000, 150_000),
|
||||||
|
},
|
||||||
|
"content square": {
|
||||||
|
"starter": ( 12_000, 40_000),
|
||||||
|
"professional": ( 60_000, 150_000),
|
||||||
|
"enterprise": (150_000, 350_000),
|
||||||
|
"premier": (350_000, 700_000),
|
||||||
|
},
|
||||||
|
"contentsquare": {
|
||||||
|
"starter": ( 12_000, 40_000),
|
||||||
|
"professional": ( 60_000, 150_000),
|
||||||
|
"enterprise": (150_000, 350_000),
|
||||||
|
"premier": (350_000, 700_000),
|
||||||
|
},
|
||||||
|
"dynatrace": {
|
||||||
|
"starter": ( 5_000, 15_000),
|
||||||
|
"professional": ( 30_000, 80_000),
|
||||||
|
"enterprise": (100_000, 300_000),
|
||||||
|
"premier": (300_000, 800_000),
|
||||||
|
},
|
||||||
|
"qualtrics": {
|
||||||
|
"starter": ( 6_000, 20_000),
|
||||||
|
"professional": ( 30_000, 80_000),
|
||||||
|
"enterprise": ( 80_000, 200_000),
|
||||||
|
"premier": (200_000, 500_000),
|
||||||
|
},
|
||||||
|
|
||||||
|
# Advertising / Retargeting (Lizenz + Self-Service; Media-Spend SEPARAT)
|
||||||
|
"criteo": {
|
||||||
|
"starter": ( 6_000, 20_000),
|
||||||
|
"professional": ( 30_000, 80_000),
|
||||||
|
"enterprise": ( 80_000, 250_000),
|
||||||
|
"premier": (250_000, 600_000),
|
||||||
|
},
|
||||||
|
"adform": {
|
||||||
|
"starter": ( 12_000, 40_000),
|
||||||
|
"professional": ( 60_000, 150_000),
|
||||||
|
"enterprise": (150_000, 400_000),
|
||||||
|
"premier": (400_000, 800_000),
|
||||||
|
},
|
||||||
|
"outbrain": {
|
||||||
|
"starter": ( 6_000, 20_000),
|
||||||
|
"professional": ( 30_000, 80_000),
|
||||||
|
"enterprise": ( 80_000, 200_000),
|
||||||
|
"premier": (200_000, 500_000),
|
||||||
|
},
|
||||||
|
"taboola": {
|
||||||
|
"starter": ( 6_000, 20_000),
|
||||||
|
"professional": ( 30_000, 80_000),
|
||||||
|
"enterprise": ( 80_000, 200_000),
|
||||||
|
"premier": (200_000, 500_000),
|
||||||
|
},
|
||||||
|
"teads": {
|
||||||
|
"starter": ( 6_000, 18_000),
|
||||||
|
"professional": ( 20_000, 60_000),
|
||||||
|
"enterprise": ( 60_000, 150_000),
|
||||||
|
"premier": (150_000, 350_000),
|
||||||
|
},
|
||||||
|
"pinterest": {
|
||||||
|
"starter": ( 3_000, 15_000),
|
||||||
|
"professional": ( 15_000, 50_000),
|
||||||
|
"enterprise": ( 50_000, 150_000),
|
||||||
|
"premier": (150_000, 400_000),
|
||||||
|
},
|
||||||
|
"linkedin insight": {
|
||||||
|
"starter": ( 3_000, 12_000),
|
||||||
|
"professional": ( 12_000, 40_000),
|
||||||
|
"enterprise": ( 40_000, 120_000),
|
||||||
|
"premier": (120_000, 300_000),
|
||||||
|
},
|
||||||
|
|
||||||
|
# CDN / Cloud
|
||||||
|
"akamai": {
|
||||||
|
"starter": ( 20_000, 60_000),
|
||||||
|
"professional": ( 80_000, 200_000),
|
||||||
|
"enterprise": (200_000, 500_000),
|
||||||
|
"premier": (500_000, 1_500_000),
|
||||||
|
},
|
||||||
|
"amazon web services": {
|
||||||
|
"starter": ( 12_000, 60_000),
|
||||||
|
"professional": ( 60_000, 300_000),
|
||||||
|
"enterprise": (300_000, 1_500_000),
|
||||||
|
"premier": (1_500_000, 8_000_000),
|
||||||
|
},
|
||||||
|
"baqend": {
|
||||||
|
"starter": ( 3_000, 12_000),
|
||||||
|
"professional": ( 12_000, 40_000),
|
||||||
|
"enterprise": ( 40_000, 120_000),
|
||||||
|
"premier": (120_000, 300_000),
|
||||||
|
},
|
||||||
|
"speedkit": {
|
||||||
|
"starter": ( 3_000, 12_000),
|
||||||
|
"professional": ( 12_000, 40_000),
|
||||||
|
"enterprise": ( 40_000, 120_000),
|
||||||
|
"premier": (120_000, 300_000),
|
||||||
|
},
|
||||||
|
"speedcurve": {
|
||||||
|
"starter": ( 1_200, 4_800),
|
||||||
|
"professional": ( 6_000, 18_000),
|
||||||
|
"enterprise": ( 18_000, 60_000),
|
||||||
|
"premier": ( 60_000, 120_000),
|
||||||
|
},
|
||||||
|
|
||||||
|
# CRM / Marketing
|
||||||
|
"salesforce": {
|
||||||
|
"starter": ( 20_000, 60_000),
|
||||||
|
"professional": ( 80_000, 250_000),
|
||||||
|
"enterprise": (250_000, 800_000),
|
||||||
|
"premier": (800_000, 2_500_000),
|
||||||
|
},
|
||||||
|
"genesys": {
|
||||||
|
"starter": ( 24_000, 80_000),
|
||||||
|
"professional": ( 80_000, 250_000),
|
||||||
|
"enterprise": (250_000, 800_000),
|
||||||
|
"premier": (800_000, 2_000_000),
|
||||||
|
},
|
||||||
|
|
||||||
|
# Captcha
|
||||||
|
"hcaptcha": {
|
||||||
|
"starter": ( 0, 2_400),
|
||||||
|
"professional": ( 2_400, 12_000),
|
||||||
|
"enterprise": ( 12_000, 40_000),
|
||||||
|
"premier": ( 40_000, 100_000),
|
||||||
|
},
|
||||||
|
|
||||||
|
# Lead-Tracking
|
||||||
|
"salesviewer": {
|
||||||
|
"starter": ( 1_200, 3_600),
|
||||||
|
"professional": ( 3_600, 12_000),
|
||||||
|
"enterprise": ( 12_000, 40_000),
|
||||||
|
"premier": ( 40_000, 100_000),
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _vendor_key(vendor_name: str) -> str | None:
|
||||||
|
"""Map a vendor name to a known pricing-table key."""
|
||||||
|
n = (vendor_name or "").lower()
|
||||||
|
for k in _TIER_PRICING:
|
||||||
|
if k in n:
|
||||||
|
return k
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def infer_company_tier(business_profile: dict | None) -> str:
|
||||||
|
"""Coarse company-tier from business profile.
|
||||||
|
|
||||||
|
Used as the baseline when vendor-specific signals are weak.
|
||||||
|
"""
|
||||||
|
if not business_profile:
|
||||||
|
return "professional"
|
||||||
|
bp = business_profile
|
||||||
|
features = {f.lower() for f in (bp.get("features") or [])}
|
||||||
|
btype = (bp.get("type") or "").lower()
|
||||||
|
# Heavy enterprise-only signals
|
||||||
|
if any(f in features for f in ("multi_country", "konzern", "enterprise",
|
||||||
|
"international", "automotive", "banking",
|
||||||
|
"luxury", "premium")):
|
||||||
|
return "premier"
|
||||||
|
# Large but maybe single-country
|
||||||
|
if "shop" in features or "konfigurator" in features or btype == "b2c":
|
||||||
|
return "enterprise"
|
||||||
|
return "professional"
|
||||||
|
|
||||||
|
|
||||||
|
def infer_vendor_tier(vendor: dict, company_tier: str) -> tuple[str, list[str]]:
|
||||||
|
"""Infer pricing tier for a single vendor from its cookie footprint.
|
||||||
|
|
||||||
|
Signals (additive — more signals → higher tier):
|
||||||
|
- cookie_count > 30 → +1 tier
|
||||||
|
- cookie_count > 60 → +2 tiers
|
||||||
|
- premium-feature cookie hit → +1 tier
|
||||||
|
- 'is_third_party' on most cookies → +1 tier (heavy-tracking signal)
|
||||||
|
- very long expiry (>=2 years) → +1 tier
|
||||||
|
"""
|
||||||
|
cookies = vendor.get("cookies") or []
|
||||||
|
n_cookies = len(cookies)
|
||||||
|
cookie_names = [c.get("name", "").lower() for c in cookies]
|
||||||
|
signals: list[str] = []
|
||||||
|
|
||||||
|
base_tiers = ["starter", "professional", "enterprise", "premier"]
|
||||||
|
# Start at company-tier as baseline
|
||||||
|
idx = base_tiers.index(company_tier) if company_tier in base_tiers else 1
|
||||||
|
|
||||||
|
if n_cookies >= 60:
|
||||||
|
idx = min(len(base_tiers) - 1, idx + 1)
|
||||||
|
signals.append(f"{n_cookies} Cookies (sehr hohe Modul-Tiefe)")
|
||||||
|
elif n_cookies >= 30:
|
||||||
|
signals.append(f"{n_cookies} Cookies (hohe Modul-Tiefe)")
|
||||||
|
|
||||||
|
# Premium feature detection
|
||||||
|
vk = _vendor_key(vendor.get("name", ""))
|
||||||
|
for pattern, expected_key, feature_label in _PREMIUM_FEATURE_PATTERNS:
|
||||||
|
if vk and vk != expected_key and expected_key not in (vendor.get("name") or "").lower():
|
||||||
|
continue
|
||||||
|
for cn in cookie_names:
|
||||||
|
if re.search(pattern, cn):
|
||||||
|
idx = min(len(base_tiers) - 1, idx + 1)
|
||||||
|
signals.append(f"Premium-Feature-Cookie: {feature_label}")
|
||||||
|
break
|
||||||
|
|
||||||
|
# Heavy third-party tracking
|
||||||
|
third_party_ratio = sum(1 for c in cookies if c.get("is_third_party")) / max(n_cookies, 1)
|
||||||
|
if third_party_ratio >= 0.6 and n_cookies >= 10:
|
||||||
|
signals.append(f"{int(third_party_ratio * 100)}% Drittanbieter-Cookies — Tracking-Heavy")
|
||||||
|
|
||||||
|
# Long-lived cookies
|
||||||
|
long_lived = sum(1 for c in cookies if _expiry_years(c.get("expiry", "")) >= 2)
|
||||||
|
if long_lived >= 3:
|
||||||
|
signals.append(f"{long_lived} Cookies mit ≥2 Jahre Speicherdauer")
|
||||||
|
|
||||||
|
return base_tiers[idx], signals
|
||||||
|
|
||||||
|
|
||||||
|
def _expiry_years(expiry_str: str) -> float:
|
||||||
|
"""Rough parse: '2 Jahre' → 2.0, '24 Monate' → 2.0, '90 Tage' → 0.25"""
|
||||||
|
s = (expiry_str or "").lower()
|
||||||
|
m = re.search(r"(\d+)\s*(jahr|year)", s)
|
||||||
|
if m: return float(m.group(1))
|
||||||
|
m = re.search(r"(\d+)\s*(monat|month)", s)
|
||||||
|
if m: return float(m.group(1)) / 12.0
|
||||||
|
m = re.search(r"(\d+)\s*(tag|day)", s)
|
||||||
|
if m: return float(m.group(1)) / 365.0
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
|
||||||
|
def estimate_vendor_cost(vendor: dict, business_profile: dict | None = None) -> dict:
|
||||||
|
"""Return cost estimation for one vendor incl. tier inference + signals."""
|
||||||
|
vk = _vendor_key(vendor.get("name", ""))
|
||||||
|
company_tier = infer_company_tier(business_profile)
|
||||||
|
|
||||||
|
if not vk:
|
||||||
|
return {
|
||||||
|
"vendor": vendor.get("name", ""),
|
||||||
|
"matched_pricing_key": None,
|
||||||
|
"inferred_tier": None,
|
||||||
|
"tier_signals": [],
|
||||||
|
"company_tier_baseline": company_tier,
|
||||||
|
"cost_year_eur_range": (0, 0),
|
||||||
|
"confidence": "none",
|
||||||
|
"note": "Kein Pricing-Eintrag fuer diesen Anbieter — Saving-Schaetzung uebergangen.",
|
||||||
|
}
|
||||||
|
|
||||||
|
tier, signals = infer_vendor_tier(vendor, company_tier)
|
||||||
|
pricing = _TIER_PRICING[vk].get(tier) or (0, 0)
|
||||||
|
confidence = "high" if len(signals) >= 2 else ("medium" if signals else "low")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"vendor": vendor.get("name", ""),
|
||||||
|
"matched_pricing_key": vk,
|
||||||
|
"inferred_tier": tier,
|
||||||
|
"tier_signals": signals,
|
||||||
|
"company_tier_baseline": company_tier,
|
||||||
|
"cost_year_eur_range": pricing,
|
||||||
|
"confidence": confidence,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def estimate_total_stack_cost(
|
||||||
|
vendors: Iterable[dict],
|
||||||
|
business_profile: dict | None = None,
|
||||||
|
) -> dict:
|
||||||
|
"""Aggregate cost estimation over all vendors.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
- per_vendor list (one entry each)
|
||||||
|
- per_recipient_type aggregate (INTERNAL vs PROCESSOR vs CONTROLLER)
|
||||||
|
- total range
|
||||||
|
- master-contract dedup hint: vendors whose name starts with the
|
||||||
|
site owner ('BMW AG — ...') are bundled into ONE master contract
|
||||||
|
per vendor-tool-key (not double-counted).
|
||||||
|
"""
|
||||||
|
per_vendor: list[dict] = []
|
||||||
|
seen_master_keys: set[tuple[str, str]] = set()
|
||||||
|
total_low = 0
|
||||||
|
total_high = 0
|
||||||
|
|
||||||
|
for v in vendors:
|
||||||
|
est = estimate_vendor_cost(v, business_profile)
|
||||||
|
per_vendor.append(est)
|
||||||
|
if not est["matched_pricing_key"]:
|
||||||
|
continue
|
||||||
|
rtype = (v.get("recipient_type") or "").upper()
|
||||||
|
master_key = (est["matched_pricing_key"], rtype if rtype == "INTERNAL" else "EXT")
|
||||||
|
if rtype == "INTERNAL" and master_key in seen_master_keys:
|
||||||
|
# Same Adobe contract serves many "BMW AG — Adobe XYZ" lines —
|
||||||
|
# count cost only ONCE per (key, internal).
|
||||||
|
est["bundled_into_master_contract"] = True
|
||||||
|
continue
|
||||||
|
seen_master_keys.add(master_key)
|
||||||
|
lo, hi = est["cost_year_eur_range"]
|
||||||
|
total_low += lo
|
||||||
|
total_high += hi
|
||||||
|
|
||||||
|
return {
|
||||||
|
"per_vendor": per_vendor,
|
||||||
|
"total_year_eur_range": (total_low, total_high),
|
||||||
|
"master_contracts_counted": len(seen_master_keys),
|
||||||
|
"disclaimer": (
|
||||||
|
"Schaetzung basiert auf Cookie-Signalen (Anzahl, Premium-Feature-Detection, "
|
||||||
|
"Drittanbieter-Quote, Lebensdauer) + Listpreisen pro Tier. Konzern-Konditionen "
|
||||||
|
"koennen 30-50% darunter liegen. Eintraege derselben Eigenmarke werden zu EINEM "
|
||||||
|
"Master-Vertrag aggregiert. Media-Spend ist NICHT enthalten."
|
||||||
|
),
|
||||||
|
}
|
||||||
@@ -0,0 +1,727 @@
|
|||||||
|
"""
|
||||||
|
Vendor Redundancy + EU-Alternatives Analyzer.
|
||||||
|
|
||||||
|
Eingang: Liste von Vendors aus dem CMP-Capture (z.B. BMW 90 Vendors).
|
||||||
|
Ausgang: drei strukturierte Listen die im Email + Migration-Modal
|
||||||
|
gerendert werden:
|
||||||
|
|
||||||
|
1. functional_categories : Vendor → Funktionsklasse (analytics,
|
||||||
|
advertising, cdn, captcha, chat, …)
|
||||||
|
2. redundancies : Kategorien mit ≥2 Vendors die dasselbe tun
|
||||||
|
→ Konsolidierungspotenzial
|
||||||
|
3. eu_alternatives : pro US-Vendor passender EU-Ersatz aus
|
||||||
|
kuratierter Lookup-Tabelle (Matomo statt
|
||||||
|
Adobe Analytics, IONOS statt AWS, etc.)
|
||||||
|
4. multi_function_tools : EU-Tools die mehrere Kategorien abdecken
|
||||||
|
(z.B. SAP CX = Analytics + CRM + Marketing)
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import re
|
||||||
|
from collections import defaultdict
|
||||||
|
from typing import Iterable
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Kategorisierung ──────────────────────────────────────────────────
|
||||||
|
|
||||||
|
# Substring-Match (lowercase) → Kategorie. Erste Treffer gewinnt.
|
||||||
|
_CATEGORY_RULES: list[tuple[str, str]] = [
|
||||||
|
# Web Analytics / Behavior
|
||||||
|
("adobe analytics", "web_analytics"),
|
||||||
|
("adobe target", "personalisation"),
|
||||||
|
("adobe campaign", "marketing_automation"),
|
||||||
|
("adobe staging library", "tag_management"),
|
||||||
|
("adobelaunch", "tag_management"),
|
||||||
|
("google analytics", "web_analytics"),
|
||||||
|
("matomo", "web_analytics"),
|
||||||
|
("hotjar", "web_analytics"),
|
||||||
|
("content square", "web_analytics"),
|
||||||
|
("contentsquare", "web_analytics"),
|
||||||
|
("dynatrace", "monitoring"),
|
||||||
|
("performance analytics", "web_analytics"),
|
||||||
|
("form analytics", "web_analytics"),
|
||||||
|
("form campaign analytics","web_analytics"),
|
||||||
|
("psyma", "survey"),
|
||||||
|
("qualtrics", "survey"),
|
||||||
|
|
||||||
|
# Tag Management
|
||||||
|
("google tag manager", "tag_management"),
|
||||||
|
("gtm", "tag_management"),
|
||||||
|
|
||||||
|
# Advertising / Retargeting
|
||||||
|
("google ads", "advertising"),
|
||||||
|
("google advertising", "advertising"),
|
||||||
|
("doubleclick", "advertising"),
|
||||||
|
("googleads", "advertising"),
|
||||||
|
("meta pixel", "advertising"),
|
||||||
|
("meta platforms", "advertising"),
|
||||||
|
("facebook", "advertising"),
|
||||||
|
("adform", "advertising"),
|
||||||
|
("criteo", "advertising"),
|
||||||
|
("outbrain", "advertising"),
|
||||||
|
("taboola", "advertising"),
|
||||||
|
("teads", "advertising"),
|
||||||
|
("pinterest", "advertising"),
|
||||||
|
("linkedin insight", "advertising"),
|
||||||
|
("youtube performance", "advertising"),
|
||||||
|
("youtube player", "external_media"),
|
||||||
|
("amazon advertising", "advertising"),
|
||||||
|
("instagram", "advertising"),
|
||||||
|
("dotaki", "advertising"),
|
||||||
|
|
||||||
|
# Video / Embeds
|
||||||
|
("youtube", "external_media"),
|
||||||
|
("vimeo", "external_media"),
|
||||||
|
("jw player", "external_media"),
|
||||||
|
("jw video", "external_media"),
|
||||||
|
("jwplayer", "external_media"),
|
||||||
|
("jwconnatix", "external_media"),
|
||||||
|
|
||||||
|
# Maps / Geo
|
||||||
|
("google maps", "maps"),
|
||||||
|
("google geolocation", "maps"),
|
||||||
|
("geolocation", "maps"),
|
||||||
|
|
||||||
|
# CDN / Infrastructure
|
||||||
|
("akamai", "cdn"),
|
||||||
|
("amazon web services", "cloud_infra"),
|
||||||
|
("aws", "cloud_infra"),
|
||||||
|
("baqend", "cdn"),
|
||||||
|
("speedkit", "cdn"),
|
||||||
|
("speedcurve", "monitoring"),
|
||||||
|
("salesforce", "crm"),
|
||||||
|
|
||||||
|
# Chat / Support
|
||||||
|
("genesys", "chat"),
|
||||||
|
("ckm", "chat"),
|
||||||
|
("chat widget", "chat"),
|
||||||
|
|
||||||
|
# Captcha / Bot-Protection
|
||||||
|
("hcaptcha", "captcha"),
|
||||||
|
("recaptcha", "captcha"),
|
||||||
|
|
||||||
|
# Sales / Lead-Tracking
|
||||||
|
("salesviewer", "lead_tracking"),
|
||||||
|
|
||||||
|
# Marketing/Sales overlay
|
||||||
|
("nayoki", "social_aggregator"),
|
||||||
|
|
||||||
|
# Site-eigene Funktionen
|
||||||
|
("infrastructure", "site_infra"),
|
||||||
|
("infrastrukturbereit", "site_infra"),
|
||||||
|
("javaserverpages", "site_infra"),
|
||||||
|
("single sign-on", "auth"),
|
||||||
|
("mybmw account", "auth"),
|
||||||
|
("sso", "auth"),
|
||||||
|
("consent", "consent_management"),
|
||||||
|
("session", "site_infra"),
|
||||||
|
("scroll", "site_infra"),
|
||||||
|
("sticky", "site_infra"),
|
||||||
|
("sidebar", "site_infra"),
|
||||||
|
("dealer search", "site_feature"),
|
||||||
|
("test drive", "site_feature"),
|
||||||
|
("vehicle configurator", "site_feature"),
|
||||||
|
("stocklocator", "site_feature"),
|
||||||
|
("eshop", "site_feature"),
|
||||||
|
("shop", "site_feature"),
|
||||||
|
("language", "site_infra"),
|
||||||
|
("sprach", "site_infra"),
|
||||||
|
("region", "site_infra"),
|
||||||
|
("ip popup", "site_infra"),
|
||||||
|
("popup", "site_infra"),
|
||||||
|
("dynatrace", "monitoring"),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def classify_vendor(name: str) -> str:
|
||||||
|
"""Map a vendor name to a functional category."""
|
||||||
|
n = (name or "").lower()
|
||||||
|
for needle, cat in _CATEGORY_RULES:
|
||||||
|
if needle in n:
|
||||||
|
return cat
|
||||||
|
return "other"
|
||||||
|
|
||||||
|
|
||||||
|
# ─── EU-Alternativen ─────────────────────────────────────────────────
|
||||||
|
|
||||||
|
# Kuratierte Liste — pro US-/Nicht-EU-Vendor passende(r) EU-Ersatz.
|
||||||
|
# Quellen: Matomo Vergleich, etracker SoMo-Studie, IONOS-Pakete,
|
||||||
|
# Friendly Captcha Whitepaper, SAP CX-Suite, Brevo / CleverReach DE-Listen.
|
||||||
|
_EU_ALTERNATIVES: dict[str, list[dict]] = {
|
||||||
|
"adobe analytics": [
|
||||||
|
{"name": "Matomo (On-Premise)", "vendor": "InnoCraft", "country": "DE-self-hosted",
|
||||||
|
"license": "GPL", "notes": "100% DSGVO, keine 3rd-Country, gleicher Funktionsumfang"},
|
||||||
|
{"name": "etracker Analytics", "vendor": "etracker GmbH", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "DSGVO-konform aus Hamburg, IP-Anonymisierung"},
|
||||||
|
{"name": "Mapp Intelligence", "vendor": "Mapp Digital", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "Enterprise-Alternative, Server in DE"},
|
||||||
|
],
|
||||||
|
"google analytics": [
|
||||||
|
{"name": "Matomo", "vendor": "InnoCraft", "country": "DE-self-hosted",
|
||||||
|
"license": "GPL", "notes": "Direkter Drop-in-Ersatz mit GA-Migrationspfad"},
|
||||||
|
{"name": "Plausible Analytics", "vendor": "Plausible Insights", "country": "EE",
|
||||||
|
"license": "AGPL/Commercial", "notes": "Cookielos, ohne Einwilligung nutzbar"},
|
||||||
|
{"name": "Fathom Analytics EU", "vendor": "Fathom", "country": "DE-Region",
|
||||||
|
"license": "Commercial", "notes": "Cookielos, EU-Hosting"},
|
||||||
|
],
|
||||||
|
"content square": [
|
||||||
|
{"name": "Mouseflow EU", "vendor": "Mouseflow ApS", "country": "DK",
|
||||||
|
"license": "Commercial", "notes": "Session-Recording + Heatmaps EU-Hosting"},
|
||||||
|
{"name": "Hotjar EU", "vendor": "Hotjar Ltd", "country": "MT",
|
||||||
|
"license": "Commercial", "notes": "EU-DataCenter (Frankfurt), Einwilligung erforderlich"},
|
||||||
|
],
|
||||||
|
"dynatrace": [
|
||||||
|
{"name": "Dynatrace EU", "vendor": "Dynatrace", "country": "AT",
|
||||||
|
"license": "Commercial", "notes": "Bereits EU (Linz). Cluster auf EU einstellen"},
|
||||||
|
],
|
||||||
|
"speedcurve": [
|
||||||
|
{"name": "SpeedCurve EU", "vendor": "SpeedCurve", "country": "EU-tenant",
|
||||||
|
"license": "Commercial", "notes": "Region-Tenant explizit konfigurieren"},
|
||||||
|
{"name": "Calibre", "vendor": "Calibre", "country": "AU/EU",
|
||||||
|
"license": "Commercial", "notes": "Performance Monitoring, EU-Region"},
|
||||||
|
],
|
||||||
|
"akamai": [
|
||||||
|
{"name": "Bunny CDN", "vendor": "BunnyWay d.o.o.", "country": "SI",
|
||||||
|
"license": "Commercial", "notes": "Slowenischer CDN, EU-Backbone"},
|
||||||
|
{"name": "Cloudflare EU-Only", "vendor": "Cloudflare", "country": "Multi",
|
||||||
|
"license": "Commercial", "notes": "EU-Datacenter erzwingbar via 'Geo Steering'"},
|
||||||
|
{"name": "IONOS CDN", "vendor": "IONOS SE", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "100% DE-Hosting"},
|
||||||
|
],
|
||||||
|
"amazon web services": [
|
||||||
|
{"name": "IONOS Cloud", "vendor": "IONOS SE", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "DE-Hosting, BSI C5-zertifiziert"},
|
||||||
|
{"name": "OVHcloud", "vendor": "OVH SAS", "country": "FR",
|
||||||
|
"license": "Commercial", "notes": "FR-Hosting, SecNumCloud-zertifiziert"},
|
||||||
|
{"name": "Hetzner Cloud", "vendor": "Hetzner Online GmbH", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "DE/FI-Hosting, sehr kostenguenstig"},
|
||||||
|
{"name": "STACKIT", "vendor": "Schwarz IT (Lidl-Gruppe)", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "Souveraener DE-Cloud, fuer Enterprise"},
|
||||||
|
],
|
||||||
|
"salesforce": [
|
||||||
|
{"name": "SAP Customer Experience", "vendor": "SAP SE", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "Vollstaendige CRM-Suite EU-Hosting"},
|
||||||
|
{"name": "weclapp", "vendor": "weclapp SE", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "Cloud-CRM aus Marburg"},
|
||||||
|
],
|
||||||
|
"adobe campaign": [
|
||||||
|
{"name": "CleverReach", "vendor": "CleverReach GmbH", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "E-Mail-Marketing DE-Hosting"},
|
||||||
|
{"name": "Brevo (Sendinblue)", "vendor": "Brevo", "country": "FR",
|
||||||
|
"license": "Commercial", "notes": "Marketing-Automation EU-Hosting"},
|
||||||
|
{"name": "Inxmail", "vendor": "Inxmail GmbH", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "Enterprise-E-Mail-Marketing aus Freiburg"},
|
||||||
|
],
|
||||||
|
"google ads": [
|
||||||
|
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
|
||||||
|
"license": "Commercial", "notes": "FR-Hosting, Programmatic + Direct-Sold"},
|
||||||
|
{"name": "Bing Ads (Microsoft Advertising EU)", "vendor": "Microsoft", "country": "Multi",
|
||||||
|
"license": "Commercial", "notes": "EU-Datacenter optional"},
|
||||||
|
],
|
||||||
|
"google maps": [
|
||||||
|
{"name": "HERE Maps", "vendor": "HERE Technologies", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "Berliner Anbieter, professionelle Karten + Routing"},
|
||||||
|
{"name": "OpenStreetMap (self-host)", "vendor": "OSM Foundation", "country": "DE-self-host",
|
||||||
|
"license": "ODbL", "notes": "Frei, OSM-Tiles self-hosted oder via Maptiler EU"},
|
||||||
|
{"name": "Maptiler Cloud EU", "vendor": "MapTiler", "country": "CH",
|
||||||
|
"license": "Commercial", "notes": "Schweizer Anbieter, EU-Tiles"},
|
||||||
|
],
|
||||||
|
"criteo": [ # criteo IS EU but use as example for retargeting alts
|
||||||
|
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
|
||||||
|
"license": "Commercial", "notes": "Retargeting + Display, FR-Hosting"},
|
||||||
|
],
|
||||||
|
"hcaptcha": [
|
||||||
|
{"name": "Friendly Captcha", "vendor": "Friendly Captcha GmbH", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "100% DSGVO, ohne Cookie, Hosting in DE"},
|
||||||
|
{"name": "Turnstile (Cloudflare EU-Only)", "vendor": "Cloudflare", "country": "Multi",
|
||||||
|
"license": "Commercial", "notes": "Ohne Cookie, EU-Region erzwingbar"},
|
||||||
|
],
|
||||||
|
"qualtrics": [
|
||||||
|
{"name": "LamaPoll", "vendor": "Lamano GmbH", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "DSGVO-Surveys aus Berlin"},
|
||||||
|
{"name": "evasys", "vendor": "evasys GmbH", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "Enterprise-Survey-Plattform aus Lueneburg"},
|
||||||
|
],
|
||||||
|
"meta pixel": [
|
||||||
|
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
|
||||||
|
"license": "Commercial", "notes": "EU-Alternative fuer Conversion-Tracking"},
|
||||||
|
],
|
||||||
|
"facebook": [
|
||||||
|
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
|
||||||
|
"license": "Commercial", "notes": "Programmatic ohne Meta"},
|
||||||
|
],
|
||||||
|
"linkedin insight": [
|
||||||
|
{"name": "Xing Insights", "vendor": "New Work SE", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "DE/AT/CH B2B-Targeting aus Hamburg"},
|
||||||
|
],
|
||||||
|
"outbrain": [
|
||||||
|
{"name": "Plista", "vendor": "Plista GmbH", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "Native Advertising aus Berlin"},
|
||||||
|
],
|
||||||
|
"taboola": [
|
||||||
|
{"name": "Plista", "vendor": "Plista GmbH", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "Native Advertising aus Berlin"},
|
||||||
|
],
|
||||||
|
"genesys": [
|
||||||
|
{"name": "Userlike", "vendor": "Userlike UG", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "Live-Chat aus Koeln, BSI-konform"},
|
||||||
|
{"name": "LiveZilla / EasyChat EU", "vendor": "LiveZilla GmbH", "country": "DE",
|
||||||
|
"license": "Commercial", "notes": "DSGVO-Live-Chat"},
|
||||||
|
],
|
||||||
|
"salesviewer": [
|
||||||
|
{"name": "Leadinfo", "vendor": "Leadinfo BV", "country": "NL",
|
||||||
|
"license": "Commercial", "notes": "B2B-Webvisitor-Tracking EU"},
|
||||||
|
{"name": "Albacross EU", "vendor": "Albacross", "country": "SE",
|
||||||
|
"license": "Commercial", "notes": "EU-Tenant verfuegbar"},
|
||||||
|
],
|
||||||
|
"youtube": [
|
||||||
|
{"name": "Vimeo Pro EU", "vendor": "Vimeo", "country": "Multi",
|
||||||
|
"license": "Commercial", "notes": "EU-Region waehlbar, weniger Tracking"},
|
||||||
|
{"name": "Self-hosted video (BunnyStream)", "vendor": "BunnyWay", "country": "SI",
|
||||||
|
"license": "Commercial", "notes": "Eigene Player + CDN ohne Drittanbieter"},
|
||||||
|
],
|
||||||
|
"amazon advertising": [
|
||||||
|
{"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
|
||||||
|
"license": "Commercial", "notes": "Retail-Media-Alternative FR"},
|
||||||
|
],
|
||||||
|
"instagram": [
|
||||||
|
{"name": "Pinterest EU + Owned-Channels", "vendor": "Mix", "country": "Multi",
|
||||||
|
"license": "Commercial", "notes": "Owned-Channels (Newsletter via CleverReach)"},
|
||||||
|
],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Kosten-Annahmen (oeffentliche Listenpreise, Schaetzung) ──────
|
||||||
|
#
|
||||||
|
# Format: (low_year_eur, high_year_eur, tier_assumption)
|
||||||
|
# Tier: 'sme' = <100 Mitarbeiter, 'mid' = 100-1000, 'ent' = >1000.
|
||||||
|
# Quellen: oeffentliche Listenpreise + Branchen-Benchmarks (Gartner,
|
||||||
|
# Forrester 2025). Konkrete Vertrags-Konditionen koennen 30-70% abweichen
|
||||||
|
# (Volumen-Rabatte, Bundling). Werden im Output explizit als
|
||||||
|
# 'Schaetzbereich' markiert.
|
||||||
|
|
||||||
|
_COST_LOOKUP: dict[str, tuple[int, int, str]] = {
|
||||||
|
"adobe analytics": (120_000, 600_000, "ent"),
|
||||||
|
"adobe target": ( 80_000, 350_000, "ent"),
|
||||||
|
"adobe campaign": ( 60_000, 250_000, "ent"),
|
||||||
|
"adobe staging library":( 0, 0, "ent"), # bundled
|
||||||
|
"google analytics": ( 0, 150_000, "ent"), # GA4 free, GA360 ~150k
|
||||||
|
"matomo": ( 6_000, 30_000, "mid"), # Cloud/On-Prem
|
||||||
|
"hotjar": ( 3_600, 18_000, "mid"),
|
||||||
|
"content square": ( 60_000, 300_000, "ent"),
|
||||||
|
"contentsquare": ( 60_000, 300_000, "ent"),
|
||||||
|
"dynatrace": ( 50_000, 400_000, "ent"), # per-host pricing
|
||||||
|
"performance analytics":( 5_000, 40_000, "mid"),
|
||||||
|
"qualtrics": ( 25_000, 150_000, "ent"),
|
||||||
|
|
||||||
|
# Self-Service-Werbung — KEIN Tool-Lizenz, nur Media-Spend (separat).
|
||||||
|
# Wir zaehlen 0 hier, weil "Sparpotenzial bei der Lizenz" = 0 ist.
|
||||||
|
# Konsolidierung wuerde nur Media-Spend reduzieren — anderes Thema.
|
||||||
|
"google ads": ( 0, 0, "ent"),
|
||||||
|
"google advertising": ( 0, 0, "ent"),
|
||||||
|
"doubleclick": ( 0, 0, "ent"),
|
||||||
|
"meta pixel": ( 0, 0, "ent"),
|
||||||
|
"facebook": ( 0, 0, "ent"),
|
||||||
|
"amazon advertising": ( 0, 0, "ent"),
|
||||||
|
"youtube performance": ( 0, 0, "ent"),
|
||||||
|
"youtube player": ( 0, 0, "ent"),
|
||||||
|
"instagram": ( 0, 0, "ent"),
|
||||||
|
# Echte DSP-/Plattform-Lizenzen — hier zahlt der Kunde eine Saas-Fee
|
||||||
|
# ON TOP des Media-Spends. Range bewusst enger gehalten (Faktor max 4x).
|
||||||
|
"adform": ( 80_000, 300_000, "ent"),
|
||||||
|
"criteo": ( 50_000, 200_000, "ent"),
|
||||||
|
"outbrain": ( 30_000, 120_000, "ent"),
|
||||||
|
"taboola": ( 30_000, 120_000, "ent"),
|
||||||
|
"teads": ( 25_000, 100_000, "ent"),
|
||||||
|
"pinterest": ( 15_000, 60_000, "ent"),
|
||||||
|
"linkedin insight": ( 10_000, 50_000, "ent"),
|
||||||
|
|
||||||
|
"google maps": ( 2_000, 30_000, "mid"),
|
||||||
|
"akamai": ( 50_000, 500_000, "ent"),
|
||||||
|
"amazon web services": (100_000, 3_000_000, "ent"),
|
||||||
|
"baqend": ( 6_000, 60_000, "mid"),
|
||||||
|
"speedkit": ( 6_000, 60_000, "mid"),
|
||||||
|
"speedcurve": ( 2_400, 24_000, "mid"),
|
||||||
|
|
||||||
|
"salesforce": (100_000, 1_500_000, "ent"), # CRM seats
|
||||||
|
"genesys": ( 80_000, 800_000, "ent"), # contact-center seats
|
||||||
|
"ckm": ( 15_000, 120_000, "mid"),
|
||||||
|
"hcaptcha": ( 0, 12_000, "sme"), # free tier OR pro
|
||||||
|
|
||||||
|
"salesviewer": ( 3_600, 18_000, "mid"),
|
||||||
|
"youtube": ( 0, 50_000, "ent"), # embed kostenlos, Production-Kosten variieren
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ─── EU-Alternativen-Kosten (gleiche Tier-Logik) ───────────────────
|
||||||
|
|
||||||
|
_EU_ALT_COSTS: dict[str, tuple[int, int]] = {
|
||||||
|
"Matomo (On-Premise)": ( 3_000, 15_000),
|
||||||
|
"Matomo (Pro / Cloud EU)": ( 6_000, 30_000),
|
||||||
|
"Matomo": ( 6_000, 30_000),
|
||||||
|
"etracker Analytics": ( 10_000, 60_000),
|
||||||
|
"Mapp Intelligence": ( 40_000, 200_000),
|
||||||
|
"Plausible Analytics": ( 240, 6_000),
|
||||||
|
"Fathom Analytics EU": ( 240, 6_000),
|
||||||
|
"Mouseflow EU": ( 12_000, 60_000),
|
||||||
|
"Hotjar EU": ( 3_600, 18_000),
|
||||||
|
"Dynatrace EU": ( 50_000, 400_000), # gleicher Preis, nur Region
|
||||||
|
"SpeedCurve EU": ( 2_400, 24_000),
|
||||||
|
"Calibre": ( 3_600, 30_000),
|
||||||
|
"Bunny CDN": ( 1_200, 12_000),
|
||||||
|
"Cloudflare EU-Only": ( 6_000, 80_000),
|
||||||
|
"IONOS CDN": ( 3_000, 30_000),
|
||||||
|
"IONOS Cloud": ( 30_000, 600_000),
|
||||||
|
"OVHcloud": ( 30_000, 600_000),
|
||||||
|
"Hetzner Cloud": ( 6_000, 120_000),
|
||||||
|
"STACKIT": ( 50_000, 800_000),
|
||||||
|
"SAP Customer Experience": ( 80_000, 1_200_000),
|
||||||
|
"weclapp": ( 12_000, 80_000),
|
||||||
|
"CleverReach": ( 2_400, 24_000),
|
||||||
|
"Brevo (Sendinblue)": ( 600, 24_000),
|
||||||
|
"Inxmail": ( 8_000, 60_000),
|
||||||
|
"Smart AdServer (Equativ)": ( 30_000, 300_000),
|
||||||
|
"Bing Ads (Microsoft Advertising EU)": ( 30_000, 3_000_000),
|
||||||
|
"HERE Maps": ( 1_200, 24_000),
|
||||||
|
"OpenStreetMap (self-host)": ( 0, 6_000), # nur Server-Kosten
|
||||||
|
"Maptiler Cloud EU": ( 600, 12_000),
|
||||||
|
"Friendly Captcha": ( 600, 9_600),
|
||||||
|
"Turnstile (Cloudflare EU-Only)": ( 0, 6_000),
|
||||||
|
"LamaPoll": ( 1_200, 24_000),
|
||||||
|
"evasys": ( 6_000, 60_000),
|
||||||
|
"Xing Insights": ( 6_000, 60_000),
|
||||||
|
"Plista": ( 20_000, 150_000),
|
||||||
|
"Userlike": ( 1_200, 30_000),
|
||||||
|
"LiveZilla / EasyChat EU": ( 600, 12_000),
|
||||||
|
"Leadinfo": ( 1_200, 12_000),
|
||||||
|
"Albacross EU": ( 3_600, 24_000),
|
||||||
|
"Vimeo Pro EU": ( 900, 6_000),
|
||||||
|
"Self-hosted video (BunnyStream)": ( 600, 12_000),
|
||||||
|
"Pinterest EU + Owned-Channels": ( 600, 24_000),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Bekannte Gruende fuer Duplikate (sollen Konsolidierung NICHT empfehlen) ─
|
||||||
|
|
||||||
|
_DUPLICATION_CAVEATS = {
|
||||||
|
"web_analytics": [
|
||||||
|
"A/B-Vergleich verschiedener Anbieter waehrend Migration",
|
||||||
|
"Marketing nutzt Adobe, Produkt nutzt Matomo — Inhouse-Politik",
|
||||||
|
"Regional split (Adobe fuer DE, GA fuer International)",
|
||||||
|
],
|
||||||
|
"advertising": [
|
||||||
|
"Brand-Kampagne vs Performance-Kampagne (verschiedene DSPs)",
|
||||||
|
"Saisonal: Black Friday/Super Bowl nutzt mehr Kanaele",
|
||||||
|
"Markenspezifisch: BMW M-Modelle anders targetet als 1er-Serie",
|
||||||
|
],
|
||||||
|
"cdn": [
|
||||||
|
"Multi-CDN-Strategie fuer Ausfallsicherheit (Akamai + Cloudflare)",
|
||||||
|
"Event-CDN-Spike (Auto-Show, Modell-Launch) braucht Skalierung",
|
||||||
|
"Regionale Latenz-Optimierung (Akamai APAC, AWS US)",
|
||||||
|
],
|
||||||
|
"marketing_automation": [
|
||||||
|
"Salesforce Marketing Cloud fuer B2C, Adobe Campaign fuer B2B",
|
||||||
|
"Lead-Generierung (Adobe) vs Loyalitaet (Salesforce)",
|
||||||
|
],
|
||||||
|
"monitoring": [
|
||||||
|
"APM (Dynatrace) misst Backend, RUM (SpeedCurve) misst Frontend",
|
||||||
|
],
|
||||||
|
"captcha": [
|
||||||
|
"Stufenweise Migration zu cookieless Captcha",
|
||||||
|
],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _company_tier_bounds(company_tier: str | None) -> tuple[float, float]:
|
||||||
|
"""Wie viel der Listpreis-Range tatsaechlich verwenden — abhaengig
|
||||||
|
vom Company-Tier. Bei 'enterprise' / 'premier' nutzen wir den UPPER
|
||||||
|
Teil (50-100%) statt starter→premier.
|
||||||
|
"""
|
||||||
|
t = (company_tier or "professional").lower()
|
||||||
|
if t == "premier": return (0.70, 1.00)
|
||||||
|
if t == "enterprise": return (0.40, 0.85)
|
||||||
|
if t == "professional": return (0.20, 0.60)
|
||||||
|
return (0.05, 0.40) # 'sme' / starter
|
||||||
|
|
||||||
|
|
||||||
|
def _estimate_savings_for_redundancy(
|
||||||
|
redundancy: dict, vendors: Iterable[dict],
|
||||||
|
company_tier: str = "enterprise",
|
||||||
|
) -> dict:
|
||||||
|
"""Schaetzbereich pro Redundanz: derzeitige Kosten + EU-Konsolidierungs-Saving.
|
||||||
|
|
||||||
|
Beruecksichtigt den company_tier — wir wollen fuer ein Konzern wie
|
||||||
|
BMW nicht die starter-Range mit anzeigen. Realistic Range ergibt
|
||||||
|
sich aus tier_bounds × (low, high).
|
||||||
|
"""
|
||||||
|
low_frac, high_frac = _company_tier_bounds(company_tier)
|
||||||
|
current_low = current_high = 0
|
||||||
|
matched_vendors = []
|
||||||
|
cat_vendors = [v for v in vendors if v.get("name") in redundancy.get("vendors", [])]
|
||||||
|
for v in cat_vendors:
|
||||||
|
name = (v.get("name") or "").lower()
|
||||||
|
for k, (lo, hi, _tier) in _COST_LOOKUP.items():
|
||||||
|
if k in name:
|
||||||
|
# Tier-aware: nimm low_frac..high_frac des Pricing-Bereichs
|
||||||
|
span = hi - lo
|
||||||
|
current_low += int(lo + span * low_frac)
|
||||||
|
current_high += int(lo + span * high_frac)
|
||||||
|
matched_vendors.append(v.get("name"))
|
||||||
|
break
|
||||||
|
|
||||||
|
# Konsolidierung: ein einziges EU-Tool ersetzt alle in der Kategorie
|
||||||
|
suggested_eu = None
|
||||||
|
suggested_low = suggested_high = 0
|
||||||
|
# 1. Multi-Funktions-Tool das diese Kategorie abdeckt
|
||||||
|
for tool in _MULTI_FUNCTION_TOOLS:
|
||||||
|
if redundancy["category"] in tool["covers"]:
|
||||||
|
suggested_eu = tool["name"]
|
||||||
|
cost = _EU_ALT_COSTS.get(tool["name"])
|
||||||
|
if cost:
|
||||||
|
suggested_low, suggested_high = cost
|
||||||
|
break
|
||||||
|
# 2. Sonst: EU-Alternative aus den Eintraegen — ABER NUR FUR VENDORS
|
||||||
|
# AUS DER AKTUELLEN KATEGORIE (sonst kommt Userlike fuer Werbung)
|
||||||
|
if not suggested_eu:
|
||||||
|
for v in cat_vendors:
|
||||||
|
n = (v.get("name") or "").lower()
|
||||||
|
for k, alts in _EU_ALTERNATIVES.items():
|
||||||
|
if k in n and alts:
|
||||||
|
suggested_eu = alts[0]["name"]
|
||||||
|
cost = _EU_ALT_COSTS.get(alts[0]["name"])
|
||||||
|
if cost:
|
||||||
|
suggested_low, suggested_high = cost
|
||||||
|
break
|
||||||
|
if suggested_eu:
|
||||||
|
break
|
||||||
|
|
||||||
|
saving_low = max(0, current_low - suggested_high)
|
||||||
|
saving_high = max(0, current_high - suggested_low)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"current_estimate_year_eur": [current_low, current_high],
|
||||||
|
"suggested_eu_tool": suggested_eu,
|
||||||
|
"suggested_estimate_year_eur": [suggested_low, suggested_high],
|
||||||
|
"estimated_saving_year_eur": [saving_low, saving_high],
|
||||||
|
"caveats": _DUPLICATION_CAVEATS.get(redundancy["category"], []),
|
||||||
|
"cost_disclaimer": (
|
||||||
|
"Schaetzbereich auf Basis oeffentlicher Listenpreise. Tatsaechliche "
|
||||||
|
"Vertragspreise koennen 30-70% niedriger liegen (Volumen, Bundling, "
|
||||||
|
"Konzern-Konditionen). Bitte mit der jeweiligen Einkaufsabteilung verifizieren."
|
||||||
|
),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Multi-Funktions-Tools (Konsolidierungs-Ankerpunkte) ───────────
|
||||||
|
|
||||||
|
_MULTI_FUNCTION_TOOLS = [
|
||||||
|
{
|
||||||
|
"name": "Matomo (Pro / Cloud EU)",
|
||||||
|
"vendor": "InnoCraft",
|
||||||
|
"country": "DE-self-host / EU",
|
||||||
|
"covers": ["web_analytics", "tag_management", "personalisation"],
|
||||||
|
"notes": "Ersetzt Adobe Analytics + GTM + Adobe Target in einem Tool. "
|
||||||
|
"100% DSGVO ohne Einwilligung wenn IP anonymisiert.",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "SAP Customer Experience Suite",
|
||||||
|
"vendor": "SAP SE",
|
||||||
|
"country": "DE",
|
||||||
|
"covers": ["crm", "marketing_automation", "personalisation", "survey"],
|
||||||
|
"notes": "Ersetzt Salesforce + Adobe Campaign + Qualtrics. EU-Hosting, "
|
||||||
|
"tiefe ERP-Integration.",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "IONOS Cloud (Compute + CDN + Storage + DNS)",
|
||||||
|
"vendor": "IONOS SE",
|
||||||
|
"country": "DE",
|
||||||
|
"covers": ["cloud_infra", "cdn", "monitoring"],
|
||||||
|
"notes": "Ersetzt AWS + Akamai + zusaetzliches Monitoring in einer "
|
||||||
|
"DE-Cloud (BSI C5).",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Userlike Suite",
|
||||||
|
"vendor": "Userlike UG",
|
||||||
|
"country": "DE",
|
||||||
|
"covers": ["chat", "consent_management"],
|
||||||
|
"notes": "Ersetzt Genesys Chat. Bietet eigenes Consent-Modul.",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Smart AdServer (Equativ)",
|
||||||
|
"vendor": "Equativ",
|
||||||
|
"country": "FR",
|
||||||
|
"covers": ["advertising"],
|
||||||
|
"notes": "Ersetzt Mehrfach-DSPs (Adform/Criteo/Outbrain/Taboola/Meta) "
|
||||||
|
"durch Programmatic+Direct-Sold EU-Stack.",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "HERE Maps",
|
||||||
|
"vendor": "HERE Technologies",
|
||||||
|
"country": "DE",
|
||||||
|
"covers": ["maps"],
|
||||||
|
"notes": "Berliner Anbieter, professionelle Karten + Routing.",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Vimeo Pro EU (oder self-hosted BunnyStream)",
|
||||||
|
"vendor": "Vimeo / BunnyWay",
|
||||||
|
"country": "Multi / SI",
|
||||||
|
"covers": ["external_media"],
|
||||||
|
"notes": "Ersetzt YouTube-Embeds + JW Player in einem Player.",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "LamaPoll",
|
||||||
|
"vendor": "Lamano GmbH",
|
||||||
|
"country": "DE",
|
||||||
|
"covers": ["survey"],
|
||||||
|
"notes": "DSGVO-Surveys aus Berlin. Ersetzt Qualtrics / Psyma.",
|
||||||
|
},
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Analyse ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def analyze(vendors: Iterable[dict], company_tier: str = "enterprise") -> dict:
|
||||||
|
"""Main entry. Returns categorised view + redundancies + EU options.
|
||||||
|
|
||||||
|
`company_tier` (starter|professional|enterprise|premier) steuert die
|
||||||
|
Cost-Range so dass z.B. fuer einen DAX-Konzern nicht starter-Preise
|
||||||
|
in der unteren Schranke landen.
|
||||||
|
"""
|
||||||
|
by_cat: dict[str, list[dict]] = defaultdict(list)
|
||||||
|
for v in vendors:
|
||||||
|
cat = classify_vendor(v.get("name", ""))
|
||||||
|
by_cat[cat].append(v)
|
||||||
|
|
||||||
|
# Redundancies: any category with ≥2 vendors (excl. site-internal cats)
|
||||||
|
skip_redundancy_cats = {"site_infra", "site_feature", "consent_management",
|
||||||
|
"auth", "other"}
|
||||||
|
all_vendors_list = list(vendors)
|
||||||
|
redundancies: list[dict] = []
|
||||||
|
for cat, vs in by_cat.items():
|
||||||
|
if cat in skip_redundancy_cats or len(vs) < 2:
|
||||||
|
continue
|
||||||
|
red = {
|
||||||
|
"category": cat,
|
||||||
|
"category_label": _CATEGORY_LABEL.get(cat, cat),
|
||||||
|
"count": len(vs),
|
||||||
|
"vendors": [v.get("name", "") for v in vs],
|
||||||
|
"consolidation_hint": _CONSOLIDATION_HINT.get(cat, ""),
|
||||||
|
}
|
||||||
|
red.update(_estimate_savings_for_redundancy(
|
||||||
|
red, all_vendors_list, company_tier))
|
||||||
|
redundancies.append(red)
|
||||||
|
redundancies.sort(key=lambda r: -(r.get("estimated_saving_year_eur") or [0, 0])[1])
|
||||||
|
|
||||||
|
# EU alternatives lookup
|
||||||
|
eu_alternatives: list[dict] = []
|
||||||
|
seen = set()
|
||||||
|
for v in vendors:
|
||||||
|
name = v.get("name") or ""
|
||||||
|
n_lower = name.lower()
|
||||||
|
for k, alts in _EU_ALTERNATIVES.items():
|
||||||
|
if k in n_lower and k not in seen:
|
||||||
|
eu_alternatives.append({
|
||||||
|
"current_vendor": name,
|
||||||
|
"current_recipient_type": v.get("recipient_type", ""),
|
||||||
|
"matched_key": k,
|
||||||
|
"alternatives": alts,
|
||||||
|
})
|
||||||
|
seen.add(k)
|
||||||
|
break
|
||||||
|
|
||||||
|
# Multi-function tool recommendations: only if the customer has vendors
|
||||||
|
# across the categories the tool covers
|
||||||
|
present_cats = set(by_cat.keys())
|
||||||
|
multi_function = []
|
||||||
|
for tool in _MULTI_FUNCTION_TOOLS:
|
||||||
|
covered_here = [c for c in tool["covers"] if c in present_cats]
|
||||||
|
if len(covered_here) >= 2:
|
||||||
|
# Vendor-Namen sammeln statt nur summieren — dedupliziert
|
||||||
|
unique_vendors: set[str] = set()
|
||||||
|
for c in covered_here:
|
||||||
|
for v in by_cat[c]:
|
||||||
|
unique_vendors.add(v.get("name", ""))
|
||||||
|
multi_function.append({
|
||||||
|
**tool,
|
||||||
|
"replaces_categories": covered_here,
|
||||||
|
"potential_replacements": len(unique_vendors),
|
||||||
|
})
|
||||||
|
multi_function.sort(key=lambda t: -t["potential_replacements"])
|
||||||
|
|
||||||
|
total_current_low = sum((r.get("current_estimate_year_eur") or [0, 0])[0] for r in redundancies)
|
||||||
|
total_current_high = sum((r.get("current_estimate_year_eur") or [0, 0])[1] for r in redundancies)
|
||||||
|
total_saving_low = sum((r.get("estimated_saving_year_eur") or [0, 0])[0] for r in redundancies)
|
||||||
|
total_saving_high = sum((r.get("estimated_saving_year_eur") or [0, 0])[1] for r in redundancies)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"summary": {
|
||||||
|
"total_vendors": len(all_vendors_list),
|
||||||
|
"distinct_categories": len([c for c in by_cat if c != "other"]),
|
||||||
|
"redundancy_count": len(redundancies),
|
||||||
|
"eu_alternative_count": len(eu_alternatives),
|
||||||
|
"consolidation_potential": sum(r["count"] - 1 for r in redundancies),
|
||||||
|
"estimated_current_year_eur": [total_current_low, total_current_high],
|
||||||
|
"estimated_saving_year_eur": [total_saving_low, total_saving_high],
|
||||||
|
"estimated_saving_pct": (
|
||||||
|
# Beide Bounds gegen denselben Nenner (Mittelwert der
|
||||||
|
# aktuellen Schaetzung) — sonst explodiert die obere
|
||||||
|
# Schranke wenn current_low klein ist. Cap auf 95%.
|
||||||
|
(lambda mid: (
|
||||||
|
f"{min(95, int(100 * total_saving_low / mid))}–"
|
||||||
|
f"{min(95, int(100 * total_saving_high / mid))}%"
|
||||||
|
))((total_current_low + total_current_high) / 2)
|
||||||
|
if total_current_high else "n/a"
|
||||||
|
),
|
||||||
|
"cost_disclaimer": (
|
||||||
|
"Schaetzbereich auf Basis oeffentlicher Listenpreise (Gartner, Forrester 2025). "
|
||||||
|
"Vertragspreise koennen 30-70% niedriger liegen (Volumen-Rabatte, Konzern-Konditionen, "
|
||||||
|
"Bundling). Werte dienen als Diskussionsgrundlage mit dem Einkauf, NICHT als Angebot."
|
||||||
|
),
|
||||||
|
},
|
||||||
|
"by_category": {cat: [v.get("name", "") for v in vs]
|
||||||
|
for cat, vs in by_cat.items()},
|
||||||
|
"redundancies": redundancies,
|
||||||
|
"eu_alternatives": eu_alternatives,
|
||||||
|
"multi_function_tools": multi_function,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
_CATEGORY_LABEL = {
|
||||||
|
"web_analytics": "Web-Analytics",
|
||||||
|
"advertising": "Werbung / Retargeting",
|
||||||
|
"tag_management": "Tag-Management",
|
||||||
|
"marketing_automation": "Marketing-Automation",
|
||||||
|
"personalisation": "Personalisierung",
|
||||||
|
"external_media": "Externe Medien (Video)",
|
||||||
|
"maps": "Karten / Geo",
|
||||||
|
"cdn": "CDN",
|
||||||
|
"cloud_infra": "Cloud-Infrastruktur",
|
||||||
|
"monitoring": "Performance-Monitoring",
|
||||||
|
"crm": "CRM",
|
||||||
|
"chat": "Chat / Support",
|
||||||
|
"captcha": "Bot-Schutz",
|
||||||
|
"lead_tracking": "Lead-Tracking",
|
||||||
|
"survey": "Umfragen",
|
||||||
|
"social_aggregator": "Social-Media-Aggregation",
|
||||||
|
"consent_management": "Consent-Management",
|
||||||
|
"auth": "Authentifizierung",
|
||||||
|
"site_infra": "Eigene Infrastruktur",
|
||||||
|
"site_feature": "Eigene Features",
|
||||||
|
"other": "Sonstige",
|
||||||
|
}
|
||||||
|
|
||||||
|
_CONSOLIDATION_HINT = {
|
||||||
|
"web_analytics": "Mehrere Analytics-Tools sammeln meist redundante Daten. Ein Tool genuegt — Matomo (DE) ist DSGVO-Standard.",
|
||||||
|
"advertising": "Werbe-/Retargeting-Pixel sind oft austauschbar. Konzentration auf 2-3 Kanaele senkt Drittland-Risiko.",
|
||||||
|
"external_media": "Mehrere Video-Embeds nur wenn fachlich noetig. Self-hosted (BunnyStream/Vimeo) reduziert Tracking.",
|
||||||
|
"maps": "Eine Karten-Loesung reicht. HERE Maps (DE) als EU-Alternative zu Google Maps.",
|
||||||
|
"cdn": "Ein CDN+Performance-Stack genuegt. IONOS oder Bunny vereinen mehrere Funktionen.",
|
||||||
|
"marketing_automation": "Marketing-Cloud + separates E-Mail-Tool sind oft Dopplung — SAP CX oder CleverReach allein moeglich.",
|
||||||
|
"chat": "Ein Chat-System genuegt. Userlike (DE) ersetzt Genesys-Stack.",
|
||||||
|
"monitoring": "RUM + APM koennen in einem Tool gebuendelt werden (Dynatrace EU oder Sentry-Self-host).",
|
||||||
|
"survey": "Eine Survey-Plattform genuegt — LamaPoll (DE) oder Mapp.",
|
||||||
|
}
|
||||||
@@ -0,0 +1,229 @@
|
|||||||
|
"""
|
||||||
|
LLM-Audit: prueft fuer jeden text-MC ob er fachlich wirklich zu seinem
|
||||||
|
zugeordneten doc_type passt. BMW Run zeigte z.B. dass die meisten
|
||||||
|
Impressum-MCs eigentlich DSA/E-Commerce-Pflichten sind, die in einem
|
||||||
|
§5-TMG-Impressum gar nicht stehen.
|
||||||
|
|
||||||
|
Output:
|
||||||
|
- doc_type passt → MC bleibt active (kein DB-Update)
|
||||||
|
- doc_type passt NICHT → check_type wird auf 'misclassified' gesetzt;
|
||||||
|
rag_document_checker filtert die dann aus
|
||||||
|
|
||||||
|
Sidecar SQLite. KEINE Postgres-Aenderung (Schema frozen).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sqlite3
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
import psycopg2
|
||||||
|
from psycopg2.extras import RealDictCursor
|
||||||
|
|
||||||
|
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||||
|
MODEL = "claude-sonnet-4-6"
|
||||||
|
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
|
||||||
|
BATCH_SIZE = 25
|
||||||
|
SLEEP_BETWEEN_BATCHES = 0.5
|
||||||
|
|
||||||
|
DOC_TYPE_DESCRIPTIONS = {
|
||||||
|
"agb": "Allgemeine Geschaeftsbedingungen — Vertragsbedingungen "
|
||||||
|
"zwischen Anbieter und Kunde",
|
||||||
|
"avv": "Auftragsverarbeitungsvertrag Art. 28 DSGVO — Vertrag zwischen "
|
||||||
|
"Verantwortlichem und Auftragsverarbeiter",
|
||||||
|
"cookie": "Cookie-Richtlinie — Information ueber Zwecke, Anbieter, "
|
||||||
|
"Speicherdauer und Widerruf von Cookies/Tracking-Pixeln",
|
||||||
|
"dse": "Datenschutzerklaerung Art. 13 DSGVO — Informationen ueber "
|
||||||
|
"Verantwortlichen, Zwecke, Rechtsgrundlagen, Empfaenger, "
|
||||||
|
"Speicherdauer, Betroffenenrechte, Aufsichtsbehoerde",
|
||||||
|
"dsfa": "Datenschutz-Folgenabschaetzung Art. 35 DSGVO — Risikobewertung "
|
||||||
|
"von Verarbeitungen mit hohem Risiko",
|
||||||
|
"impressum": "Impressum §5 TMG / §18 MStV — Anbieterkennzeichnung mit "
|
||||||
|
"Name, Anschrift, Vertretungsberechtigten, Handelsregister, "
|
||||||
|
"USt-IdNr., berufsrechtliche Angaben, Aufsicht",
|
||||||
|
"loeschkonzept": "Loeschkonzept DIN 66398 — Definition Aufbewahrungs- "
|
||||||
|
"und Loeschfristen pro Datenkategorie + Prozess",
|
||||||
|
"widerruf": "Widerrufsbelehrung §312 BGB — Verbraucher-Widerrufsrecht "
|
||||||
|
"bei Fernabsatz, Frist, Folgen, Muster",
|
||||||
|
}
|
||||||
|
|
||||||
|
SYSTEM_PROMPT = """Du pruefst Master-Controls (MCs) der Compliance-Pruefung.
|
||||||
|
|
||||||
|
Fuer jeden MC bekommst du:
|
||||||
|
- den ihm zugeordneten doc_type (z.B. 'impressum', 'cookie', 'dse')
|
||||||
|
- den Titel und die check_question
|
||||||
|
|
||||||
|
Deine Aufgabe: entscheide ob der MC fachlich wirklich zu diesem doc_type passt.
|
||||||
|
|
||||||
|
Beispiele:
|
||||||
|
- MC "Wird die USt-IdNr im Impressum genannt?" + doc_type=impressum → PASST
|
||||||
|
- MC "Monatlich aktive Endnutzer fuer Online-Vermittlungsdienste messen" + doc_type=impressum → PASST NICHT
|
||||||
|
(DSA-Pflicht, gehoert nicht in ein klassisches §5-TMG-Impressum)
|
||||||
|
- MC "Mithoeren von Funknachrichten verhindern" + doc_type=cookie → PASST NICHT
|
||||||
|
(TKG-Spezialthema, nicht Cookie-Richtlinie)
|
||||||
|
|
||||||
|
Antworte als JSON-Array, eine Zeile pro MC:
|
||||||
|
[{"control_id": "<wie input>", "doc_type": "<wie input>", "fits": true|false,
|
||||||
|
"rationale": "ein kurzer satz"}, ...]
|
||||||
|
Kein Markdown."""
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_pairs_to_audit(conn) -> list[dict]:
|
||||||
|
"""All text-MCs that haven't been audited yet (no 'fits' column)."""
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as side:
|
||||||
|
cols = [r[1] for r in side.execute("PRAGMA table_info(mc_classification)")]
|
||||||
|
if "fits_doc_type" not in cols:
|
||||||
|
side.execute("ALTER TABLE mc_classification ADD COLUMN fits_doc_type INTEGER")
|
||||||
|
side.commit()
|
||||||
|
already = set()
|
||||||
|
for cid, dt in side.execute(
|
||||||
|
"SELECT control_id, doc_type FROM mc_classification "
|
||||||
|
"WHERE fits_doc_type IS NOT NULL"
|
||||||
|
):
|
||||||
|
already.add((cid, dt or ""))
|
||||||
|
|
||||||
|
with conn.cursor(cursor_factory=RealDictCursor) as c:
|
||||||
|
c.execute("""SELECT dc.control_id, dc.doc_type, dc.title, dc.check_question
|
||||||
|
FROM compliance.doc_check_controls dc
|
||||||
|
WHERE dc.control_id IN (
|
||||||
|
SELECT control_id FROM compliance.doc_check_controls
|
||||||
|
)""")
|
||||||
|
all_rows = list(c.fetchall())
|
||||||
|
|
||||||
|
# Audit only those classified as 'text' in sidecar — process/review
|
||||||
|
# never run through doc_check anyway
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as side:
|
||||||
|
text_pairs = set()
|
||||||
|
for cid, dt in side.execute(
|
||||||
|
"SELECT control_id, doc_type FROM mc_classification "
|
||||||
|
"WHERE check_type = 'text'"
|
||||||
|
):
|
||||||
|
text_pairs.add((cid, dt or ""))
|
||||||
|
|
||||||
|
target = [r for r in all_rows
|
||||||
|
if (r["control_id"], r["doc_type"] or "") in text_pairs
|
||||||
|
and (r["control_id"], r["doc_type"] or "") not in already]
|
||||||
|
return target
|
||||||
|
|
||||||
|
|
||||||
|
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
|
||||||
|
payload = {
|
||||||
|
"model": MODEL,
|
||||||
|
"max_tokens": 4000,
|
||||||
|
"system": SYSTEM_PROMPT,
|
||||||
|
"messages": [{
|
||||||
|
"role": "user",
|
||||||
|
"content": (
|
||||||
|
"Doc-Typen-Beschreibungen:\n"
|
||||||
|
+ "\n".join(f"- {k}: {v}" for k, v in DOC_TYPE_DESCRIPTIONS.items())
|
||||||
|
+ "\n\nPruefe folgende MCs:\n\n"
|
||||||
|
+ json.dumps([
|
||||||
|
{"control_id": m["control_id"], "doc_type": m["doc_type"],
|
||||||
|
"title": m["title"], "check_question": (m["check_question"] or "")[:300]}
|
||||||
|
for m in batch
|
||||||
|
], ensure_ascii=False, indent=2)
|
||||||
|
),
|
||||||
|
}],
|
||||||
|
}
|
||||||
|
headers = {
|
||||||
|
"x-api-key": api_key,
|
||||||
|
"anthropic-version": "2023-06-01",
|
||||||
|
"content-type": "application/json",
|
||||||
|
}
|
||||||
|
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
|
||||||
|
r.raise_for_status()
|
||||||
|
txt = r.json()["content"][0]["text"].strip()
|
||||||
|
if txt.startswith("```"):
|
||||||
|
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
|
||||||
|
if txt.startswith("json"):
|
||||||
|
txt = txt[4:].strip()
|
||||||
|
return json.loads(txt)
|
||||||
|
|
||||||
|
|
||||||
|
def store_audit(rows: list[dict]) -> None:
|
||||||
|
ts = datetime.now(timezone.utc).isoformat()
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as c:
|
||||||
|
c.executemany(
|
||||||
|
"UPDATE mc_classification SET fits_doc_type = ?, "
|
||||||
|
"rationale = COALESCE(?, rationale), classified_at = ? "
|
||||||
|
"WHERE control_id = ? AND doc_type = ?",
|
||||||
|
[
|
||||||
|
(
|
||||||
|
1 if r.get("fits") else 0,
|
||||||
|
(r.get("rationale") or "")[:500] or None,
|
||||||
|
ts,
|
||||||
|
r.get("control_id"),
|
||||||
|
r.get("doc_type") or "",
|
||||||
|
)
|
||||||
|
for r in rows
|
||||||
|
],
|
||||||
|
)
|
||||||
|
c.commit()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
ap = argparse.ArgumentParser()
|
||||||
|
ap.add_argument("--sample", action="store_true")
|
||||||
|
args = ap.parse_args()
|
||||||
|
|
||||||
|
api_key = os.environ["ANTHROPIC_API_KEY"]
|
||||||
|
conn = psycopg2.connect(os.environ["DATABASE_URL"])
|
||||||
|
pairs = fetch_pairs_to_audit(conn)
|
||||||
|
|
||||||
|
if args.sample:
|
||||||
|
for m in pairs[:5]:
|
||||||
|
print(json.dumps(m, ensure_ascii=False, indent=2))
|
||||||
|
print(f"\nTotal pairs to audit: {len(pairs)}")
|
||||||
|
return
|
||||||
|
|
||||||
|
print(f"Audit {len(pairs)} text-MCs in Batches von {BATCH_SIZE}", flush=True)
|
||||||
|
if not pairs:
|
||||||
|
print("Alles auditiert.")
|
||||||
|
return
|
||||||
|
|
||||||
|
done = 0
|
||||||
|
failed_batches = 0
|
||||||
|
t0 = time.time()
|
||||||
|
for i in range(0, len(pairs), BATCH_SIZE):
|
||||||
|
batch = pairs[i:i + BATCH_SIZE]
|
||||||
|
try:
|
||||||
|
out = call_claude(api_key, batch)
|
||||||
|
store_audit(out)
|
||||||
|
done += len(out)
|
||||||
|
elapsed = time.time() - t0
|
||||||
|
rate = done / max(elapsed, 0.01)
|
||||||
|
eta = (len(pairs) - done) / max(rate, 0.01)
|
||||||
|
print(f" [{done:>4}/{len(pairs)}] {rate:.1f} MC/s ETA {eta/60:.1f}min", flush=True)
|
||||||
|
except Exception as e:
|
||||||
|
failed_batches += 1
|
||||||
|
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
|
||||||
|
if failed_batches >= 5:
|
||||||
|
print("Zu viele Fehler — abbrechen.", file=sys.stderr)
|
||||||
|
break
|
||||||
|
time.sleep(SLEEP_BETWEEN_BATCHES)
|
||||||
|
|
||||||
|
print(f"\nFertig: {done}/{len(pairs)} auditiert ({failed_batches} Fehlbatches).")
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as c:
|
||||||
|
c.row_factory = sqlite3.Row
|
||||||
|
rows = c.execute(
|
||||||
|
"SELECT doc_type, "
|
||||||
|
" SUM(CASE WHEN fits_doc_type = 1 THEN 1 ELSE 0 END) AS fits, "
|
||||||
|
" SUM(CASE WHEN fits_doc_type = 0 THEN 1 ELSE 0 END) AS misfits, "
|
||||||
|
" COUNT(*) AS total "
|
||||||
|
"FROM mc_classification "
|
||||||
|
"WHERE check_type = 'text' AND fits_doc_type IS NOT NULL "
|
||||||
|
"GROUP BY doc_type ORDER BY doc_type"
|
||||||
|
).fetchall()
|
||||||
|
print("\n=== Audit-Verteilung doc_type x fits ===")
|
||||||
|
for r in rows:
|
||||||
|
print(f" {r['doc_type']:<14} fits={r['fits']:<4} misfits={r['misfits']:<4} total={r['total']}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -0,0 +1,216 @@
|
|||||||
|
"""
|
||||||
|
A3-v2: nachschärfen welche MCs in Wahrheit auf das CMP-Banner-UI / einen
|
||||||
|
Prozess zielen, nicht auf den Doc-TEXT.
|
||||||
|
|
||||||
|
BMW-Run zeigte ~10 'Sprache klar/einfach' + 'Button-Text formuliert' MCs,
|
||||||
|
die der erste Audit als doc_type-passend markierte. Sie sind ABER nicht
|
||||||
|
gegen den Cookie-Policy- oder DSE-Text pruefbar — die fragen nach der
|
||||||
|
Verstaendlichkeit der Einwilligungs-UI.
|
||||||
|
|
||||||
|
Schaltet 'scope_requires' bei FRT/biometrischen MCs und re-klassifiziert
|
||||||
|
diese UI-Items auf check_type='process' (filtert sie aus doc_check raus).
|
||||||
|
|
||||||
|
Plus: Setzt scope_requires fuer offensichtlich konditionale MCs:
|
||||||
|
- 'biometric_processing' bei FRT/Gesichtserkennung
|
||||||
|
- 'ai_decision_making' bei automatisierten Einzelentscheidungen
|
||||||
|
- 'child_targeting' bei Kinder-Einwilligungs-MCs
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sqlite3
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
import psycopg2
|
||||||
|
from psycopg2.extras import RealDictCursor
|
||||||
|
|
||||||
|
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||||
|
MODEL = "claude-sonnet-4-6"
|
||||||
|
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
|
||||||
|
BATCH_SIZE = 20
|
||||||
|
|
||||||
|
SYSTEM_PROMPT = """Du pruefst Master-Controls (MCs) der Compliance-Pruefung
|
||||||
|
zweiter Ordnung: jeder MC wurde bereits als 'text' klassifiziert und einem
|
||||||
|
doc_type zugeordnet. Du entscheidest:
|
||||||
|
|
||||||
|
A) Pruefst der MC den TEXT des Dokuments (z.B. "Wird im Impressum die
|
||||||
|
USt-IdNr genannt?") → fits_doc_type=true, ui_only=false
|
||||||
|
B) Pruefst der MC die UI/UX eines Banners oder Buttons (z.B.
|
||||||
|
"Ist der Ablehnungs-Button gleich gross wie Akzeptieren?", "Ist die
|
||||||
|
Einwilligungsanfrage in einfacher Sprache formuliert?") → ui_only=true
|
||||||
|
(Diese MCs koennen NIE im Dokumenttext stehen, weil sie sich auf eine
|
||||||
|
externe UI beziehen.)
|
||||||
|
|
||||||
|
Zusaetzlich kennzeichne SCOPE-Bedingungen, wenn der MC nur fuer bestimmte
|
||||||
|
Sites relevant ist:
|
||||||
|
- 'biometric_processing' : nur bei Sites die biometrische Daten
|
||||||
|
(Gesichtserkennung/FRT, Fingerabdruecke) verarbeiten
|
||||||
|
- 'ai_decision_making' : nur bei automatisierten Einzelentscheidungen
|
||||||
|
(Art. 22 DSGVO)
|
||||||
|
- 'child_targeting' : nur bei Sites die sich an Kinder richten
|
||||||
|
- 'ecommerce' : nur bei Webshops
|
||||||
|
- 'b2c' : nur fuer Verbraucher-Geschaeft (UWG/BGB-Pflichten)
|
||||||
|
Wenn der MC fuer ALLE Sites gilt, setze scope_requires=null.
|
||||||
|
|
||||||
|
Antworte als JSON-Array — keine Erklaerung davor/danach, kein Markdown.
|
||||||
|
Format:
|
||||||
|
[{"control_id": "<wie input>", "doc_type": "<wie input>",
|
||||||
|
"ui_only": true|false, "scope_requires": "biometric_processing|ai_decision_making|child_targeting|ecommerce|b2c|null",
|
||||||
|
"rationale": "ein kurzer satz"}, ...]"""
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_pairs_to_audit(conn) -> list[dict]:
|
||||||
|
"""All text-MCs that haven't been v2-audited yet (no 'ui_only' column)."""
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as side:
|
||||||
|
cols = [r[1] for r in side.execute("PRAGMA table_info(mc_classification)")]
|
||||||
|
added = False
|
||||||
|
if "ui_only" not in cols:
|
||||||
|
side.execute("ALTER TABLE mc_classification ADD COLUMN ui_only INTEGER")
|
||||||
|
added = True
|
||||||
|
if "scope_requires" not in cols:
|
||||||
|
side.execute("ALTER TABLE mc_classification ADD COLUMN scope_requires TEXT")
|
||||||
|
added = True
|
||||||
|
if added:
|
||||||
|
side.commit()
|
||||||
|
already = set()
|
||||||
|
for cid, dt in side.execute(
|
||||||
|
"SELECT control_id, doc_type FROM mc_classification "
|
||||||
|
"WHERE ui_only IS NOT NULL"
|
||||||
|
):
|
||||||
|
already.add((cid, dt or ""))
|
||||||
|
|
||||||
|
with conn.cursor(cursor_factory=RealDictCursor) as c:
|
||||||
|
c.execute("""SELECT dc.control_id, dc.doc_type, dc.title, dc.check_question
|
||||||
|
FROM compliance.doc_check_controls dc""")
|
||||||
|
all_rows = list(c.fetchall())
|
||||||
|
|
||||||
|
# Audit only those already classified as text+fits in sidecar
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as side:
|
||||||
|
eligible = set()
|
||||||
|
for cid, dt in side.execute(
|
||||||
|
"SELECT control_id, doc_type FROM mc_classification "
|
||||||
|
"WHERE check_type = 'text' AND (fits_doc_type IS NULL OR fits_doc_type = 1)"
|
||||||
|
):
|
||||||
|
eligible.add((cid, dt or ""))
|
||||||
|
|
||||||
|
target = [r for r in all_rows
|
||||||
|
if (r["control_id"], r["doc_type"] or "") in eligible
|
||||||
|
and (r["control_id"], r["doc_type"] or "") not in already]
|
||||||
|
return target
|
||||||
|
|
||||||
|
|
||||||
|
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
|
||||||
|
payload = {
|
||||||
|
"model": MODEL,
|
||||||
|
"max_tokens": 4000,
|
||||||
|
"system": SYSTEM_PROMPT,
|
||||||
|
"messages": [{
|
||||||
|
"role": "user",
|
||||||
|
"content": "Pruefe folgende MCs:\n\n" + json.dumps([
|
||||||
|
{"control_id": m["control_id"], "doc_type": m["doc_type"],
|
||||||
|
"title": m["title"], "check_question": (m["check_question"] or "")[:300]}
|
||||||
|
for m in batch
|
||||||
|
], ensure_ascii=False, indent=2),
|
||||||
|
}],
|
||||||
|
}
|
||||||
|
headers = {
|
||||||
|
"x-api-key": api_key,
|
||||||
|
"anthropic-version": "2023-06-01",
|
||||||
|
"content-type": "application/json",
|
||||||
|
}
|
||||||
|
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
|
||||||
|
r.raise_for_status()
|
||||||
|
txt = r.json()["content"][0]["text"].strip()
|
||||||
|
if txt.startswith("```"):
|
||||||
|
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
|
||||||
|
if txt.startswith("json"):
|
||||||
|
txt = txt[4:].strip()
|
||||||
|
return json.loads(txt)
|
||||||
|
|
||||||
|
|
||||||
|
def store(rows: list[dict]) -> None:
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as c:
|
||||||
|
c.executemany(
|
||||||
|
"UPDATE mc_classification SET ui_only = ?, scope_requires = ? "
|
||||||
|
"WHERE control_id = ? AND doc_type = ?",
|
||||||
|
[
|
||||||
|
(
|
||||||
|
1 if r.get("ui_only") else 0,
|
||||||
|
(r.get("scope_requires") or "").strip() or None
|
||||||
|
if (r.get("scope_requires") or "").lower() not in ("", "null")
|
||||||
|
else None,
|
||||||
|
r.get("control_id"),
|
||||||
|
r.get("doc_type") or "",
|
||||||
|
)
|
||||||
|
for r in rows
|
||||||
|
],
|
||||||
|
)
|
||||||
|
# MCs flagged ui_only become check_type='process' so they're not in doc_check
|
||||||
|
c.executemany(
|
||||||
|
"UPDATE mc_classification SET check_type='process' "
|
||||||
|
"WHERE ui_only=1 AND control_id=? AND doc_type=?",
|
||||||
|
[(r.get("control_id"), r.get("doc_type") or "") for r in rows
|
||||||
|
if r.get("ui_only")],
|
||||||
|
)
|
||||||
|
c.commit()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
ap = argparse.ArgumentParser()
|
||||||
|
ap.add_argument("--sample", action="store_true")
|
||||||
|
args = ap.parse_args()
|
||||||
|
|
||||||
|
api_key = os.environ["ANTHROPIC_API_KEY"]
|
||||||
|
conn = psycopg2.connect(os.environ["DATABASE_URL"])
|
||||||
|
pairs = fetch_pairs_to_audit(conn)
|
||||||
|
|
||||||
|
if args.sample:
|
||||||
|
for m in pairs[:5]:
|
||||||
|
print(json.dumps(m, ensure_ascii=False, indent=2))
|
||||||
|
print(f"\nTotal: {len(pairs)}")
|
||||||
|
return
|
||||||
|
|
||||||
|
print(f"Audit-v2: {len(pairs)} MCs in Batches von {BATCH_SIZE}", flush=True)
|
||||||
|
if not pairs:
|
||||||
|
print("Alles geprueft.")
|
||||||
|
return
|
||||||
|
|
||||||
|
done = 0
|
||||||
|
fail = 0
|
||||||
|
t0 = time.time()
|
||||||
|
for i in range(0, len(pairs), BATCH_SIZE):
|
||||||
|
batch = pairs[i:i + BATCH_SIZE]
|
||||||
|
try:
|
||||||
|
out = call_claude(api_key, batch)
|
||||||
|
store(out)
|
||||||
|
done += len(out)
|
||||||
|
elapsed = time.time() - t0
|
||||||
|
rate = done / max(elapsed, 0.01)
|
||||||
|
eta = (len(pairs) - done) / max(rate, 0.01)
|
||||||
|
print(f" [{done:>4}/{len(pairs)}] {rate:.1f} MC/s ETA {eta/60:.1f}min", flush=True)
|
||||||
|
except Exception as e:
|
||||||
|
fail += 1
|
||||||
|
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", file=sys.stderr, flush=True)
|
||||||
|
if fail >= 5: break
|
||||||
|
time.sleep(0.5)
|
||||||
|
|
||||||
|
print(f"\nFertig: {done}/{len(pairs)} ({fail} Fehlbatches).")
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as c:
|
||||||
|
ui = c.execute("SELECT COUNT(*) FROM mc_classification WHERE ui_only=1").fetchone()[0]
|
||||||
|
scope = c.execute(
|
||||||
|
"SELECT scope_requires, COUNT(*) FROM mc_classification "
|
||||||
|
"WHERE scope_requires IS NOT NULL GROUP BY scope_requires"
|
||||||
|
).fetchall()
|
||||||
|
print(f"\nui_only=1: {ui} MCs (jetzt 'process')")
|
||||||
|
print("scope_requires Verteilung:")
|
||||||
|
for s, n in scope:
|
||||||
|
print(f" {s}: {n}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -0,0 +1,222 @@
|
|||||||
|
"""
|
||||||
|
Classify doc_check_controls (1874 MCs) into check_type:
|
||||||
|
- text : MC asks about the DOCUMENT'S WORDING ("enthaelt die DSE..", "wird im Impressum genannt..")
|
||||||
|
- process : MC asks about a TECHNICAL OR ORGANISATIONAL CONTROL ("ist sichergestellt..", "ist implementiert..")
|
||||||
|
- review : MC asks an ABSTRACT META-QUESTION that needs human judgment ("sind alle Verarbeitungen vollstaendig?")
|
||||||
|
|
||||||
|
Output goes to sidecar SQLite at /data/mc_classification.db (no Postgres schema change
|
||||||
|
per CLAUDE.md guardrails). Schema:
|
||||||
|
|
||||||
|
CREATE TABLE mc_classification (
|
||||||
|
control_id TEXT PRIMARY KEY,
|
||||||
|
doc_type TEXT,
|
||||||
|
title TEXT,
|
||||||
|
check_type TEXT, -- text|process|review
|
||||||
|
confidence REAL, -- 0..1
|
||||||
|
rationale TEXT,
|
||||||
|
classified_at TEXT
|
||||||
|
);
|
||||||
|
|
||||||
|
Run from inside bp-compliance-backend container:
|
||||||
|
docker exec bp-compliance-backend python /app/scripts/classify_mc_check_type.py [--limit N] [--sample]
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sqlite3
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
import psycopg2
|
||||||
|
from psycopg2.extras import RealDictCursor
|
||||||
|
|
||||||
|
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||||
|
MODEL = "claude-sonnet-4-6"
|
||||||
|
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
|
||||||
|
BATCH_SIZE = 25
|
||||||
|
SLEEP_BETWEEN_BATCHES = 0.5 # sec — keep gentle for the parallel Haiku batch
|
||||||
|
|
||||||
|
SYSTEM_PROMPT = """Du klassifizierst Master-Controls (MCs) der DSGVO/TMG/TDDDG-Compliance-Pruefung in genau drei Klassen:
|
||||||
|
|
||||||
|
TEXT — Der MC prueft, ob ein bestimmter Inhalt im TEXT eines Dokuments (DSE, Impressum, Cookie-Richtlinie etc.) steht.
|
||||||
|
Beispiele: "Enthaelt die Datenschutzerklaerung die Aufsichtsbehoerde?" / "Wird im Impressum die USt-IdNr genannt?"
|
||||||
|
Diese MCs koennen gegen den Dokument-Text gematched werden.
|
||||||
|
|
||||||
|
PROCESS — Der MC prueft eine technische oder organisatorische MASSNAHME, die ausserhalb des Dokumenttexts liegt.
|
||||||
|
Beispiele: "Ist sichergestellt, dass Cookies erst nach Einwilligung gespeichert werden?" /
|
||||||
|
"Ist ein Loeschkonzept implementiert?" / "Wird das Consent-Tool monatlich getestet?"
|
||||||
|
Diese MCs lassen sich NICHT aus dem Dokumenttext beantworten — sie brauchen Evidence/TOM-Nachweis.
|
||||||
|
|
||||||
|
REVIEW — Der MC stellt eine abstrakte Vollstaendigkeits-/Konsistenzfrage, die menschliche Beurteilung braucht.
|
||||||
|
Beispiele: "Sind ALLE Verarbeitungszwecke vollstaendig erfasst?" / "Stimmen Cookie-Richtlinie und Banner ueberein?"
|
||||||
|
Diese MCs sind Checklisten-Items fuer den DSB, kein automatischer Check moeglich.
|
||||||
|
|
||||||
|
Antworte ausschliesslich als JSON-Array — keine Erklaerung davor/danach, kein Markdown-Codeblock. Format:
|
||||||
|
[{"control_id": "<wie input>", "check_type": "text|process|review", "confidence": 0.9, "rationale": "ein kurzer satz"}, ...]"""
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_mcs(conn, limit: int | None = None, only_unclassified: bool = True) -> list[dict]:
|
||||||
|
sql = """SELECT control_id, doc_type, title, check_question
|
||||||
|
FROM compliance.doc_check_controls"""
|
||||||
|
if only_unclassified:
|
||||||
|
sql += " WHERE control_id NOT IN (SELECT control_id FROM _classified_already)"
|
||||||
|
sql += " ORDER BY doc_type, title"
|
||||||
|
if limit:
|
||||||
|
sql += f" LIMIT {limit}"
|
||||||
|
with conn.cursor(cursor_factory=RealDictCursor) as c:
|
||||||
|
try:
|
||||||
|
c.execute("CREATE TEMP TABLE _classified_already (control_id TEXT)")
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as side:
|
||||||
|
rows = side.execute("SELECT control_id FROM mc_classification").fetchall()
|
||||||
|
if rows:
|
||||||
|
c.executemany("INSERT INTO _classified_already VALUES (%s)", [(r[0],) for r in rows])
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
c.execute(sql)
|
||||||
|
return list(c.fetchall())
|
||||||
|
|
||||||
|
|
||||||
|
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
|
||||||
|
payload = {
|
||||||
|
"model": MODEL,
|
||||||
|
"max_tokens": 4000,
|
||||||
|
"system": SYSTEM_PROMPT,
|
||||||
|
"messages": [{
|
||||||
|
"role": "user",
|
||||||
|
"content": "Klassifiziere die folgenden MCs:\n\n" + json.dumps(
|
||||||
|
[{"control_id": m["control_id"],
|
||||||
|
"doc_type": m["doc_type"],
|
||||||
|
"title": m["title"],
|
||||||
|
"check_question": (m["check_question"] or "")[:400]}
|
||||||
|
for m in batch],
|
||||||
|
ensure_ascii=False, indent=2),
|
||||||
|
}],
|
||||||
|
}
|
||||||
|
headers = {
|
||||||
|
"x-api-key": api_key,
|
||||||
|
"anthropic-version": "2023-06-01",
|
||||||
|
"content-type": "application/json",
|
||||||
|
}
|
||||||
|
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
|
||||||
|
r.raise_for_status()
|
||||||
|
txt = r.json()["content"][0]["text"].strip()
|
||||||
|
# Strip code fences if Sonnet adds them
|
||||||
|
if txt.startswith("```"):
|
||||||
|
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
|
||||||
|
if txt.startswith("json"):
|
||||||
|
txt = txt[4:].strip()
|
||||||
|
return json.loads(txt)
|
||||||
|
|
||||||
|
|
||||||
|
def ensure_sidecar() -> None:
|
||||||
|
Path(SIDECAR_DB).parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as c:
|
||||||
|
c.executescript("""
|
||||||
|
CREATE TABLE IF NOT EXISTS mc_classification (
|
||||||
|
control_id TEXT PRIMARY KEY,
|
||||||
|
doc_type TEXT,
|
||||||
|
title TEXT,
|
||||||
|
check_type TEXT,
|
||||||
|
confidence REAL,
|
||||||
|
rationale TEXT,
|
||||||
|
classified_at TEXT
|
||||||
|
);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_doctype ON mc_classification(doc_type);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_type ON mc_classification(check_type);
|
||||||
|
""")
|
||||||
|
|
||||||
|
|
||||||
|
def store_results(rows: list[dict], lookup: dict[str, dict]) -> None:
|
||||||
|
ts = datetime.now(timezone.utc).isoformat()
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as c:
|
||||||
|
c.executemany(
|
||||||
|
"INSERT OR REPLACE INTO mc_classification "
|
||||||
|
"(control_id, doc_type, title, check_type, confidence, rationale, classified_at) "
|
||||||
|
"VALUES (?, ?, ?, ?, ?, ?, ?)",
|
||||||
|
[
|
||||||
|
(
|
||||||
|
r.get("control_id"),
|
||||||
|
lookup.get(r.get("control_id"), {}).get("doc_type", ""),
|
||||||
|
lookup.get(r.get("control_id"), {}).get("title", ""),
|
||||||
|
(r.get("check_type") or "").lower(),
|
||||||
|
float(r.get("confidence") or 0),
|
||||||
|
(r.get("rationale") or "")[:500],
|
||||||
|
ts,
|
||||||
|
)
|
||||||
|
for r in rows
|
||||||
|
],
|
||||||
|
)
|
||||||
|
c.commit()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
ap = argparse.ArgumentParser()
|
||||||
|
ap.add_argument("--limit", type=int, default=None, help="limit number of MCs (for testing)")
|
||||||
|
ap.add_argument("--sample", action="store_true", help="just print first 5 MCs and exit")
|
||||||
|
ap.add_argument("--all", action="store_true", help="re-classify all (otherwise skips already classified)")
|
||||||
|
args = ap.parse_args()
|
||||||
|
|
||||||
|
ensure_sidecar()
|
||||||
|
api_key = os.environ["ANTHROPIC_API_KEY"]
|
||||||
|
conn = psycopg2.connect(os.environ["DATABASE_URL"])
|
||||||
|
mcs = fetch_mcs(conn, limit=args.limit, only_unclassified=not args.all)
|
||||||
|
|
||||||
|
if args.sample:
|
||||||
|
for m in mcs[:5]:
|
||||||
|
print(json.dumps(m, ensure_ascii=False, indent=2))
|
||||||
|
return
|
||||||
|
|
||||||
|
print(f"Klassifiziere {len(mcs)} MCs in Batches von {BATCH_SIZE}", flush=True)
|
||||||
|
if not mcs:
|
||||||
|
print("Nichts zu tun.")
|
||||||
|
return
|
||||||
|
|
||||||
|
lookup = {m["control_id"]: m for m in mcs}
|
||||||
|
total = len(mcs)
|
||||||
|
done = 0
|
||||||
|
failed_batches = 0
|
||||||
|
t0 = time.time()
|
||||||
|
for i in range(0, total, BATCH_SIZE):
|
||||||
|
batch = mcs[i:i + BATCH_SIZE]
|
||||||
|
try:
|
||||||
|
out = call_claude(api_key, batch)
|
||||||
|
store_results(out, lookup)
|
||||||
|
done += len(out)
|
||||||
|
elapsed = time.time() - t0
|
||||||
|
rate = done / max(elapsed, 0.01)
|
||||||
|
eta = (total - done) / max(rate, 0.01)
|
||||||
|
print(f" [{done:>5}/{total}] {rate:.1f} MC/s ETA {eta/60:.1f}min",
|
||||||
|
flush=True)
|
||||||
|
except Exception as e:
|
||||||
|
failed_batches += 1
|
||||||
|
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
|
||||||
|
if failed_batches >= 5:
|
||||||
|
print(" Zu viele Fehler — abbrechen.", file=sys.stderr)
|
||||||
|
break
|
||||||
|
time.sleep(SLEEP_BETWEEN_BATCHES)
|
||||||
|
|
||||||
|
print(f"\nFertig: {done}/{total} klassifiziert ({failed_batches} Fehlbatches).")
|
||||||
|
# Summary
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as c:
|
||||||
|
c.row_factory = sqlite3.Row
|
||||||
|
rows = c.execute(
|
||||||
|
"SELECT doc_type, check_type, COUNT(*) n FROM mc_classification "
|
||||||
|
"GROUP BY doc_type, check_type ORDER BY doc_type, check_type"
|
||||||
|
).fetchall()
|
||||||
|
print("\n=== Verteilung nach doc_type x check_type ===")
|
||||||
|
prev = None
|
||||||
|
for r in rows:
|
||||||
|
if r["doc_type"] != prev:
|
||||||
|
print(); print(f"[{r['doc_type']}]")
|
||||||
|
prev = r["doc_type"]
|
||||||
|
print(f" {r['check_type']:<8} {r['n']}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -0,0 +1,241 @@
|
|||||||
|
"""
|
||||||
|
v2: Re-classify only the (control_id, doc_type) PAIRS that aren't covered yet.
|
||||||
|
|
||||||
|
V1 used PK=control_id, so cross-doc-type variants (same control assigned
|
||||||
|
to e.g. AGB AND Widerruf with different check_questions) overwrote each
|
||||||
|
other. v2 migrates to PK=(control_id, doc_type) and classifies only the
|
||||||
|
~262 missing pairs.
|
||||||
|
|
||||||
|
Run from container:
|
||||||
|
docker exec bp-compliance-backend python /app/scripts/classify_mc_check_type_v2.py [--sample]
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sqlite3
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
import psycopg2
|
||||||
|
from psycopg2.extras import RealDictCursor
|
||||||
|
|
||||||
|
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||||
|
MODEL = "claude-sonnet-4-6"
|
||||||
|
SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
|
||||||
|
BATCH_SIZE = 25
|
||||||
|
SLEEP_BETWEEN_BATCHES = 0.5
|
||||||
|
|
||||||
|
SYSTEM_PROMPT = """Du klassifizierst Master-Controls (MCs) der DSGVO/TMG/TDDDG-Compliance-Pruefung in genau drei Klassen:
|
||||||
|
|
||||||
|
TEXT — Der MC prueft, ob ein bestimmter Inhalt im TEXT eines Dokuments steht.
|
||||||
|
Beispiele: "Enthaelt die Datenschutzerklaerung die Aufsichtsbehoerde?"
|
||||||
|
Diese MCs koennen gegen den Dokument-Text gematched werden.
|
||||||
|
|
||||||
|
PROCESS — Der MC prueft eine technische oder organisatorische MASSNAHME ausserhalb des Dokumenttexts.
|
||||||
|
Beispiele: "Ist sichergestellt, dass Cookies erst nach Einwilligung gespeichert werden?"
|
||||||
|
|
||||||
|
REVIEW — Der MC stellt eine abstrakte Vollstaendigkeitsfrage, die menschliche Beurteilung braucht.
|
||||||
|
Beispiele: "Sind ALLE Verarbeitungszwecke vollstaendig erfasst?"
|
||||||
|
|
||||||
|
Wichtig: derselbe Control kann fuer DIFFERENT doc_types unterschiedlich klassifiziert sein —
|
||||||
|
mit doc_type-spezifisch angepasster check_question kann ein "text"-Check fuer ein Dok zum
|
||||||
|
"process"-Check fuer ein anderes werden.
|
||||||
|
|
||||||
|
Antworte ausschliesslich als JSON-Array — kein Markdown. Format:
|
||||||
|
[{"control_id": "<input>", "doc_type": "<input>", "check_type": "text|process|review",
|
||||||
|
"confidence": 0.9, "rationale": "ein kurzer satz"}, ...]"""
|
||||||
|
|
||||||
|
|
||||||
|
def migrate_schema() -> None:
|
||||||
|
"""Migrate sidecar from PK=control_id to PK=(control_id, doc_type)."""
|
||||||
|
Path(SIDECAR_DB).parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as c:
|
||||||
|
# Check if v2 schema already in place (composite PK)
|
||||||
|
cols = c.execute("PRAGMA table_info(mc_classification)").fetchall()
|
||||||
|
if not cols:
|
||||||
|
# First run — create fresh
|
||||||
|
c.executescript("""
|
||||||
|
CREATE TABLE mc_classification (
|
||||||
|
control_id TEXT,
|
||||||
|
doc_type TEXT,
|
||||||
|
title TEXT,
|
||||||
|
check_type TEXT,
|
||||||
|
confidence REAL,
|
||||||
|
rationale TEXT,
|
||||||
|
classified_at TEXT,
|
||||||
|
PRIMARY KEY (control_id, doc_type)
|
||||||
|
);
|
||||||
|
CREATE INDEX idx_doctype ON mc_classification(doc_type);
|
||||||
|
CREATE INDEX idx_type ON mc_classification(check_type);
|
||||||
|
""")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Check whether the existing table already has composite PK
|
||||||
|
pk_cols = [r[1] for r in cols if r[5] > 0]
|
||||||
|
if set(pk_cols) == {"control_id", "doc_type"}:
|
||||||
|
print("Schema already v2 (composite PK). Skipping migration.")
|
||||||
|
return
|
||||||
|
|
||||||
|
print("Migrating sidecar schema to PK(control_id, doc_type)...")
|
||||||
|
c.executescript("""
|
||||||
|
CREATE TABLE mc_classification_v2 (
|
||||||
|
control_id TEXT,
|
||||||
|
doc_type TEXT,
|
||||||
|
title TEXT,
|
||||||
|
check_type TEXT,
|
||||||
|
confidence REAL,
|
||||||
|
rationale TEXT,
|
||||||
|
classified_at TEXT,
|
||||||
|
PRIMARY KEY (control_id, doc_type)
|
||||||
|
);
|
||||||
|
INSERT INTO mc_classification_v2
|
||||||
|
(control_id, doc_type, title, check_type, confidence, rationale, classified_at)
|
||||||
|
SELECT control_id, COALESCE(doc_type, ''), title, check_type, confidence, rationale, classified_at
|
||||||
|
FROM mc_classification;
|
||||||
|
DROP TABLE mc_classification;
|
||||||
|
ALTER TABLE mc_classification_v2 RENAME TO mc_classification;
|
||||||
|
CREATE INDEX idx_doctype ON mc_classification(doc_type);
|
||||||
|
CREATE INDEX idx_type ON mc_classification(check_type);
|
||||||
|
""")
|
||||||
|
n = c.execute("SELECT COUNT(*) FROM mc_classification").fetchone()[0]
|
||||||
|
print(f"Migrated {n} existing rows.")
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_unclassified_pairs(conn) -> list[dict]:
|
||||||
|
"""All (control_id, doc_type) pairs in PG that aren't yet in sidecar."""
|
||||||
|
side_pairs: set[tuple[str, str]] = set()
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as side:
|
||||||
|
for cid, dt in side.execute("SELECT control_id, doc_type FROM mc_classification"):
|
||||||
|
side_pairs.add((cid, dt or ""))
|
||||||
|
|
||||||
|
with conn.cursor(cursor_factory=RealDictCursor) as c:
|
||||||
|
c.execute("""SELECT control_id, doc_type, title, check_question
|
||||||
|
FROM compliance.doc_check_controls""")
|
||||||
|
all_rows = list(c.fetchall())
|
||||||
|
|
||||||
|
missing = [r for r in all_rows if (r["control_id"], r["doc_type"] or "") not in side_pairs]
|
||||||
|
return missing
|
||||||
|
|
||||||
|
|
||||||
|
def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
|
||||||
|
payload = {
|
||||||
|
"model": MODEL,
|
||||||
|
"max_tokens": 4000,
|
||||||
|
"system": SYSTEM_PROMPT,
|
||||||
|
"messages": [{
|
||||||
|
"role": "user",
|
||||||
|
"content": "Klassifiziere die folgenden MCs:\n\n" + json.dumps(
|
||||||
|
[{"control_id": m["control_id"],
|
||||||
|
"doc_type": m["doc_type"],
|
||||||
|
"title": m["title"],
|
||||||
|
"check_question": (m["check_question"] or "")[:400]}
|
||||||
|
for m in batch],
|
||||||
|
ensure_ascii=False, indent=2),
|
||||||
|
}],
|
||||||
|
}
|
||||||
|
headers = {
|
||||||
|
"x-api-key": api_key,
|
||||||
|
"anthropic-version": "2023-06-01",
|
||||||
|
"content-type": "application/json",
|
||||||
|
}
|
||||||
|
r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
|
||||||
|
r.raise_for_status()
|
||||||
|
txt = r.json()["content"][0]["text"].strip()
|
||||||
|
if txt.startswith("```"):
|
||||||
|
txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
|
||||||
|
if txt.startswith("json"):
|
||||||
|
txt = txt[4:].strip()
|
||||||
|
return json.loads(txt)
|
||||||
|
|
||||||
|
|
||||||
|
def store_results(rows: list[dict], lookup: dict[tuple[str, str], dict]) -> None:
|
||||||
|
ts = datetime.now(timezone.utc).isoformat()
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as c:
|
||||||
|
c.executemany(
|
||||||
|
"INSERT OR REPLACE INTO mc_classification "
|
||||||
|
"(control_id, doc_type, title, check_type, confidence, rationale, classified_at) "
|
||||||
|
"VALUES (?, ?, ?, ?, ?, ?, ?)",
|
||||||
|
[
|
||||||
|
(
|
||||||
|
r.get("control_id"),
|
||||||
|
r.get("doc_type") or "",
|
||||||
|
lookup.get((r.get("control_id"), r.get("doc_type") or ""), {}).get("title", ""),
|
||||||
|
(r.get("check_type") or "").lower(),
|
||||||
|
float(r.get("confidence") or 0),
|
||||||
|
(r.get("rationale") or "")[:500],
|
||||||
|
ts,
|
||||||
|
)
|
||||||
|
for r in rows
|
||||||
|
],
|
||||||
|
)
|
||||||
|
c.commit()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
ap = argparse.ArgumentParser()
|
||||||
|
ap.add_argument("--sample", action="store_true")
|
||||||
|
args = ap.parse_args()
|
||||||
|
|
||||||
|
migrate_schema()
|
||||||
|
api_key = os.environ["ANTHROPIC_API_KEY"]
|
||||||
|
conn = psycopg2.connect(os.environ["DATABASE_URL"])
|
||||||
|
missing = fetch_unclassified_pairs(conn)
|
||||||
|
|
||||||
|
if args.sample:
|
||||||
|
for m in missing[:5]:
|
||||||
|
print(json.dumps(m, ensure_ascii=False, indent=2))
|
||||||
|
print(f"\nTotal missing pairs: {len(missing)}")
|
||||||
|
return
|
||||||
|
|
||||||
|
print(f"Klassifiziere {len(missing)} fehlende (control_id, doc_type)-Paare in Batches von {BATCH_SIZE}", flush=True)
|
||||||
|
if not missing:
|
||||||
|
print("Alles klassifiziert. Nichts zu tun.")
|
||||||
|
return
|
||||||
|
|
||||||
|
lookup = {(m["control_id"], m["doc_type"] or ""): m for m in missing}
|
||||||
|
total = len(missing)
|
||||||
|
done = 0
|
||||||
|
failed_batches = 0
|
||||||
|
t0 = time.time()
|
||||||
|
for i in range(0, total, BATCH_SIZE):
|
||||||
|
batch = missing[i:i + BATCH_SIZE]
|
||||||
|
try:
|
||||||
|
out = call_claude(api_key, batch)
|
||||||
|
store_results(out, lookup)
|
||||||
|
done += len(out)
|
||||||
|
elapsed = time.time() - t0
|
||||||
|
rate = done / max(elapsed, 0.01)
|
||||||
|
eta = (total - done) / max(rate, 0.01)
|
||||||
|
print(f" [{done:>4}/{total}] {rate:.1f} MC/s ETA {eta/60:.1f}min", flush=True)
|
||||||
|
except Exception as e:
|
||||||
|
failed_batches += 1
|
||||||
|
print(f" FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
|
||||||
|
if failed_batches >= 5:
|
||||||
|
print(" Zu viele Fehler — abbrechen.", file=sys.stderr)
|
||||||
|
break
|
||||||
|
time.sleep(SLEEP_BETWEEN_BATCHES)
|
||||||
|
|
||||||
|
print(f"\nFertig: {done}/{total} klassifiziert ({failed_batches} Fehlbatches).")
|
||||||
|
with sqlite3.connect(SIDECAR_DB) as c:
|
||||||
|
c.row_factory = sqlite3.Row
|
||||||
|
rows = c.execute(
|
||||||
|
"SELECT doc_type, check_type, COUNT(*) n FROM mc_classification "
|
||||||
|
"GROUP BY doc_type, check_type ORDER BY doc_type, check_type"
|
||||||
|
).fetchall()
|
||||||
|
print("\n=== Final-Verteilung doc_type x check_type ===")
|
||||||
|
prev = None
|
||||||
|
for r in rows:
|
||||||
|
if r["doc_type"] != prev:
|
||||||
|
print(); print(f"[{r['doc_type']}]")
|
||||||
|
prev = r["doc_type"]
|
||||||
|
print(f" {r['check_type']:<8} {r['n']}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -172,6 +172,11 @@ class DSIDiscoveryResult:
|
|||||||
# Schema: [{"kind": str, "url": str, "data": dict}, ...]
|
# Schema: [{"kind": str, "url": str, "data": dict}, ...]
|
||||||
# Backend uses these to build vendor records + run per-vendor checks.
|
# Backend uses these to build vendor records + run per-vendor checks.
|
||||||
cmp_payloads: list[dict] = field(default_factory=list)
|
cmp_payloads: list[dict] = field(default_factory=list)
|
||||||
|
# Reconstructed cookie-policy text from all captured CMP payloads
|
||||||
|
# (CMP-library reconstruct + heuristic generic). Backend uses this as
|
||||||
|
# the authoritative cookie-text so MC checks run on the real policy,
|
||||||
|
# not the homepage navigation that DOM extraction returns.
|
||||||
|
cmp_cookie_text: str = ""
|
||||||
|
|
||||||
def _matches_dsi_keyword(text: str) -> tuple[bool, str]:
|
def _matches_dsi_keyword(text: str) -> tuple[bool, str]:
|
||||||
"""Check if text contains any DSI keyword. Returns (match, language)."""
|
"""Check if text contains any DSI keyword. Returns (match, language)."""
|
||||||
@@ -551,8 +556,17 @@ async def discover_dsi_documents(
|
|||||||
result.cmp_payloads = [
|
result.cmp_payloads = [
|
||||||
{"kind": kind, "data": data} for kind, data in cmp_capture.payloads
|
{"kind": kind, "data": data} for kind, data in cmp_capture.payloads
|
||||||
]
|
]
|
||||||
logger.info("DSI discovery complete: %d documents found in %s, %d CMP payloads",
|
if cmp_capture.payloads:
|
||||||
result.total_found, result.languages_detected, len(result.cmp_payloads))
|
try:
|
||||||
|
result.cmp_cookie_text = cmp_capture.reconstruct_cookie_policy()
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("CMP reconstruct on discovery failed: %s", e)
|
||||||
|
logger.info(
|
||||||
|
"DSI discovery complete: %d documents found in %s, %d CMP payloads, "
|
||||||
|
"cmp_cookie_text=%d words",
|
||||||
|
result.total_found, result.languages_detected, len(result.cmp_payloads),
|
||||||
|
len(result.cmp_cookie_text.split()) if result.cmp_cookie_text else 0,
|
||||||
|
)
|
||||||
return result
|
return result
|
||||||
|
|
||||||
# Nav elements, not real documents
|
# Nav elements, not real documents
|
||||||
|
|||||||
Reference in New Issue
Block a user