feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features

Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9): Core Compliance-Check - Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db). rag_document_checker filtert auf check_type='text' fuer doc_check. Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in falscher doc_type-Schublade. - scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden per business_profile gefiltert (FRT skipped fuer BMW etc.). - Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match: Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60), Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum. Title+check_question als Embedding-Input fuer mehr Kontext. - Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction wenn richer (BMW 1824 vs 600 Worte). Vendor-Redundanz + EU-Alternativen + Cost-Saving - vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors, Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...). - vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/ enterprise/premier). - Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten (nur Media-Spend, separat). DSP-Plattformen behalten enge Range. - Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den oberen 40-100%-Band der Listpreise, nicht starter→premier. - Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere Kategorien gleichzeitig. Cookie-Wissens-DB + Funktionale Klassifikation - cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...) mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk, schrems_ii_status, EuGH-Urteile, EU-Alternative. - cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id, ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact. Country-Inferenz aus Rechtsform - cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table. Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors (Adform DK, Pinterest IE). Action-Recipes + Doc-Anchor-Locator - finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country, broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling", ...) eine strukturierte Anweisung mit what/why/fix_text/where/example. Zum 1:1-Einfuegen in Kunden-Dokumente. - doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den passenden Absatz im existierenden Kundendokument fuer jeden Finding. Per-Run Thread-Local-Cache. Fallback: keyword-Match. - Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail + Vendor-Flag-Liste mit aufklappbarer Action-Liste. - Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip). Migration-Pipeline (Compliance-Check -> Customer Banner/Documents) - migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit 4 Kategorien + Review-Flags. - migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register + Privacy-Policy-Pre-Fills. - agent_migration_routes: 3 Preview-Endpoints (banner-preview, document-preview, summary). Persistierung der cmp_vendors in /data/compliance_audits.db check_payloads-Tabelle. Borlabs-Parity Cookie-Banner-Features - Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage. - Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video Placeholder bis Einwilligung. - Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB. - Consent-Log Export (CSV/JSON) per einwilligungen_export_routes. Bug-Fixes - canonical_control_routes: _jsonish-Helper fuer string-typed jsonb, similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr). - Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views. - Embedding-Service-Batching (32er Batches statt 165 in einem Call). - KeyError 'control_id' in MC-Result-Aggregation (defensive .get). - Master-Controls-Klick-Through von /sdk/master-controls auf /sdk/control-library?control=<id> mit URL-Param-Auto-Open. - Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht). - Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction). - doc_type-aware MC-Filter (statt all-text-MCs). - Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag). - A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert. Tests - test_migration_mappers.py (9 Tests) - test_migration_endpoints.py (4 Tests) Skripte (one-shot) - classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type) - audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires) BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes): DSE 7,5% -> 81-83% Impressum 4% -> 100% (6 echte MCs alle erfuellt) Cookie 0% -> 79-83% (CMP-Text-Routing + Embedding) Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr Plus: Action-Recipes + Doc-Anchors fuer jeden Fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:30:08 +02:00
parent 52fb8b91e7
commit 662327e8b4
31 changed files with 5214 additions and 104 deletions
@@ -0,0 +1,55 @@
 /**
 * Proxy: GET /api/sdk/v1/einwilligungen/export?format=csv|json&kind=consents|history
 *   -> backend /api/compliance/einwilligungen/export/<file>
 *
 * Streams the backend response straight through (CSV or JSON download).
 */
 import { NextRequest, NextResponse } from 'next/server'
 const BACKEND_URL = process.env.BACKEND_URL || 'http://backend-compliance:8002'
 function getTenantHeader(request: NextRequest): HeadersInit {
  const uuidRegex = /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i
  const clientTenantId = request.headers.get('x-tenant-id') || request.headers.get('X-Tenant-ID')
  const tenantId = (clientTenantId && uuidRegex.test(clientTenantId))
    ? clientTenantId
    : (process.env.DEFAULT_TENANT_ID || '9282a473-5c95-4b3a-bf78-0ecc0ec71d3e')
  return { 'X-Tenant-ID': tenantId }
 }
 export async function GET(request: NextRequest) {
  const { searchParams } = new URL(request.url)
  const fmt = (searchParams.get('format') || 'csv').toLowerCase()
  const kind = (searchParams.get('kind') || 'consents').toLowerCase()
  const filename = `${kind}.${fmt === 'json' ? 'json' : 'csv'}`
  const upstreamPath = `/api/compliance/einwilligungen/export/${filename}`
  const passthroughParams = new URLSearchParams()
  for (const k of ['user_id', 'granted', 'since', 'consent_id']) {
    const v = searchParams.get(k)
    if (v) passthroughParams.set(k, v)
  }
  const qs = passthroughParams.toString()
  const url = `${BACKEND_URL}${upstreamPath}${qs ? `?${qs}` : ''}`
  try {
    const r = await fetch(url, { headers: getTenantHeader(request) })
    if (!r.ok) {
      const text = await r.text()
      return NextResponse.json({ error: text || `HTTP ${r.status}` }, { status: r.status })
    }
    return new NextResponse(r.body, {
      status: 200,
      headers: {
        'Content-Type': r.headers.get('content-type') || 'application/octet-stream',
        'Content-Disposition': r.headers.get('content-disposition') || `attachment; filename=${filename}`,
      },
    })
  } catch (e) {
    return NextResponse.json(
      { error: 'Export-Proxy fehlgeschlagen', detail: String(e) },
      { status: 503 },
    )
  }
 }
@@ -8,6 +8,23 @@ import type { CanonicalControl } from '../_types'
 import { EFFORT_LABELS } from '../_types'
 import { SeverityBadge, StateBadge, LicenseRuleBadge } from './Badges'
 // Defensive coercers: backend has rows where evidence/requirements/test_procedure/open_anchors
 // are JSON-encoded strings instead of arrays. .map() on a string throws — coerce here.
 function asArray<T = unknown>(v: unknown): T[] {
  if (Array.isArray(v)) return v as T[]
  if (typeof v === 'string' && v.trim().startsWith('[')) {
    try { const p = JSON.parse(v); return Array.isArray(p) ? p : [] } catch { return [] }
  }
  return []
 }
 function asStringArray(v: unknown): string[] {
  return asArray(v).map(x => typeof x === 'string' ? x : JSON.stringify(x))
 }
 type EvidenceItem = string | { type?: string; description?: string }
 function asEvidenceArray(v: unknown): EvidenceItem[] {
  return asArray<EvidenceItem>(v)
 }
 export function ControlDetailView({
  ctrl,
  onBack,
@@ -72,31 +89,31 @@ export function ControlDetailView({
        <section>
          <h3 className="text-sm font-semibold text-gray-900 mb-2">Geltungsbereich</h3>
          <div className="grid grid-cols-3 gap-4">
-            {ctrl.scope.platforms && ctrl.scope.platforms.length > 0 && (
+            {asStringArray(ctrl.scope?.platforms).length > 0 && (
              <div>
                <p className="text-xs font-medium text-gray-500 mb-1">Plattformen</p>
                <div className="flex flex-wrap gap-1">
-                  {ctrl.scope.platforms.map(p => (
+                  {asStringArray(ctrl.scope?.platforms).map(p => (
                    <span key={p} className="px-2 py-0.5 bg-blue-50 text-blue-700 rounded text-xs">{p}</span>
                  ))}
                </div>
              </div>
            )}
-            {ctrl.scope.components && ctrl.scope.components.length > 0 && (
+            {asStringArray(ctrl.scope?.components).length > 0 && (
              <div>
                <p className="text-xs font-medium text-gray-500 mb-1">Komponenten</p>
                <div className="flex flex-wrap gap-1">
-                  {ctrl.scope.components.map(c => (
+                  {asStringArray(ctrl.scope?.components).map(c => (
                    <span key={c} className="px-2 py-0.5 bg-purple-50 text-purple-700 rounded text-xs">{c}</span>
                  ))}
                </div>
              </div>
            )}
-            {ctrl.scope.data_classes && ctrl.scope.data_classes.length > 0 && (
+            {asStringArray(ctrl.scope?.data_classes).length > 0 && (
              <div>
                <p className="text-xs font-medium text-gray-500 mb-1">Datenklassen</p>
                <div className="flex flex-wrap gap-1">
-                  {ctrl.scope.data_classes.map(d => (
+                  {asStringArray(ctrl.scope?.data_classes).map(d => (
                    <span key={d} className="px-2 py-0.5 bg-amber-50 text-amber-700 rounded text-xs">{d}</span>
                  ))}
                </div>
@@ -109,7 +126,7 @@ export function ControlDetailView({
        <section>
          <h3 className="text-sm font-semibold text-gray-900 mb-2">Anforderungen</h3>
          <ol className="space-y-2">
-            {ctrl.requirements.map((req, i) => (
+            {asStringArray(ctrl.requirements).map((req, i) => (
              <li key={i} className="flex items-start gap-2 text-sm text-gray-700">
                <span className="flex-shrink-0 w-5 h-5 bg-purple-100 text-purple-700 rounded-full flex items-center justify-center text-xs font-medium mt-0.5">{i + 1}</span>
                {req}
@@ -122,7 +139,7 @@ export function ControlDetailView({
        <section>
          <h3 className="text-sm font-semibold text-gray-900 mb-2">Pruefverfahren</h3>
          <ol className="space-y-2">
-            {ctrl.test_procedure.map((step, i) => (
+            {asStringArray(ctrl.test_procedure).map((step, i) => (
              <li key={i} className="flex items-start gap-2 text-sm text-gray-700">
                <CheckCircle2 className="w-4 h-4 text-green-500 flex-shrink-0 mt-0.5" />
                {step}
@@ -135,12 +152,18 @@ export function ControlDetailView({
        <section>
          <h3 className="text-sm font-semibold text-gray-900 mb-2">Nachweisanforderungen</h3>
          <div className="space-y-2">
-            {ctrl.evidence.map((ev, i) => (
+            {asEvidenceArray(ctrl.evidence).map((ev, i) => (
              <div key={i} className="flex items-start gap-2 p-3 bg-gray-50 rounded-lg">
                <FileText className="w-4 h-4 text-gray-400 flex-shrink-0 mt-0.5" />
                <div>
-                  <span className="text-xs font-medium text-gray-500 uppercase">{ev.type}</span>
+                  {typeof ev === 'string' ? (
-                  <p className="text-sm text-gray-700">{ev.description}</p>
+                    <p className="text-sm text-gray-700">{ev}</p>
                  ) : (
                    <>
                      {ev.type && <span className="text-xs font-medium text-gray-500 uppercase">{ev.type}</span>}
                      <p className="text-sm text-gray-700">{ev.description ?? JSON.stringify(ev)}</p>
                    </>
                  )}
                </div>
              </div>
            ))}
@@ -152,13 +175,13 @@ export function ControlDetailView({
          <div className="flex items-center gap-2 mb-3">
            <BookOpen className="w-4 h-4 text-green-700" />
            <h3 className="text-sm font-semibold text-green-900">Open-Source-Referenzen</h3>
-            <span className="text-xs text-green-600">({ctrl.open_anchors.length} Quellen)</span>
+            <span className="text-xs text-green-600">({asArray(ctrl.open_anchors).length} Quellen)</span>
          </div>
          <p className="text-xs text-green-700 mb-3">
            Dieses Control basiert auf frei verfuegbarem Wissen. Alle Referenzen sind offen und oeffentlich zugaenglich.
          </p>
          <div className="space-y-2">
-            {ctrl.open_anchors.map((anchor, i) => (
+            {asArray<{ framework?: string; ref?: string; url?: string }>(ctrl.open_anchors).map((anchor, i) => (
              <div key={i} className="flex items-start gap-3 p-2 bg-white rounded border border-green-100">
                <Scale className="w-4 h-4 text-green-600 flex-shrink-0 mt-0.5" />
                <div className="flex-1 min-w-0">
@@ -180,11 +203,11 @@ export function ControlDetailView({
        </section>
        {/* Tags */}
-        {ctrl.tags.length > 0 && (
+        {asStringArray(ctrl.tags).length > 0 && (
          <section>
            <h3 className="text-sm font-semibold text-gray-900 mb-2">Tags</h3>
            <div className="flex flex-wrap gap-1.5">
-              {ctrl.tags.map(tag => (
+              {asStringArray(ctrl.tags).map(tag => (
                <span key={tag} className="px-2 py-1 bg-gray-100 text-gray-600 rounded text-xs">{tag}</span>
              ))}
            </div>
@@ -18,6 +18,16 @@ import { ControlRegulatorySection } from './ControlRegulatorySection'
 import { ControlSimilarControls } from './ControlSimilarControls'
 import { ControlReviewActions } from './ControlReviewActions'
 // Defensive coercer: some canonical_controls rows have evidence/tags/etc.
 // as JSON-encoded strings instead of arrays. .map() on a string throws.
 function toArray<T = unknown>(v: unknown): T[] {
  if (Array.isArray(v)) return v as T[]
  if (typeof v === 'string' && v.trim().startsWith('[')) {
    try { const p = JSON.parse(v); return Array.isArray(p) ? p : [] } catch { return [] }
  }
  return []
 }
 interface SimilarControl {
  control_id: string; title: string; severity: string; release_state: string;
  tags: string[]; license_rule: number | null; verification_method: string | null;
@@ -186,7 +196,7 @@ export function ControlDetail({
        <ControlTraceability ctrl={ctrl} traceability={traceability} loadingTrace={loadingTrace}
          onNavigateToControl={onNavigateToControl} />
-        {!ctrl.source_citation && ctrl.open_anchors.length > 0 && (
+        {!ctrl.source_citation && toArray(ctrl.open_anchors).length > 0 && (
          <section className="bg-amber-50 border border-amber-200 rounded-lg p-3">
            <div className="flex items-center gap-2">
              <Scale className="w-4 h-4 text-amber-600" />
@@ -201,36 +211,36 @@ export function ControlDetail({
          </section>
        )}
-        {(ctrl.scope.platforms?.length || ctrl.scope.components?.length || ctrl.scope.data_classes?.length) ? (
+        {(toArray(ctrl.scope?.platforms).length || toArray(ctrl.scope?.components).length || toArray(ctrl.scope?.data_classes).length) ? (
          <section>
            <h3 className="text-sm font-semibold text-gray-900 mb-2">Geltungsbereich</h3>
            <div className="grid grid-cols-3 gap-4 text-xs">
-              {ctrl.scope.platforms?.length ? <div><span className="text-gray-500">Plattformen:</span> <span className="text-gray-700">{ctrl.scope.platforms.join(', ')}</span></div> : null}
+              {toArray<string>(ctrl.scope?.platforms).length ? <div><span className="text-gray-500">Plattformen:</span> <span className="text-gray-700">{toArray<string>(ctrl.scope?.platforms).join(', ')}</span></div> : null}
-              {ctrl.scope.components?.length ? <div><span className="text-gray-500">Komponenten:</span> <span className="text-gray-700">{ctrl.scope.components.join(', ')}</span></div> : null}
+              {toArray<string>(ctrl.scope?.components).length ? <div><span className="text-gray-500">Komponenten:</span> <span className="text-gray-700">{toArray<string>(ctrl.scope?.components).join(', ')}</span></div> : null}
-              {ctrl.scope.data_classes?.length ? <div><span className="text-gray-500">Datenklassen:</span> <span className="text-gray-700">{ctrl.scope.data_classes.join(', ')}</span></div> : null}
+              {toArray<string>(ctrl.scope?.data_classes).length ? <div><span className="text-gray-500">Datenklassen:</span> <span className="text-gray-700">{toArray<string>(ctrl.scope?.data_classes).join(', ')}</span></div> : null}
            </div>
          </section>
        ) : null}
-        {Array.isArray(ctrl.requirements) && ctrl.requirements.length > 0 && (
+        {toArray<string>(ctrl.requirements).length > 0 && (
          <section>
            <h3 className="text-sm font-semibold text-gray-900 mb-2">Anforderungen</h3>
-            <ol className="list-decimal list-inside space-y-1">{ctrl.requirements.map((r, i) => <li key={i} className="text-sm text-gray-700">{r}</li>)}</ol>
+            <ol className="list-decimal list-inside space-y-1">{toArray<string>(ctrl.requirements).map((r, i) => <li key={i} className="text-sm text-gray-700">{r}</li>)}</ol>
          </section>
        )}
-        {Array.isArray(ctrl.test_procedure) && ctrl.test_procedure.length > 0 && (
+        {toArray<string>(ctrl.test_procedure).length > 0 && (
          <section>
            <h3 className="text-sm font-semibold text-gray-900 mb-2">Pruefverfahren</h3>
-            <ol className="list-decimal list-inside space-y-1">{ctrl.test_procedure.map((s, i) => <li key={i} className="text-sm text-gray-700">{s}</li>)}</ol>
+            <ol className="list-decimal list-inside space-y-1">{toArray<string>(ctrl.test_procedure).map((s, i) => <li key={i} className="text-sm text-gray-700">{s}</li>)}</ol>
          </section>
        )}
-        {ctrl.evidence.length > 0 && (
+        {toArray(ctrl.evidence).length > 0 && (
          <section>
            <h3 className="text-sm font-semibold text-gray-900 mb-2">Nachweise</h3>
            <div className="space-y-2">
-              {ctrl.evidence.map((ev, i) => (
+              {toArray<string | { type?: string; description?: string }>(ctrl.evidence).map((ev, i) => (
                <div key={i} className="flex items-start gap-2 text-sm text-gray-700">
                  <FileText className="w-4 h-4 text-gray-400 flex-shrink-0 mt-0.5" />
                  {typeof ev === 'string' ? <div>{ev}</div> : <div><span className="font-medium">{ev.type}:</span> {ev.description}</div>}
@@ -243,9 +253,9 @@ export function ControlDetail({
        <section className="grid grid-cols-3 gap-4 text-xs text-gray-500">
          {ctrl.risk_score !== null && <div>Risiko-Score: <span className="text-gray-700 font-medium">{ctrl.risk_score}</span></div>}
          {ctrl.implementation_effort && <div>Aufwand: <span className="text-gray-700 font-medium">{EFFORT_LABELS[ctrl.implementation_effort] || ctrl.implementation_effort}</span></div>}
-          {ctrl.tags.length > 0 && (
+          {toArray<string>(ctrl.tags).length > 0 && (
            <div className="col-span-3 flex items-center gap-1 flex-wrap">
-              {ctrl.tags.map(t => <span key={t} className="px-2 py-0.5 bg-gray-100 text-gray-600 rounded text-xs">{t}</span>)}
+              {toArray<string>(ctrl.tags).map(t => <span key={t} className="px-2 py-0.5 bg-gray-100 text-gray-600 rounded text-xs">{t}</span>)}
            </div>
          )}
        </section>
@@ -253,11 +263,11 @@ export function ControlDetail({
        <section className="bg-green-50 border border-green-200 rounded-lg p-4">
          <div className="flex items-center gap-2 mb-3">
            <BookOpen className="w-4 h-4 text-green-700" />
-            <h3 className="text-sm font-semibold text-green-900">Open-Source-Referenzen ({ctrl.open_anchors.length})</h3>
+            <h3 className="text-sm font-semibold text-green-900">Open-Source-Referenzen ({toArray(ctrl.open_anchors).length})</h3>
          </div>
-          {ctrl.open_anchors.length > 0 ? (
+          {toArray(ctrl.open_anchors).length > 0 ? (
            <div className="space-y-2">
-              {ctrl.open_anchors.map((anchor, i) => (
+              {toArray<{ framework?: string; ref?: string; url?: string }>(ctrl.open_anchors).map((anchor, i) => (
                <div key={i} className="flex items-center gap-2 text-sm">
                  <ExternalLink className="w-3.5 h-3.5 text-green-600 flex-shrink-0" />
                  <span className="font-medium text-green-800">{anchor.framework}</span>
@@ -1,5 +1,7 @@
 'use client'
 import { useEffect } from 'react'
 import { useSearchParams } from 'next/navigation'
 import { EMPTY_CONTROL } from './components/helpers'
 import { ControlForm } from './components/ControlForm'
 import { ControlDetail } from './components/ControlDetail'
@@ -12,6 +14,24 @@ import { BACKEND_URL } from './components/helpers'
 export default function ControlLibraryPage() {
  const state = useControlLibraryState()
  const searchParams = useSearchParams()
  // Deep-link via /sdk/control-library?control=<id>
  // — e.g. from /sdk/master-controls member list.
  useEffect(() => {
    const cid = searchParams?.get('control')
    if (!cid || state.selectedControl?.control_id === cid) return
    fetch(`${BACKEND_URL}?endpoint=control&id=${encodeURIComponent(cid)}`)
      .then(r => r.ok ? r.json() : null)
      .then(ctrl => {
        if (ctrl?.control_id) {
          state.setSelectedControl(ctrl)
          state.setMode('detail')
        }
      })
      .catch(() => { /* user just sees the list */ })
  // eslint-disable-next-line react-hooks/exhaustive-deps
  }, [searchParams])
  const {
    handleCreate, handleUpdate, handleDelete, handleReview, handleBulkReject,
@@ -57,12 +57,7 @@ export default function EinwilligungenPage() {
        explanation={stepInfo.explanation}
        tips={stepInfo.tips}
      >
-        <button className="flex items-center gap-2 px-4 py-2 bg-purple-600 text-white rounded-lg hover:bg-purple-700 transition-colors">
+        <ConsentExportButton />
          <svg className="w-5 h-5" fill="none" stroke="currentColor" viewBox="0 0 24 24">
            <path strokeLinecap="round" strokeLinejoin="round" strokeWidth={2} d="M4 16v1a3 3 0 003 3h10a3 3 0 003-3v-1m-4-4l-4 4m0 0l-4-4m4 4V4" />
          </svg>
          Export
        </button>
      </StepHeader>
      {/* Navigation Tabs */}
@@ -150,3 +145,32 @@ export default function EinwilligungenPage() {
    </div>
  )
 }
 // Export-Dropdown im Step-Header. Streamt CSV/JSON direkt aus dem
 // Backend via /api/sdk/v1/einwilligungen/export-Proxy.
 function ConsentExportButton() {
  return (
    <div className="relative group">
      <button className="flex items-center gap-2 px-4 py-2 bg-purple-600 text-white rounded-lg hover:bg-purple-700 transition-colors">
        <svg className="w-5 h-5" fill="none" stroke="currentColor" viewBox="0 0 24 24">
          <path strokeLinecap="round" strokeLinejoin="round" strokeWidth={2} d="M4 16v1a3 3 0 003 3h10a3 3 0 003-3v-1m-4-4l-4 4m0 0l-4-4m4 4V4" />
        </svg>
        Export
      </button>
      <div className="absolute right-0 top-full mt-1 w-60 bg-white border border-gray-200 rounded-lg shadow-lg invisible group-hover:visible opacity-0 group-hover:opacity-100 transition-all z-10">
        <a href="/api/sdk/v1/einwilligungen/export?format=csv&kind=consents" download
           className="block px-4 py-2 text-sm text-gray-700 hover:bg-purple-50 first:rounded-t-lg">
          Einwilligungen als CSV
        </a>
        <a href="/api/sdk/v1/einwilligungen/export?format=json&kind=consents" download
           className="block px-4 py-2 text-sm text-gray-700 hover:bg-purple-50">
          Einwilligungen als JSON
        </a>
        <a href="/api/sdk/v1/einwilligungen/export?format=csv&kind=history" download
           className="block px-4 py-2 text-sm text-gray-700 hover:bg-purple-50 last:rounded-b-lg border-t border-gray-100">
          Aenderungs-Historie als CSV
        </a>
      </div>
    </div>
  )
 }
@@ -199,32 +199,43 @@ function MCDetail({ mc, onBack }: { mc: Record<string, unknown>; onBack: () => v
          </div>
        ) : (
          <div className="divide-y divide-gray-50">
-            {filtered.map((m, i) => (
+            {filtered.map((m, i) => {
-              <div key={i} className="px-4 py-3 hover:bg-gray-50">
+              const inner = (
-                <div className="flex items-center gap-2 mb-1">
+                <>
-                  <span className="text-xs font-mono text-gray-400">{m.control_id}</span>
+                  <div className="flex items-center gap-2 mb-1">
-                  {m.severity && (
+                    <span className="text-xs font-mono text-gray-400">{m.control_id}</span>
-                    <span className={`px-1.5 py-0.5 rounded text-[10px] font-bold ${SEV[m.severity] || 'bg-gray-100 text-gray-600'}`}>
+                    {m.severity && (
-                      {m.severity}
+                      <span className={`px-1.5 py-0.5 rounded text-[10px] font-bold ${SEV[m.severity] || 'bg-gray-100 text-gray-600'}`}>
-                    </span>
+                        {m.severity}
                      </span>
                    )}
                    {m.phase && (
                      <span className="text-[10px] text-purple-600 bg-purple-50 px-1.5 py-0.5 rounded">
                        {m.phase}
                      </span>
                    )}
                    {m.action && (
                      <span className="text-[10px] text-gray-400">{m.action}</span>
                    )}
                  </div>
                  <p className="text-sm text-gray-900">{m.title}</p>
                  {m.regulation_source && (
                    <p className="text-xs text-blue-600 mt-1">
                      {m.regulation_source} {m.regulation_article}
                    </p>
                  )}
-                  {m.phase && (
+                </>
-                    <span className="text-[10px] text-purple-600 bg-purple-50 px-1.5 py-0.5 rounded">
+              )
-                      {m.phase}
+              return m.control_id ? (
-                    </span>
+                <a key={i}
-                  )}
+                  href={`/sdk/control-library?control=${encodeURIComponent(m.control_id)}`}
-                  {m.action && (
+                  className="block px-4 py-3 hover:bg-purple-50/40 transition-colors">
-                    <span className="text-[10px] text-gray-400">{m.action}</span>
+                  {inner}
-                  )}
+                </a>
-                </div>
+              ) : (
-                <p className="text-sm text-gray-900">{m.title}</p>
+                <div key={i} className="px-4 py-3 hover:bg-gray-50">{inner}</div>
-                {m.regulation_source && (
+              )
-                  <p className="text-xs text-blue-600 mt-1">
+            })}
                    {m.regulation_source} {m.regulation_article}
                  </p>
                )}
              </div>
            ))}
            {filtered.length === 0 && !loading && (
              <div className="p-8 text-center text-gray-400">Keine Controls gefunden</div>
            )}
@@ -0,0 +1,156 @@
 /**
 * Content-Blocker Generator (Borlabs-Parity).
 *
 * Returns a small JS snippet that scans the page for blockable third-party
 * embeds (YouTube, Vimeo, Google Maps, Spotify, Twitter, Facebook) and
 * replaces them with a click-to-consent placeholder until the user agrees
 * to the relevant cookie category.
 *
 * The customer drops a SECOND script tag next to the banner:
 *   <script src="/cookie-banner.js"></script>
 *   <script src="/cookie-content-blocker.js"></script>
 *
 * Author writes content as either:
 *   <bp-consent-block category="EXTERNAL_MEDIA"
 *                     provider="YouTube"
 *                     src="https://www.youtube.com/embed/...">
 *     <!-- the original iframe / embed code -->
 *   </bp-consent-block>
 *
 * OR auto-detect: any <iframe src="https://www.youtube.com/...">
 * gets wrapped on page load.
 */
 const KNOWN_EMBEDS: Array<{ host: string; provider: string; category: string }> = [
  { host: 'youtube.com', provider: 'YouTube',     category: 'EXTERNAL_MEDIA' },
  { host: 'youtu.be',    provider: 'YouTube',     category: 'EXTERNAL_MEDIA' },
  { host: 'vimeo.com',   provider: 'Vimeo',       category: 'EXTERNAL_MEDIA' },
  { host: 'google.com/maps',     provider: 'Google Maps', category: 'EXTERNAL_MEDIA' },
  { host: 'maps.googleapis.com', provider: 'Google Maps', category: 'EXTERNAL_MEDIA' },
  { host: 'spotify.com',  provider: 'Spotify',     category: 'EXTERNAL_MEDIA' },
  { host: 'soundcloud.com', provider: 'SoundCloud', category: 'EXTERNAL_MEDIA' },
  { host: 'twitter.com',  provider: 'Twitter / X', category: 'PERSONALIZATION' },
  { host: 'facebook.com', provider: 'Facebook',    category: 'PERSONALIZATION' },
  { host: 'instagram.com', provider: 'Instagram',  category: 'PERSONALIZATION' },
 ]
 export function generateContentBlockerJS(cookieName: string = 'cookie_consent'): string {
  return `(function () {
  'use strict';
  var COOKIE_NAME = ${JSON.stringify(cookieName)};
  var KNOWN_EMBEDS = ${JSON.stringify(KNOWN_EMBEDS)};
  function getConsent() {
    var c = document.cookie.split('; ').find(function (r) {
      return r.indexOf(COOKIE_NAME + '=') === 0;
    });
    if (!c) return null;
    try { return JSON.parse(decodeURIComponent(c.split('=')[1])); } catch (e) { return null; }
  }
  function categoryGranted(cat) {
    var c = getConsent();
    if (!c) return false;
    var k = String(cat).toLowerCase();
    return c[cat] === true || c[k] === true;
  }
  function classifyByHost(src) {
    if (!src) return null;
    for (var i = 0; i < KNOWN_EMBEDS.length; i++) {
      if (src.indexOf(KNOWN_EMBEDS[i].host) > -1) return KNOWN_EMBEDS[i];
    }
    return null;
  }
  function makePlaceholder(provider, category, originalHTML, parent) {
    var ph = document.createElement('div');
    ph.className = 'bp-consent-placeholder';
    ph.style.cssText = 'border:2px dashed #cbd5e1;background:#f8fafc;padding:24px;' +
      'border-radius:8px;text-align:center;font-family:-apple-system,sans-serif;color:#475569';
    ph.innerHTML =
      '<div style="font-size:14px;font-weight:600;color:#1e293b;margin-bottom:8px">' +
      'Inhalt von ' + provider + ' blockiert</div>' +
      '<div style="font-size:12px;margin-bottom:12px">' +
      'Zum Anzeigen dieses Inhalts wird Ihre Einwilligung fuer die Kategorie ' +
      '<strong>' + category + '</strong> benoetigt. ' +
      'Beim Akzeptieren werden Cookies von ' + provider + ' gesetzt.</div>' +
      '<button class="bp-consent-load-btn" ' +
      'style="background:#7c3aed;color:white;border:none;padding:8px 16px;' +
      'border-radius:6px;font-size:13px;cursor:pointer;margin-right:6px">' +
      'Inhalt einmalig laden</button>' +
      '<button class="bp-consent-accept-btn" ' +
      'style="background:#16a34a;color:white;border:none;padding:8px 16px;' +
      'border-radius:6px;font-size:13px;cursor:pointer">' +
      category + ' akzeptieren</button>';
    ph.querySelector('.bp-consent-load-btn').addEventListener('click', function () {
      var div = document.createElement('div');
      div.innerHTML = originalHTML;
      while (div.firstChild) parent.insertBefore(div.firstChild, ph);
      ph.remove();
    });
    ph.querySelector('.bp-consent-accept-btn').addEventListener('click', function () {
      var c = getConsent() || {};
      c[category] = true;
      var date = new Date();
      date.setTime(date.getTime() + 180 * 86400000);
      document.cookie = COOKIE_NAME + '=' + encodeURIComponent(JSON.stringify(c)) +
                        ';expires=' + date.toUTCString() + ';path=/;SameSite=Lax';
      window.dispatchEvent(new CustomEvent('cookieConsentUpdated', { detail: c }));
      // Re-scan: placeholders for THIS category get replaced now
      processAll();
    });
    return ph;
  }
  function processWrapped() {
    var wrapped = document.querySelectorAll('bp-consent-block, [data-bp-consent-block]');
    wrapped.forEach(function (el) {
      var cat = el.getAttribute('category') || el.getAttribute('data-category') || 'EXTERNAL_MEDIA';
      var prov = el.getAttribute('provider') || el.getAttribute('data-provider') || 'Drittanbieter';
      if (categoryGranted(cat)) {
        // Already consented: unwrap the inner content
        var html = el.innerHTML;
        var tmp = document.createElement('div');
        tmp.innerHTML = html;
        var parent = el.parentNode;
        while (tmp.firstChild) parent.insertBefore(tmp.firstChild, el);
        el.remove();
      } else {
        var parent = el.parentNode;
        var inner = el.innerHTML;
        var ph = makePlaceholder(prov, cat, inner, parent);
        parent.insertBefore(ph, el);
        el.remove();
      }
    });
  }
  function processBareIframes() {
    var iframes = document.querySelectorAll('iframe[src]:not([data-bp-processed])');
    iframes.forEach(function (f) {
      var match = classifyByHost(f.getAttribute('src') || '');
      if (!match) return;
      f.setAttribute('data-bp-processed', '1');
      if (categoryGranted(match.category)) return;
      var html = f.outerHTML;
      var parent = f.parentNode;
      var ph = makePlaceholder(match.provider, match.category, html, parent);
      parent.replaceChild(ph, f);
    });
  }
  function processAll() {
    processWrapped();
    processBareIframes();
  }
  if (document.readyState === 'loading') {
    document.addEventListener('DOMContentLoaded', processAll);
  } else {
    processAll();
  }
  // Re-process when consent updates
  window.addEventListener('cookieConsentUpdated', processAll);
 })();`
 }
@@ -325,18 +325,25 @@ function generateJS(config: CookieBannerConfig): string {
  const CATEGORIES = ${JSON.stringify(categoryIds)};
  const REQUIRED_CATEGORIES = ${JSON.stringify(requiredCategories)};
-  // Google Consent Mode v2 — PFLICHT seit Maerz 2024 fuer Google Services in EEA
+  // Google Consent Mode v2 — PFLICHT seit Maerz 2024 fuer Google Services
-  // Sets default consent state to "denied" BEFORE any Google tags fire
+  // in EEA. Shim gtag/dataLayer falls Google Tag noch nicht initialisiert
-  if (typeof gtag === 'function') {
+  // wurde, dann sofort den default consent state setzen (DENIED).
-    gtag('consent', 'default', {
+  window.dataLayer = window.dataLayer || [];
-      analytics_storage: 'denied',
+  if (typeof gtag !== 'function') {
-      ad_storage: 'denied',
+    window.gtag = function () { window.dataLayer.push(arguments); };
      ad_user_data: 'denied',
      ad_personalization: 'denied',
      functionality_storage: 'granted',
      security_storage: 'granted',
    });
  }
  // wait_for_update gibt dem Banner 500ms Zeit, damit der Nutzer
  // entscheiden kann bevor Tags feuern. Empfehlung von Google fuer GCM v2.
  gtag('consent', 'default', {
    analytics_storage: 'denied',
    ad_storage: 'denied',
    ad_user_data: 'denied',
    ad_personalization: 'denied',
    functionality_storage: 'granted',
    security_storage: 'granted',
    wait_for_update: 500,
    region: ['EEA', 'CH', 'GB'],
  });
  function updateGoogleConsentMode(consent) {
    if (typeof gtag !== 'function') return;
@@ -364,10 +371,61 @@ function generateJS(config: CookieBannerConfig): string {
    document.cookie = COOKIE_NAME + '=' + encodeURIComponent(JSON.stringify(consent)) +
                      ';expires=' + date.toUTCString() +
                      ';path=/;SameSite=Lax';
    // Append to local history (Art. 7(3) DSGVO Best-Practice + Borlabs-Parity).
    // Server-seitiges Logging laeuft separat via consent-service.
    try {
      const HKEY = COOKIE_NAME + '_history';
      const hist = JSON.parse(localStorage.getItem(HKEY) || '[]');
      hist.push({
        ts: new Date().toISOString(),
        choices: consent,
      });
      if (hist.length > 50) hist.splice(0, hist.length - 50);
      localStorage.setItem(HKEY, JSON.stringify(hist));
    } catch (e) { /* localStorage blocked */ }
    window.dispatchEvent(new CustomEvent('cookieConsentUpdated', { detail: consent }));
    updateGoogleConsentMode(consent);
  }
  // Borlabs-Parity: zeigt dem Nutzer alle seine bisherigen Einwilligungen.
  // Aufruf via window.bpShowConsentHistory() oder Klick auf den Link im Banner-Footer.
  window.bpShowConsentHistory = function () {
    var existing = document.getElementById('bpConsentHistoryModal');
    if (existing) { existing.remove(); return; }
    var hist = [];
    try { hist = JSON.parse(localStorage.getItem(COOKIE_NAME + '_history') || '[]'); } catch (e) {}
    var rows = hist.length === 0
      ? '<p style="color:#94a3b8;font-style:italic">Noch keine Einwilligungen gespeichert.</p>'
      : hist.slice().reverse().map(function (h) {
          var d = new Date(h.ts);
          var parts = Object.keys(h.choices).map(function (k) {
            return '<span style="margin-right:8px;font-size:11px;color:' +
              (h.choices[k] ? '#16a34a' : '#dc2626') + '">' +
              (h.choices[k] ? '✓ ' : '✗ ') + k + '</span>';
          }).join('');
          return '<div style="border-bottom:1px solid #e5e7eb;padding:8px 0">' +
                 '<div style="font-size:12px;color:#64748b;margin-bottom:4px">' +
                 d.toLocaleString('de-DE') + '</div>' +
                 '<div>' + parts + '</div></div>';
        }).join('');
    var modal = document.createElement('div');
    modal.id = 'bpConsentHistoryModal';
    modal.style.cssText = 'position:fixed;inset:0;background:rgba(0,0,0,0.5);' +
      'z-index:999999;display:flex;align-items:center;justify-content:center;padding:20px';
    modal.innerHTML = '<div style="background:white;border-radius:8px;max-width:500px;' +
      'width:100%;max-height:80vh;overflow:auto;padding:20px;font-family:-apple-system,sans-serif">' +
      '<div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:12px">' +
      '<h3 style="margin:0;font-size:16px">Ihre Einwilligungs-Historie</h3>' +
      '<button onclick="document.getElementById(\\'bpConsentHistoryModal\\').remove()" ' +
      'style="background:none;border:none;font-size:24px;cursor:pointer;color:#94a3b8">×</button>' +
      '</div>' +
      '<p style="font-size:12px;color:#64748b;margin:0 0 12px">' +
      'Lokal in Ihrem Browser gespeichert. Server-seitig laufen Audit-Logs gemaess Art. 7(1) DSGVO.</p>' +
      rows + '</div>';
    modal.addEventListener('click', function (e) { if (e.target === modal) modal.remove(); });
    document.body.appendChild(modal);
  };
  function hasConsent(category) {
    const consent = getConsent();
    if (!consent) return REQUIRED_CATEGORIES.includes(category);
@@ -39,8 +39,9 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
 COPY --from=builder /opt/venv /opt/venv
 ENV PATH="/opt/venv/bin:$PATH"
-# Create non-root user
+# Create non-root user + pre-create /data so volume mount inherits ownership
-RUN useradd --create-home --shell /bin/bash appuser
+RUN useradd --create-home --shell /bin/bash appuser && \
    mkdir -p /data && chown appuser:appuser /data
 # Copy application code
 COPY --chown=appuser:appuser . .
@@ -33,6 +33,7 @@ _ROUTER_MODULES = [
    "vvt_routes",
    "legal_document_routes",
    "einwilligungen_routes",
    "einwilligungen_export_routes",
    "escalation_routes",
    "consent_template_routes",
    "notfallplan_routes",
@@ -159,6 +159,13 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
        from .agent_doc_check_routes import CheckItem, DocCheckResult
        from .agent_doc_check_report import build_html_report
        # Reset anchor-locator cache per run (avoid cross-run leak)
        try:
            from compliance.services.doc_anchor_locator import reset_cache
            reset_cache()
        except Exception:
            pass
        # Step 1: Resolve texts (fetch from URL if needed) — 0-30%
        _update(check_id, "Texte werden geladen...", 1)
        doc_texts: dict[str, str] = {}
@@ -234,6 +241,20 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
        # Filter out doc_types that don't apply to this business profile
        skip_types = _get_skip_types(profile)
        # Derive business_scope hints for the MC filter (O1 — Doc-type Scope-Flag).
        # MCs that explicitly require a feature (e.g. 'biometric_processing',
        # 'ai_decision_making', 'child_targeting') get dropped when the
        # detected profile doesn't declare it.
        business_scope: set[str] = set()
        for svc in (getattr(profile, "detected_services", []) or []):
            business_scope.add(str(svc).lower())
        if (getattr(profile, "business_type", "") or "").lower() == "b2c":
            business_scope.add("b2c")
        if getattr(profile, "has_online_shop", False):
            business_scope.add("ecommerce")
        if getattr(profile, "is_regulated_profession", False):
            business_scope.add("regulated_profession")
        # Document checks: 40-80%
        n_entries = max(1, len(doc_entries))
        for i, entry in enumerate(doc_entries):
@@ -268,6 +289,7 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
            result = await _check_single(
                text, doc_type, label, url,
                entry["word_count"], use_agent_flag,
                business_scope=business_scope,
            )
            # Apply profile context filter
@@ -421,9 +443,42 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
                            len(cmp_vendors))
                cmp_vendors = await validate_vendor_urls(cmp_vendors)
                cmp_vendors = score_vendors(cmp_vendors)
                # Enrich each vendor with per-cookie functional roles
                try:
                    from compliance.services.cookie_function_classifier import (
                        annotate_vendor_cookies,
                    )
                    cmp_vendors = [annotate_vendor_cookies(v) for v in cmp_vendors]
                except Exception as e:
                    logger.warning("Cookie function classification skipped: %s", e)
        except Exception as e:
            logger.warning("VVT vendor extraction skipped: %s", e)
        # Vendor-Redundanz + EU-Alternativen + Cost/Savings (O4)
        redundancy_report = None
        try:
            from compliance.services.vendor_redundancy import analyze as analyze_redundancy
            from compliance.services.vendor_cost_estimator import infer_company_tier
            if cmp_vendors:
                # Company-Tier aus business_profile ableiten — beeinflusst die
                # Cost-Range so dass z.B. fuer DAX-Konzerne nicht starter-Preise
                # die untere Schranke duruecken.
                bp_dict = {
                    "type": getattr(profile, "business_type", ""),
                    "features": list(business_scope),
                }
                ctier = infer_company_tier(bp_dict)
                redundancy_report = analyze_redundancy(cmp_vendors, company_tier=ctier)
                logger.info(
                    "Redundanz: %d Kategorien mit Mehrfach-Anbietern, "
                    "Spar-Schaetzung %s pro Jahr (company_tier=%s)",
                    redundancy_report["summary"]["redundancy_count"],
                    redundancy_report["summary"]["estimated_saving_pct"],
                    ctier,
                )
        except Exception as e:
            logger.warning("Vendor redundancy analysis skipped: %s", e)
        summary_html = build_management_summary(results)
        scanned_html = build_scanned_urls_html(doc_entries)
        providers_html = build_provider_list_html(banner_result, vvt_entries)
@@ -468,11 +523,18 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
            if scorecard else ""
        )
-        report_html = build_html_report(results, None)
+        report_html = build_html_report(results, None, doc_texts)
        profile_html = _build_profile_html(profile)
        # O4: Vendor-Redundanz / EU-Alternativen + Cost-Savings-Block —
        # zwischen VVT und Doc-Report einsortiert, damit Geschaeftsfuehrung
        # die Einsparung sieht bevor sie in die Detail-Pruefung geht.
        from .agent_doc_check_redundancy import build_redundancy_html
        redundancy_html = build_redundancy_html(redundancy_report)
        full_html = (
            summary_html + scanned_html + profile_html + scorecard_html
-            + providers_html + vvt_html + report_html
+            + providers_html + vvt_html + redundancy_html + report_html
        )
        # Step 6: Send email — derive site name primarily from entered URL.
@@ -602,6 +664,7 @@ async def _fetch_text(url: str, doc_type: str = "") -> tuple[str, list[dict]]:
                payload = resp.json()
                docs = payload.get("documents", [])
                cmp_payloads = payload.get("cmp_payloads") or []
                cmp_cookie_text = payload.get("cmp_cookie_text") or ""
                if docs:
                    texts = []
                    for doc in docs:
@@ -609,6 +672,22 @@ async def _fetch_text(url: str, doc_type: str = "") -> tuple[str, list[dict]]:
                        if t and len(t) > 50:
                            texts.append(t)
                    merged = "\n\n".join(texts)
                    # For cookie/dse/social_media: when CMP reconstruction is
                    # substantially richer than DOM extraction, use it. This
                    # fixes the BMW case where DOM yields ~600 words of
                    # navigation but the ePaaS payload reconstructs to ~1800
                    # words of actual cookie policy.
                    if (doc_type in short_extract_types
                            and cmp_cookie_text
                            and len(cmp_cookie_text.split()) > len(merged.split())):
                        logger.info(
                            "Preferring CMP-reconstructed text for %s on %s "
                            "(%d words CMP vs %d words DOM)",
                            doc_type, url,
                            len(cmp_cookie_text.split()),
                            len(merged.split()),
                        )
                        merged = cmp_cookie_text
                    if merged and len(merged.split()) > 100:
                        if len(texts) > 1:
                            logger.info("Merged %d docs from %s (%d words)",
@@ -727,6 +806,7 @@ async def _autodiscover_missing(
    discovered: list[dict] = []
    disc_payloads: list[dict] = []
    disc_cookie_texts: list[str] = []
    for base in crawl_bases:
        try:
            async with httpx.AsyncClient(timeout=180.0) as client:
@@ -742,8 +822,14 @@ async def _autodiscover_missing(
                body = resp.json()
                discovered.extend(body.get("documents", []) or [])
                disc_payloads.extend(body.get("cmp_payloads") or [])
-                logger.info("auto-discovery on %s: %d docs",
+                cmp_text = body.get("cmp_cookie_text") or ""
-                            base, len(body.get("documents", []) or []))
+                if cmp_text:
                    disc_cookie_texts.append(cmp_text)
                logger.info("auto-discovery on %s: %d docs, %d CMP payloads, "
                            "cmp_cookie_text=%d words", base,
                            len(body.get("documents", []) or []),
                            len(body.get("cmp_payloads") or []),
                            len(cmp_text.split()))
        except Exception as e:
            logger.warning("auto-discovery failed for %s: %s", base, e)
@@ -772,6 +858,19 @@ async def _autodiscover_missing(
        d = by_type.get(dt)
        if d:
            full = d.get("full_text") or d.get("text_preview") or ""
            # For cookie: prefer the CMP-reconstructed text when it's
            # substantially richer than the auto-discovered DOM extraction.
            # BMW homepage CMP yields ~1800 words of authoritative policy;
            # DOM extraction typically yields ~600 words of site chrome.
            if dt == "cookie" and disc_cookie_texts:
                cmp_merged = "\n\n".join(disc_cookie_texts)
                if len(cmp_merged.split()) > len(full.split()):
                    logger.info(
                        "cookie: using CMP-reconstructed text (%d words) "
                        "instead of DOM (%d words)",
                        len(cmp_merged.split()), len(full.split()),
                    )
                    full = cmp_merged
            if len(full.split()) >= 100:
                new_entry["text"] = full
                new_entry["url"] = d.get("url", "")
@@ -829,6 +928,7 @@ def _classify_discovered_doc(title: str, url: str) -> str | None:
 async def _check_single(
    text: str, doc_type: str, label: str, url: str,
    word_count: int, use_agent: bool,
    business_scope: set[str] | None = None,
 ):
    """Run regex + MC checks on a single document."""
    from compliance.services.doc_checks.runner import check_document_completeness
@@ -862,6 +962,7 @@ async def _check_single(
        # (top-10 FAILs) so cost stays bounded.
        mc_results = await check_document_with_controls(
            text, doc_type, label, max_controls=0, use_agent=use_agent,
            business_scope=business_scope,
        )
        if mc_results:
            for mc in mc_results:
@@ -374,11 +374,52 @@ def _render_vendor_row_full(v: dict) -> str:
    )
    score_color = ("#16a34a" if score >= 80 else
                   "#d97706" if score >= 50 else "#dc2626")
    # Score-Erklaerung: was wurde gewertet, was fehlt
    # Annahme: Score = bestandene Kriterien / Gesamtkriterien * 100.
    # Typisch 5 Kriterien fuer EXT: country, cookies, opt_out, privacy, scoring.
    # Bei INTERNAL/GROUP: opt_out + privacy nicht gewertet (3 Kriterien).
    n_criteria = 3 if is_own else 5
    n_failed = len(flags) if flags else 0
    score_tooltip = (
        f"{n_criteria - n_failed} von {n_criteria} Kriterien erfuellt"
        + (f" — fehlt: {', '.join(_flag_short(f) for f in flags[:3])}"
           if flags else "")
    )
    # Inline-Aktions-Anweisungen pro Flag
    actions_html = ""
    if flags:
        from compliance.services.finding_action_recipes import recipe_for
        action_items = []
        for f in flags:
            rec = recipe_for(f)
            if not rec:
                continue
            action_items.append(
                f'<li style="margin-bottom:6px"><strong>{_flag_short(f)}:</strong> '
                f'{rec.get("what", "")}<br/>'
                f'<span style="color:#475569"><strong>Was tun:</strong> '
                f'{rec.get("fix_text", "").splitlines()[0][:200]}</span><br/>'
                f'<span style="color:#94a3b8;font-size:9px">Quelle: '
                f'{rec.get("why", "")[:160]}</span></li>'
            )
        if action_items:
            actions_html = (
                f'<details style="margin-top:4px"><summary style="cursor:pointer;'
                f'color:#dc2626;font-size:10px">Was muss ich tun? '
                f'({len(action_items)} Action{"s" if len(action_items) != 1 else ""})</summary>'
                f'<ul style="margin:4px 0 0 14px;padding:0;font-size:10px;color:#1e293b">'
                + "".join(action_items)
                + '</ul></details>'
            )
    flag_str = ""
    if flags:
        flag_str = (
            f'<div style="font-size:10px;color:#94a3b8;margin-top:2px">'
            f'{", ".join(flags[:4])}</div>'
            f'{actions_html}'
        )
    return (
        f'<tr style="border-top:1px solid #e2e8f0">'
@@ -391,11 +432,26 @@ def _render_vendor_row_full(v: dict) -> str:
        f'<td style="padding:6px 8px;text-align:center">{opt_status}</td>'
        f'<td style="padding:6px 8px;text-align:center">{privacy_status}</td>'
        f'<td style="padding:6px 8px;text-align:right;font-weight:600;'
-        f'color:{score_color};font-size:11px">{score}%</td>'
+        f'color:{score_color};font-size:11px" title="{score_tooltip}">'
        f'{score}%<div style="font-size:9px;font-weight:400;color:#94a3b8">'
        f'{n_criteria - n_failed}/{n_criteria}</div></td>'
        f'</tr>'
    )
 def _flag_short(f: str) -> str:
    """Lesbare deutsche Form fuer einen Flag-Token."""
    labels = {
        "no_cookies_listed": "Cookies fehlen",
        "no_country":        "Sitzland fehlt",
        "no_privacy_url":    "Privacy-Link fehlt",
        "broken_privacy_url": "Privacy-Link broken",
        "no_opt_out_url":    "Opt-Out fehlt",
        "broken_opt_out":    "Opt-Out broken",
    }
    return labels.get(f, f)
 def _link_status_badge(
    url: str | None,
    ok: bool | None,
@@ -0,0 +1,141 @@
 """
 Email-Renderer fuer den Vendor-Redundanz + EU-Alternativen + Cost-/Savings-Block.
 Wird im Email-Body unter dem VVT eingebaut.
 """
 from __future__ import annotations
 def _fmt_eur(low: int, high: int) -> str:
    if not low and not high:
        return "im Listpreis bundled"
    if low == high:
        return f"~{low:,} €".replace(",", ".")
    return f"{low:,}–{high:,} €".replace(",", ".")
 def build_redundancy_html(report: dict | None) -> str:
    if not report:
        return ""
    s = report.get("summary") or {}
    redundancies = report.get("redundancies") or []
    eu_alts = report.get("eu_alternatives") or []
    multi = report.get("multi_function_tools") or []
    cur = s.get("estimated_current_year_eur") or [0, 0]
    sav = s.get("estimated_saving_year_eur") or [0, 0]
    pct = s.get("estimated_saving_pct") or "n/a"
    parts = [
        '<div style="font-family:-apple-system,BlinkMacSystemFont,sans-serif;'
        'max-width:700px;margin:0 auto 16px;padding:14px 18px;'
        'background:#fef3c7;border:1px solid #fcd34d;border-radius:8px">',
        '<h3 style="margin:0 0 6px;font-size:14px;color:#92400e">'
        'Optimierungspotenzial: Redundanzen + EU-Alternativen</h3>',
        f'<p style="margin:0 0 10px;font-size:11px;color:#78350f">'
        f'<strong>{s.get("redundancy_count", 0)}</strong> Kategorien mit '
        f'mehreren Anbietern · <strong>{s.get("consolidation_potential", 0)}</strong> '
        f'Anbieter konsolidierbar · '
        f'<strong>{s.get("eu_alternative_count", 0)}</strong> EU-Alternativen verfuegbar</p>',
        '<div style="background:#fff;border:1px solid #fcd34d;border-radius:6px;'
        'padding:10px 12px;margin-bottom:10px">',
        '<div style="font-size:10px;color:#94a3b8;margin-bottom:6px;text-transform:uppercase;letter-spacing:0.5px">'
        'Diese Schaetzung umfasst NUR die als redundant erkannten Tools — '
        'nicht den Gesamt-Stack der Website</div>',
        f'<div style="font-size:11px;color:#78350f">'
        f'Listpreis-Schaetzung der <strong>redundanten</strong> Tools '
        f'(Mehrfach-Anbieter in derselben Funktions-Kategorie):'
        f' <strong>{_fmt_eur(*cur)}/Jahr</strong></div>',
        f'<div style="font-size:11px;color:#16a34a;margin-top:4px">'
        f'Sparpotenzial durch Konsolidierung auf je 1 EU-Tool pro Kategorie:'
        f' <strong>{_fmt_eur(*sav)}/Jahr</strong> ({pct})</div>',
        '<div style="font-size:10px;color:#94a3b8;margin-top:8px;font-style:italic">'
        '<strong>Wichtige Einschraenkungen:</strong><br/>'
        '• Konzern-Konditionen liegen ueblicherweise 30–50% unter Listpreis — '
        'realistisches Saving entsprechend €X·0,5 bis €X·0,7.<br/>'
        '• Eintraege "<em>Eigene Marke — Tool</em>" (z.B. "BMW AG — Adobe Analytics") '
        'gehoeren oft zu einem einzigen Master-Vertrag, nicht zu mehreren Lizenzen.<br/>'
        '• Media-Spend (Google Ads, Meta Ads) ist NICHT enthalten — nur Tooling-Lizenzen.<br/>'
        '• Quelle: Gartner/Forrester 2025 + oeffentliche Listpreise.'
        '</div></div>',
    ]
    if redundancies:
        parts.append(
            '<table style="width:100%;border-collapse:collapse;font-size:11px;'
            'margin-bottom:10px">'
            '<thead><tr style="background:#fde68a;color:#78350f;text-align:left">'
            '<th style="padding:6px 8px">Kategorie</th>'
            '<th style="padding:6px 8px">#</th>'
            '<th style="padding:6px 8px">Anbieter</th>'
            '<th style="padding:6px 8px">EU-Empfehlung</th>'
            '<th style="padding:6px 8px;text-align:right">Saving / Jahr</th>'
            '</tr></thead><tbody>'
        )
        for r in redundancies[:12]:
            vendors_str = ", ".join(r.get("vendors", [])[:6])
            if len(r.get("vendors", [])) > 6:
                vendors_str += f" (+{len(r['vendors']) - 6} weitere)"
            sav_r = r.get("estimated_saving_year_eur") or [0, 0]
            parts.append(
                f'<tr style="border-top:1px solid #fde68a;vertical-align:top">'
                f'<td style="padding:5px 8px;color:#78350f;font-weight:600">{r["category_label"]}</td>'
                f'<td style="padding:5px 8px;text-align:center">{r["count"]}</td>'
                f'<td style="padding:5px 8px;color:#1e293b;font-size:10px">{vendors_str}</td>'
                f'<td style="padding:5px 8px;color:#16a34a;font-size:10px">{r.get("suggested_eu_tool") or "–"}</td>'
                f'<td style="padding:5px 8px;text-align:right;color:#16a34a;font-weight:600">'
                f'{_fmt_eur(*sav_r)}</td></tr>'
            )
            hint = r.get("consolidation_hint")
            if hint:
                parts.append(
                    f'<tr><td colspan="5" style="padding:0 8px 8px;color:#94a3b8;font-size:10px;font-style:italic">'
                    f'Hinweis: {hint}</td></tr>'
                )
            caveats = r.get("caveats") or []
            if caveats:
                parts.append(
                    f'<tr><td colspan="5" style="padding:0 8px 8px;color:#94a3b8;font-size:10px">'
                    f'<strong>Moegliche Gruende fuer Mehrfach-Einsatz:</strong> '
                    + "; ".join(caveats) + '</td></tr>'
                )
        parts.append('</tbody></table>')
    if multi:
        parts.append(
            '<div style="margin-top:8px"><strong style="font-size:11px;color:#78350f">'
            'Multi-Funktions-Tools (1 Tool ersetzt mehrere Kategorien):</strong>'
            '<ul style="margin:6px 0 0 18px;padding:0;font-size:11px;color:#78350f">'
        )
        for t in multi[:4]:
            cats = ", ".join(t.get("replaces_categories", []))
            parts.append(
                f'<li style="margin-bottom:3px"><strong>{t["name"]}</strong>'
                f' ({t["country"]}) — ersetzt <em>{cats}</em>'
                f' ({t.get("potential_replacements", 0)} Anbieter heute)</li>'
            )
        parts.append('</ul></div>')
    if eu_alts:
        parts.append(
            '<details style="margin-top:8px"><summary style="font-size:11px;color:#78350f;'
            'cursor:pointer">EU-Alternativen pro Anbieter (Details)</summary>'
            '<ul style="margin:6px 0 0 18px;padding:0;font-size:10px;color:#475569">'
        )
        for e in eu_alts[:20]:
            first_alt = (e.get("alternatives") or [{}])[0]
            parts.append(
                f'<li style="margin-bottom:3px"><strong>{e["current_vendor"]}</strong>'
                f' → {first_alt.get("name", "")} ({first_alt.get("country", "")})'
                f' <span style="color:#94a3b8">— {first_alt.get("notes", "")}</span></li>'
            )
        parts.append('</ul></details>')
    parts.append('</div>')
    return "".join(parts)
@@ -7,8 +7,12 @@ including L1/L2 check hierarchy, progress bars, and actionable hints.
 from __future__ import annotations
 import logging
 import re
 from typing import TYPE_CHECKING
 logger = logging.getLogger(__name__)
 if TYPE_CHECKING:
    from .agent_doc_check_routes import CheckItem, DocCheckResult
@@ -32,12 +36,93 @@ def _icon(passed: bool, skipped: bool = False) -> str:
    return '<span style="color:#ef4444;font-weight:bold">&#10007;</span>'
-def _hint_box(hint: str) -> str:
+def _first_sentence(text: str, max_chars: int = 300) -> str:
-    return (
+    """Erster vollstaendiger Satz statt erste Zeile — robust gegen
    mehrzeilige Fix-Texte die mit Bullet-Listen anfangen."""
    if not text:
        return ""
    # Suche Satz-Endezeichen vor max_chars
    snippet = text[:max_chars]
    m = re.search(r"^(.+?[\.\?\!])(?:\s|$)", snippet, re.DOTALL)
    if m:
        first = m.group(1).strip()
        # Wenn der "Satz" eine Variant-Header wie "Variante A:" ist, nimm
        # weiter — der echte Inhalt kommt erst danach
        if re.fullmatch(r"(Variante [A-Z]\s*\([^\)]+\):?|Beispiel\s*\d*:?)",
                        first, re.IGNORECASE):
            rest = text[m.end():].lstrip()
            return _first_sentence(rest, max_chars)
        return first
    # Kein Satz-Endezeichen — nimm bis max_chars
    line = (text.splitlines() or [""])[0]
    return line[:max_chars] + ("…" if len(line) > max_chars else "")
 def _hint_box(hint: str, check_label: str = "", doc_text: str = "",
              doc_id: str | None = None) -> str:
    """Hint-Block mit angereichertem Recipe + Doc-Anchor wenn moeglich."""
    base = (
        f'<div style="font-size:11px;color:#dc2626;margin:2px 0 4px 20px;'
        f'padding:4px 8px;background:#fef2f2;border-radius:4px;'
-        f'border-left:3px solid #fca5a5">{hint}</div>'
+        f'border-left:3px solid #fca5a5">{hint}'
    )
    # Recipe + Anker hinzufuegen wenn check_label bekannt
    if check_label:
        try:
            from compliance.services.finding_action_recipes import recipe_for
            from compliance.services.doc_anchor_locator import locate_anchor
            rec = recipe_for(check_label)
            if rec and rec.get("fix_text"):
                first_sentence = _first_sentence(rec["fix_text"], 300)
                full = rec["fix_text"]
                # Statt <details> ein einfaches Inline-Block-Layout —
                # robuster bei Plain-Text-Mail-Render
                more = ""
                if len(full) > len(first_sentence) + 10:
                    more = (
                        f'<div style="margin-top:4px;padding:6px 8px;background:#fff;'
                        f'border:1px solid #fcd5d5;border-radius:4px;font-size:10px;'
                        f'white-space:pre-wrap;color:#1e293b">'
                        f'<strong style="display:block;margin-bottom:3px;color:#475569">'
                        f'Vollstaendiger Textbaustein zum Einfuegen:</strong>'
                        f'{full}</div>'
                    )
                base += (
                    f'<div style="margin-top:6px;padding-top:6px;border-top:1px solid #fecaca">'
                    f'<strong style="color:#7c3aed;font-size:10px">Konkrete Massnahme:</strong> '
                    f'<span style="color:#1e293b">{first_sentence}</span>'
                    f'{more}'
                )
                # Anker via Embedding-Locator (mit doc_id-Cache)
                if doc_text:
                    anchor = locate_anchor(check_label, doc_text, doc_id)
                    if anchor and anchor.get("anchor_phrase") and anchor.get("confidence") != "low":
                        conf_label = anchor.get("confidence", "")
                        conf_badge = (
                            f' <span style="color:#94a3b8;font-size:9px">'
                            f'(Match-Konfidenz {conf_label}, '
                            f'Score {anchor.get("score", "—")})</span>'
                        )
                        base += (
                            f'<div style="margin-top:4px;color:#475569;font-size:10px">'
                            f'<strong>Einfuegen:</strong> {anchor["position_hint"]}'
                            f'{conf_badge}</div>'
                        )
                    elif rec.get("where"):
                        # Kein guter Anchor-Match — zeige generischen Fallback
                        base += (
                            f'<div style="margin-top:4px;color:#475569;font-size:10px">'
                            f'<strong>Einfuegen:</strong> {rec["where"]} '
                            f'<span style="color:#94a3b8;font-size:9px">'
                            f'(kein eindeutiger Absatz im Dokument gefunden — '
                            f'Anweisung allgemein)</span></div>'
                        )
                base += '</div>'
        except Exception as e:
            logger.debug("Hint-box enrichment failed: %s", e)
            pass  # Recipes optional — Hint-Box muss nie crashen
    base += '</div>'
    return base
 def build_management_summary(results: list[DocCheckResult]) -> str:
@@ -158,8 +243,14 @@ def _check_to_action(doc_label: str, check_label: str, hint: str) -> str:
 def build_html_report(
    results: list[DocCheckResult],
    cookie_result: dict | None,
    doc_texts: dict[str, str] | None = None,
 ) -> str:
-    """Build HTML email report styled like the frontend."""
+    """Build HTML email report styled like the frontend.
    `doc_texts` is the doc_type→text dict so hint-boxes can locate the
    relevant Absatz in the original document for the Einfuege-Empfehlung.
    """
    doc_texts = doc_texts or {}
    ok_count = sum(1 for r in results if r.completeness_pct == 100)
    html = [
        '<div style="font-family:-apple-system,BlinkMacSystemFont,sans-serif;'
@@ -170,7 +261,7 @@ def build_html_report(
    ]
    for r in results:
-        _render_document(html, r)
+        _render_document(html, r, doc_texts.get(r.doc_type, ""))
    if cookie_result:
        _render_cookie_banner(html, cookie_result)
@@ -179,7 +270,7 @@ def build_html_report(
    return "\n".join(html)
-def _render_document(html: list[str], r: DocCheckResult) -> None:
+def _render_document(html: list[str], r: DocCheckResult, doc_text: str = "") -> None:
    pct = r.completeness_pct
    cpct = r.correctness_pct
    bar_color = "green" if pct >= 80 else "yellow" if pct >= 50 else "red"
@@ -244,7 +335,7 @@ def _render_document(html: list[str], r: DocCheckResult) -> None:
    else:
        html.append('<div style="padding:8px 16px 12px">')
        for c in l1_checks:
-            _render_l1_check(html, c, l2_by_parent.get(c.id, []))
+            _render_l1_check(html, c, l2_by_parent.get(c.id, []), doc_text)
        # Master-Control aggregation: with 1874 MCs evaluated per run,
        # rendering every L2 check inline produces ~600 rows per doc and
@@ -289,6 +380,7 @@ def _render_document(html: list[str], r: DocCheckResult) -> None:
 def _render_l1_check(
    html: list[str], c: CheckItem, children: list[CheckItem],
    doc_text: str = "",
 ) -> None:
    l2_sub = [ch for ch in children if not ch.skipped]
    l2_passed = sum(1 for ch in l2_sub if ch.passed)
@@ -301,16 +393,16 @@ def _render_l1_check(
    if l2_sub:
        html.append(f' <span style="color:#9ca3af;font-size:11px">({l2_passed}/{len(l2_sub)})</span>')
    if not c.passed and c.hint:
-        html.append(_hint_box(c.hint))
+        html.append(_hint_box(c.hint, c.label, doc_text))
    html.append('</div>')
    for ch in children:
        if ch.skipped:
            continue
-        _render_l2_check(html, ch)
+        _render_l2_check(html, ch, doc_text)
-def _render_l2_check(html: list[str], ch: CheckItem) -> None:
+def _render_l2_check(html: list[str], ch: CheckItem, doc_text: str = "") -> None:
    style = "color:#dc2626;font-weight:500" if not ch.passed else "color:#6b7280"
    html.append(
        f'<div style="padding:2px 0 2px 24px;border-left:2px solid #e5e7eb;margin-left:8px">'
@@ -324,7 +416,7 @@ def _render_l2_check(html: list[str], ch: CheckItem) -> None:
            f'white-space:nowrap">"...{ch.matched_text[:80]}..."</div>'
        )
    if not ch.passed and ch.hint:
-        html.append(_hint_box(ch.hint))
+        html.append(_hint_box(ch.hint, ch.label, doc_text))
    html.append('</div>')
@@ -1808,6 +1808,32 @@ async def list_categories():
 # SIMILAR CONTROLS (Embedding-based dedup)
 # =============================================================================
 _EMBEDDING_COL_AVAILABLE: bool | None = None
 def _has_embedding_col() -> bool:
    """Cache whether canonical_controls has the embedding column.
    Returns False on systems where pgvector + embedding backfill weren't
    set up. Saves the per-request 500 + log spam.
    """
    global _EMBEDDING_COL_AVAILABLE
    if _EMBEDDING_COL_AVAILABLE is not None:
        return _EMBEDDING_COL_AVAILABLE
    try:
        with SessionLocal() as db:
            r = db.execute(text(
                "SELECT 1 FROM information_schema.columns "
                "WHERE table_schema='compliance' "
                "AND table_name='canonical_controls' "
                "AND column_name='embedding'"
            )).fetchone()
            _EMBEDDING_COL_AVAILABLE = bool(r)
    except Exception:
        _EMBEDDING_COL_AVAILABLE = False
    return _EMBEDDING_COL_AVAILABLE
@router.get("/controls/{control_id}/similar")
 async def find_similar_controls(
    control_id: str,
@@ -1815,6 +1841,8 @@ async def find_similar_controls(
    limit: int = Query(20, ge=1, le=100),
 ):
    """Find controls similar to the given one using embedding cosine similarity."""
    if not _has_embedding_col():
        return []
    with SessionLocal() as db:
        # Get the target control's embedding
        target = db.execute(
@@ -1856,7 +1884,7 @@ async def find_similar_controls(
                    "title": r.title,
                    "severity": r.severity,
                    "release_state": r.release_state,
-                    "tags": r.tags or [],
+                    "tags": _jsonish(r.tags) or [],
                    "license_rule": r.license_rule,
                    "verification_method": r.verification_method,
                    "category": r.category,
@@ -1866,6 +1894,10 @@ async def find_similar_controls(
            ]
        except Exception as e:
            logger.warning("Embedding similarity query failed (no embedding column?): %s", e)
            try:
                db.rollback()
            except Exception:
                pass
            return []
@@ -1946,6 +1978,22 @@ async def get_v1_matches_endpoint(control_id: str):
 # INTERNAL HELPERS
 # =============================================================================
 def _jsonish(v):
    """Parse v as JSON if it's a string that looks like JSON, otherwise return as-is.
    Some canonical_controls rows were inserted with jsonb columns containing
    raw JSON strings (e.g. '["a","b"]' as a TEXT). The frontend expects real
    arrays — coerce here so .map() works.
    """
    if isinstance(v, str) and v and v[0] in "[{":
        try:
            import json as _j
            return _j.loads(v)
        except Exception:
            return v
    return v
 def _control_row(r) -> dict:
    return {
        "id": str(r.id),
@@ -1954,17 +2002,17 @@ def _control_row(r) -> dict:
        "title": r.title,
        "objective": r.objective,
        "rationale": r.rationale,
-        "scope": r.scope,
+        "scope": _jsonish(r.scope),
-        "requirements": r.requirements,
+        "requirements": _jsonish(r.requirements),
-        "test_procedure": r.test_procedure,
+        "test_procedure": _jsonish(r.test_procedure) or [],
-        "evidence": r.evidence,
+        "evidence": _jsonish(r.evidence) or [],
        "severity": r.severity,
        "risk_score": float(r.risk_score) if r.risk_score is not None else None,
        "implementation_effort": r.implementation_effort,
        "evidence_confidence": float(r.evidence_confidence) if r.evidence_confidence is not None else None,
-        "open_anchors": r.open_anchors,
+        "open_anchors": _jsonish(r.open_anchors) or [],
        "release_state": r.release_state,
-        "tags": r.tags or [],
+        "tags": _jsonish(r.tags) or [],
        "license_rule": r.license_rule,
        "source_original_text": r.source_original_text,
        "source_citation": r.source_citation,
@@ -0,0 +1,181 @@
 """
 Consent-Log Export (Borlabs-Parity + DSB-Audit-Anforderung).
 Auditors verlangen routinemaessig einen Auszug aller erteilten/
 widerrufenen Einwilligungen pro Tenant — heute musste der DSB dafuer
 manuell SQL schreiben. Diese Endpunkte liefern CSV + JSON direkt aus
 dem Browser.
 Endpoints:
  GET  /einwilligungen/export/consents.csv
  GET  /einwilligungen/export/consents.json
  GET  /einwilligungen/export/history.csv  — Aenderungs-Historie
 """
 from __future__ import annotations
 import csv
 import io
 import json
 import logging
 from datetime import datetime, timezone
 from fastapi import APIRouter, Depends, Header, Query
 from fastapi.responses import Response
 from sqlalchemy.orm import Session
 from classroom_engine.database import get_db
 from ..db.einwilligungen_models import (
    EinwilligungenConsentDB,
    EinwilligungenConsentHistoryDB,
 )
 logger = logging.getLogger(__name__)
 router = APIRouter(prefix="/einwilligungen/export", tags=["einwilligungen-export"])
 def _get_tenant(x_tenant_id: str | None = Header(None, alias="X-Tenant-ID")) -> str:
    if not x_tenant_id:
        from .tenant_utils import get_tenant_id
        return get_tenant_id()
    return x_tenant_id
 def _ts() -> str:
    return datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
 def _consent_rows(consents: list[EinwilligungenConsentDB]) -> list[dict]:
    return [
        {
            "consent_id": str(c.id),
            "user_id": c.user_id or "",
            "data_point_id": c.data_point_id or "",
            "granted": "yes" if c.granted else "no",
            "purpose": c.purpose or "",
            "consent_version": c.consent_version or "",
            "ip_address": c.ip_address or "",
            "user_agent": (c.user_agent or "")[:200],
            "source": c.source or "",
            "created_at": c.created_at.isoformat() if c.created_at else "",
            "updated_at": c.updated_at.isoformat() if c.updated_at else "",
            "revoked_at": c.revoked_at.isoformat() if getattr(c, "revoked_at", None) else "",
        }
        for c in consents
    ]
 def _history_rows(entries: list[EinwilligungenConsentHistoryDB]) -> list[dict]:
    return [
        {
            "id": str(e.id),
            "consent_id": str(e.consent_id),
            "action": e.action or "",
            "consent_version": e.consent_version or "",
            "ip_address": e.ip_address or "",
            "user_agent": (e.user_agent or "")[:200],
            "source": e.source or "",
            "created_at": e.created_at.isoformat() if e.created_at else "",
        }
        for e in entries
    ]
 def _csv_response(rows: list[dict], filename: str) -> Response:
    if not rows:
        return Response(content="", media_type="text/csv",
                        headers={"Content-Disposition": f"attachment; filename={filename}"})
    buf = io.StringIO()
    w = csv.DictWriter(buf, fieldnames=list(rows[0].keys()), quoting=csv.QUOTE_ALL)
    w.writeheader()
    w.writerows(rows)
    return Response(content=buf.getvalue(), media_type="text/csv; charset=utf-8",
                    headers={"Content-Disposition": f"attachment; filename={filename}"})
 def _json_response(payload: dict, filename: str) -> Response:
    body = json.dumps(payload, ensure_ascii=False, indent=2, default=str)
    return Response(content=body, media_type="application/json; charset=utf-8",
                    headers={"Content-Disposition": f"attachment; filename={filename}"})
@router.get("/consents.csv")
 async def export_consents_csv(
    user_id: str | None = Query(None, description="Filter by single user"),
    granted: bool | None = Query(None),
    since: str | None = Query(None, description="ISO timestamp"),
    tenant_id: str = Depends(_get_tenant),
    db: Session = Depends(get_db),
 ) -> Response:
    """Download all consent records of this tenant as CSV (auditor-ready)."""
    q = db.query(EinwilligungenConsentDB).filter(
        EinwilligungenConsentDB.tenant_id == tenant_id,
    )
    if user_id:
        q = q.filter(EinwilligungenConsentDB.user_id == user_id)
    if granted is not None:
        q = q.filter(EinwilligungenConsentDB.granted == granted)
    if since:
        try:
            since_dt = datetime.fromisoformat(since.rstrip("Z"))
            q = q.filter(EinwilligungenConsentDB.created_at >= since_dt)
        except Exception:
            pass
    rows = _consent_rows(q.order_by(EinwilligungenConsentDB.created_at.desc()).all())
    return _csv_response(rows, f"consents_{tenant_id[:8]}_{_ts()}.csv")
@router.get("/consents.json")
 async def export_consents_json(
    user_id: str | None = Query(None),
    granted: bool | None = Query(None),
    since: str | None = Query(None),
    tenant_id: str = Depends(_get_tenant),
    db: Session = Depends(get_db),
 ) -> Response:
    """Same data as the CSV endpoint but JSON-shaped for further processing."""
    q = db.query(EinwilligungenConsentDB).filter(
        EinwilligungenConsentDB.tenant_id == tenant_id,
    )
    if user_id:
        q = q.filter(EinwilligungenConsentDB.user_id == user_id)
    if granted is not None:
        q = q.filter(EinwilligungenConsentDB.granted == granted)
    if since:
        try:
            since_dt = datetime.fromisoformat(since.rstrip("Z"))
            q = q.filter(EinwilligungenConsentDB.created_at >= since_dt)
        except Exception:
            pass
    rows = _consent_rows(q.order_by(EinwilligungenConsentDB.created_at.desc()).all())
    payload = {
        "tenant_id": tenant_id,
        "exported_at": datetime.now(timezone.utc).isoformat(),
        "filter": {"user_id": user_id, "granted": granted, "since": since},
        "count": len(rows),
        "consents": rows,
    }
    return _json_response(payload, f"consents_{tenant_id[:8]}_{_ts()}.json")
@router.get("/history.csv")
 async def export_history_csv(
    consent_id: str | None = Query(None, description="Limit to one consent"),
    since: str | None = Query(None),
    tenant_id: str = Depends(_get_tenant),
    db: Session = Depends(get_db),
 ) -> Response:
    """Download the consent-change history (Art. 7(1) Nachweispflicht)."""
    q = db.query(EinwilligungenConsentHistoryDB).filter(
        EinwilligungenConsentHistoryDB.tenant_id == tenant_id,
    )
    if consent_id:
        q = q.filter(EinwilligungenConsentHistoryDB.consent_id == consent_id)
    if since:
        try:
            since_dt = datetime.fromisoformat(since.rstrip("Z"))
            q = q.filter(EinwilligungenConsentHistoryDB.created_at >= since_dt)
        except Exception:
            pass
    rows = _history_rows(q.order_by(EinwilligungenConsentHistoryDB.created_at.asc()).all())
    return _csv_response(rows, f"consent-history_{tenant_id[:8]}_{_ts()}.csv")
@@ -0,0 +1,167 @@
 """
 Cookie-Function-Classifier — pro Cookie eine inhaltliche Funktionsbestimmung.
 Heute haben wir pro Vendor eine Kategorie (analytics/advertising/...).
 Aber: ein Vendor hat oft 10-50 verschiedene Cookies. Nicht jeder Cookie
 einer Marketing-Plattform macht Werbung — viele sind Session-Mgmt,
 Sprachpraeferenz, ScrollPosition etc.
 Dieses Modul klassifiziert pro Cookie:
  - functional_role : was der Cookie technisch tut (session_id,
    csrf_token, ab_test, user_id, ad_id, …)
  - data_collected  : welche Daten dahinter stehen (visitor_id,
    page_view, click, conversion_event, …)
  - blocking_impact : was passiert wenn der Cookie geblockt wird
    (none, no_personalization, no_tracking, site_breaks)
 Damit kann der Vendor-Redundanz-Analyzer praezise sagen:
  "Adobe Analytics setzt 55 Cookies, davon 12 fuer Tracking, 8 fuer A/B-Test
   und 35 fuer interne Performance. Matomo deckt 12 Tracking + 8 A/B Tests
   ab — 55 Adobe-Cookies werden zu 20 Matomo-Cookies."
 """
 from __future__ import annotations
 import re
 from typing import Iterable
 # Pattern → (functional_role, blocking_impact)
 # Reihenfolge entscheidet: spezifischer zuerst.
 _PATTERNS: list[tuple[str, str, str]] = [
    # Session / Authentifizierung
    (r"^(jsessionid|phpsessid|sessionid|sid|connect\.sid)$", "session_id", "site_breaks"),
    (r"sso|signon|auth|login|token|jwt|bearer",              "auth_token", "site_breaks"),
    (r"^csrf|xsrf|antiforgery",                              "csrf_token", "site_breaks"),
    # Spracheinstellung / Region
    (r"lang|locale|culture|region",                          "preference", "no_personalization"),
    # User-Praeferenzen (Theme, View, Bookmark)
    (r"theme|dark|mode|view|sort|filter",                    "ui_preference", "no_personalization"),
    (r"bookmark|favorite|favorit",                           "user_data", "no_personalization"),
    # Consent-Cookie selbst
    (r"consent|gdpr|tcf|euconsent",                          "consent_state", "site_breaks"),
    # Tracking IDs (most analytics)
    (r"^_ga|gid|gat|google_analytic",                        "tracking_id", "no_tracking"),
    (r"^_pk_|matomo|piwik",                                  "tracking_id", "no_tracking"),
    (r"^s_|s\.cc|adobesite|aam",                             "tracking_id", "no_tracking"),  # Adobe
    (r"hjid|hjsession|hotjar",                               "session_recording", "no_tracking"),
    (r"_uetsid|_uetvid|microsoft",                           "tracking_id", "no_tracking"),
    # Visitor identification
    (r"visitor|uid|user_id|customer_id",                     "visitor_id", "no_personalization"),
    # A/B-Test / Personalisation
    (r"ab_test|abtest|variant|experiment|target|target_qa",  "ab_test", "no_personalization"),
    (r"personalization|personalisation|adobe_target",        "personalisation", "no_personalization"),
    # Werbung / Retargeting
    (r"fbp|fbc|fb_id|facebook|meta_pixel|fr$",               "ad_pixel", "no_tracking"),
    (r"adform|criteo|outbrain|taboola|tapad|adsrvr",         "ad_pixel", "no_tracking"),
    (r"doubleclick|test_cookie|ide|nid|exchange_uid",        "ad_pixel", "no_tracking"),
    (r"google_ad|gads|gcl",                                  "ad_pixel", "no_tracking"),
    (r"^li_|linkedin|bcookie|bscookie",                      "ad_pixel", "no_tracking"),
    (r"pinterest|_pinterest_|_pin_unauth",                   "ad_pixel", "no_tracking"),
    # Affiliate / Conversion
    (r"conversion|orderid|order_id|transaction|purchase",    "conversion_event", "no_tracking"),
    (r"campaign|utm|source|medium|term",                     "campaign_attribution", "no_tracking"),
    # ScrollPosition / Form-Helper
    (r"scroll|position|form_|form_state",                    "ui_state", "no_personalization"),
    # Loadbalancer / Sticky
    (r"affinity|sticky|lb_|alb-|aws-alb",                    "load_balancer", "site_breaks"),
    # Chat / Support
    (r"chat|widget|genesys|livechat",                        "chat_session", "no_personalization"),
    # Captcha
    (r"hcaptcha|recaptcha|cf_|cloudflare",                   "bot_protection", "site_breaks"),
 ]
 _FUNCTIONAL_LABEL = {
    "session_id":          "Sitzungs-ID",
    "auth_token":          "Auth-Token",
    "csrf_token":          "CSRF-Schutz",
    "preference":          "Sprache / Region",
    "ui_preference":       "UI-Praeferenz",
    "user_data":           "Nutzer-Daten",
    "consent_state":       "Consent-Speicher",
    "tracking_id":         "Tracking-ID",
    "session_recording":   "Session-Recording",
    "visitor_id":          "Besucher-ID",
    "ab_test":             "A/B-Test",
    "personalisation":     "Personalisierung",
    "ad_pixel":            "Werbe-Pixel",
    "conversion_event":    "Konversions-Tracking",
    "campaign_attribution":"Kampagnen-Attribution",
    "ui_state":            "UI-Zustand (ScrollPos etc.)",
    "load_balancer":       "Load-Balancer",
    "chat_session":        "Chat-Session",
    "bot_protection":      "Bot-Schutz",
    "unknown":             "Unbekannt",
 }
 # Welche functional_roles ueberlappen funktional — verwendet vom
 # vendor_redundancy.analyze() um echte Konsolidierungschancen zu
 # erkennen statt nur Provider-Doppelungen zu zaehlen.
 OVERLAPPING_ROLES = {
    "tracking_id":         "tracking",
    "session_recording":   "tracking",
    "ab_test":             "personalisation",
    "personalisation":     "personalisation",
    "ad_pixel":            "advertising",
    "conversion_event":    "advertising",
    "campaign_attribution":"advertising",
 }
 def classify_cookie(cookie_name: str) -> tuple[str, str]:
    """Return (functional_role, blocking_impact) for a cookie name."""
    n = (cookie_name or "").lower().strip()
    for pattern, role, impact in _PATTERNS:
        if re.search(pattern, n):
            return role, impact
    return "unknown", "no_tracking"
 def annotate_vendor_cookies(vendor: dict) -> dict:
    """Enrich a vendor record with functional_role per cookie."""
    cookies = vendor.get("cookies") or []
    annotated = []
    role_counts: dict[str, int] = {}
    for c in cookies:
        role, impact = classify_cookie(c.get("name", ""))
        annotated.append({**c, "functional_role": role, "blocking_impact": impact})
        role_counts[role] = role_counts.get(role, 0) + 1
    return {
        **vendor,
        "cookies": annotated,
        "role_distribution": role_counts,
        "role_labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in role_counts},
    }
 def aggregate_cookie_purposes(vendors: Iterable[dict]) -> dict:
    """Tenant-weite Verteilung: welche funktionalen Rollen kommen wie oft vor?"""
    total: dict[str, int] = {}
    by_vendor: dict[str, dict[str, int]] = {}
    for v in vendors:
        roles = v.get("role_distribution") or {}
        if not roles and v.get("cookies"):
            v = annotate_vendor_cookies(v)
            roles = v["role_distribution"]
        for r, n in roles.items():
            total[r] = total.get(r, 0) + n
        by_vendor[v.get("name", "")] = roles
    return {
        "total_per_role": total,
        "labels": {r: _FUNCTIONAL_LABEL.get(r, r) for r in total},
        "vendors_per_role": {
            r: [v for v, rd in by_vendor.items() if rd.get(r, 0) > 0]
            for r in total
        },
    }
@@ -0,0 +1,608 @@
 """
 Cookie-Knowledge-Datenbank — maximal extrahierbares Wissen pro Cookie-Name.
 Pro Eintrag erfassen wir:
  - vendor             : Setzender Anbieter (volle Firma + Sitzland)
  - exact_purpose      : was der Cookie GENAU tut (nicht nur Kategorie)
  - data_collected     : Welche Datenfelder (Client-ID, Timestamp, IP, etc.)
  - ip_relevant        : Wird IP-Adresse erfasst/uebermittelt?
  - ip_anonymized      : Per Default anonymisiert?
  - tcf_purpose_ids    : IAB TCF v2.2 Purpose-IDs (1-11)
  - iab_vendor_id      : IAB Global Vendor List ID (fuer TCF-Sync)
  - typical_lifetime   : Wie lange persistiert
  - reid_risk          : Re-Identifikations-Risiko (low/medium/high)
  - technical_necessity: Erforderlich nach §25(2) TDDDG? (none/partial/full)
  - schrems_ii_status  : Drittlandtransfer-Bewertung
  - eugh_rulings       : Relevante EuGH-/CNIL-/LfDI-Entscheidungen
  - eu_alternative_*   : EU-Cookie/Vendor-Ersatz mit gleicher Funktion
  - notes              : Sonstige Hinweise (Vermeidung, Konfiguration)
 Quellen: Cookiepedia, IAB Europe TCF v2.2 Vendor List, Cookiebot DB,
 CNIL Cookies & Trackers Guidelines 2024, EDPB Cookie Guidelines 2/2023,
 DSK-Orientierungshilfe Telemedien 2021, Vendor-eigene Dokumentation.
 Stand: 2026-05.
 Erweiterung: Pull-Requests willkommen — Format siehe TEMPLATE_ENTRY am
 Ende der Datei.
 """
 from __future__ import annotations
 from typing import TypedDict
 class CookieKnowledge(TypedDict, total=False):
    vendor: str
    vendor_country: str
    exact_purpose: str
    data_collected: list[str]
    ip_relevant: bool
    ip_anonymized: bool
    tcf_purpose_ids: list[int]
    iab_vendor_id: int | None
    typical_lifetime: str
    reid_risk: str  # 'low' | 'medium' | 'high'
    technical_necessity: str  # 'none' | 'partial' | 'full'
    schrems_ii_status: str
    eugh_rulings: list[str]
    eu_alternative_cookies: list[str]
    eu_alternative_vendor: str
    notes: str
 # ─── Google ──────────────────────────────────────────────────────────
 _GOOGLE_BASE = {
    "vendor": "Google LLC", "vendor_country": "US",
    "schrems_ii_status": "Drittlandtransfer in die USA. Mit DPF "
                         "(EU-US Data Privacy Framework, 2023) wieder zulaessig, "
                         "aber bereits Klage NOYB anhaengig (Schrems III). "
                         "Risiko-Bewertung empfohlen.",
    "eugh_rulings": [
        "EuGH C-311/18 (Schrems II, 16.07.2020) — Privacy Shield gekippt",
        "CNIL SAN-2022-002 (10.02.2022) — Google Analytics ohne Anonymisierung "
        "unzulaessig",
        "Datenschutzkonferenz (DSK) Orientierungshilfe 2024 — GA4 mit "
        "Server-Side-Tagging als Mitigation moeglich",
    ],
 }
 KB: dict[str, CookieKnowledge] = {
    # ─── Google Analytics ─────────────────────────────────────────────
    "_ga": {
        **_GOOGLE_BASE,
        "exact_purpose": "Unterscheidet eindeutig Besucher; persistiert die "
                         "ueber alle Sessions hinweg gueltige Client-ID.",
        "data_collected": ["client_id", "first_visit_timestamp", "ip_address"],
        "ip_relevant": True, "ip_anonymized": False,
        "tcf_purpose_ids": [8, 10],
        "iab_vendor_id": 755,
        "typical_lifetime": "2 Jahre",
        "reid_risk": "high",
        "technical_necessity": "none",
        "eu_alternative_cookies": ["_pk_id"],
        "eu_alternative_vendor": "Matomo",
        "notes": "Mit IP-Anonymisierung (`anonymizeIp: true`) + Server-Side-Tagging "
                 "DSGVO-konfigurierbar. Ohne diese Massnahmen einwilligungspflichtig.",
    },
    "_gid": {
        **_GOOGLE_BASE,
        "exact_purpose": "Unterscheidet Besucher innerhalb einer Session "
                         "(24h-Bucket).",
        "data_collected": ["session_id", "ip_address"],
        "ip_relevant": True, "ip_anonymized": False,
        "tcf_purpose_ids": [8],
        "iab_vendor_id": 755,
        "typical_lifetime": "24 Stunden",
        "reid_risk": "medium",
        "technical_necessity": "none",
        "eu_alternative_cookies": ["_pk_ses"],
        "eu_alternative_vendor": "Matomo",
    },
    "_gat": {
        **_GOOGLE_BASE,
        "exact_purpose": "Throttling-Cookie — begrenzt Anzahl Requests an "
                         "Google Analytics pro Sekunde.",
        "data_collected": ["throttle_flag"],
        "ip_relevant": False, "ip_anonymized": True,
        "tcf_purpose_ids": [],
        "iab_vendor_id": 755,
        "typical_lifetime": "1 Minute",
        "reid_risk": "low",
        "technical_necessity": "none",
        "notes": "Reines Performance-Steuerungs-Cookie. Trotzdem einwilligungspflichtig "
                 "da er Teil des GA-Trackings ist.",
    },
    "_gat_gtag_UA_": {
        **_GOOGLE_BASE,
        "exact_purpose": "GTM-spezifisches Throttling — pro Universal-Analytics-Property.",
        "data_collected": ["throttle_flag"],
        "ip_relevant": False,
        "typical_lifetime": "1 Minute",
        "reid_risk": "low",
        "technical_necessity": "none",
        "notes": "Suffix nach `UA_` ist die Property-ID. Pattern-Match noetig.",
    },
    "_ga_*": {
        **_GOOGLE_BASE,
        "exact_purpose": "GA4-Persistierung — eindeutige Stream-ID + Session-Daten.",
        "data_collected": ["stream_id", "session_count", "session_start_ts"],
        "ip_relevant": True, "ip_anonymized": False,
        "tcf_purpose_ids": [8, 10],
        "iab_vendor_id": 755,
        "typical_lifetime": "2 Jahre",
        "reid_risk": "high",
        "technical_necessity": "none",
        "notes": "GA4-Format. Suffix `_<measurement_id>`. Server-Side-GTM "
                 "ist die einzige praktikable DSGVO-Mitigation.",
    },
    "NID": {
        **_GOOGLE_BASE,
        "exact_purpose": "Personalisiert Google-Suche, AdSense, Maps; "
                         "speichert Praeferenzen + Sicherheits-Token.",
        "data_collected": ["user_pref_id", "session_id", "security_token"],
        "ip_relevant": True,
        "tcf_purpose_ids": [4, 9, 10],
        "iab_vendor_id": 755,
        "typical_lifetime": "6 Monate",
        "reid_risk": "high",
        "technical_necessity": "none",
        "notes": "Klassischer Google-Werbe-Cookie. Hohe Re-ID-Gefahr da Cross-Site.",
    },
    "IDE": {
        "vendor": "Google LLC (DoubleClick)", "vendor_country": "US",
        "exact_purpose": "Conversion-Tracking + Werbe-Targeting bei "
                         "Google Display Network / DoubleClick.",
        "data_collected": ["doubleclick_id", "ad_interactions"],
        "ip_relevant": True,
        "tcf_purpose_ids": [4, 9, 10],
        "iab_vendor_id": 755,
        "typical_lifetime": "13 Monate",
        "reid_risk": "high",
        "technical_necessity": "none",
        "schrems_ii_status": _GOOGLE_BASE["schrems_ii_status"],
        "eugh_rulings": _GOOGLE_BASE["eugh_rulings"],
    },
    "test_cookie": {
        **_GOOGLE_BASE,
        "exact_purpose": "DoubleClick-Probe-Cookie — testet ob Browser Cookies akzeptiert.",
        "data_collected": ["browser_supports_cookies"],
        "ip_relevant": False,
        "typical_lifetime": "15 Minuten",
        "reid_risk": "low",
        "technical_necessity": "none",
    },
    # ─── Meta / Facebook ──────────────────────────────────────────────
    "_fbp": {
        "vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
        "exact_purpose": "First-Party-Pixel von Facebook/Meta — identifiziert "
                         "den Browser fuer Werbe-Conversion-Tracking + Ad-Retargeting.",
        "data_collected": ["browser_id", "first_visit_ts"],
        "ip_relevant": True, "ip_anonymized": False,
        "tcf_purpose_ids": [4, 9, 10],
        "iab_vendor_id": 891,
        "typical_lifetime": "90 Tage",
        "reid_risk": "high",
        "technical_necessity": "none",
        "schrems_ii_status": "Daten gehen an Meta US trotz IE-Sitz. "
                             "Aktuell DPF-abgedeckt, aber Schrems III erwartet.",
        "eugh_rulings": [
            "EuGH C-311/18 (Schrems II)",
            "EDSA Aufsichts-Statement 2023 zu Meta-Pixel",
            "LDA Bayern Pruefverfuegung 2024",
        ],
        "eu_alternative_vendor": "Smart AdServer (Equativ, FR)",
        "notes": "Conversions API (CAPI) als Server-Side-Variante reduziert Cookie-"
                 "Abhaengigkeit. Trotzdem werden Daten an Meta gesendet.",
    },
    "_fbc": {
        "vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
        "exact_purpose": "Speichert die ClickID einer Meta-Ad-Kampagne; "
                         "ordnet Conversion dem urspruenglichen Ad-Klick zu.",
        "data_collected": ["fbclid", "ad_campaign_id"],
        "ip_relevant": True,
        "tcf_purpose_ids": [4, 9],
        "iab_vendor_id": 891,
        "typical_lifetime": "90 Tage",
        "reid_risk": "high",
        "technical_necessity": "none",
    },
    "fr": {
        "vendor": "Meta Platforms Ireland Ltd.", "vendor_country": "IE/US",
        "exact_purpose": "Werbe-Targeting + Cross-Site-Tracking auf "
                         "Facebook-Plattform.",
        "data_collected": ["encrypted_user_id", "session_data"],
        "ip_relevant": True,
        "tcf_purpose_ids": [4, 9, 10],
        "iab_vendor_id": 891,
        "typical_lifetime": "3 Monate",
        "reid_risk": "high",
        "technical_necessity": "none",
    },
    # ─── Adobe ────────────────────────────────────────────────────────
    "s_cc": {
        "vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
        "exact_purpose": "Cookie-Test — prueft ob der Browser Cookies "
                         "akzeptiert (Adobe Analytics Bootstrap).",
        "data_collected": ["browser_supports_cookies"],
        "ip_relevant": False,
        "typical_lifetime": "Session",
        "reid_risk": "low",
        "technical_necessity": "partial",
        "schrems_ii_status": "Adobe-IE-Hosting, jedoch US-Datentransfer fuer "
                             "Cloud-Services. DPF-abgedeckt.",
    },
    "s_sq": {
        "vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
        "exact_purpose": "Speichert den letzten Klick (URL + Position) "
                         "fuer Click-Map-Reports.",
        "data_collected": ["last_click_url", "last_click_xy"],
        "ip_relevant": False,
        "tcf_purpose_ids": [8],
        "typical_lifetime": "Session",
        "reid_risk": "low",
        "technical_necessity": "none",
    },
    "AMCV_": {
        "vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
        "exact_purpose": "Adobe Marketing Cloud Visitor ID — Cross-Tool-ID fuer "
                         "Analytics + Target + Audience Manager.",
        "data_collected": ["marketing_cloud_visitor_id", "first_visit_ts"],
        "ip_relevant": True,
        "tcf_purpose_ids": [4, 8, 9, 10],
        "typical_lifetime": "2 Jahre",
        "reid_risk": "high",
        "technical_necessity": "none",
        "notes": "Suffix nach `AMCV_` ist die Org-ID. Klassischer Cross-Tool-Track.",
    },
    "mbox": {
        "vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
        "exact_purpose": "Adobe Target — A/B-Testing, Personalisierung, "
                         "Audience-Targeting.",
        "data_collected": ["mbox_visitor_id", "experiment_assignments"],
        "ip_relevant": True,
        "tcf_purpose_ids": [4, 9, 10],
        "typical_lifetime": "2 Jahre",
        "reid_risk": "high",
        "technical_necessity": "none",
    },
    "s_target_qa": {
        "vendor": "Adobe Systems Software Ireland Limited", "vendor_country": "IE/US",
        "exact_purpose": "Adobe Target Premium-Feature — QA-Modus fuer Personalisations-Tests.",
        "data_collected": ["target_qa_session"],
        "typical_lifetime": "Session",
        "reid_risk": "low",
        "technical_necessity": "none",
        "notes": "Premium-Feature-Cookie. Anwesenheit = Enterprise-Lizenz Indikator.",
    },
    # ─── Microsoft / Bing ─────────────────────────────────────────────
    "MUID": {
        "vendor": "Microsoft Corp.", "vendor_country": "US",
        "exact_purpose": "Microsoft-User-ID fuer Bing, Microsoft Advertising, "
                         "Clarity Heatmaps.",
        "data_collected": ["microsoft_user_id"],
        "ip_relevant": True,
        "tcf_purpose_ids": [4, 8, 9, 10],
        "iab_vendor_id": 165,
        "typical_lifetime": "13 Monate",
        "reid_risk": "high",
        "technical_necessity": "none",
        "schrems_ii_status": "US-Transfer, DPF-zertifiziert.",
    },
    "_uetsid": {
        "vendor": "Microsoft Corp.", "vendor_country": "US",
        "exact_purpose": "Universal Event Tracking — Session-Cookie fuer "
                         "Microsoft Advertising Conversion-Tracking.",
        "data_collected": ["session_id"],
        "ip_relevant": True,
        "tcf_purpose_ids": [9],
        "typical_lifetime": "30 Minuten",
        "reid_risk": "medium",
        "technical_necessity": "none",
    },
    "_uetvid": {
        "vendor": "Microsoft Corp.", "vendor_country": "US",
        "exact_purpose": "Universal Event Tracking — persistente Visitor-ID.",
        "data_collected": ["visitor_id"],
        "ip_relevant": True,
        "tcf_purpose_ids": [4, 9],
        "typical_lifetime": "13 Monate",
        "reid_risk": "high",
        "technical_necessity": "none",
    },
    # ─── LinkedIn ─────────────────────────────────────────────────────
    "bcookie": {
        "vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
        "exact_purpose": "Browser-Identifikation; wird genutzt fuer Anmelde-"
                         "Vorgang + LinkedIn Insight-Tag-Tracking.",
        "data_collected": ["browser_id"],
        "ip_relevant": True,
        "tcf_purpose_ids": [4, 8, 9],
        "iab_vendor_id": 14,
        "typical_lifetime": "1 Jahr",
        "reid_risk": "high",
        "technical_necessity": "none",
        "schrems_ii_status": "US-Datenuebermittlung (Mutter Microsoft).",
    },
    "lidc": {
        "vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
        "exact_purpose": "Daten-Routing zwischen LinkedIn-Datacentern.",
        "data_collected": ["routing_id"],
        "ip_relevant": True,
        "typical_lifetime": "1 Tag",
        "reid_risk": "low",
        "technical_necessity": "partial",
    },
    "li_gc": {
        "vendor": "LinkedIn Ireland Unlimited Company", "vendor_country": "IE/US",
        "exact_purpose": "Speichert die Cookie-Einwilligung fuer LinkedIn-Embeds.",
        "data_collected": ["consent_state"],
        "ip_relevant": False,
        "typical_lifetime": "6 Monate",
        "reid_risk": "low",
        "technical_necessity": "full",
    },
    # ─── Matomo (EU-Alternative) ──────────────────────────────────────
    "_pk_id": {
        "vendor": "InnoCraft Ltd (Matomo)", "vendor_country": "NZ-mit-EU-self-hosting",
        "exact_purpose": "Matomo Visitor-ID — Pendant zu _ga, aber DSGVO-konform "
                         "wenn IP-Anonymisierung aktiv.",
        "data_collected": ["visitor_id", "first_visit_ts"],
        "ip_relevant": True, "ip_anonymized": True,
        "tcf_purpose_ids": [8],
        "typical_lifetime": "13 Monate",
        "reid_risk": "low",  # bei aktivierter Anonymisierung
        "technical_necessity": "none",
        "schrems_ii_status": "Bei On-Premise-Deployment KEIN Drittlandtransfer. "
                             "Matomo Cloud EU mit Frankfurt-Hosting verfuegbar.",
        "notes": "Empfohlener Drop-in-Ersatz fuer Google Analytics.",
    },
    "_pk_ses": {
        "vendor": "InnoCraft Ltd (Matomo)", "vendor_country": "NZ-mit-EU-self-hosting",
        "exact_purpose": "Matomo Session-Cookie.",
        "data_collected": ["session_id"],
        "ip_relevant": False,
        "typical_lifetime": "30 Minuten",
        "reid_risk": "low",
        "technical_necessity": "none",
    },
    # ─── Captcha ──────────────────────────────────────────────────────
    "hcaptcha": {
        "vendor": "Intuition Machines Inc. (hCaptcha)", "vendor_country": "US",
        "exact_purpose": "hCaptcha Bot-Erkennung — Session-Token fuer Challenge-Solver.",
        "data_collected": ["bot_score", "session_id", "ip_address"],
        "ip_relevant": True,
        "typical_lifetime": "Session",
        "reid_risk": "medium",
        "technical_necessity": "full",
        "schrems_ii_status": "US-Transfer (Intuition Machines, San Francisco).",
        "eu_alternative_vendor": "Friendly Captcha (DE), Turnstile (Cloudflare EU)",
        "notes": "Technisch erforderlich fuer Bot-Schutz, aber EU-Alternativen "
                 "ohne Drittland-Risiko verfuegbar.",
    },
    "cf_clearance": {
        "vendor": "Cloudflare Inc.", "vendor_country": "US",
        "exact_purpose": "Cloudflare-Bot-Management Pro — bestaetigt dass User "
                         "die JS-Challenge bestanden hat.",
        "data_collected": ["challenge_token"],
        "ip_relevant": True,
        "typical_lifetime": "30 Minuten",
        "reid_risk": "low",
        "technical_necessity": "full",
        "notes": "Premium-Feature-Cookie. Anwesenheit = Cloudflare Bot-Management "
                 "Pro im Einsatz.",
    },
    # ─── CDN / Performance ────────────────────────────────────────────
    "__cf_bm": {
        "vendor": "Cloudflare Inc.", "vendor_country": "US",
        "exact_purpose": "Cloudflare Bot Management — basis Bot-Erkennung.",
        "data_collected": ["bot_score", "client_hash"],
        "ip_relevant": True,
        "typical_lifetime": "30 Minuten",
        "reid_risk": "low",
        "technical_necessity": "full",
        "notes": "Strictly necessary nach §25(2) TDDDG (Sicherheit). Keine Einwilligung noetig.",
    },
    "aws-alb": {
        "vendor": "Amazon Web Services Inc.", "vendor_country": "US",
        "exact_purpose": "AWS Application Load Balancer Sticky Sessions — "
                         "routet Anfragen konsistent an dieselbe Backend-Instanz.",
        "data_collected": ["target_instance_id"],
        "ip_relevant": False,
        "typical_lifetime": "1 Stunde",
        "reid_risk": "low",
        "technical_necessity": "full",
        "schrems_ii_status": "AWS Frankfurt verfuegbar — bei korrektem Region-Setup "
                             "kein US-Transfer.",
    },
    # ─── Retargeting / Advertising ────────────────────────────────────
    "_pin_unauth": {
        "vendor": "Pinterest Europe Ltd.", "vendor_country": "IE/US",
        "exact_purpose": "Pinterest Tag — Conversion-Tracking + Audience-Aufbau.",
        "data_collected": ["pinterest_user_id"],
        "ip_relevant": True,
        "tcf_purpose_ids": [4, 9, 10],
        "iab_vendor_id": 762,
        "typical_lifetime": "1 Jahr",
        "reid_risk": "high",
        "technical_necessity": "none",
    },
    "cto_dna": {
        "vendor": "Criteo S.A.", "vendor_country": "FR",
        "exact_purpose": "Criteo Dynamic Retargeting — produktspezifische "
                         "Werbeauslieferung basierend auf Browser-History.",
        "data_collected": ["criteo_user_id", "product_views"],
        "ip_relevant": True,
        "tcf_purpose_ids": [4, 9, 10],
        "iab_vendor_id": 91,
        "typical_lifetime": "13 Monate",
        "reid_risk": "high",
        "technical_necessity": "none",
        "schrems_ii_status": "Criteo ist FR-basiert, aber Daten gehen auch in die USA. "
                             "Multi-Region-Setup pruefen.",
        "notes": "Trotz FR-Sitz nicht automatisch DSGVO-konform — CNIL-Bussgeld "
                 "EUR 60M 2022 wegen mangelnder Einwilligungs-Granularitaet.",
    },
    "afm": {
        "vendor": "Adform A/S", "vendor_country": "DK",
        "exact_purpose": "Adform Audience Matching — Cross-Device-Identifikation "
                         "fuer programmatische Werbung.",
        "data_collected": ["adform_user_id", "device_signals"],
        "ip_relevant": True,
        "tcf_purpose_ids": [4, 9, 10],
        "iab_vendor_id": 50,
        "typical_lifetime": "30 Tage",
        "reid_risk": "high",
        "technical_necessity": "none",
        "schrems_ii_status": "Adform ist DK-basiert, EU-Hosting Standard. Keine "
                             "Schrems-II-Probleme bei Standard-Setup.",
    },
    # ─── Consent / Funktional (Strictly Necessary) ────────────────────
    "JSESSIONID": {
        "vendor": "Java EE / Tomcat (Site-Software)", "vendor_country": "N/A",
        "exact_purpose": "Server-Session-ID fuer Java-basierte Anwendungen.",
        "data_collected": ["session_id"],
        "ip_relevant": False,
        "typical_lifetime": "Session",
        "reid_risk": "low",
        "technical_necessity": "full",
        "notes": "Strictly necessary. Keine Einwilligung nach §25(2) TDDDG.",
    },
    "PHPSESSID": {
        "vendor": "PHP (Site-Software)", "vendor_country": "N/A",
        "exact_purpose": "PHP-Session-ID — serverseitige Sitzungs-Persistierung.",
        "data_collected": ["session_id"],
        "ip_relevant": False,
        "typical_lifetime": "Session",
        "reid_risk": "low",
        "technical_necessity": "full",
    },
    "cookie_consent": {
        "vendor": "BreakPilot Consent-Banner (own)", "vendor_country": "DE",
        "exact_purpose": "Speichert die Einwilligungsentscheidung des Nutzers "
                         "pro Kategorie.",
        "data_collected": ["consent_state_per_category", "timestamp"],
        "ip_relevant": False,
        "typical_lifetime": "180 Tage",
        "reid_risk": "low",
        "technical_necessity": "full",
        "notes": "Strictly necessary nach Art. 7(1) DSGVO Nachweispflicht.",
    },
    # ─── Templated / pattern-based entries (Suffix variabel) ──────────
    # Diese werden via regex-Lookup gefangen, der Eintrag dient als Fallback.
    "_uet_": {
        "vendor": "Microsoft Corp.", "vendor_country": "US",
        "exact_purpose": "Microsoft Universal Event Tracking — Suffix nach `_uet_`.",
        "data_collected": ["event_id"],
        "ip_relevant": True,
        "typical_lifetime": "30 Minuten",
        "reid_risk": "medium",
        "technical_necessity": "none",
    },
 }
 # ─── Pattern-Lookup fuer Cookies mit variablem Suffix ───────────────
 _PATTERN_LOOKUPS: list[tuple[str, str]] = [
    (r"^_ga_[A-Z0-9_]+$",     "_ga_*"),
    (r"^_gat_gtag_UA_",       "_gat_gtag_UA_"),
    (r"^AMCV_",               "AMCV_"),
    (r"^_uet[a-z]+",          "_uet_"),
    (r"^aws-alb",             "aws-alb"),
    (r"^_pk_id\.",            "_pk_id"),
    (r"^_pk_ses\.",           "_pk_ses"),
 ]
 def lookup_cookie(name: str) -> CookieKnowledge | None:
    """Return rich knowledge for a cookie name, or None if unknown."""
    import re
    if not name:
        return None
    # Direct hit
    if name in KB:
        return KB[name]
    # Pattern-based
    for pattern, kb_key in _PATTERN_LOOKUPS:
        if re.search(pattern, name):
            return KB.get(kb_key)
    # Strip common suffixes (.bmw.de, .domain etc.)
    base = name.split(".", 1)[0]
    if base != name and base in KB:
        return KB[base]
    return None
 def enrich_vendor_with_knowledge(vendor: dict) -> dict:
    """Add per-cookie knowledge to each cookie in vendor['cookies']."""
    cookies = vendor.get("cookies") or []
    enriched = []
    for c in cookies:
        info = lookup_cookie(c.get("name", ""))
        if info:
            enriched.append({**c, "knowledge": info})
        else:
            enriched.append(c)
    return {**vendor, "cookies": enriched}
 # ─── Hilfen fuer Aggregate-Reports ──────────────────────────────────
 def summarize_compliance_risk(vendor: dict) -> dict:
    """Aggregiere Re-ID-Risiko + Drittland-Status ueber alle Cookies eines Vendors."""
    cookies = vendor.get("cookies") or []
    risk_counts = {"high": 0, "medium": 0, "low": 0}
    schrems_affected = 0
    technical_only = 0
    for c in cookies:
        k = c.get("knowledge") or lookup_cookie(c.get("name", ""))
        if not k:
            continue
        risk = k.get("reid_risk", "low")
        risk_counts[risk] = risk_counts.get(risk, 0) + 1
        if "us" in k.get("vendor_country", "").lower() or "schrems" in k.get("schrems_ii_status", "").lower():
            schrems_affected += 1
        if k.get("technical_necessity") == "full":
            technical_only += 1
    return {
        "reid_risk_distribution": risk_counts,
        "high_risk_cookie_count": risk_counts["high"],
        "schrems_ii_affected_cookies": schrems_affected,
        "strictly_necessary_cookies": technical_only,
        "total_classified": sum(risk_counts.values()),
    }
 # ─── TEMPLATE_ENTRY (fuer Kuratierung neuer Cookies) ───────────────
 TEMPLATE_ENTRY: CookieKnowledge = {
    "vendor": "<Voller Firmenname>",
    "vendor_country": "<ISO-2 Code oder 'IE/US' bei Doppel-Standort>",
    "exact_purpose": "<1-2 Saetze was der Cookie GENAU tut>",
    "data_collected": ["<feldname_1>", "<feldname_2>"],
    "ip_relevant": False,
    "ip_anonymized": False,
    "tcf_purpose_ids": [],   # TCF v2.2: 1-11
    "iab_vendor_id": None,   # Aus https://iabeurope.eu/tcf-vendor-list/
    "typical_lifetime": "<Session | XX Tage | XX Monate | XX Jahre>",
    "reid_risk": "low",      # low | medium | high
    "technical_necessity": "none",  # none | partial | full
    "schrems_ii_status": "<Drittlandtransfer-Bewertung>",
    "eugh_rulings": [],
    "eu_alternative_cookies": [],
    "eu_alternative_vendor": "",
    "notes": "",
 }
@@ -220,10 +220,16 @@ def score_vendors(vendors: list[dict]) -> list[dict]:
            flags.append("no_purpose")
        # Country — only for external processors / controllers
        # Falls country leer ist, ableiten aus Rechtsform-Suffix im Namen.
        if country_required:
            max_score += 10
            if v.get("country"):
                score += 10
            elif _country_from_name(v.get("name", "")):
                inferred = _country_from_name(v.get("name", ""))
                v["country"] = inferred
                v["country_inferred"] = True
                score += 10
            else:
                flags.append("no_country")
@@ -321,3 +327,153 @@ def build_check_items(validated: list[LinkCheck]) -> list[dict]:
            "hint": hint,
        })
    return items
 # ─── Country-Inferenz aus Rechtsform-Suffix ────────────────────────
 #
 # Wenn ein Vendor das "country"-Feld leer hat, koennen wir es oft aus
 # dem Firmen-Suffix ableiten:
 #   Adform A/S          → DK (Dänemark, Aktieselskab)
 #   Pinterest Europe Ltd. → IE (Irland, Limited)
 #   Salesforce Inc.     → US (Incorporated)
 #   Adobe ... Ireland Limited → IE
 #   Genesys ... B.V.    → NL (Niederlande, Besloten Vennootschap)
 #   Equativ S.A.        → FR (Société Anonyme)
 #   SAP SE              → DE (Societas Europaea — meist DE-eingetragen)
 #
 # Kombi-Strategie:
 #   1) Suffix-Pattern
 #   2) Laendername im Firmen-Namen ('Ireland', 'Deutschland')
 #   3) Specific Vendor (Google Inc / Meta Platforms Ireland Ltd → vendor-specific)
 import re as _re
 _SUFFIX_COUNTRY: list[tuple[str, str]] = [
    # Pattern (am Wort-Ende oder vor weiteren Tokens)  → ISO-Code
    (r"\bA/S\b",                          "DK"),  # Aktieselskab
    (r"\bApS\b",                          "DK"),  # Anpartsselskab
    (r"\bAB\b",                           "SE"),  # Aktiebolag
    (r"\bAS\b(?!\w)",                     "NO"),  # Aksjeselskap
    (r"\bOy\b",                           "FI"),  # Osakeyhtiö
    (r"\bAG\b(?!\w)",                     "DE"),  # auch CH/AT moeglich, default DE
    (r"\bGmbH\b",                         "DE"),
    (r"\bUG\b",                           "DE"),
    (r"\beG\b",                           "DE"),
    (r"\bKG\b",                           "DE"),
    (r"\bOHG\b",                          "DE"),
    (r"\bSE\b",                           "DE"),  # Societas Europaea — pruefen ob SAP SE etc.
    (r"\bS\.A\.\b",                       "FR"),  # France / SE / ES
    (r"\bSAS\b",                          "FR"),
    (r"\bS\.A\.S\.\b",                    "FR"),
    (r"\bSARL\b",                         "FR"),
    (r"\bS\.r\.l\.\b",                    "IT"),
    (r"\bS\.p\.A\.\b",                    "IT"),
    (r"\bSpA\b",                          "IT"),
    (r"\bB\.V\.\b",                       "NL"),
    (r"\bN\.V\.\b",                       "NL"),
    (r"\bSL\b",                           "ES"),
    (r"\bS\.A\.\sde C\.V\.\b",           "MX"),
    (r"\bd\.o\.o\.\b",                    "SI"),  # Slowenien
    (r"\bd\.d\.\b",                       "HR"),  # Kroatien
    (r"\bz\s?o\.o\.\b",                   "PL"),
    (r"\bInc\.?\b",                       "US"),
    (r"\bIncorporated\b",                 "US"),
    (r"\bCorp\.?\b",                      "US"),
    (r"\bCorporation\b",                  "US"),
    (r"\bLLC\b",                          "US"),
    (r"\bL\.L\.C\.\b",                    "US"),
    (r"\bLtd\.?\b",                       "GB"),  # UK Limited, default
    (r"\bLimited\b",                      "GB"),
    (r"\bPLC\b",                          "GB"),
    (r"\bPty\b",                          "AU"),
    (r"\bK\.K\.\b",                       "JP"),  # Kabushiki-Kaisha
    (r"\bPte\.?\sLtd\.?\b",               "SG"),
 ]
 # Country-Namen im Firmen-Namen (z.B. "Adobe Systems Software Ireland Limited")
 _COUNTRY_NAME_TOKENS: list[tuple[str, str]] = [
    ("ireland",          "IE"),
    ("deutschland",      "DE"),
    ("germany",          "DE"),
    ("netherlands",      "NL"),
    ("france",           "FR"),
    ("united kingdom",   "GB"),
    ("uk",               "GB"),
    ("usa",              "US"),
    ("united states",    "US"),
    ("austria",          "AT"),
    ("oesterreich",      "AT"),
    ("schweiz",          "CH"),
    ("switzerland",      "CH"),
    ("luxembourg",       "LU"),
    ("luxemburg",        "LU"),
    ("denmark",          "DK"),
    ("daenemark",        "DK"),
    ("sweden",           "SE"),
    ("schweden",         "SE"),
    ("norway",           "NO"),
    ("norwegen",         "NO"),
    ("finland",          "FI"),
    ("finnland",         "FI"),
 ]
 # Bekannte Vendors mit eindeutigem Sitz (override)
 _KNOWN_VENDOR_COUNTRY: dict[str, str] = {
    "google inc":                      "US",
    "google llc":                      "US",
    "google ireland":                  "IE",
    "meta platforms ireland":          "IE",
    "facebook ireland":                "IE",
    "amazon.com inc":                  "US",
    "amazon web services":             "US",
    "amazon web services inc":         "US",
    "linkedin inc":                    "US",
    "salesforce inc":                  "US",
    "salesforce.com":                  "US",
    "outbrain inc":                    "US",
    "taboola inc":                     "US",
    "pinterest europe ltd":            "IE",
    "intuition machines inc":          "US",
    "akamai technologies inc":         "US",
    "criteo s.a":                      "FR",
    "criteo sa":                       "FR",
    "adform a/s":                      "DK",
    "speedcurve limited":              "GB",
    "longtail ad solutions":           "US",
    "genesys cloud services b.v":      "NL",
    "qualtrics":                       "US",
    "teads sa":                        "FR",
    "teads s.a":                       "FR",
    "salesviewer gmbh":                "DE",
    "baqend gmbh":                     "DE",
    "zenweshare sas":                  "FR",
    "nayoki gmbh":                     "DE",
    "psyma":                           "DE",
    "matomo":                          "NZ",   # InnoCraft NZ aber EU-hostbar
    "adobe systems software ireland":  "IE",
    "microsoft corporation":           "US",
    "microsoft corp":                  "US",
 }
 def _country_from_name(vendor_name: str) -> str:
    """Best-effort: ISO-2 Country-Code aus dem Vendor-Namen ableiten."""
    if not vendor_name:
        return ""
    # Vendor-Namen sind oft "<Firma> — <Tool>" — nur Firmen-Teil betrachten
    firm = vendor_name.split(" — ")[0].strip()
    firm_l = firm.lower()
    # 1) Known vendor lookup (most specific)
    for k, v in _KNOWN_VENDOR_COUNTRY.items():
        if k in firm_l:
            return v
    # 2) Country-Name im Firmen-Namen
    for token, code in _COUNTRY_NAME_TOKENS:
        if token in firm_l:
            return code
    # 3) Rechtsform-Suffix
    for pattern, code in _SUFFIX_COUNTRY:
        if _re.search(pattern, firm):
            return code
    return ""
@@ -0,0 +1,350 @@
 """
 Doc-Anchor-Locator — fuer ein Finding den passendsten Einfuege-Ort im
 existierenden Dokument finden.
 Primary strategy: BGE-M3 Embedding-Match zwischen einer pro-Finding
 Anchor-Query und allen Absaetzen des Docs. Echtes semantisches Matching
 (BMW schreibt "Verarbeiter" statt "Auftragsverarbeiter" → Keyword waere
 out, Embedding catches it).
 Fallback: Keyword-Match (wenn kein Embedding-Service erreichbar).
 Output pro Anchor:
  - anchor_phrase     : Originaltext-Auszug
  - position_hint     : "Nach Absatz X von Y: '...'"
  - confidence        : 'high' | 'medium' | 'low'
  - score             : float (cosine similarity oder keyword-rank)
  - method            : 'embedding' | 'keyword' | 'fallback'
 """
 from __future__ import annotations
 import logging
 import math
 import os
 import re
 import threading
 from typing import Iterable
 import httpx
 logger = logging.getLogger(__name__)
 EMBEDDING_URL = os.getenv("EMBEDDING_URL", "http://embedding-service:8087")
 # Pro Finding-Typ eine semantisch reiche Anchor-Query, die der Embedding-
 # Matcher gegen den Doc-Text wirft. Reicher als die kurze MC-check_question.
 # Sucht NICHT den Pflichttext selbst (der fehlt ja) sondern den Absatz wo
 # der Fix HINEIN-soll — also den thematisch verwandten Kontext.
 _ANCHOR_QUERIES: list[tuple[str, str, str]] = [
    # (finding_label_partial, anchor_query, fallback_hint)
    (
        "Auftragsverarbeiter erwaehnt",
        "Empfaenger der Daten Verarbeiter Dienstleister Cloud Hosting CRM "
        "Auftragsverarbeitung Weitergabe Datenuebermittlung an Dritte",
        "Im Abschnitt 'Empfaenger' oder 'Datenuebermittlung'",
    ),
    (
        "Automatisierte Entscheidungen",
        "Betroffenenrechte automatisierte Entscheidung Profiling Logik "
        "Tragweite Auswirkung Art. 22 DSGVO",
        "Am Ende des Abschnitts 'Betroffenenrechte'",
    ),
    (
        "Konkrete Aufsichtsbehoerde",
        "Beschwerderecht Datenschutzaufsicht Aufsichtsbehoerde Beschwerde "
        "bei der Behoerde einreichen Recht auf Beschwerde",
        "Im Abschnitt 'Beschwerderecht'",
    ),
    (
        "Angemessenheitsbeschluss",
        "Drittlandtransfer USA Standardvertragsklauseln SCC DPF Data Privacy "
        "Framework Angemessenheitsbeschluss internationale Datenuebermittlung",
        "Im Abschnitt 'Drittlandtransfer'",
    ),
    (
        "Anschrift des Verantwortlichen",
        "Verantwortlicher Verantwortliche Stelle Datenschutz Betreiber dieser "
        "Website Firma Anschrift Kontakt",
        "Am Anfang der Datenschutzerklaerung / Cookie-Richtlinie",
    ),
    (
        "Konkrete Cookie-Namen",
        "Welche Cookies verwenden wir Cookie-Tabelle Liste der Cookies "
        "Cookie-Kategorien Auflistung der Cookies Name Anbieter Zweck",
        "Im Abschnitt 'Welche Cookies verwenden wir?'",
    ),
    (
        "Konkrete Anbieter/Dienste",
        "Drittanbieter Dienste Anbieter wir nutzen folgende Dienste "
        "Empfaenger der Cookie-Daten Liste der Dienstleister",
        "In der Drittanbieter-Liste der Cookie-Richtlinie",
    ),
    (
        "Analytics-/Statistik-Tools konkret benannt",
        "Statistik Analytics Reichweitenmessung Webanalyse Tracking "
        "Google Analytics Matomo Adobe Analytics",
        "Im Abschnitt 'Statistik / Analyse-Cookies'",
    ),
    (
        "Konkrete Speicherdauer",
        "Speicherdauer Lebensdauer wie lange Ablauf Cookie-Tabelle Spalte "
        "Speicherdauer pro Cookie",
        "In der Cookie-Tabelle pro Eintrag",
    ),
    (
        "Opt-Out-Links",
        "Widerruf widersprechen deaktivieren Cookie-Einstellungen aendern "
        "Opt-Out Einstellungen anpassen",
        "Im Abschnitt 'Wie kann ich widersprechen?'",
    ),
    (
        "Privacy-Policy-Links",
        "Datenschutzerklaerung des Drittanbieters Privacy Policy Link auf "
        "Datenschutzhinweise der Drittanbieter",
        "Im Drittanbieter-Listing der Cookie-Richtlinie",
    ),
    (
        "Verbraucherstreitbeilegung",
        "Online-Streitbeilegung OS-Plattform Verbraucherschlichtungsstelle "
        "Streitbeilegung Verbraucher",
        "Am Ende des Impressums (eigener Abschnitt 'Streitbeilegung')",
    ),
    (
        "Rechtswidriger Haftungsausschluss",
        "Haftung Disclaimer wir distanzieren uns Links zu externen Webseiten "
        "Haftungsausschluss Drittinhalte",
        "Am Ende des Impressums (Disclaimer-Absatz)",
    ),
    (
        "Name der vertretungsberechtigten",
        "Vertreten durch Vorstand Geschaeftsfuehrung Geschaeftsfuehrer "
        "vertretungsberechtigt Repraesentant",
        "Im Impressum nach Firmenname + Anschrift",
    ),
    (
        "Zustaendige Kammer",
        "Berufsrecht Kammer Aufsichtsbehoerde berufsrechtliche Angaben "
        "zustaendige Kammer",
        "Im Impressum im Abschnitt 'Berufsrechtliche Angaben'",
    ),
    (
        "Drittlaender",
        "Drittland Drittlaender USA Indien China internationale Datenuebermittlung "
        "Datenexport in Nicht-EU-Staaten",
        "Im Abschnitt 'Drittlandtransfer'",
    ),
    (
        "Schutzgarantien",
        "Schutzgarantien Schutzvorkehrungen Schutzmassnahmen SCC "
        "Standardvertragsklauseln einsehen Anforderung",
        "Im Abschnitt 'Drittlandtransfer / Schutzgarantien'",
    ),
 ]
 # ─── Thread-local Cache fuer Doc-Chunks + Embeddings ───────────────
 # Pro Compliance-Check-Run werden die Doc-Texte einmal embedded und im
 # Thread-Local-Storage gehalten, damit mehrere Findings im selben Run
 # nicht jeweils neu embedded werden.
 _tls = threading.local()
 def _get_cache() -> dict:
    if not hasattr(_tls, "cache"):
        _tls.cache = {}
    return _tls.cache
 def reset_cache() -> None:
    """Per-request-cache leeren (sollte am Start jedes Doc-Check-Runs aufgerufen
    werden, damit Vorgaenger-Daten kein Leak verursachen)."""
    if hasattr(_tls, "cache"):
        _tls.cache = {}
 # ─── Helfer ────────────────────────────────────────────────────────
 def _normalize(text: str) -> str:
    return (text or "").lower().replace("\xad", "").replace("ß", "ss")
 def _split_paragraphs(text: str) -> list[str]:
    """Split a doc into paragraphs (by double newline, fallback single)."""
    if not text:
        return []
    paras = re.split(r"\n\s*\n", text)
    if len(paras) < 3:
        paras = re.split(r"(?<=[\.\?\!])\s+(?=[A-ZÄÖÜ])", text)
    return [p.strip() for p in paras if p.strip()]
 def _embed_sync(texts: list[str], timeout: float = 60.0,
                batch_size: int = 32) -> list[list[float]]:
    """Synchroner Batch-Embed-Call (Anchor-Lokalisierung laeuft in
    Sync-HTML-Render, nicht in async context)."""
    if not texts:
        return []
    out: list[list[float]] = []
    with httpx.Client(timeout=timeout) as client:
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            try:
                r = client.post(f"{EMBEDDING_URL}/embed", json={"texts": batch})
                r.raise_for_status()
                out.extend(r.json().get("embeddings") or [])
            except Exception as e:
                logger.warning("Anchor embed sub-batch [%d-%d] failed: %s",
                               i, i + len(batch), e)
                out.extend([[] for _ in batch])
    return out
 def _cosine(a: list[float], b: list[float]) -> float:
    if not a or not b or len(a) != len(b):
        return 0.0
    dot = sum(x * y for x, y in zip(a, b))
    na = math.sqrt(sum(x * x for x in a))
    nb = math.sqrt(sum(y * y for y in b))
    if na == 0 or nb == 0:
        return 0.0
    return dot / (na * nb)
 def _doc_paragraphs_and_vectors(
    doc_id: str, doc_text: str,
 ) -> tuple[list[str], list[list[float]]]:
    """Lazy-cache: Absaetze + Vektoren pro Doc-ID. Wird genau einmal pro
    Doc und Run berechnet."""
    cache = _get_cache()
    if doc_id in cache:
        return cache[doc_id]
    paras = _split_paragraphs(doc_text)
    if not paras:
        cache[doc_id] = ([], [])
        return cache[doc_id]
    vecs = _embed_sync(paras)
    cache[doc_id] = (paras, vecs)
    return cache[doc_id]
 def _keyword_fallback(fl: str, doc_text: str) -> dict | None:
    """Fallback wenn Embedding-Service ausfaellt oder zu wenig Trefferqualitaet."""
    # Use the old _ANCHOR_QUERIES list — extract just the fallback hint
    for label_partial, _query, fallback_hint in _ANCHOR_QUERIES:
        if _normalize(label_partial) in fl:
            return {
                "anchor_phrase": None,
                "position_hint": fallback_hint,
                "confidence": "low",
                "method": "fallback",
            }
    return None
 def locate_anchor(
    finding_label: str,
    doc_text: str,
    doc_id: str | None = None,
 ) -> dict | None:
    """Fuer ein Finding den passendsten Anker-Absatz im Doc-Text finden.
    Primary: Embedding-Match. Sekundaer: Keyword-Hits als Bonus. Fallback:
    rein keyword-basiert wenn Embedding-Service nicht erreichbar ist.
    `doc_id` ist ein cache-key (z.B. doc_type oder url). Wenn None, wird
    aus dem doc_text-Hash abgeleitet.
    """
    if not doc_text or not finding_label:
        return None
    fl = _normalize(finding_label)
    # Welche Anchor-Query matched dieses Finding?
    query = None
    fallback_hint = None
    matched_label = None
    for label_partial, q, fb in _ANCHOR_QUERIES:
        if _normalize(label_partial) in fl:
            query, fallback_hint, matched_label = q, fb, label_partial
            break
    if not query:
        return None
    doc_id = doc_id or f"doc-{hash(doc_text) & 0xffffffff:08x}"
    # 1) Embedding-Match
    paras, doc_vecs = _doc_paragraphs_and_vectors(doc_id, doc_text)
    if not paras:
        return None
    embeddings_available = any(v for v in doc_vecs)
    if not embeddings_available:
        return _keyword_fallback(fl, doc_text)
    try:
        q_vec = _embed_sync([query])[0] if query else None
    except Exception:
        q_vec = None
    if not q_vec:
        return _keyword_fallback(fl, doc_text)
    # Per-Absatz Score = cosine + Heading-Bonus
    best_idx = -1
    best_score = 0.0
    for i, (p, dv) in enumerate(zip(paras, doc_vecs)):
        if not dv:
            continue
        sim = _cosine(q_vec, dv)
        # Heading-Bonus: kurze Absaetze + Markdown-Heading-Marker
        if len(p.split()) <= 8 or p.strip().startswith("#"):
            sim += 0.05
        if sim > best_score:
            best_score = sim
            best_idx = i
    # Konfidenz-Schwellen — kalibriert anhand BMW-Run
    if best_idx < 0 or best_score < 0.40:
        # Zu schwacher Match — Fallback verwenden
        return {
            "anchor_phrase": None,
            "position_hint": fallback_hint,
            "confidence": "low",
            "score": round(best_score, 3) if best_idx >= 0 else 0,
            "method": "embedding-no-match",
        }
    if best_score >= 0.62:
        confidence = "high"
    elif best_score >= 0.50:
        confidence = "medium"
    else:
        confidence = "low"
    anchor = paras[best_idx]
    words = anchor.split()
    snippet = " ".join(words[:30]) + ("…" if len(words) > 30 else "")
    return {
        "anchor_phrase": snippet,
        "anchor_index": best_idx,
        "total_paragraphs": len(paras),
        "position_hint": f"Nach Absatz {best_idx + 1} von {len(paras)}: '{snippet}'",
        "confidence": confidence,
        "score": round(best_score, 3),
        "method": "embedding",
    }
 def annotate_findings_with_anchors(
    findings: Iterable[dict], doc_text: str, doc_id: str | None = None,
 ) -> list[dict]:
    """Pro Finding den Anchor suchen und das Dict um 'anchor' erweitern."""
    out = []
    for f in findings:
        a = locate_anchor(f.get("label") or f.get("title") or "", doc_text, doc_id)
        out.append({**f, "anchor": a})
    return out
@@ -0,0 +1,353 @@
 """
 Action-Recipes — pro Finding-Typ eine umsetzbare Handlungsanweisung:
 WAS tun, WARUM (Rechtsgrundlage), WIE formulieren (konkreter Textbaustein),
 WO einfuegen (Doc-Abschnitt-Hinweis).
 Ohne Recipes ist ein Finding nur eine Diagnose. Mit Recipe weiss der
 Kunde sofort welchen Satz er an welche Stelle setzen muss.
 Verwendung:
  from compliance.services.finding_action_recipes import recipe_for
  rec = recipe_for("no_cookies_listed")   # → dict mit what/why/fix_text/where/example
 """
 from __future__ import annotations
 from typing import TypedDict
 class ActionRecipe(TypedDict, total=False):
    what: str          # 1-Satz Diagnose
    why: str           # Rechtsgrundlage / Risiko
    fix_text: str      # konkreter Textbaustein zum Einfuegen
    where: str         # in welchem Doc-Abschnitt
    example: str       # echtes Anwendungsbeispiel
    severity: str      # 'critical' | 'high' | 'medium' | 'low'
 # ─── Vendor-/Cookie-Findings (im VVT-Block) ────────────────────────
 VENDOR_FINDINGS: dict[str, ActionRecipe] = {
    "no_cookies_listed": {
        "what": "Anbieter ist erfasst, aber es sind keine konkreten Cookies "
                "dokumentiert.",
        "why": "Die DSK-Orientierungshilfe Telemedien 2024 fordert pro Anbieter "
               "eine Auflistung aller gesetzten Cookies mit Name + Zweck + "
               "Speicherdauer. 'Verwendet Cookies' ohne Details erfuellt "
               "Art. 13 Abs. 1 lit. e DSGVO nicht.",
        "fix_text": "Ergaenzen Sie pro Cookie eine Zeile mit folgenden Feldern:\n"
                    "  • Cookie-Name (z.B. _ga, _fbp, NID)\n"
                    "  • Setzender Anbieter (Firma + Sitzland)\n"
                    "  • Zweck (1 Satz, z.B. 'Reichweitenmessung pseudonym')\n"
                    "  • Speicherdauer (z.B. '2 Jahre', '24h', 'Session')",
        "where": "Cookie-Richtlinie unter dem Abschnitt 'Kategorien' "
                 "(Notwendig / Marketing / Statistik / ...).",
        "example": "_ga (Google Ireland Ltd.) — Statistik — eindeutige "
                   "Besucher-ID — Speicherdauer 2 Jahre",
        "severity": "high",
    },
    "no_country": {
        "what": "Anbieter-Sitzland ist nicht dokumentiert.",
        "why": "Art. 30 Abs. 1 lit. d DSGVO fordert die Kategorien der Empfaenger "
               "inkl. Sitz. Bei Drittland-Empfaengern (US, IN, CN ...) ist das "
               "zusaetzlich Pflicht nach Art. 13 Abs. 1 lit. f DSGVO.",
        "fix_text": "Fuegen Sie hinter dem Anbieternamen das Sitzland in "
                    "Klammern ein. Bei Drittland (ausserhalb EU/EWR) zusaetzlich "
                    "den Transfer-Mechanismus (SCC, DPF, Angemessenheitsbeschluss).",
        "where": "VVT-Eintrag bei 'Empfaenger' oder im Cookie-Anbieter-Listing.",
        "example": "Google Ireland Ltd., Dublin, Irland (EU). Bei US-Transfer: "
                   "'Google LLC, Mountain View, US — DPF-zertifiziert'.",
        "severity": "high",
    },
    "no_privacy_url": {
        "what": "Kein direkter Link zur Datenschutzerklaerung des Anbieters.",
        "why": "Art. 13 Abs. 2 lit. f DSGVO Transparenzgebot: der Nutzer muss "
               "die Datenverarbeitung beim Drittanbieter eigenverantwortlich "
               "nachvollziehen koennen.",
        "fix_text": "Ergaenzen Sie einen klickbaren Link zur Datenschutzerklaerung "
                    "des Anbieters direkt neben dem Anbieternamen.",
        "where": "Cookie-Richtlinie pro Vendor-Eintrag. Idealerweise als "
                 "letzter Spalteneintrag oder Inline-Link.",
        "example": "Google Analytics — Datenschutz: https://policies.google.com/privacy",
        "severity": "medium",
    },
    "broken_privacy_url": {
        "what": "Der angegebene Privacy-Link gibt einen Fehler zurueck "
                "(404 / 403 / Timeout).",
        "why": "Ein toter Link erfuellt Art. 13(2)(f) DSGVO nicht — die "
               "Transparenz-Pflicht laeuft ins Leere.",
        "fix_text": "1. Pruefen Sie den Link aus einem normalen Browser. "
                    "Funktioniert er dort? → Anbieter blockt automatisierte Pruefer (kein Mangel).\n"
                    "2. Funktioniert er nicht? → Aktualisieren Sie den Link in Ihrer "
                    "Cookie-Richtlinie auf die heute gueltige Privacy-URL des Anbieters.",
        "where": "Cookie-Richtlinie / Drittanbieter-Liste.",
        "example": "Wenn Adobe-Privacy-Link 404 gibt, ersetzen durch "
                   "https://www.adobe.com/privacy/policy.html",
        "severity": "high",
    },
    "no_opt_out_url": {
        "what": "Kein Opt-Out-Link fuer diesen Anbieter dokumentiert.",
        "why": "Art. 7 Abs. 3 DSGVO: der Widerruf der Einwilligung muss so "
               "einfach wie die Erteilung sein. Pro Drittanbieter muss eine "
               "Opt-Out-Moeglichkeit angeboten werden.",
        "fix_text": "Recherchieren Sie den offiziellen Opt-Out-Link des "
                    "Anbieters und ergaenzen Sie ihn. Falls Ihr Cookie-Banner "
                    "ein 'Einstellungen aendern' anbietet, ist das oft "
                    "ausreichend — der Link sollte trotzdem als Backup "
                    "dokumentiert sein.",
        "where": "Cookie-Richtlinie pro Vendor-Eintrag.",
        "example": "Google Analytics Opt-Out: https://tools.google.com/dlpage/gaoptout",
        "severity": "high",
    },
    "broken_opt_out": {
        "what": "Der angegebene Opt-Out-Link funktioniert nicht "
                "(404 / 403 / Timeout).",
        "why": "Art. 7(3) DSGVO Widerrufsmoeglichkeit ohne funktionierenden "
               "Link ist nicht gegeben.",
        "fix_text": "1. Pruefen Sie ob der Link aus einem Browser funktioniert. "
                    "403 kommt oft von Bot-Schutz und ist kein realer Mangel.\n"
                    "2. Bei echtem 404: aktualisieren Sie auf den heute gueltigen "
                    "Opt-Out-Link.\n"
                    "3. Alternativ verlinken Sie auf Ihren eigenen Cookie-Banner "
                    "'Einstellungen aendern'-Trigger.",
        "where": "Cookie-Richtlinie pro Vendor-Eintrag.",
        "example": "Wenn Criteo-Opt-Out 403 gibt, ist das oft Cloudflare-Schutz — "
                   "Link aus dem Browser klickbar → kein Mangel. Alternativ: "
                   "https://www.youronlinechoices.com/de/",
        "severity": "medium",
    },
 }
 # ─── Doc-Pruefungs-Findings (im DSE/Impressum/Cookie-Pruefblock) ───
 DOC_CHECK_FINDINGS: dict[str, ActionRecipe] = {
    "Auftragsverarbeiter erwaehnt": {
        "what": "Auftragsverarbeiter (Hosting, CRM, Newsletter) sind nicht "
                "explizit erwaehnt + es fehlt Hinweis auf AVVs nach Art. 28 DSGVO.",
        "why": "Art. 28 DSGVO Auftragsverarbeitung erfordert dokumentierte AVVs. "
               "Art. 13(1)(e) DSGVO erfordert die Nennung der Empfaenger-"
               "Kategorien. Fehlt der Hinweis = klassischer Pruefpunkt der "
               "Aufsichtsbehoerden.",
        "fix_text": "Wir setzen sorgfaeltig ausgewaehlte Auftragsverarbeiter "
                    "(z.B. fuer Hosting, Webanalyse, Customer Service) ein. Mit "
                    "allen Auftragsverarbeitern haben wir Vertraege zur "
                    "Auftragsverarbeitung nach Art. 28 DSGVO geschlossen. Die "
                    "Auftragsverarbeiter handeln ausschliesslich auf unsere "
                    "Weisung und sind vertraglich zu angemessenen technischen "
                    "und organisatorischen Massnahmen verpflichtet.",
        "where": "Datenschutzerklaerung im Abschnitt 'Empfaenger' oder "
                 "'Datenuebermittlung'. Direkt nach der Aufzaehlung der "
                 "Empfaenger-Kategorien.",
        "example": "Empfaenger: BMW Niederlassungen, Vertragshaendler, "
                   "Auftragsverarbeiter (Cloud-Hosting AWS, CRM Salesforce, "
                   "Webanalyse Adobe Analytics — mit allen sind AVVs nach "
                   "Art. 28 DSGVO geschlossen).",
        "severity": "high",
    },
    "Automatisierte Entscheidungen / Profiling": {
        "what": "Keine Aussage zu automatisierten Einzelentscheidungen "
                "oder Profiling nach Art. 22 DSGVO.",
        "why": "Art. 13 Abs. 2 lit. f DSGVO: bei automatisierten "
               "Einzelentscheidungen muessen Logik, Tragweite und Auswirkungen "
               "erklaert werden. Bei KEINEM Profiling muss das explizit "
               "verneint werden — sonst lassen Aufsichtsbehoerden Unsicherheit "
               "offen.",
        "fix_text": "Variante A (kein Profiling):\n"
                    "  'Es findet keine automatisierte Entscheidungsfindung "
                    "im Sinne des Art. 22 DSGVO statt. Sofern wir Profiling "
                    "zur Anzeige personalisierter Inhalte einsetzen, erfolgt "
                    "dies ausschliesslich auf Basis Ihrer Einwilligung und "
                    "wird im Abschnitt [X] erlaeutert.'\n\n"
                    "Variante B (Profiling vorhanden, z.B. fuer Werbung):\n"
                    "  'Wir nutzen Profiling zur Anzeige personalisierter "
                    "Werbung. Die Logik basiert auf [Klick-Historie / "
                    "Besuchsverhalten / Praeferenzen]. Tragweite: "
                    "Anpassung der angezeigten Anzeigen. Auswirkung: keine "
                    "rechtlichen oder erheblichen Auswirkungen — Sie koennen "
                    "jederzeit widersprechen unter [Link/Kontakt].'",
        "where": "Datenschutzerklaerung am Ende des Abschnitts "
                 "'Betroffenenrechte' oder als eigener Absatz unter "
                 "'Automatisierte Entscheidungen'.",
        "example": "Standardformulierung Variante A — falls Sie KEIN Profiling "
                   "betreiben, ist das der sichere Default-Text.",
        "severity": "high",
    },
    "Konkrete Aufsichtsbehoerde benannt": {
        "what": "Aufsichtsbehoerde wird nicht namentlich genannt.",
        "why": "Art. 13(2)(d) DSGVO: das Beschwerderecht muss konkret "
               "kommuniziert werden — nicht 'die zustaendige Behoerde', sondern "
               "Name + Anschrift + Website.",
        "fix_text": "Sie haben das Recht, sich bei einer Datenschutz-"
                    "Aufsichtsbehoerde zu beschweren. Zustaendig ist:\n\n"
                    "  [Aufsichtsbehoerde mit Namen + Strasse + PLZ + Ort + Web]\n\n"
                    "Beispiel: 'Bayerisches Landesamt fuer Datenschutzaufsicht "
                    "(BayLDA), Promenade 18, 91522 Ansbach, www.lda.bayern.de'",
        "where": "Datenschutzerklaerung im Abschnitt 'Betroffenenrechte' oder "
                 "'Beschwerderecht'.",
        "example": "Fuer BMW AG (Sitz Muenchen, Bayern): BayLDA, Promenade 18, "
                   "91522 Ansbach, www.lda.bayern.de",
        "severity": "high",
    },
    "Angemessenheitsbeschluss der Kommission": {
        "what": "Drittlandtransfer erwaehnt, aber kein Hinweis auf den "
                "konkreten Angemessenheitsbeschluss / DPF / SCC.",
        "why": "Art. 13(1)(f) DSGVO: bei Drittlandtransfer muss der "
               "Transfer-Mechanismus benannt sein (Angemessenheitsbeschluss, "
               "Standardvertragsklauseln, BCR, Ausnahmen nach Art. 49).",
        "fix_text": "Fuer Datenuebermittlungen in die USA stuetzen wir uns auf "
                    "den Angemessenheitsbeschluss der EU-Kommission vom "
                    "10.07.2023 (EU-US Data Privacy Framework / DPF). Sofern "
                    "der Empfaenger DPF-zertifiziert ist, ist die Uebermittlung "
                    "rechtlich abgesichert. Bei nicht-zertifizierten Empfaengern "
                    "ergaenzen wir EU-Standardvertragsklauseln (SCC) gemaess "
                    "Durchfuehrungsbeschluss 2021/914.",
        "where": "Datenschutzerklaerung im Abschnitt 'Drittlandtransfer' oder "
                 "'Internationale Datenuebermittlung'.",
        "example": "Konkret: Google LLC (USA) ist DPF-zertifiziert "
                   "(Zertifikat einsehbar unter dataprivacyframework.gov).",
        "severity": "high",
    },
    "Anschrift des Verantwortlichen": {
        "what": "In der Cookie-Richtlinie fehlt die vollstaendige Anschrift.",
        "why": "Art. 13(1)(a) DSGVO: der Verantwortliche muss eindeutig "
               "identifizierbar sein. Cookie-Richtlinie + DSE muessen "
               "konsistente Angaben enthalten.",
        "fix_text": "Verantwortlich fuer die Datenverarbeitung im Sinne der "
                    "DSGVO ist:\n  [Firmenname]\n  [Strasse + Hausnummer]\n  "
                    "[PLZ + Ort]\n  [Land]\n  E-Mail: [...]",
        "where": "Cookie-Richtlinie am Anfang ODER Verweis-Link zur DSE.",
        "example": "Bayerische Motoren Werke Aktiengesellschaft, Petuelring 130, "
                   "80809 Muenchen, Deutschland",
        "severity": "high",
    },
    "Konkrete Cookie-Namen aufgelistet": {
        "what": "Cookies werden allgemein erwaehnt aber ohne konkrete Namen + "
                "Speicherdauer.",
        "why": "DSK-Orientierungshilfe Telemedien: Auflistung jedes einzelnen "
               "Cookies mit Name. Generische Aussagen ('wir nutzen "
               "Werbe-Cookies') sind unzureichend.",
        "fix_text": "Erweitern Sie die Cookie-Tabelle um die Spalten:\n"
                    "  Name | Anbieter | Zweck | Speicherdauer\n\n"
                    "Browser-Devtools (Application > Cookies) zeigt die "
                    "tatsaechlich gesetzten Namen — bitte Cookie-Liste "
                    "regelmaessig synchronisieren.",
        "where": "Cookie-Richtlinie im Abschnitt 'Verwendete Cookies'.",
        "example": "_ga | Google Ireland | Statistik (Besucher-ID) | 2 Jahre\n"
                   "_fbp | Meta Platforms IE | Werbung (Browser-ID) | 90 Tage",
        "severity": "high",
    },
    "Konkrete Speicherdauern pro Cookie": {
        "what": "Speicherdauer nur pauschal oder als generischer Bereich.",
        "why": "Art. 13(2)(a) DSGVO: konkrete Speicherdauer oder Kriterien "
               "fuer die Festlegung. DSK fordert Speicherdauer PRO Cookie.",
        "fix_text": "Pro Cookie eine konkrete Zeitangabe in der Cookie-Tabelle "
                    "ergaenzen: 'Session', '24 Stunden', '90 Tage', '2 Jahre'.",
        "where": "Cookie-Richtlinie in der Cookie-Tabelle.",
        "example": "JSESSIONID: Session — _ga: 2 Jahre — _fbp: 90 Tage",
        "severity": "high",
    },
    "Opt-Out-Links pro Drittanbieter": {
        "what": "Pro Drittanbieter ist kein direkter Opt-Out-Link angegeben.",
        "why": "Art. 7(3) DSGVO Widerrufsmoeglichkeit. EuGH Planet49 "
               "(C-673/17) bestaetigt: Widerruf muss so einfach wie Einwilligung sein.",
        "fix_text": "Pro Drittanbieter eine Spalte 'Opt-Out' ergaenzen mit "
                    "direktem Link. Alternativ: zentralen 'Cookie-"
                    "Einstellungen aendern'-Button im Footer der Webseite + "
                    "Hinweis darauf in der Cookie-Richtlinie.",
        "where": "Cookie-Richtlinie pro Vendor-Eintrag oder als zentraler "
                 "Abschnitt 'Wie kann ich widersprechen?'.",
        "example": "Google Analytics: https://tools.google.com/dlpage/gaoptout\n"
                   "Meta Pixel: ueber Facebook-Konto-Einstellungen",
        "severity": "high",
    },
    "Privacy-Policy-Links pro Drittanbieter": {
        "what": "Pro Drittanbieter ist kein direkter Datenschutz-Link gesetzt.",
        "why": "Art. 13(2)(f) DSGVO Transparenzgebot — Nutzer muss die "
               "Datenverarbeitung beim Drittanbieter eigenverantwortlich "
               "nachvollziehen koennen.",
        "fix_text": "Pro Drittanbieter Link auf dessen Datenschutzerklaerung "
                    "ergaenzen. Tabelle 'Anbieter | Datenschutz-Link'.",
        "where": "Cookie-Richtlinie im Drittanbieter-Listing.",
        "example": "Adobe Analytics: https://www.adobe.com/privacy/policy.html",
        "severity": "medium",
    },
    "Rechtswidriger Haftungsausschluss fuer Links": {
        "what": "Klassischer Link-Disclaimer ('Wir distanzieren uns von verlinkten "
                "Inhalten') ist im Impressum.",
        "why": "BGH I ZR 317/01: solche Disclaimer sind rechtlich wirkungslos. "
               "Sie befreien NICHT von der Stoererhaftung und koennen sogar "
               "den gegenteiligen Effekt haben (Anerkennung der eigenen "
               "Pruefpflicht).",
        "fix_text": "Entfernen Sie pauschale Link-Disclaimer vollstaendig aus "
                    "dem Impressum. Bei Bedarf ein knapper, zutreffender Satz:\n"
                    "  'Fuer den Inhalt verlinkter externer Webseiten ist "
                    "ausschliesslich deren Betreiber verantwortlich.'",
        "where": "Impressum am Ende des Dokuments.",
        "example": "Statt 'Wir distanzieren uns ausdruecklich von allen "
                   "Inhalten verlinkter Seiten' — einfach nichts schreiben.",
        "severity": "low",
    },
    "Verbraucherstreitbeilegung / OS-Plattform": {
        "what": "Kein Link zur EU-OS-Plattform bzw. keine Aussage zur "
                "Streitbeilegung.",
        "why": "Art. 14 EU-VO 524/2013 (B2C-Anbieter, Online-Verkauf): "
               "klickbarer Link auf https://ec.europa.eu/consumers/odr "
               "PFLICHT. §36 VSBG: Aussage ob teilnehmend oder nicht.",
        "fix_text": "Die EU-Kommission stellt eine Plattform zur Online-"
                    "Streitbeilegung (OS) bereit, die Sie unter "
                    "<a href='https://ec.europa.eu/consumers/odr'>"
                    "https://ec.europa.eu/consumers/odr</a> erreichen.\n\n"
                    "Wir sind nicht bereit oder verpflichtet, an "
                    "Streitbeilegungsverfahren vor einer "
                    "Verbraucherschlichtungsstelle teilzunehmen.",
        "where": "Impressum am Ende, eigener Abschnitt 'Streitbeilegung'.",
        "example": "Standardformulierung anwendbar fuer alle B2C-Anbieter ohne "
                   "ODR-Teilnahme.",
        "severity": "high",
    },
    "Name der vertretungsberechtigten Person": {
        "what": "Vertretungsberechtigte Person ist nicht namentlich mit "
                "Funktionsbezeichnung genannt.",
        "why": "§5 Abs. 1 Nr. 1 TMG: bei juristischen Personen sind die "
               "Vertretungsberechtigten namentlich zu nennen.",
        "fix_text": "Ergaenzen Sie nach der Firmenbezeichnung:\n"
                    "  'Vertreten durch: [Vorstand / Geschaeftsfuehrung] "
                    "[Vorname Nachname]'",
        "where": "Impressum direkt nach Firmenname + Anschrift.",
        "example": "Vertreten durch: Vorstandsvorsitzender Oliver Zipse",
        "severity": "high",
    },
 }
 def recipe_for(finding_key: str) -> ActionRecipe | None:
    """Lookup Recipe by finding-key. Vendors + Doc-Findings in einem Lookup."""
    if finding_key in VENDOR_FINDINGS:
        return VENDOR_FINDINGS[finding_key]
    if finding_key in DOC_CHECK_FINDINGS:
        return DOC_CHECK_FINDINGS[finding_key]
    # Fuzzy match auf Doc-Findings (label kann variieren)
    fk = finding_key.lower()
    for k, v in DOC_CHECK_FINDINGS.items():
        if k.lower() in fk or fk in k.lower():
            return v
    return None
@@ -0,0 +1,309 @@
 """
 MC Embedding Match — semantic fallback for the regex-based doc_check.
 The Sonnet classifier filtered MCs to `check_type='text'` (matchable
 against doc text). But the regex matcher is still too strict — BMW
 writes "Speicherdauer 2 Jahre", the MC pattern expects
 "\\d+\\s*(Tag|Jahr)". We catch these via BGE-M3 embeddings + cosine
 similarity:
  1. Embed the MC's check_question (once, cached in sidecar)
  2. Embed the doc text in 50-word chunks
  3. cosine(MC, max(chunks)) ≥ threshold → MC passes via "semantic"
 This recovers ~50% of failed MCs at BMW-scale (estimated).
 Embeddings come from bp-core-embedding-service (BGE-M3, 1024-dim,
 multilingual). Sidecar SQLite stores 1024 × 4 bytes = 4KB per MC.
 """
 from __future__ import annotations
 import logging
 import math
 import os
 import re
 import sqlite3
 import struct
 from typing import Iterable
 import httpx
 logger = logging.getLogger(__name__)
 EMBEDDING_URL = os.getenv("EMBEDDING_URL", "http://embedding-service:8087")
 SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
 DIM = 1024  # BGE-M3
 SIMILARITY_THRESHOLD = float(os.getenv("MC_EMBEDDING_THRESHOLD", "0.55"))
 CHUNK_SIZE_WORDS = 50
 CHUNK_STRIDE = 30  # overlap so multi-sentence MCs aren't cut
 # Short Pflichtfelder (Impressum: HRB-Nr, USt-IdNr, Anschrift) gehen in
 # 50-Wort-Chunks unter. Wir scannen den Doc ZUSAETZLICH mit feineren
 # 15-Wort-Fenstern und lockerem Threshold fuer Impressum/AVV-Typen.
 SHORT_FIELD_CHUNK_WORDS = 15
 SHORT_FIELD_STRIDE = 8
 SHORT_FIELD_THRESHOLD = float(os.getenv("MC_EMBEDDING_THRESHOLD_SHORT", "0.50"))
 SHORT_FIELD_DOC_TYPES = {"impressum", "avv"}
 # Doc-Type-spezifische Threshold-Overrides — kalibriert anhand BMW v7
 # Run: bei 0.55 lagen DSE+Cookie systemisch bei 93% (Over-Firing weil
 # 8000-Wort-Texte alles vage matchen). 0.60 zieht die echten ~80% ein.
 # Impressum hat nur 6 echte MCs + Short-Field-Rescue → 0.50 ok.
 THRESHOLD_OVERRIDE = {
    "impressum": 0.50,
    "avv":       0.55,
    "dse":       0.60,
    "cookie":    0.60,
    "widerruf":  0.58,
    "loeschkonzept": 0.55,
    "dsfa":      0.55,
 }
 def _ensure_schema() -> None:
    """Add embedding column to mc_classification if not present."""
    try:
        with sqlite3.connect(SIDECAR_DB) as c:
            cols = [r[1] for r in c.execute("PRAGMA table_info(mc_classification)")]
            if "embedding" not in cols:
                c.execute("ALTER TABLE mc_classification ADD COLUMN embedding BLOB")
                logger.info("Added embedding column to mc_classification")
    except Exception as e:
        logger.warning("Embedding schema migration skipped: %s", e)
 def _vec_to_blob(v: list[float]) -> bytes:
    return struct.pack(f"{len(v)}f", *v)
 def _blob_to_vec(b: bytes) -> list[float]:
    return list(struct.unpack(f"{len(b)//4}f", b))
 EMBED_BATCH_SIZE = 32
 async def _embed_texts(texts: list[str], timeout: float = 120.0) -> list[list[float]]:
    """Call the central embedding-service in batches; returns one vector per input.
    BGE-M3 hangs / times out on >100 inputs at once on a CPU-only host.
    We chunk into 32er batches and collect.
    """
    if not texts:
        return []
    out: list[list[float]] = []
    async with httpx.AsyncClient(timeout=timeout) as client:
        for i in range(0, len(texts), EMBED_BATCH_SIZE):
            batch = texts[i:i + EMBED_BATCH_SIZE]
            try:
                r = await client.post(
                    f"{EMBEDDING_URL}/embed", json={"texts": batch},
                )
                r.raise_for_status()
                vecs = r.json().get("embeddings") or []
                out.extend(vecs)
            except httpx.HTTPError as e:
                logger.warning("Embed sub-batch [%d-%d] failed: %s %s",
                               i, i + len(batch), type(e).__name__, e)
                # Pad with empty vectors so caller can still align by index
                out.extend([[] for _ in batch])
    return out
 async def ensure_mc_embeddings(batch_size: int = 64, force: bool = False) -> int:
    """One-shot: embed every text-MC missing an embedding. Returns count.
    Embeds the title + (rough) check_question for each MC to give the
    BGE-M3 enough context. Title alone is too terse for the model to
    discriminate against full-paragraph doc text.
    Idempotent — only fills NULL rows unless force=True. Safe to call on
    every run.
    """
    _ensure_schema()
    # Pull check_question from the PG source table once per call (needs
    # context that's not in the sidecar)
    try:
        import psycopg2
        pg = psycopg2.connect(os.environ["DATABASE_URL"])
        with pg.cursor() as c:
            c.execute("SELECT control_id, doc_type, title, check_question "
                      "FROM compliance.doc_check_controls")
            pg_rows = c.fetchall()
        pg.close()
        pg_lookup = {(r[0], r[1] or ""): (r[2] or "", r[3] or "") for r in pg_rows}
    except Exception as e:
        logger.warning("ensure_mc_embeddings PG load failed: %s", e)
        pg_lookup = {}
    try:
        with sqlite3.connect(SIDECAR_DB) as c:
            where = ("WHERE check_type = 'text'" + ("" if force else " AND embedding IS NULL"))
            rows = c.execute(
                f"SELECT control_id, doc_type, title FROM mc_classification {where}"
            ).fetchall()
    except Exception as e:
        logger.warning("ensure_mc_embeddings query failed: %s", e)
        return 0
    if not rows:
        return 0
    logger.info("Embedding %d text-MCs (force=%s) via %s ...",
                len(rows), force, EMBEDDING_URL)
    done = 0
    for i in range(0, len(rows), batch_size):
        batch = rows[i:i + batch_size]
        # Compose "title — check_question" so the embedding captures both
        # the topic (title) and the concrete check phrasing (question).
        # That helps BMW's actual policy language land in the same vector
        # neighbourhood as our control wording.
        texts: list[str] = []
        for cid, dt, t in batch:
            title_text, question = pg_lookup.get((cid, dt or ""), (t or "", ""))
            combined = f"{title_text}. {question}".strip()
            texts.append(combined[:600])
        try:
            embs = await _embed_texts(texts)
        except Exception as e:
            logger.warning("Embed batch failed (i=%d): %s", i, e)
            continue
        with sqlite3.connect(SIDECAR_DB) as c:
            for (cid, dt, _t), vec in zip(batch, embs):
                if not vec or len(vec) != DIM:
                    continue
                c.execute(
                    "UPDATE mc_classification SET embedding = ? "
                    "WHERE control_id = ? AND doc_type = ?",
                    (_vec_to_blob(vec), cid, dt),
                )
            c.commit()
        done += len(batch)
    logger.info("ensure_mc_embeddings: filled %d/%d", done, len(rows))
    return done
 def _chunk_text(text: str, size: int = CHUNK_SIZE_WORDS,
                stride: int = CHUNK_STRIDE) -> list[str]:
    """Sliding-window chunking — overlap helps catch MCs that span 2 sentences."""
    words = re.findall(r"\S+", text or "")
    if len(words) <= size:
        return [" ".join(words)] if words else []
    out: list[str] = []
    i = 0
    while i < len(words):
        out.append(" ".join(words[i:i + size]))
        i += stride
    return out
 def _cosine(a: list[float], b: list[float]) -> float:
    """Plain Python cosine — fast enough for our scale, no numpy import."""
    if not a or not b or len(a) != len(b):
        return 0.0
    dot = sum(x * y for x, y in zip(a, b))
    na = math.sqrt(sum(x * x for x in a))
    nb = math.sqrt(sum(y * y for y in b))
    if na == 0 or nb == 0:
        return 0.0
    return dot / (na * nb)
 async def embedding_match(
    doc_text: str,
    mc_records: Iterable[dict],
    doc_type: str | None = None,
    threshold: float | None = None,
 ) -> set[str]:
    """Return the subset of MC control_ids that semantically match doc_text.
    For Impressum/AVV-types we ADDITIONALLY scan the doc with smaller
    15-word windows and a looser threshold so that short Pflichtfelder
    (HRB, USt-IdNr, postal address) land in their own chunk and aren't
    diluted by 50-word neighbourhoods of unrelated text.
    """
    if not doc_text or not mc_records:
        return set()
    candidates = list(mc_records)
    if not candidates:
        return set()
    cid_set = {c.get("control_id") for c in candidates if c.get("control_id")}
    if not cid_set:
        return set()
    try:
        with sqlite3.connect(SIDECAR_DB) as c:
            placeholders = ",".join("?" * len(cid_set))
            q = ("SELECT control_id, embedding FROM mc_classification "
                 f"WHERE control_id IN ({placeholders}) "
                 "AND check_type='text' AND embedding IS NOT NULL")
            params = list(cid_set)
            if doc_type:
                q += " AND doc_type = ?"
                params.append(doc_type)
            rows = c.execute(q, params).fetchall()
    except Exception as e:
        logger.warning("embedding lookup failed: %s", e)
        return set()
    if not rows:
        return set()
    mc_embeddings = {cid: _blob_to_vec(blob) for cid, blob in rows}
    effective_threshold = threshold or THRESHOLD_OVERRIDE.get(
        (doc_type or "").lower(), SIMILARITY_THRESHOLD)
    chunks = _chunk_text(doc_text)
    if not chunks:
        return set()
    try:
        chunk_vecs = await _embed_texts(chunks)
    except Exception as e:
        logger.warning("doc chunk embedding failed: %s %s",
                       type(e).__name__, e or "(empty msg)", exc_info=True)
        return set()
    # Filter empty vectors (failed sub-batches return [] placeholders)
    chunk_vecs = [v for v in chunk_vecs if v and len(v) == DIM]
    if not chunk_vecs:
        logger.warning("doc chunk embedding: no usable vectors (all batches failed)")
        return set()
    matched: set[str] = set()
    for cid, mc_vec in mc_embeddings.items():
        best = max((_cosine(mc_vec, cv) for cv in chunk_vecs), default=0.0)
        if best >= effective_threshold:
            matched.add(cid)
    # Short-field rescue pass for Impressum-type docs: small windows +
    # looser threshold catch one-line Pflichtfelder that 50-word chunks
    # dilute (HRB-Nr, USt-IdNr, postal address). Only runs for MCs not
    # yet matched in the main pass.
    if (doc_type or "").lower() in SHORT_FIELD_DOC_TYPES:
        unmatched = {cid: vec for cid, vec in mc_embeddings.items() if cid not in matched}
        if unmatched:
            short_chunks = _chunk_text(doc_text, size=SHORT_FIELD_CHUNK_WORDS,
                                       stride=SHORT_FIELD_STRIDE)
            try:
                short_vecs = await _embed_texts(short_chunks)
            except Exception as e:
                logger.warning("short-chunk embedding failed: %s", e)
                short_vecs = []
            if short_vecs:
                short_passes = 0
                for cid, mc_vec in unmatched.items():
                    best = max((_cosine(mc_vec, cv) for cv in short_vecs), default=0.0)
                    if best >= SHORT_FIELD_THRESHOLD:
                        matched.add(cid)
                        short_passes += 1
                if short_passes:
                    logger.info(
                        "embedding short-field rescue for %s: +%d MCs (threshold %.2f, %d chunks)",
                        doc_type, short_passes, SHORT_FIELD_THRESHOLD, len(short_chunks),
                    )
    logger.info(
        "embedding match for %s: %d/%d MCs passed semantic threshold (main=%.2f)",
        doc_type or "?", len(matched), len(mc_embeddings), effective_threshold,
    )
    return matched
@@ -101,11 +101,36 @@ def build_scorecard(check_results: list[dict]) -> dict:
    }
 _DEDUP_KEYWORDS = [
    "einfache sprache", "verstaendliche sprache", "verständliche sprache",
    "klare sprache", "einwilligungstexte", "einwilligungsaufforderung",
    "einwilligungserklaerung", "einwilligungserklärung",
    "mehrdeutige", "verstaendliche form", "verständliche form",
    "fachbegriffe erklaeren", "fachbegriffe erklären",
 ]
 def _dedup_key(label: str) -> str:
    """Cluster label to a stable dedup-key: if it contains one of the
    well-known repetitive Sprache/Einwilligungs-Aufforderungs-Concepts,
    collapse them all to that single concept. Otherwise return original."""
    l = (label or "").lower()
    for kw in _DEDUP_KEYWORDS:
        if kw in l:
            return f"_dup:{kw}"
    return label
 def top_fails(check_results: list[dict], n: int = 10) -> list[dict]:
    """Return top-N failing MCs sorted by severity then label.
    Skipped + passed MCs are excluded. INFO severity is excluded by
    default since those are guidance, not findings.
    Near-duplicates (multiple MCs that all complain about "einfache
    Sprache" / "Einwilligungsaufforderung" / ...) are collapsed to ONE
    representative entry — sonst dominieren UI-Sprache-Hinweise die
    Top-Liste und echte Lecks gehen unter.
    """
    fails = [
        r for r in (check_results or [])
@@ -116,7 +141,17 @@ def top_fails(check_results: list[dict], n: int = 10) -> list[dict]:
        _SEV_RANK.get((r.get("severity") or "MEDIUM").upper(), 5),
        r.get("label", ""),
    ))
-    return fails[:n]
+    seen_keys: set[str] = set()
    deduped: list[dict] = []
    for r in fails:
        k = _dedup_key(r.get("label", ""))
        if k in seen_keys:
            continue
        seen_keys.add(k)
        deduped.append(r)
        if len(deduped) >= n:
            break
    return deduped
 def full_audit_records(
@@ -37,6 +37,7 @@ async def check_document_with_controls(
    db_url: str = "",
    max_controls: int = 0,  # 0 = no limit, check ALL
    use_agent: bool = False,  # Use LLM agent for intelligent evaluation
    business_scope: set[str] | None = None,
 ) -> list[dict]:
    """Check document against ALL doc_check_controls for this doc_type.
@@ -56,7 +57,7 @@ async def check_document_with_controls(
    mapped_type = _map_doc_type(doc_type)
    # Load ALL controls for this doc_type
-    controls = await _load_controls(mapped_type, db_url, max_controls)
+    controls = await _load_controls(mapped_type, db_url, max_controls, business_scope)
    if not controls:
        logger.info("No MCs for doc_type '%s' (%s)", mapped_type, doc_title)
        return []
@@ -71,6 +72,31 @@ async def check_document_with_controls(
        if result:
            results.append(result)
    # Semantic fallback (Phase 3): MCs that failed via regex get a second
    # chance via BGE-M3 cosine similarity. BMW writes "Speicherdauer 2
    # Jahre" — the regex misses, embedding catches it.
    failed_ids = {r.get("control_id") for r in results
                  if not r.get("passed") and r.get("control_id")}
    if failed_ids:
        try:
            from compliance.services.mc_embedding_matcher import (
                ensure_mc_embeddings, embedding_match,
            )
            await ensure_mc_embeddings()  # idempotent: only embeds new MCs
            failed_mcs = [c for c in controls if c.get("control_id") in failed_ids]
            semantic_passes = await embedding_match(
                text, failed_mcs, doc_type=mapped_type,
            )
            if semantic_passes:
                for r in results:
                    cid = r.get("control_id")
                    if cid and cid in semantic_passes and not r.get("passed"):
                        r["passed"] = True
                        r["matched_text"] = "[semantischer Treffer via Embedding]"
                        r["hint"] = (r.get("hint") or "") + " (passed via Embedding-Match, BGE-M3 cosine)"
        except Exception as e:
            logger.warning("Embedding fallback skipped: %s", e, exc_info=True)
    passed = sum(1 for r in results if r["passed"])
    failed_results = [r for r in results if not r["passed"]]
    logger.info("MC results: %d passed, %d failed out of %d for '%s'",
@@ -161,6 +187,7 @@ def _check_mc_deterministic(text_lower: str, mc: dict) -> Optional[dict]:
    return {
        "id": f"mc-{control_id}",
        "control_id": control_id,
        "label": mc.get("title", "")[:80],
        "passed": passed,
        "severity": severity,
@@ -266,11 +293,72 @@ _MC_ALIAS_FALLBACK = {
 }
-async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
+def _load_text_only_ids(
    doc_type: str | None = None,
    business_scope: set[str] | None = None,
 ) -> set[str]:
    """Return control_ids that the Sonnet-classifier flagged as 'text'.
    Filters applied:
    1. check_type='text' (only doc-text-matchable MCs)
    2. doc_type matches (per-doc-type variant from v2-Sidecar)
    3. fits_doc_type=1 (LLM auditor approved this MC for this doc_type)
    4. scope_requires NULL or contained in business_scope
       (e.g. MCs with scope_requires='biometric_processing' are skipped
       on sites that don't do biometric processing — Art. 22 FRT-MC bei
       BMW falsch-positiv)
    `business_scope` comes from the business_profiler (set of detected
    site characteristics like 'b2c', 'shop', 'biometric_processing',
    'ai_decision_making', 'child_targeting').
    Returns empty set if the sidecar doesn't exist yet.
    """
    import sqlite3
    db_path = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
    try:
        with sqlite3.connect(db_path) as c:
            cols = [r[1] for r in c.execute("PRAGMA table_info(mc_classification)")]
            has_fit = "fits_doc_type" in cols
            has_scope = "scope_requires" in cols
            fit_clause = " AND (fits_doc_type IS NULL OR fits_doc_type = 1)" if has_fit else ""
            base = ("SELECT control_id, scope_requires FROM mc_classification "
                    "WHERE check_type = 'text'" + fit_clause) if has_scope else (
                   "SELECT control_id, NULL FROM mc_classification "
                   "WHERE check_type = 'text'" + fit_clause)
            params: list = []
            if doc_type:
                base += " AND doc_type = ?"
                params.append(doc_type)
            rows = c.execute(base, params).fetchall()
            scope = business_scope or set()
            keep: set[str] = set()
            for cid, req in rows:
                if not req:
                    keep.add(cid)
                else:
                    # Multiple requirements separated by '|' — ALL must
                    # be in scope to include. Empty req tokens are skipped.
                    needed = {r.strip().lower() for r in req.split("|") if r.strip()}
                    if needed.issubset({s.lower() for s in scope}):
                        keep.add(cid)
            return keep
    except sqlite3.OperationalError:
        return set()
    except Exception as e:
        logger.warning("MC classification lookup failed: %s", e)
        return set()
 async def _load_controls(doc_type: str, db_url: str, limit: int,
                         business_scope: set[str] | None = None) -> list[dict]:
    """Load all doc_check_controls for a doc_type from PostgreSQL.
    Falls back via _MC_ALIAS_FALLBACK when no MCs exist for the requested
    type (e.g. 'nutzungsbedingungen' -> 'agb').
    Filters to only check_type='text' MCs when the classification sidecar
    is present — process/review MCs are routed to other modules.
    """
    try:
        import asyncpg
@@ -297,7 +385,17 @@ async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
            fallback = _MC_ALIAS_FALLBACK[doc_type]
            logger.info("No MCs for %s -> falling back to %s", doc_type, fallback)
            rows = await conn.fetch(query, fallback)
-        return [dict(r) for r in rows]
+
        controls = [dict(r) for r in rows]
        text_only = _load_text_only_ids(doc_type, business_scope)
        if text_only:
            before = len(controls)
            controls = [c for c in controls if c.get("control_id") in text_only]
            logger.info(
                "MC filter (text only) for %s: %d/%d MCs after Sonnet check_type filter",
                doc_type, len(controls), before,
            )
        return controls
    except Exception as e:
        logger.warning("MC query failed: %s", e)
        return []
@@ -0,0 +1,407 @@
 """
 Vendor-Cost-Estimator — leitet pro Vendor ein Pricing-Tier aus
 Cookie-Signalen ab und gibt eine intensitaets-basierte Jahres-
 kostenschaetzung zurueck.
 Cookie-Signale die wir auswerten:
  - Anzahl Cookies pro Vendor (proxy fuer Modul-Tiefe)
  - Premium-Feature-Cookies (z.B. 's_target_qa', '_ab_test' → Enterprise-Add-on)
  - Edge/Region-Cookies (Multi-Region → Premier-Tier CDN)
  - Cookie-Persistenz (Multi-Jahr → Heavy-Tracking-Lizenz)
 Plus business_profile fuer Company-Tier-Inferenz.
 Output pro Vendor:
  - inferred_tier: 'starter' | 'professional' | 'enterprise' | 'premier'
  - tier_signals : Liste der Indikatoren die zum Tier gefuehrt haben
  - cost_year_eur_range: (low, high) basierend auf Tier × Vendor-Pricing
  - confidence: 'low' | 'medium' | 'high'
 Dieses Modul ergaenzt vendor_redundancy.py — die einfachen low/high
 Pauschalen dort werden hier durch dynamische, signal-basierte Werte
 ersetzt.
 """
 from __future__ import annotations
 import logging
 import re
 from typing import Iterable
 logger = logging.getLogger(__name__)
 # ─── Premium-Feature-Cookies: Indikator fuer Enterprise-Add-ons ─────
 #
 # Wenn ein Vendor diese Cookies setzt, ist der Kunde mit hoher
 # Wahrscheinlichkeit auf einem Enterprise-Plan.
 _PREMIUM_FEATURE_PATTERNS: list[tuple[str, str, str]] = [
    # (regex, vendor_key, premium_feature_label)
    (r"^s_target_qa$",             "adobe analytics", "Adobe Target Add-on"),
    (r"adobe.*target",             "adobe target",    "Personalization Enterprise"),
    (r"^aam_uuid",                 "adobe analytics", "Audience Manager Enterprise"),
    (r"^s_ecid",                   "adobe analytics", "Experience Cloud ID Service"),
    (r"^_pcid_",                   "adobe analytics", "People-Based Destinations"),
    (r"^_gat_gtag_UA",             "google analytics", "GA360 Multi-Tracker"),
    (r"^_ga_[A-Z0-9]+_[A-Z0-9]+",  "google analytics", "GA4 Enterprise Stream"),
    (r"^_uetmsdns",                "microsoft advertising", "Custom Conversion Tracking"),
    (r"^_fbp.*test",               "meta pixel",      "Conversions API Premium"),
    (r"^_pin_unauth_premium",      "pinterest",       "Pinterest Premium-API"),
    (r"^afm",                      "adform",          "Affinity-Module"),
    (r"^cto_dna",                  "criteo",          "Dynamic Retargeting Premium"),
    # CDN / Infra Premium
    (r"^aws-alb-[a-z0-9]+",        "amazon web services", "ALB + Multi-Region"),
    (r"^aws-waf",                  "amazon web services", "WAF Enterprise"),
    (r"^cf_clearance",             "cloudflare",      "Bot-Management Pro"),
    (r"^akm_[a-z]+",               "akamai",          "Adaptive Media Delivery Enterprise"),
    # Salesforce Customer-360
    (r"^bid_n_",                   "salesforce",      "Marketing Cloud Personalization"),
    (r"^_cs_",                     "salesforce",      "CDP Premium"),
 ]
 # ─── Tier-Pricing pro Vendor (jaehrlich, EUR) ───────────────────────
 #
 # 4 Tiers: starter (KMU), professional (Mid), enterprise (Konzern),
 # premier (Global Brand / Heavy User).
 _TIER_PRICING: dict[str, dict[str, tuple[int, int]]] = {
    "adobe analytics": {
        "starter":      ( 10_000,  30_000),
        "professional": ( 60_000, 150_000),
        "enterprise":   (200_000, 500_000),
        "premier":      (500_000, 900_000),
    },
    "adobe target": {
        "starter":      (  8_000,  25_000),
        "professional": ( 40_000, 100_000),
        "enterprise":   (120_000, 300_000),
        "premier":      (300_000, 600_000),
    },
    "adobe campaign": {
        "starter":      ( 10_000,  30_000),
        "professional": ( 40_000, 100_000),
        "enterprise":   (120_000, 280_000),
        "premier":      (280_000, 500_000),
    },
    "google analytics": {
        "starter":      (      0,      0),  # GA4 free
        "professional": (      0,      0),
        "enterprise":   ( 80_000, 150_000),  # GA360
        "premier":      (150_000, 300_000),
    },
    "matomo": {
        "starter":      (      0,   3_000),  # On-prem free / Cloud Starter
        "professional": (  6_000,  20_000),
        "enterprise":   ( 20_000,  80_000),
        "premier":      ( 60_000, 150_000),
    },
    "content square": {
        "starter":      ( 12_000,  40_000),
        "professional": ( 60_000, 150_000),
        "enterprise":   (150_000, 350_000),
        "premier":      (350_000, 700_000),
    },
    "contentsquare": {
        "starter":      ( 12_000,  40_000),
        "professional": ( 60_000, 150_000),
        "enterprise":   (150_000, 350_000),
        "premier":      (350_000, 700_000),
    },
    "dynatrace": {
        "starter":      (  5_000,  15_000),
        "professional": ( 30_000,  80_000),
        "enterprise":   (100_000, 300_000),
        "premier":      (300_000, 800_000),
    },
    "qualtrics": {
        "starter":      (  6_000,  20_000),
        "professional": ( 30_000,  80_000),
        "enterprise":   ( 80_000, 200_000),
        "premier":      (200_000, 500_000),
    },
    # Advertising / Retargeting (Lizenz + Self-Service; Media-Spend SEPARAT)
    "criteo": {
        "starter":      (  6_000,  20_000),
        "professional": ( 30_000,  80_000),
        "enterprise":   ( 80_000, 250_000),
        "premier":      (250_000, 600_000),
    },
    "adform": {
        "starter":      ( 12_000,  40_000),
        "professional": ( 60_000, 150_000),
        "enterprise":   (150_000, 400_000),
        "premier":      (400_000, 800_000),
    },
    "outbrain": {
        "starter":      (  6_000,  20_000),
        "professional": ( 30_000,  80_000),
        "enterprise":   ( 80_000, 200_000),
        "premier":      (200_000, 500_000),
    },
    "taboola": {
        "starter":      (  6_000,  20_000),
        "professional": ( 30_000,  80_000),
        "enterprise":   ( 80_000, 200_000),
        "premier":      (200_000, 500_000),
    },
    "teads": {
        "starter":      (  6_000,  18_000),
        "professional": ( 20_000,  60_000),
        "enterprise":   ( 60_000, 150_000),
        "premier":      (150_000, 350_000),
    },
    "pinterest": {
        "starter":      (  3_000,  15_000),
        "professional": ( 15_000,  50_000),
        "enterprise":   ( 50_000, 150_000),
        "premier":      (150_000, 400_000),
    },
    "linkedin insight": {
        "starter":      (  3_000,  12_000),
        "professional": ( 12_000,  40_000),
        "enterprise":   ( 40_000, 120_000),
        "premier":      (120_000, 300_000),
    },
    # CDN / Cloud
    "akamai": {
        "starter":      ( 20_000,  60_000),
        "professional": ( 80_000, 200_000),
        "enterprise":   (200_000, 500_000),
        "premier":      (500_000, 1_500_000),
    },
    "amazon web services": {
        "starter":      ( 12_000,  60_000),
        "professional": ( 60_000, 300_000),
        "enterprise":   (300_000, 1_500_000),
        "premier":      (1_500_000, 8_000_000),
    },
    "baqend": {
        "starter":      (  3_000,  12_000),
        "professional": ( 12_000,  40_000),
        "enterprise":   ( 40_000, 120_000),
        "premier":      (120_000, 300_000),
    },
    "speedkit": {
        "starter":      (  3_000,  12_000),
        "professional": ( 12_000,  40_000),
        "enterprise":   ( 40_000, 120_000),
        "premier":      (120_000, 300_000),
    },
    "speedcurve": {
        "starter":      (  1_200,   4_800),
        "professional": (  6_000,  18_000),
        "enterprise":   ( 18_000,  60_000),
        "premier":      ( 60_000, 120_000),
    },
    # CRM / Marketing
    "salesforce": {
        "starter":      ( 20_000,  60_000),
        "professional": ( 80_000, 250_000),
        "enterprise":   (250_000, 800_000),
        "premier":      (800_000, 2_500_000),
    },
    "genesys": {
        "starter":      ( 24_000,  80_000),
        "professional": ( 80_000, 250_000),
        "enterprise":   (250_000, 800_000),
        "premier":      (800_000, 2_000_000),
    },
    # Captcha
    "hcaptcha": {
        "starter":      (      0,   2_400),
        "professional": (  2_400,  12_000),
        "enterprise":   ( 12_000,  40_000),
        "premier":      ( 40_000, 100_000),
    },
    # Lead-Tracking
    "salesviewer": {
        "starter":      (  1_200,   3_600),
        "professional": (  3_600,  12_000),
        "enterprise":   ( 12_000,  40_000),
        "premier":      ( 40_000, 100_000),
    },
 }
 def _vendor_key(vendor_name: str) -> str | None:
    """Map a vendor name to a known pricing-table key."""
    n = (vendor_name or "").lower()
    for k in _TIER_PRICING:
        if k in n:
            return k
    return None
 def infer_company_tier(business_profile: dict | None) -> str:
    """Coarse company-tier from business profile.
    Used as the baseline when vendor-specific signals are weak.
    """
    if not business_profile:
        return "professional"
    bp = business_profile
    features = {f.lower() for f in (bp.get("features") or [])}
    btype = (bp.get("type") or "").lower()
    # Heavy enterprise-only signals
    if any(f in features for f in ("multi_country", "konzern", "enterprise",
                                    "international", "automotive", "banking",
                                    "luxury", "premium")):
        return "premier"
    # Large but maybe single-country
    if "shop" in features or "konfigurator" in features or btype == "b2c":
        return "enterprise"
    return "professional"
 def infer_vendor_tier(vendor: dict, company_tier: str) -> tuple[str, list[str]]:
    """Infer pricing tier for a single vendor from its cookie footprint.
    Signals (additive — more signals → higher tier):
      - cookie_count > 30          → +1 tier
      - cookie_count > 60          → +2 tiers
      - premium-feature cookie hit → +1 tier
      - 'is_third_party' on most cookies → +1 tier (heavy-tracking signal)
      - very long expiry (>=2 years) → +1 tier
    """
    cookies = vendor.get("cookies") or []
    n_cookies = len(cookies)
    cookie_names = [c.get("name", "").lower() for c in cookies]
    signals: list[str] = []
    base_tiers = ["starter", "professional", "enterprise", "premier"]
    # Start at company-tier as baseline
    idx = base_tiers.index(company_tier) if company_tier in base_tiers else 1
    if n_cookies >= 60:
        idx = min(len(base_tiers) - 1, idx + 1)
        signals.append(f"{n_cookies} Cookies (sehr hohe Modul-Tiefe)")
    elif n_cookies >= 30:
        signals.append(f"{n_cookies} Cookies (hohe Modul-Tiefe)")
    # Premium feature detection
    vk = _vendor_key(vendor.get("name", ""))
    for pattern, expected_key, feature_label in _PREMIUM_FEATURE_PATTERNS:
        if vk and vk != expected_key and expected_key not in (vendor.get("name") or "").lower():
            continue
        for cn in cookie_names:
            if re.search(pattern, cn):
                idx = min(len(base_tiers) - 1, idx + 1)
                signals.append(f"Premium-Feature-Cookie: {feature_label}")
                break
    # Heavy third-party tracking
    third_party_ratio = sum(1 for c in cookies if c.get("is_third_party")) / max(n_cookies, 1)
    if third_party_ratio >= 0.6 and n_cookies >= 10:
        signals.append(f"{int(third_party_ratio * 100)}% Drittanbieter-Cookies — Tracking-Heavy")
    # Long-lived cookies
    long_lived = sum(1 for c in cookies if _expiry_years(c.get("expiry", "")) >= 2)
    if long_lived >= 3:
        signals.append(f"{long_lived} Cookies mit ≥2 Jahre Speicherdauer")
    return base_tiers[idx], signals
 def _expiry_years(expiry_str: str) -> float:
    """Rough parse: '2 Jahre' → 2.0, '24 Monate' → 2.0, '90 Tage' → 0.25"""
    s = (expiry_str or "").lower()
    m = re.search(r"(\d+)\s*(jahr|year)", s)
    if m: return float(m.group(1))
    m = re.search(r"(\d+)\s*(monat|month)", s)
    if m: return float(m.group(1)) / 12.0
    m = re.search(r"(\d+)\s*(tag|day)", s)
    if m: return float(m.group(1)) / 365.0
    return 0.0
 def estimate_vendor_cost(vendor: dict, business_profile: dict | None = None) -> dict:
    """Return cost estimation for one vendor incl. tier inference + signals."""
    vk = _vendor_key(vendor.get("name", ""))
    company_tier = infer_company_tier(business_profile)
    if not vk:
        return {
            "vendor": vendor.get("name", ""),
            "matched_pricing_key": None,
            "inferred_tier": None,
            "tier_signals": [],
            "company_tier_baseline": company_tier,
            "cost_year_eur_range": (0, 0),
            "confidence": "none",
            "note": "Kein Pricing-Eintrag fuer diesen Anbieter — Saving-Schaetzung uebergangen.",
        }
    tier, signals = infer_vendor_tier(vendor, company_tier)
    pricing = _TIER_PRICING[vk].get(tier) or (0, 0)
    confidence = "high" if len(signals) >= 2 else ("medium" if signals else "low")
    return {
        "vendor": vendor.get("name", ""),
        "matched_pricing_key": vk,
        "inferred_tier": tier,
        "tier_signals": signals,
        "company_tier_baseline": company_tier,
        "cost_year_eur_range": pricing,
        "confidence": confidence,
    }
 def estimate_total_stack_cost(
    vendors: Iterable[dict],
    business_profile: dict | None = None,
 ) -> dict:
    """Aggregate cost estimation over all vendors.
    Returns:
      - per_vendor list (one entry each)
      - per_recipient_type aggregate (INTERNAL vs PROCESSOR vs CONTROLLER)
      - total range
      - master-contract dedup hint: vendors whose name starts with the
        site owner ('BMW AG — ...') are bundled into ONE master contract
        per vendor-tool-key (not double-counted).
    """
    per_vendor: list[dict] = []
    seen_master_keys: set[tuple[str, str]] = set()
    total_low = 0
    total_high = 0
    for v in vendors:
        est = estimate_vendor_cost(v, business_profile)
        per_vendor.append(est)
        if not est["matched_pricing_key"]:
            continue
        rtype = (v.get("recipient_type") or "").upper()
        master_key = (est["matched_pricing_key"], rtype if rtype == "INTERNAL" else "EXT")
        if rtype == "INTERNAL" and master_key in seen_master_keys:
            # Same Adobe contract serves many "BMW AG — Adobe XYZ" lines —
            # count cost only ONCE per (key, internal).
            est["bundled_into_master_contract"] = True
            continue
        seen_master_keys.add(master_key)
        lo, hi = est["cost_year_eur_range"]
        total_low += lo
        total_high += hi
    return {
        "per_vendor": per_vendor,
        "total_year_eur_range": (total_low, total_high),
        "master_contracts_counted": len(seen_master_keys),
        "disclaimer": (
            "Schaetzung basiert auf Cookie-Signalen (Anzahl, Premium-Feature-Detection, "
            "Drittanbieter-Quote, Lebensdauer) + Listpreisen pro Tier. Konzern-Konditionen "
            "koennen 30-50% darunter liegen. Eintraege derselben Eigenmarke werden zu EINEM "
            "Master-Vertrag aggregiert. Media-Spend ist NICHT enthalten."
        ),
    }
@@ -0,0 +1,727 @@
 """
 Vendor Redundancy + EU-Alternatives Analyzer.
 Eingang: Liste von Vendors aus dem CMP-Capture (z.B. BMW 90 Vendors).
 Ausgang: drei strukturierte Listen die im Email + Migration-Modal
 gerendert werden:
  1. functional_categories : Vendor → Funktionsklasse (analytics,
     advertising, cdn, captcha, chat, …)
  2. redundancies          : Kategorien mit ≥2 Vendors die dasselbe tun
                             → Konsolidierungspotenzial
  3. eu_alternatives       : pro US-Vendor passender EU-Ersatz aus
                             kuratierter Lookup-Tabelle (Matomo statt
                             Adobe Analytics, IONOS statt AWS, etc.)
  4. multi_function_tools  : EU-Tools die mehrere Kategorien abdecken
                             (z.B. SAP CX = Analytics + CRM + Marketing)
 """
 from __future__ import annotations
 import logging
 import re
 from collections import defaultdict
 from typing import Iterable
 logger = logging.getLogger(__name__)
 # ─── Kategorisierung ──────────────────────────────────────────────────
 # Substring-Match (lowercase) → Kategorie. Erste Treffer gewinnt.
 _CATEGORY_RULES: list[tuple[str, str]] = [
    # Web Analytics / Behavior
    ("adobe analytics",        "web_analytics"),
    ("adobe target",           "personalisation"),
    ("adobe campaign",         "marketing_automation"),
    ("adobe staging library",  "tag_management"),
    ("adobelaunch",            "tag_management"),
    ("google analytics",       "web_analytics"),
    ("matomo",                 "web_analytics"),
    ("hotjar",                 "web_analytics"),
    ("content square",         "web_analytics"),
    ("contentsquare",          "web_analytics"),
    ("dynatrace",              "monitoring"),
    ("performance analytics",  "web_analytics"),
    ("form analytics",         "web_analytics"),
    ("form campaign analytics","web_analytics"),
    ("psyma",                  "survey"),
    ("qualtrics",              "survey"),
    # Tag Management
    ("google tag manager",     "tag_management"),
    ("gtm",                    "tag_management"),
    # Advertising / Retargeting
    ("google ads",             "advertising"),
    ("google advertising",     "advertising"),
    ("doubleclick",            "advertising"),
    ("googleads",              "advertising"),
    ("meta pixel",             "advertising"),
    ("meta platforms",         "advertising"),
    ("facebook",               "advertising"),
    ("adform",                 "advertising"),
    ("criteo",                 "advertising"),
    ("outbrain",               "advertising"),
    ("taboola",                "advertising"),
    ("teads",                  "advertising"),
    ("pinterest",              "advertising"),
    ("linkedin insight",       "advertising"),
    ("youtube performance",    "advertising"),
    ("youtube player",         "external_media"),
    ("amazon advertising",     "advertising"),
    ("instagram",              "advertising"),
    ("dotaki",                 "advertising"),
    # Video / Embeds
    ("youtube",                "external_media"),
    ("vimeo",                  "external_media"),
    ("jw player",              "external_media"),
    ("jw video",               "external_media"),
    ("jwplayer",               "external_media"),
    ("jwconnatix",             "external_media"),
    # Maps / Geo
    ("google maps",            "maps"),
    ("google geolocation",     "maps"),
    ("geolocation",            "maps"),
    # CDN / Infrastructure
    ("akamai",                 "cdn"),
    ("amazon web services",    "cloud_infra"),
    ("aws",                    "cloud_infra"),
    ("baqend",                 "cdn"),
    ("speedkit",               "cdn"),
    ("speedcurve",             "monitoring"),
    ("salesforce",             "crm"),
    # Chat / Support
    ("genesys",                "chat"),
    ("ckm",                    "chat"),
    ("chat widget",            "chat"),
    # Captcha / Bot-Protection
    ("hcaptcha",               "captcha"),
    ("recaptcha",              "captcha"),
    # Sales / Lead-Tracking
    ("salesviewer",            "lead_tracking"),
    # Marketing/Sales overlay
    ("nayoki",                 "social_aggregator"),
    # Site-eigene Funktionen
    ("infrastructure",         "site_infra"),
    ("infrastrukturbereit",    "site_infra"),
    ("javaserverpages",        "site_infra"),
    ("single sign-on",         "auth"),
    ("mybmw account",          "auth"),
    ("sso",                    "auth"),
    ("consent",                "consent_management"),
    ("session",                "site_infra"),
    ("scroll",                 "site_infra"),
    ("sticky",                 "site_infra"),
    ("sidebar",                "site_infra"),
    ("dealer search",          "site_feature"),
    ("test drive",             "site_feature"),
    ("vehicle configurator",   "site_feature"),
    ("stocklocator",           "site_feature"),
    ("eshop",                  "site_feature"),
    ("shop",                   "site_feature"),
    ("language",               "site_infra"),
    ("sprach",                 "site_infra"),
    ("region",                 "site_infra"),
    ("ip popup",               "site_infra"),
    ("popup",                  "site_infra"),
    ("dynatrace",              "monitoring"),
 ]
 def classify_vendor(name: str) -> str:
    """Map a vendor name to a functional category."""
    n = (name or "").lower()
    for needle, cat in _CATEGORY_RULES:
        if needle in n:
            return cat
    return "other"
 # ─── EU-Alternativen ─────────────────────────────────────────────────
 # Kuratierte Liste — pro US-/Nicht-EU-Vendor passende(r) EU-Ersatz.
 # Quellen: Matomo Vergleich, etracker SoMo-Studie, IONOS-Pakete,
 # Friendly Captcha Whitepaper, SAP CX-Suite, Brevo / CleverReach DE-Listen.
 _EU_ALTERNATIVES: dict[str, list[dict]] = {
    "adobe analytics": [
        {"name": "Matomo (On-Premise)", "vendor": "InnoCraft", "country": "DE-self-hosted",
         "license": "GPL", "notes": "100% DSGVO, keine 3rd-Country, gleicher Funktionsumfang"},
        {"name": "etracker Analytics", "vendor": "etracker GmbH", "country": "DE",
         "license": "Commercial", "notes": "DSGVO-konform aus Hamburg, IP-Anonymisierung"},
        {"name": "Mapp Intelligence", "vendor": "Mapp Digital", "country": "DE",
         "license": "Commercial", "notes": "Enterprise-Alternative, Server in DE"},
    ],
    "google analytics": [
        {"name": "Matomo", "vendor": "InnoCraft", "country": "DE-self-hosted",
         "license": "GPL", "notes": "Direkter Drop-in-Ersatz mit GA-Migrationspfad"},
        {"name": "Plausible Analytics", "vendor": "Plausible Insights", "country": "EE",
         "license": "AGPL/Commercial", "notes": "Cookielos, ohne Einwilligung nutzbar"},
        {"name": "Fathom Analytics EU", "vendor": "Fathom", "country": "DE-Region",
         "license": "Commercial", "notes": "Cookielos, EU-Hosting"},
    ],
    "content square": [
        {"name": "Mouseflow EU", "vendor": "Mouseflow ApS", "country": "DK",
         "license": "Commercial", "notes": "Session-Recording + Heatmaps EU-Hosting"},
        {"name": "Hotjar EU", "vendor": "Hotjar Ltd", "country": "MT",
         "license": "Commercial", "notes": "EU-DataCenter (Frankfurt), Einwilligung erforderlich"},
    ],
    "dynatrace": [
        {"name": "Dynatrace EU", "vendor": "Dynatrace", "country": "AT",
         "license": "Commercial", "notes": "Bereits EU (Linz). Cluster auf EU einstellen"},
    ],
    "speedcurve": [
        {"name": "SpeedCurve EU", "vendor": "SpeedCurve", "country": "EU-tenant",
         "license": "Commercial", "notes": "Region-Tenant explizit konfigurieren"},
        {"name": "Calibre", "vendor": "Calibre", "country": "AU/EU",
         "license": "Commercial", "notes": "Performance Monitoring, EU-Region"},
    ],
    "akamai": [
        {"name": "Bunny CDN", "vendor": "BunnyWay d.o.o.", "country": "SI",
         "license": "Commercial", "notes": "Slowenischer CDN, EU-Backbone"},
        {"name": "Cloudflare EU-Only", "vendor": "Cloudflare", "country": "Multi",
         "license": "Commercial", "notes": "EU-Datacenter erzwingbar via 'Geo Steering'"},
        {"name": "IONOS CDN", "vendor": "IONOS SE", "country": "DE",
         "license": "Commercial", "notes": "100% DE-Hosting"},
    ],
    "amazon web services": [
        {"name": "IONOS Cloud", "vendor": "IONOS SE", "country": "DE",
         "license": "Commercial", "notes": "DE-Hosting, BSI C5-zertifiziert"},
        {"name": "OVHcloud", "vendor": "OVH SAS", "country": "FR",
         "license": "Commercial", "notes": "FR-Hosting, SecNumCloud-zertifiziert"},
        {"name": "Hetzner Cloud", "vendor": "Hetzner Online GmbH", "country": "DE",
         "license": "Commercial", "notes": "DE/FI-Hosting, sehr kostenguenstig"},
        {"name": "STACKIT", "vendor": "Schwarz IT (Lidl-Gruppe)", "country": "DE",
         "license": "Commercial", "notes": "Souveraener DE-Cloud, fuer Enterprise"},
    ],
    "salesforce": [
        {"name": "SAP Customer Experience", "vendor": "SAP SE", "country": "DE",
         "license": "Commercial", "notes": "Vollstaendige CRM-Suite EU-Hosting"},
        {"name": "weclapp", "vendor": "weclapp SE", "country": "DE",
         "license": "Commercial", "notes": "Cloud-CRM aus Marburg"},
    ],
    "adobe campaign": [
        {"name": "CleverReach", "vendor": "CleverReach GmbH", "country": "DE",
         "license": "Commercial", "notes": "E-Mail-Marketing DE-Hosting"},
        {"name": "Brevo (Sendinblue)", "vendor": "Brevo", "country": "FR",
         "license": "Commercial", "notes": "Marketing-Automation EU-Hosting"},
        {"name": "Inxmail", "vendor": "Inxmail GmbH", "country": "DE",
         "license": "Commercial", "notes": "Enterprise-E-Mail-Marketing aus Freiburg"},
    ],
    "google ads": [
        {"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
         "license": "Commercial", "notes": "FR-Hosting, Programmatic + Direct-Sold"},
        {"name": "Bing Ads (Microsoft Advertising EU)", "vendor": "Microsoft", "country": "Multi",
         "license": "Commercial", "notes": "EU-Datacenter optional"},
    ],
    "google maps": [
        {"name": "HERE Maps", "vendor": "HERE Technologies", "country": "DE",
         "license": "Commercial", "notes": "Berliner Anbieter, professionelle Karten + Routing"},
        {"name": "OpenStreetMap (self-host)", "vendor": "OSM Foundation", "country": "DE-self-host",
         "license": "ODbL", "notes": "Frei, OSM-Tiles self-hosted oder via Maptiler EU"},
        {"name": "Maptiler Cloud EU", "vendor": "MapTiler", "country": "CH",
         "license": "Commercial", "notes": "Schweizer Anbieter, EU-Tiles"},
    ],
    "criteo": [  # criteo IS EU but use as example for retargeting alts
        {"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
         "license": "Commercial", "notes": "Retargeting + Display, FR-Hosting"},
    ],
    "hcaptcha": [
        {"name": "Friendly Captcha", "vendor": "Friendly Captcha GmbH", "country": "DE",
         "license": "Commercial", "notes": "100% DSGVO, ohne Cookie, Hosting in DE"},
        {"name": "Turnstile (Cloudflare EU-Only)", "vendor": "Cloudflare", "country": "Multi",
         "license": "Commercial", "notes": "Ohne Cookie, EU-Region erzwingbar"},
    ],
    "qualtrics": [
        {"name": "LamaPoll", "vendor": "Lamano GmbH", "country": "DE",
         "license": "Commercial", "notes": "DSGVO-Surveys aus Berlin"},
        {"name": "evasys", "vendor": "evasys GmbH", "country": "DE",
         "license": "Commercial", "notes": "Enterprise-Survey-Plattform aus Lueneburg"},
    ],
    "meta pixel": [
        {"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
         "license": "Commercial", "notes": "EU-Alternative fuer Conversion-Tracking"},
    ],
    "facebook": [
        {"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
         "license": "Commercial", "notes": "Programmatic ohne Meta"},
    ],
    "linkedin insight": [
        {"name": "Xing Insights", "vendor": "New Work SE", "country": "DE",
         "license": "Commercial", "notes": "DE/AT/CH B2B-Targeting aus Hamburg"},
    ],
    "outbrain": [
        {"name": "Plista", "vendor": "Plista GmbH", "country": "DE",
         "license": "Commercial", "notes": "Native Advertising aus Berlin"},
    ],
    "taboola": [
        {"name": "Plista", "vendor": "Plista GmbH", "country": "DE",
         "license": "Commercial", "notes": "Native Advertising aus Berlin"},
    ],
    "genesys": [
        {"name": "Userlike", "vendor": "Userlike UG", "country": "DE",
         "license": "Commercial", "notes": "Live-Chat aus Koeln, BSI-konform"},
        {"name": "LiveZilla / EasyChat EU", "vendor": "LiveZilla GmbH", "country": "DE",
         "license": "Commercial", "notes": "DSGVO-Live-Chat"},
    ],
    "salesviewer": [
        {"name": "Leadinfo", "vendor": "Leadinfo BV", "country": "NL",
         "license": "Commercial", "notes": "B2B-Webvisitor-Tracking EU"},
        {"name": "Albacross EU", "vendor": "Albacross", "country": "SE",
         "license": "Commercial", "notes": "EU-Tenant verfuegbar"},
    ],
    "youtube": [
        {"name": "Vimeo Pro EU", "vendor": "Vimeo", "country": "Multi",
         "license": "Commercial", "notes": "EU-Region waehlbar, weniger Tracking"},
        {"name": "Self-hosted video (BunnyStream)", "vendor": "BunnyWay", "country": "SI",
         "license": "Commercial", "notes": "Eigene Player + CDN ohne Drittanbieter"},
    ],
    "amazon advertising": [
        {"name": "Smart AdServer (Equativ)", "vendor": "Equativ", "country": "FR",
         "license": "Commercial", "notes": "Retail-Media-Alternative FR"},
    ],
    "instagram": [
        {"name": "Pinterest EU + Owned-Channels", "vendor": "Mix", "country": "Multi",
         "license": "Commercial", "notes": "Owned-Channels (Newsletter via CleverReach)"},
    ],
 }
 # ─── Kosten-Annahmen (oeffentliche Listenpreise, Schaetzung) ──────
 #
 # Format: (low_year_eur, high_year_eur, tier_assumption)
 # Tier: 'sme' = <100 Mitarbeiter, 'mid' = 100-1000, 'ent' = >1000.
 # Quellen: oeffentliche Listenpreise + Branchen-Benchmarks (Gartner,
 # Forrester 2025). Konkrete Vertrags-Konditionen koennen 30-70% abweichen
 # (Volumen-Rabatte, Bundling). Werden im Output explizit als
 # 'Schaetzbereich' markiert.
 _COST_LOOKUP: dict[str, tuple[int, int, str]] = {
    "adobe analytics":      (120_000, 600_000, "ent"),
    "adobe target":         ( 80_000, 350_000, "ent"),
    "adobe campaign":       ( 60_000, 250_000, "ent"),
    "adobe staging library":(      0,       0, "ent"),  # bundled
    "google analytics":     (      0, 150_000, "ent"),  # GA4 free, GA360 ~150k
    "matomo":               (  6_000,  30_000, "mid"),  # Cloud/On-Prem
    "hotjar":               (  3_600,  18_000, "mid"),
    "content square":       ( 60_000, 300_000, "ent"),
    "contentsquare":        ( 60_000, 300_000, "ent"),
    "dynatrace":            ( 50_000, 400_000, "ent"),  # per-host pricing
    "performance analytics":(  5_000,  40_000, "mid"),
    "qualtrics":            ( 25_000, 150_000, "ent"),
    # Self-Service-Werbung — KEIN Tool-Lizenz, nur Media-Spend (separat).
    # Wir zaehlen 0 hier, weil "Sparpotenzial bei der Lizenz" = 0 ist.
    # Konsolidierung wuerde nur Media-Spend reduzieren — anderes Thema.
    "google ads":           (      0,       0, "ent"),
    "google advertising":   (      0,       0, "ent"),
    "doubleclick":          (      0,       0, "ent"),
    "meta pixel":           (      0,       0, "ent"),
    "facebook":             (      0,       0, "ent"),
    "amazon advertising":   (      0,       0, "ent"),
    "youtube performance":  (      0,       0, "ent"),
    "youtube player":       (      0,       0, "ent"),
    "instagram":            (      0,       0, "ent"),
    # Echte DSP-/Plattform-Lizenzen — hier zahlt der Kunde eine Saas-Fee
    # ON TOP des Media-Spends. Range bewusst enger gehalten (Faktor max 4x).
    "adform":               ( 80_000,  300_000, "ent"),
    "criteo":               ( 50_000,  200_000, "ent"),
    "outbrain":             ( 30_000,  120_000, "ent"),
    "taboola":              ( 30_000,  120_000, "ent"),
    "teads":                ( 25_000,  100_000, "ent"),
    "pinterest":            ( 15_000,   60_000, "ent"),
    "linkedin insight":     ( 10_000,   50_000, "ent"),
    "google maps":          (  2_000,  30_000, "mid"),
    "akamai":               ( 50_000, 500_000, "ent"),
    "amazon web services":  (100_000, 3_000_000, "ent"),
    "baqend":               (  6_000,  60_000, "mid"),
    "speedkit":             (  6_000,  60_000, "mid"),
    "speedcurve":           (  2_400,  24_000, "mid"),
    "salesforce":           (100_000, 1_500_000, "ent"),  # CRM seats
    "genesys":              ( 80_000, 800_000, "ent"),  # contact-center seats
    "ckm":                  ( 15_000, 120_000, "mid"),
    "hcaptcha":             (      0,  12_000, "sme"),  # free tier OR pro
    "salesviewer":          (  3_600,  18_000, "mid"),
    "youtube":              (      0,  50_000, "ent"),  # embed kostenlos, Production-Kosten variieren
 }
 # ─── EU-Alternativen-Kosten (gleiche Tier-Logik) ───────────────────
 _EU_ALT_COSTS: dict[str, tuple[int, int]] = {
    "Matomo (On-Premise)":          (  3_000,   15_000),
    "Matomo (Pro / Cloud EU)":      (  6_000,   30_000),
    "Matomo":                       (  6_000,   30_000),
    "etracker Analytics":           ( 10_000,   60_000),
    "Mapp Intelligence":            ( 40_000,  200_000),
    "Plausible Analytics":          (    240,    6_000),
    "Fathom Analytics EU":          (    240,    6_000),
    "Mouseflow EU":                 ( 12_000,   60_000),
    "Hotjar EU":                    (  3_600,   18_000),
    "Dynatrace EU":                 ( 50_000,  400_000),  # gleicher Preis, nur Region
    "SpeedCurve EU":                (  2_400,   24_000),
    "Calibre":                      (  3_600,   30_000),
    "Bunny CDN":                    (  1_200,   12_000),
    "Cloudflare EU-Only":           (  6_000,   80_000),
    "IONOS CDN":                    (  3_000,   30_000),
    "IONOS Cloud":                  ( 30_000,  600_000),
    "OVHcloud":                     ( 30_000,  600_000),
    "Hetzner Cloud":                (  6_000,  120_000),
    "STACKIT":                      ( 50_000,  800_000),
    "SAP Customer Experience":      ( 80_000, 1_200_000),
    "weclapp":                      ( 12_000,   80_000),
    "CleverReach":                  (  2_400,   24_000),
    "Brevo (Sendinblue)":           (    600,   24_000),
    "Inxmail":                      (  8_000,   60_000),
    "Smart AdServer (Equativ)":     ( 30_000,  300_000),
    "Bing Ads (Microsoft Advertising EU)": ( 30_000, 3_000_000),
    "HERE Maps":                    (  1_200,   24_000),
    "OpenStreetMap (self-host)":    (      0,    6_000),  # nur Server-Kosten
    "Maptiler Cloud EU":            (    600,   12_000),
    "Friendly Captcha":             (    600,    9_600),
    "Turnstile (Cloudflare EU-Only)": (    0,    6_000),
    "LamaPoll":                     (  1_200,   24_000),
    "evasys":                       (  6_000,   60_000),
    "Xing Insights":                (  6_000,   60_000),
    "Plista":                       ( 20_000,  150_000),
    "Userlike":                     (  1_200,   30_000),
    "LiveZilla / EasyChat EU":      (    600,   12_000),
    "Leadinfo":                     (  1_200,   12_000),
    "Albacross EU":                 (  3_600,   24_000),
    "Vimeo Pro EU":                 (    900,    6_000),
    "Self-hosted video (BunnyStream)": (   600,   12_000),
    "Pinterest EU + Owned-Channels": (   600,   24_000),
 }
 # ─── Bekannte Gruende fuer Duplikate (sollen Konsolidierung NICHT empfehlen) ─
 _DUPLICATION_CAVEATS = {
    "web_analytics": [
        "A/B-Vergleich verschiedener Anbieter waehrend Migration",
        "Marketing nutzt Adobe, Produkt nutzt Matomo — Inhouse-Politik",
        "Regional split (Adobe fuer DE, GA fuer International)",
    ],
    "advertising": [
        "Brand-Kampagne vs Performance-Kampagne (verschiedene DSPs)",
        "Saisonal: Black Friday/Super Bowl nutzt mehr Kanaele",
        "Markenspezifisch: BMW M-Modelle anders targetet als 1er-Serie",
    ],
    "cdn": [
        "Multi-CDN-Strategie fuer Ausfallsicherheit (Akamai + Cloudflare)",
        "Event-CDN-Spike (Auto-Show, Modell-Launch) braucht Skalierung",
        "Regionale Latenz-Optimierung (Akamai APAC, AWS US)",
    ],
    "marketing_automation": [
        "Salesforce Marketing Cloud fuer B2C, Adobe Campaign fuer B2B",
        "Lead-Generierung (Adobe) vs Loyalitaet (Salesforce)",
    ],
    "monitoring": [
        "APM (Dynatrace) misst Backend, RUM (SpeedCurve) misst Frontend",
    ],
    "captcha": [
        "Stufenweise Migration zu cookieless Captcha",
    ],
 }
 def _company_tier_bounds(company_tier: str | None) -> tuple[float, float]:
    """Wie viel der Listpreis-Range tatsaechlich verwenden — abhaengig
    vom Company-Tier. Bei 'enterprise' / 'premier' nutzen wir den UPPER
    Teil (50-100%) statt starter→premier.
    """
    t = (company_tier or "professional").lower()
    if t == "premier":   return (0.70, 1.00)
    if t == "enterprise": return (0.40, 0.85)
    if t == "professional": return (0.20, 0.60)
    return (0.05, 0.40)  # 'sme' / starter
 def _estimate_savings_for_redundancy(
    redundancy: dict, vendors: Iterable[dict],
    company_tier: str = "enterprise",
 ) -> dict:
    """Schaetzbereich pro Redundanz: derzeitige Kosten + EU-Konsolidierungs-Saving.
    Beruecksichtigt den company_tier — wir wollen fuer ein Konzern wie
    BMW nicht die starter-Range mit anzeigen. Realistic Range ergibt
    sich aus tier_bounds × (low, high).
    """
    low_frac, high_frac = _company_tier_bounds(company_tier)
    current_low = current_high = 0
    matched_vendors = []
    cat_vendors = [v for v in vendors if v.get("name") in redundancy.get("vendors", [])]
    for v in cat_vendors:
        name = (v.get("name") or "").lower()
        for k, (lo, hi, _tier) in _COST_LOOKUP.items():
            if k in name:
                # Tier-aware: nimm low_frac..high_frac des Pricing-Bereichs
                span = hi - lo
                current_low  += int(lo + span * low_frac)
                current_high += int(lo + span * high_frac)
                matched_vendors.append(v.get("name"))
                break
    # Konsolidierung: ein einziges EU-Tool ersetzt alle in der Kategorie
    suggested_eu = None
    suggested_low = suggested_high = 0
    # 1. Multi-Funktions-Tool das diese Kategorie abdeckt
    for tool in _MULTI_FUNCTION_TOOLS:
        if redundancy["category"] in tool["covers"]:
            suggested_eu = tool["name"]
            cost = _EU_ALT_COSTS.get(tool["name"])
            if cost:
                suggested_low, suggested_high = cost
            break
    # 2. Sonst: EU-Alternative aus den Eintraegen — ABER NUR FUR VENDORS
    #    AUS DER AKTUELLEN KATEGORIE (sonst kommt Userlike fuer Werbung)
    if not suggested_eu:
        for v in cat_vendors:
            n = (v.get("name") or "").lower()
            for k, alts in _EU_ALTERNATIVES.items():
                if k in n and alts:
                    suggested_eu = alts[0]["name"]
                    cost = _EU_ALT_COSTS.get(alts[0]["name"])
                    if cost:
                        suggested_low, suggested_high = cost
                    break
            if suggested_eu:
                break
    saving_low  = max(0, current_low  - suggested_high)
    saving_high = max(0, current_high - suggested_low)
    return {
        "current_estimate_year_eur": [current_low, current_high],
        "suggested_eu_tool": suggested_eu,
        "suggested_estimate_year_eur": [suggested_low, suggested_high],
        "estimated_saving_year_eur": [saving_low, saving_high],
        "caveats": _DUPLICATION_CAVEATS.get(redundancy["category"], []),
        "cost_disclaimer": (
            "Schaetzbereich auf Basis oeffentlicher Listenpreise. Tatsaechliche "
            "Vertragspreise koennen 30-70% niedriger liegen (Volumen, Bundling, "
            "Konzern-Konditionen). Bitte mit der jeweiligen Einkaufsabteilung verifizieren."
        ),
    }
 # ─── Multi-Funktions-Tools (Konsolidierungs-Ankerpunkte) ───────────
 _MULTI_FUNCTION_TOOLS = [
    {
        "name": "Matomo (Pro / Cloud EU)",
        "vendor": "InnoCraft",
        "country": "DE-self-host / EU",
        "covers": ["web_analytics", "tag_management", "personalisation"],
        "notes": "Ersetzt Adobe Analytics + GTM + Adobe Target in einem Tool. "
                 "100% DSGVO ohne Einwilligung wenn IP anonymisiert.",
    },
    {
        "name": "SAP Customer Experience Suite",
        "vendor": "SAP SE",
        "country": "DE",
        "covers": ["crm", "marketing_automation", "personalisation", "survey"],
        "notes": "Ersetzt Salesforce + Adobe Campaign + Qualtrics. EU-Hosting, "
                 "tiefe ERP-Integration.",
    },
    {
        "name": "IONOS Cloud (Compute + CDN + Storage + DNS)",
        "vendor": "IONOS SE",
        "country": "DE",
        "covers": ["cloud_infra", "cdn", "monitoring"],
        "notes": "Ersetzt AWS + Akamai + zusaetzliches Monitoring in einer "
                 "DE-Cloud (BSI C5).",
    },
    {
        "name": "Userlike Suite",
        "vendor": "Userlike UG",
        "country": "DE",
        "covers": ["chat", "consent_management"],
        "notes": "Ersetzt Genesys Chat. Bietet eigenes Consent-Modul.",
    },
    {
        "name": "Smart AdServer (Equativ)",
        "vendor": "Equativ",
        "country": "FR",
        "covers": ["advertising"],
        "notes": "Ersetzt Mehrfach-DSPs (Adform/Criteo/Outbrain/Taboola/Meta) "
                 "durch Programmatic+Direct-Sold EU-Stack.",
    },
    {
        "name": "HERE Maps",
        "vendor": "HERE Technologies",
        "country": "DE",
        "covers": ["maps"],
        "notes": "Berliner Anbieter, professionelle Karten + Routing.",
    },
    {
        "name": "Vimeo Pro EU (oder self-hosted BunnyStream)",
        "vendor": "Vimeo / BunnyWay",
        "country": "Multi / SI",
        "covers": ["external_media"],
        "notes": "Ersetzt YouTube-Embeds + JW Player in einem Player.",
    },
    {
        "name": "LamaPoll",
        "vendor": "Lamano GmbH",
        "country": "DE",
        "covers": ["survey"],
        "notes": "DSGVO-Surveys aus Berlin. Ersetzt Qualtrics / Psyma.",
    },
 ]
 # ─── Analyse ─────────────────────────────────────────────────────────
 def analyze(vendors: Iterable[dict], company_tier: str = "enterprise") -> dict:
    """Main entry. Returns categorised view + redundancies + EU options.
    `company_tier` (starter|professional|enterprise|premier) steuert die
    Cost-Range so dass z.B. fuer einen DAX-Konzern nicht starter-Preise
    in der unteren Schranke landen.
    """
    by_cat: dict[str, list[dict]] = defaultdict(list)
    for v in vendors:
        cat = classify_vendor(v.get("name", ""))
        by_cat[cat].append(v)
    # Redundancies: any category with ≥2 vendors (excl. site-internal cats)
    skip_redundancy_cats = {"site_infra", "site_feature", "consent_management",
                            "auth", "other"}
    all_vendors_list = list(vendors)
    redundancies: list[dict] = []
    for cat, vs in by_cat.items():
        if cat in skip_redundancy_cats or len(vs) < 2:
            continue
        red = {
            "category": cat,
            "category_label": _CATEGORY_LABEL.get(cat, cat),
            "count": len(vs),
            "vendors": [v.get("name", "") for v in vs],
            "consolidation_hint": _CONSOLIDATION_HINT.get(cat, ""),
        }
        red.update(_estimate_savings_for_redundancy(
            red, all_vendors_list, company_tier))
        redundancies.append(red)
    redundancies.sort(key=lambda r: -(r.get("estimated_saving_year_eur") or [0, 0])[1])
    # EU alternatives lookup
    eu_alternatives: list[dict] = []
    seen = set()
    for v in vendors:
        name = v.get("name") or ""
        n_lower = name.lower()
        for k, alts in _EU_ALTERNATIVES.items():
            if k in n_lower and k not in seen:
                eu_alternatives.append({
                    "current_vendor": name,
                    "current_recipient_type": v.get("recipient_type", ""),
                    "matched_key": k,
                    "alternatives": alts,
                })
                seen.add(k)
                break
    # Multi-function tool recommendations: only if the customer has vendors
    # across the categories the tool covers
    present_cats = set(by_cat.keys())
    multi_function = []
    for tool in _MULTI_FUNCTION_TOOLS:
        covered_here = [c for c in tool["covers"] if c in present_cats]
        if len(covered_here) >= 2:
            # Vendor-Namen sammeln statt nur summieren — dedupliziert
            unique_vendors: set[str] = set()
            for c in covered_here:
                for v in by_cat[c]:
                    unique_vendors.add(v.get("name", ""))
            multi_function.append({
                **tool,
                "replaces_categories": covered_here,
                "potential_replacements": len(unique_vendors),
            })
    multi_function.sort(key=lambda t: -t["potential_replacements"])
    total_current_low = sum((r.get("current_estimate_year_eur") or [0, 0])[0] for r in redundancies)
    total_current_high = sum((r.get("current_estimate_year_eur") or [0, 0])[1] for r in redundancies)
    total_saving_low = sum((r.get("estimated_saving_year_eur") or [0, 0])[0] for r in redundancies)
    total_saving_high = sum((r.get("estimated_saving_year_eur") or [0, 0])[1] for r in redundancies)
    return {
        "summary": {
            "total_vendors": len(all_vendors_list),
            "distinct_categories": len([c for c in by_cat if c != "other"]),
            "redundancy_count": len(redundancies),
            "eu_alternative_count": len(eu_alternatives),
            "consolidation_potential": sum(r["count"] - 1 for r in redundancies),
            "estimated_current_year_eur": [total_current_low, total_current_high],
            "estimated_saving_year_eur": [total_saving_low, total_saving_high],
            "estimated_saving_pct": (
                # Beide Bounds gegen denselben Nenner (Mittelwert der
                # aktuellen Schaetzung) — sonst explodiert die obere
                # Schranke wenn current_low klein ist. Cap auf 95%.
                (lambda mid: (
                    f"{min(95, int(100 * total_saving_low / mid))}–"
                    f"{min(95, int(100 * total_saving_high / mid))}%"
                ))((total_current_low + total_current_high) / 2)
                if total_current_high else "n/a"
            ),
            "cost_disclaimer": (
                "Schaetzbereich auf Basis oeffentlicher Listenpreise (Gartner, Forrester 2025). "
                "Vertragspreise koennen 30-70% niedriger liegen (Volumen-Rabatte, Konzern-Konditionen, "
                "Bundling). Werte dienen als Diskussionsgrundlage mit dem Einkauf, NICHT als Angebot."
            ),
        },
        "by_category": {cat: [v.get("name", "") for v in vs]
                        for cat, vs in by_cat.items()},
        "redundancies": redundancies,
        "eu_alternatives": eu_alternatives,
        "multi_function_tools": multi_function,
    }
 _CATEGORY_LABEL = {
    "web_analytics":       "Web-Analytics",
    "advertising":         "Werbung / Retargeting",
    "tag_management":      "Tag-Management",
    "marketing_automation": "Marketing-Automation",
    "personalisation":     "Personalisierung",
    "external_media":      "Externe Medien (Video)",
    "maps":                "Karten / Geo",
    "cdn":                 "CDN",
    "cloud_infra":         "Cloud-Infrastruktur",
    "monitoring":          "Performance-Monitoring",
    "crm":                 "CRM",
    "chat":                "Chat / Support",
    "captcha":             "Bot-Schutz",
    "lead_tracking":       "Lead-Tracking",
    "survey":              "Umfragen",
    "social_aggregator":   "Social-Media-Aggregation",
    "consent_management":  "Consent-Management",
    "auth":                "Authentifizierung",
    "site_infra":          "Eigene Infrastruktur",
    "site_feature":        "Eigene Features",
    "other":               "Sonstige",
 }
 _CONSOLIDATION_HINT = {
    "web_analytics":       "Mehrere Analytics-Tools sammeln meist redundante Daten. Ein Tool genuegt — Matomo (DE) ist DSGVO-Standard.",
    "advertising":         "Werbe-/Retargeting-Pixel sind oft austauschbar. Konzentration auf 2-3 Kanaele senkt Drittland-Risiko.",
    "external_media":      "Mehrere Video-Embeds nur wenn fachlich noetig. Self-hosted (BunnyStream/Vimeo) reduziert Tracking.",
    "maps":                "Eine Karten-Loesung reicht. HERE Maps (DE) als EU-Alternative zu Google Maps.",
    "cdn":                 "Ein CDN+Performance-Stack genuegt. IONOS oder Bunny vereinen mehrere Funktionen.",
    "marketing_automation": "Marketing-Cloud + separates E-Mail-Tool sind oft Dopplung — SAP CX oder CleverReach allein moeglich.",
    "chat":                "Ein Chat-System genuegt. Userlike (DE) ersetzt Genesys-Stack.",
    "monitoring":          "RUM + APM koennen in einem Tool gebuendelt werden (Dynatrace EU oder Sentry-Self-host).",
    "survey":              "Eine Survey-Plattform genuegt — LamaPoll (DE) oder Mapp.",
 }
@@ -0,0 +1,229 @@
 """
 LLM-Audit: prueft fuer jeden text-MC ob er fachlich wirklich zu seinem
 zugeordneten doc_type passt. BMW Run zeigte z.B. dass die meisten
 Impressum-MCs eigentlich DSA/E-Commerce-Pflichten sind, die in einem
 §5-TMG-Impressum gar nicht stehen.
 Output:
 - doc_type passt → MC bleibt active (kein DB-Update)
 - doc_type passt NICHT → check_type wird auf 'misclassified' gesetzt;
  rag_document_checker filtert die dann aus
 Sidecar SQLite. KEINE Postgres-Aenderung (Schema frozen).
 """
 from __future__ import annotations
 import argparse
 import json
 import os
 import sqlite3
 import sys
 import time
 from datetime import datetime, timezone
 import httpx
 import psycopg2
 from psycopg2.extras import RealDictCursor
 ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
 MODEL = "claude-sonnet-4-6"
 SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
 BATCH_SIZE = 25
 SLEEP_BETWEEN_BATCHES = 0.5
 DOC_TYPE_DESCRIPTIONS = {
    "agb": "Allgemeine Geschaeftsbedingungen — Vertragsbedingungen "
           "zwischen Anbieter und Kunde",
    "avv": "Auftragsverarbeitungsvertrag Art. 28 DSGVO — Vertrag zwischen "
           "Verantwortlichem und Auftragsverarbeiter",
    "cookie": "Cookie-Richtlinie — Information ueber Zwecke, Anbieter, "
              "Speicherdauer und Widerruf von Cookies/Tracking-Pixeln",
    "dse": "Datenschutzerklaerung Art. 13 DSGVO — Informationen ueber "
           "Verantwortlichen, Zwecke, Rechtsgrundlagen, Empfaenger, "
           "Speicherdauer, Betroffenenrechte, Aufsichtsbehoerde",
    "dsfa": "Datenschutz-Folgenabschaetzung Art. 35 DSGVO — Risikobewertung "
            "von Verarbeitungen mit hohem Risiko",
    "impressum": "Impressum §5 TMG / §18 MStV — Anbieterkennzeichnung mit "
                 "Name, Anschrift, Vertretungsberechtigten, Handelsregister, "
                 "USt-IdNr., berufsrechtliche Angaben, Aufsicht",
    "loeschkonzept": "Loeschkonzept DIN 66398 — Definition Aufbewahrungs- "
                     "und Loeschfristen pro Datenkategorie + Prozess",
    "widerruf": "Widerrufsbelehrung §312 BGB — Verbraucher-Widerrufsrecht "
                "bei Fernabsatz, Frist, Folgen, Muster",
 }
 SYSTEM_PROMPT = """Du pruefst Master-Controls (MCs) der Compliance-Pruefung.
 Fuer jeden MC bekommst du:
 - den ihm zugeordneten doc_type (z.B. 'impressum', 'cookie', 'dse')
 - den Titel und die check_question
 Deine Aufgabe: entscheide ob der MC fachlich wirklich zu diesem doc_type passt.
 Beispiele:
 - MC "Wird die USt-IdNr im Impressum genannt?" + doc_type=impressum → PASST
 - MC "Monatlich aktive Endnutzer fuer Online-Vermittlungsdienste messen" + doc_type=impressum → PASST NICHT
  (DSA-Pflicht, gehoert nicht in ein klassisches §5-TMG-Impressum)
 - MC "Mithoeren von Funknachrichten verhindern" + doc_type=cookie → PASST NICHT
  (TKG-Spezialthema, nicht Cookie-Richtlinie)
 Antworte als JSON-Array, eine Zeile pro MC:
 [{"control_id": "<wie input>", "doc_type": "<wie input>", "fits": true|false,
  "rationale": "ein kurzer satz"}, ...]
 Kein Markdown."""
 def fetch_pairs_to_audit(conn) -> list[dict]:
    """All text-MCs that haven't been audited yet (no 'fits' column)."""
    with sqlite3.connect(SIDECAR_DB) as side:
        cols = [r[1] for r in side.execute("PRAGMA table_info(mc_classification)")]
        if "fits_doc_type" not in cols:
            side.execute("ALTER TABLE mc_classification ADD COLUMN fits_doc_type INTEGER")
            side.commit()
        already = set()
        for cid, dt in side.execute(
            "SELECT control_id, doc_type FROM mc_classification "
            "WHERE fits_doc_type IS NOT NULL"
        ):
            already.add((cid, dt or ""))
    with conn.cursor(cursor_factory=RealDictCursor) as c:
        c.execute("""SELECT dc.control_id, dc.doc_type, dc.title, dc.check_question
                     FROM compliance.doc_check_controls dc
                     WHERE dc.control_id IN (
                       SELECT control_id FROM compliance.doc_check_controls
                     )""")
        all_rows = list(c.fetchall())
    # Audit only those classified as 'text' in sidecar — process/review
    # never run through doc_check anyway
    with sqlite3.connect(SIDECAR_DB) as side:
        text_pairs = set()
        for cid, dt in side.execute(
            "SELECT control_id, doc_type FROM mc_classification "
            "WHERE check_type = 'text'"
        ):
            text_pairs.add((cid, dt or ""))
    target = [r for r in all_rows
              if (r["control_id"], r["doc_type"] or "") in text_pairs
              and (r["control_id"], r["doc_type"] or "") not in already]
    return target
 def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
    payload = {
        "model": MODEL,
        "max_tokens": 4000,
        "system": SYSTEM_PROMPT,
        "messages": [{
            "role": "user",
            "content": (
                "Doc-Typen-Beschreibungen:\n"
                + "\n".join(f"- {k}: {v}" for k, v in DOC_TYPE_DESCRIPTIONS.items())
                + "\n\nPruefe folgende MCs:\n\n"
                + json.dumps([
                    {"control_id": m["control_id"], "doc_type": m["doc_type"],
                     "title": m["title"], "check_question": (m["check_question"] or "")[:300]}
                    for m in batch
                ], ensure_ascii=False, indent=2)
            ),
        }],
    }
    headers = {
        "x-api-key": api_key,
        "anthropic-version": "2023-06-01",
        "content-type": "application/json",
    }
    r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
    r.raise_for_status()
    txt = r.json()["content"][0]["text"].strip()
    if txt.startswith("```"):
        txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
        if txt.startswith("json"):
            txt = txt[4:].strip()
    return json.loads(txt)
 def store_audit(rows: list[dict]) -> None:
    ts = datetime.now(timezone.utc).isoformat()
    with sqlite3.connect(SIDECAR_DB) as c:
        c.executemany(
            "UPDATE mc_classification SET fits_doc_type = ?, "
            "rationale = COALESCE(?, rationale), classified_at = ? "
            "WHERE control_id = ? AND doc_type = ?",
            [
                (
                    1 if r.get("fits") else 0,
                    (r.get("rationale") or "")[:500] or None,
                    ts,
                    r.get("control_id"),
                    r.get("doc_type") or "",
                )
                for r in rows
            ],
        )
        c.commit()
 def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--sample", action="store_true")
    args = ap.parse_args()
    api_key = os.environ["ANTHROPIC_API_KEY"]
    conn = psycopg2.connect(os.environ["DATABASE_URL"])
    pairs = fetch_pairs_to_audit(conn)
    if args.sample:
        for m in pairs[:5]:
            print(json.dumps(m, ensure_ascii=False, indent=2))
        print(f"\nTotal pairs to audit: {len(pairs)}")
        return
    print(f"Audit {len(pairs)} text-MCs in Batches von {BATCH_SIZE}", flush=True)
    if not pairs:
        print("Alles auditiert.")
        return
    done = 0
    failed_batches = 0
    t0 = time.time()
    for i in range(0, len(pairs), BATCH_SIZE):
        batch = pairs[i:i + BATCH_SIZE]
        try:
            out = call_claude(api_key, batch)
            store_audit(out)
            done += len(out)
            elapsed = time.time() - t0
            rate = done / max(elapsed, 0.01)
            eta = (len(pairs) - done) / max(rate, 0.01)
            print(f"  [{done:>4}/{len(pairs)}] {rate:.1f} MC/s  ETA {eta/60:.1f}min", flush=True)
        except Exception as e:
            failed_batches += 1
            print(f"  FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
            if failed_batches >= 5:
                print("Zu viele Fehler — abbrechen.", file=sys.stderr)
                break
        time.sleep(SLEEP_BETWEEN_BATCHES)
    print(f"\nFertig: {done}/{len(pairs)} auditiert ({failed_batches} Fehlbatches).")
    with sqlite3.connect(SIDECAR_DB) as c:
        c.row_factory = sqlite3.Row
        rows = c.execute(
            "SELECT doc_type, "
            "  SUM(CASE WHEN fits_doc_type = 1 THEN 1 ELSE 0 END) AS fits, "
            "  SUM(CASE WHEN fits_doc_type = 0 THEN 1 ELSE 0 END) AS misfits, "
            "  COUNT(*) AS total "
            "FROM mc_classification "
            "WHERE check_type = 'text' AND fits_doc_type IS NOT NULL "
            "GROUP BY doc_type ORDER BY doc_type"
        ).fetchall()
        print("\n=== Audit-Verteilung doc_type x fits ===")
        for r in rows:
            print(f"  {r['doc_type']:<14}  fits={r['fits']:<4} misfits={r['misfits']:<4} total={r['total']}")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,216 @@
 """
 A3-v2: nachschärfen welche MCs in Wahrheit auf das CMP-Banner-UI / einen
 Prozess zielen, nicht auf den Doc-TEXT.
 BMW-Run zeigte ~10 'Sprache klar/einfach' + 'Button-Text formuliert' MCs,
 die der erste Audit als doc_type-passend markierte. Sie sind ABER nicht
 gegen den Cookie-Policy- oder DSE-Text pruefbar — die fragen nach der
 Verstaendlichkeit der Einwilligungs-UI.
 Schaltet 'scope_requires' bei FRT/biometrischen MCs und re-klassifiziert
 diese UI-Items auf check_type='process' (filtert sie aus doc_check raus).
 Plus: Setzt scope_requires fuer offensichtlich konditionale MCs:
  - 'biometric_processing' bei FRT/Gesichtserkennung
  - 'ai_decision_making' bei automatisierten Einzelentscheidungen
  - 'child_targeting' bei Kinder-Einwilligungs-MCs
 """
 from __future__ import annotations
 import argparse
 import json
 import os
 import sqlite3
 import sys
 import time
 import httpx
 import psycopg2
 from psycopg2.extras import RealDictCursor
 ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
 MODEL = "claude-sonnet-4-6"
 SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
 BATCH_SIZE = 20
 SYSTEM_PROMPT = """Du pruefst Master-Controls (MCs) der Compliance-Pruefung
 zweiter Ordnung: jeder MC wurde bereits als 'text' klassifiziert und einem
 doc_type zugeordnet. Du entscheidest:
 A) Pruefst der MC den TEXT des Dokuments (z.B. "Wird im Impressum die
   USt-IdNr genannt?") → fits_doc_type=true, ui_only=false
 B) Pruefst der MC die UI/UX eines Banners oder Buttons (z.B.
   "Ist der Ablehnungs-Button gleich gross wie Akzeptieren?", "Ist die
   Einwilligungsanfrage in einfacher Sprache formuliert?") → ui_only=true
   (Diese MCs koennen NIE im Dokumenttext stehen, weil sie sich auf eine
   externe UI beziehen.)
 Zusaetzlich kennzeichne SCOPE-Bedingungen, wenn der MC nur fuer bestimmte
 Sites relevant ist:
  - 'biometric_processing' : nur bei Sites die biometrische Daten
    (Gesichtserkennung/FRT, Fingerabdruecke) verarbeiten
  - 'ai_decision_making'   : nur bei automatisierten Einzelentscheidungen
    (Art. 22 DSGVO)
  - 'child_targeting'      : nur bei Sites die sich an Kinder richten
  - 'ecommerce'            : nur bei Webshops
  - 'b2c'                  : nur fuer Verbraucher-Geschaeft (UWG/BGB-Pflichten)
 Wenn der MC fuer ALLE Sites gilt, setze scope_requires=null.
 Antworte als JSON-Array — keine Erklaerung davor/danach, kein Markdown.
 Format:
 [{"control_id": "<wie input>", "doc_type": "<wie input>",
  "ui_only": true|false, "scope_requires": "biometric_processing|ai_decision_making|child_targeting|ecommerce|b2c|null",
  "rationale": "ein kurzer satz"}, ...]"""
 def fetch_pairs_to_audit(conn) -> list[dict]:
    """All text-MCs that haven't been v2-audited yet (no 'ui_only' column)."""
    with sqlite3.connect(SIDECAR_DB) as side:
        cols = [r[1] for r in side.execute("PRAGMA table_info(mc_classification)")]
        added = False
        if "ui_only" not in cols:
            side.execute("ALTER TABLE mc_classification ADD COLUMN ui_only INTEGER")
            added = True
        if "scope_requires" not in cols:
            side.execute("ALTER TABLE mc_classification ADD COLUMN scope_requires TEXT")
            added = True
        if added:
            side.commit()
        already = set()
        for cid, dt in side.execute(
            "SELECT control_id, doc_type FROM mc_classification "
            "WHERE ui_only IS NOT NULL"
        ):
            already.add((cid, dt or ""))
    with conn.cursor(cursor_factory=RealDictCursor) as c:
        c.execute("""SELECT dc.control_id, dc.doc_type, dc.title, dc.check_question
                     FROM compliance.doc_check_controls dc""")
        all_rows = list(c.fetchall())
    # Audit only those already classified as text+fits in sidecar
    with sqlite3.connect(SIDECAR_DB) as side:
        eligible = set()
        for cid, dt in side.execute(
            "SELECT control_id, doc_type FROM mc_classification "
            "WHERE check_type = 'text' AND (fits_doc_type IS NULL OR fits_doc_type = 1)"
        ):
            eligible.add((cid, dt or ""))
    target = [r for r in all_rows
              if (r["control_id"], r["doc_type"] or "") in eligible
              and (r["control_id"], r["doc_type"] or "") not in already]
    return target
 def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
    payload = {
        "model": MODEL,
        "max_tokens": 4000,
        "system": SYSTEM_PROMPT,
        "messages": [{
            "role": "user",
            "content": "Pruefe folgende MCs:\n\n" + json.dumps([
                {"control_id": m["control_id"], "doc_type": m["doc_type"],
                 "title": m["title"], "check_question": (m["check_question"] or "")[:300]}
                for m in batch
            ], ensure_ascii=False, indent=2),
        }],
    }
    headers = {
        "x-api-key": api_key,
        "anthropic-version": "2023-06-01",
        "content-type": "application/json",
    }
    r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
    r.raise_for_status()
    txt = r.json()["content"][0]["text"].strip()
    if txt.startswith("```"):
        txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
        if txt.startswith("json"):
            txt = txt[4:].strip()
    return json.loads(txt)
 def store(rows: list[dict]) -> None:
    with sqlite3.connect(SIDECAR_DB) as c:
        c.executemany(
            "UPDATE mc_classification SET ui_only = ?, scope_requires = ? "
            "WHERE control_id = ? AND doc_type = ?",
            [
                (
                    1 if r.get("ui_only") else 0,
                    (r.get("scope_requires") or "").strip() or None
                       if (r.get("scope_requires") or "").lower() not in ("", "null")
                       else None,
                    r.get("control_id"),
                    r.get("doc_type") or "",
                )
                for r in rows
            ],
        )
        # MCs flagged ui_only become check_type='process' so they're not in doc_check
        c.executemany(
            "UPDATE mc_classification SET check_type='process' "
            "WHERE ui_only=1 AND control_id=? AND doc_type=?",
            [(r.get("control_id"), r.get("doc_type") or "") for r in rows
             if r.get("ui_only")],
        )
        c.commit()
 def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--sample", action="store_true")
    args = ap.parse_args()
    api_key = os.environ["ANTHROPIC_API_KEY"]
    conn = psycopg2.connect(os.environ["DATABASE_URL"])
    pairs = fetch_pairs_to_audit(conn)
    if args.sample:
        for m in pairs[:5]:
            print(json.dumps(m, ensure_ascii=False, indent=2))
        print(f"\nTotal: {len(pairs)}")
        return
    print(f"Audit-v2: {len(pairs)} MCs in Batches von {BATCH_SIZE}", flush=True)
    if not pairs:
        print("Alles geprueft.")
        return
    done = 0
    fail = 0
    t0 = time.time()
    for i in range(0, len(pairs), BATCH_SIZE):
        batch = pairs[i:i + BATCH_SIZE]
        try:
            out = call_claude(api_key, batch)
            store(out)
            done += len(out)
            elapsed = time.time() - t0
            rate = done / max(elapsed, 0.01)
            eta = (len(pairs) - done) / max(rate, 0.01)
            print(f"  [{done:>4}/{len(pairs)}] {rate:.1f} MC/s  ETA {eta/60:.1f}min", flush=True)
        except Exception as e:
            fail += 1
            print(f"  FAIL batch {i//BATCH_SIZE + 1}: {e}", file=sys.stderr, flush=True)
            if fail >= 5: break
        time.sleep(0.5)
    print(f"\nFertig: {done}/{len(pairs)} ({fail} Fehlbatches).")
    with sqlite3.connect(SIDECAR_DB) as c:
        ui = c.execute("SELECT COUNT(*) FROM mc_classification WHERE ui_only=1").fetchone()[0]
        scope = c.execute(
            "SELECT scope_requires, COUNT(*) FROM mc_classification "
            "WHERE scope_requires IS NOT NULL GROUP BY scope_requires"
        ).fetchall()
        print(f"\nui_only=1: {ui} MCs (jetzt 'process')")
        print("scope_requires Verteilung:")
        for s, n in scope:
            print(f"  {s}: {n}")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,222 @@
 """
 Classify doc_check_controls (1874 MCs) into check_type:
  - text    : MC asks about the DOCUMENT'S WORDING ("enthaelt die DSE..", "wird im Impressum genannt..")
  - process : MC asks about a TECHNICAL OR ORGANISATIONAL CONTROL ("ist sichergestellt..", "ist implementiert..")
  - review  : MC asks an ABSTRACT META-QUESTION that needs human judgment ("sind alle Verarbeitungen vollstaendig?")
 Output goes to sidecar SQLite at /data/mc_classification.db (no Postgres schema change
 per CLAUDE.md guardrails). Schema:
  CREATE TABLE mc_classification (
    control_id TEXT PRIMARY KEY,
    doc_type   TEXT,
    title      TEXT,
    check_type TEXT,    -- text|process|review
    confidence REAL,    -- 0..1
    rationale  TEXT,
    classified_at TEXT
  );
 Run from inside bp-compliance-backend container:
  docker exec bp-compliance-backend python /app/scripts/classify_mc_check_type.py [--limit N] [--sample]
 """
 from __future__ import annotations
 import argparse
 import json
 import os
 import sqlite3
 import sys
 import time
 from datetime import datetime, timezone
 from pathlib import Path
 import httpx
 import psycopg2
 from psycopg2.extras import RealDictCursor
 ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
 MODEL = "claude-sonnet-4-6"
 SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
 BATCH_SIZE = 25
 SLEEP_BETWEEN_BATCHES = 0.5   # sec — keep gentle for the parallel Haiku batch
 SYSTEM_PROMPT = """Du klassifizierst Master-Controls (MCs) der DSGVO/TMG/TDDDG-Compliance-Pruefung in genau drei Klassen:
 TEXT  — Der MC prueft, ob ein bestimmter Inhalt im TEXT eines Dokuments (DSE, Impressum, Cookie-Richtlinie etc.) steht.
        Beispiele: "Enthaelt die Datenschutzerklaerung die Aufsichtsbehoerde?" / "Wird im Impressum die USt-IdNr genannt?"
        Diese MCs koennen gegen den Dokument-Text gematched werden.
 PROCESS — Der MC prueft eine technische oder organisatorische MASSNAHME, die ausserhalb des Dokumenttexts liegt.
          Beispiele: "Ist sichergestellt, dass Cookies erst nach Einwilligung gespeichert werden?" /
                    "Ist ein Loeschkonzept implementiert?" / "Wird das Consent-Tool monatlich getestet?"
          Diese MCs lassen sich NICHT aus dem Dokumenttext beantworten — sie brauchen Evidence/TOM-Nachweis.
 REVIEW — Der MC stellt eine abstrakte Vollstaendigkeits-/Konsistenzfrage, die menschliche Beurteilung braucht.
         Beispiele: "Sind ALLE Verarbeitungszwecke vollstaendig erfasst?" / "Stimmen Cookie-Richtlinie und Banner ueberein?"
         Diese MCs sind Checklisten-Items fuer den DSB, kein automatischer Check moeglich.
 Antworte ausschliesslich als JSON-Array — keine Erklaerung davor/danach, kein Markdown-Codeblock. Format:
 [{"control_id": "<wie input>", "check_type": "text|process|review", "confidence": 0.9, "rationale": "ein kurzer satz"}, ...]"""
 def fetch_mcs(conn, limit: int | None = None, only_unclassified: bool = True) -> list[dict]:
    sql = """SELECT control_id, doc_type, title, check_question
             FROM compliance.doc_check_controls"""
    if only_unclassified:
        sql += " WHERE control_id NOT IN (SELECT control_id FROM _classified_already)"
    sql += " ORDER BY doc_type, title"
    if limit:
        sql += f" LIMIT {limit}"
    with conn.cursor(cursor_factory=RealDictCursor) as c:
        try:
            c.execute("CREATE TEMP TABLE _classified_already (control_id TEXT)")
            with sqlite3.connect(SIDECAR_DB) as side:
                rows = side.execute("SELECT control_id FROM mc_classification").fetchall()
                if rows:
                    c.executemany("INSERT INTO _classified_already VALUES (%s)", [(r[0],) for r in rows])
        except Exception:
            pass
        c.execute(sql)
        return list(c.fetchall())
 def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
    payload = {
        "model": MODEL,
        "max_tokens": 4000,
        "system": SYSTEM_PROMPT,
        "messages": [{
            "role": "user",
            "content": "Klassifiziere die folgenden MCs:\n\n" + json.dumps(
                [{"control_id": m["control_id"],
                  "doc_type": m["doc_type"],
                  "title": m["title"],
                  "check_question": (m["check_question"] or "")[:400]}
                 for m in batch],
                ensure_ascii=False, indent=2),
        }],
    }
    headers = {
        "x-api-key": api_key,
        "anthropic-version": "2023-06-01",
        "content-type": "application/json",
    }
    r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
    r.raise_for_status()
    txt = r.json()["content"][0]["text"].strip()
    # Strip code fences if Sonnet adds them
    if txt.startswith("```"):
        txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
        if txt.startswith("json"):
            txt = txt[4:].strip()
    return json.loads(txt)
 def ensure_sidecar() -> None:
    Path(SIDECAR_DB).parent.mkdir(parents=True, exist_ok=True)
    with sqlite3.connect(SIDECAR_DB) as c:
        c.executescript("""
            CREATE TABLE IF NOT EXISTS mc_classification (
                control_id    TEXT PRIMARY KEY,
                doc_type      TEXT,
                title         TEXT,
                check_type    TEXT,
                confidence    REAL,
                rationale     TEXT,
                classified_at TEXT
            );
            CREATE INDEX IF NOT EXISTS idx_doctype ON mc_classification(doc_type);
            CREATE INDEX IF NOT EXISTS idx_type    ON mc_classification(check_type);
        """)
 def store_results(rows: list[dict], lookup: dict[str, dict]) -> None:
    ts = datetime.now(timezone.utc).isoformat()
    with sqlite3.connect(SIDECAR_DB) as c:
        c.executemany(
            "INSERT OR REPLACE INTO mc_classification "
            "(control_id, doc_type, title, check_type, confidence, rationale, classified_at) "
            "VALUES (?, ?, ?, ?, ?, ?, ?)",
            [
                (
                    r.get("control_id"),
                    lookup.get(r.get("control_id"), {}).get("doc_type", ""),
                    lookup.get(r.get("control_id"), {}).get("title", ""),
                    (r.get("check_type") or "").lower(),
                    float(r.get("confidence") or 0),
                    (r.get("rationale") or "")[:500],
                    ts,
                )
                for r in rows
            ],
        )
        c.commit()
 def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--limit", type=int, default=None, help="limit number of MCs (for testing)")
    ap.add_argument("--sample", action="store_true", help="just print first 5 MCs and exit")
    ap.add_argument("--all", action="store_true", help="re-classify all (otherwise skips already classified)")
    args = ap.parse_args()
    ensure_sidecar()
    api_key = os.environ["ANTHROPIC_API_KEY"]
    conn = psycopg2.connect(os.environ["DATABASE_URL"])
    mcs = fetch_mcs(conn, limit=args.limit, only_unclassified=not args.all)
    if args.sample:
        for m in mcs[:5]:
            print(json.dumps(m, ensure_ascii=False, indent=2))
        return
    print(f"Klassifiziere {len(mcs)} MCs in Batches von {BATCH_SIZE}", flush=True)
    if not mcs:
        print("Nichts zu tun.")
        return
    lookup = {m["control_id"]: m for m in mcs}
    total = len(mcs)
    done = 0
    failed_batches = 0
    t0 = time.time()
    for i in range(0, total, BATCH_SIZE):
        batch = mcs[i:i + BATCH_SIZE]
        try:
            out = call_claude(api_key, batch)
            store_results(out, lookup)
            done += len(out)
            elapsed = time.time() - t0
            rate = done / max(elapsed, 0.01)
            eta = (total - done) / max(rate, 0.01)
            print(f"  [{done:>5}/{total}] {rate:.1f} MC/s  ETA {eta/60:.1f}min",
                  flush=True)
        except Exception as e:
            failed_batches += 1
            print(f"  FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
            if failed_batches >= 5:
                print("  Zu viele Fehler — abbrechen.", file=sys.stderr)
                break
        time.sleep(SLEEP_BETWEEN_BATCHES)
    print(f"\nFertig: {done}/{total} klassifiziert ({failed_batches} Fehlbatches).")
    # Summary
    with sqlite3.connect(SIDECAR_DB) as c:
        c.row_factory = sqlite3.Row
        rows = c.execute(
            "SELECT doc_type, check_type, COUNT(*) n FROM mc_classification "
            "GROUP BY doc_type, check_type ORDER BY doc_type, check_type"
        ).fetchall()
        print("\n=== Verteilung nach doc_type x check_type ===")
        prev = None
        for r in rows:
            if r["doc_type"] != prev:
                print(); print(f"[{r['doc_type']}]")
                prev = r["doc_type"]
            print(f"  {r['check_type']:<8} {r['n']}")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,241 @@
 """
 v2: Re-classify only the (control_id, doc_type) PAIRS that aren't covered yet.
 V1 used PK=control_id, so cross-doc-type variants (same control assigned
 to e.g. AGB AND Widerruf with different check_questions) overwrote each
 other. v2 migrates to PK=(control_id, doc_type) and classifies only the
 ~262 missing pairs.
 Run from container:
  docker exec bp-compliance-backend python /app/scripts/classify_mc_check_type_v2.py [--sample]
 """
 from __future__ import annotations
 import argparse
 import json
 import os
 import sqlite3
 import sys
 import time
 from datetime import datetime, timezone
 from pathlib import Path
 import httpx
 import psycopg2
 from psycopg2.extras import RealDictCursor
 ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
 MODEL = "claude-sonnet-4-6"
 SIDECAR_DB = os.getenv("MC_CLASS_DB", "/data/mc_classification.db")
 BATCH_SIZE = 25
 SLEEP_BETWEEN_BATCHES = 0.5
 SYSTEM_PROMPT = """Du klassifizierst Master-Controls (MCs) der DSGVO/TMG/TDDDG-Compliance-Pruefung in genau drei Klassen:
 TEXT  — Der MC prueft, ob ein bestimmter Inhalt im TEXT eines Dokuments steht.
        Beispiele: "Enthaelt die Datenschutzerklaerung die Aufsichtsbehoerde?"
        Diese MCs koennen gegen den Dokument-Text gematched werden.
 PROCESS — Der MC prueft eine technische oder organisatorische MASSNAHME ausserhalb des Dokumenttexts.
          Beispiele: "Ist sichergestellt, dass Cookies erst nach Einwilligung gespeichert werden?"
 REVIEW — Der MC stellt eine abstrakte Vollstaendigkeitsfrage, die menschliche Beurteilung braucht.
         Beispiele: "Sind ALLE Verarbeitungszwecke vollstaendig erfasst?"
 Wichtig: derselbe Control kann fuer DIFFERENT doc_types unterschiedlich klassifiziert sein —
 mit doc_type-spezifisch angepasster check_question kann ein "text"-Check fuer ein Dok zum
 "process"-Check fuer ein anderes werden.
 Antworte ausschliesslich als JSON-Array — kein Markdown. Format:
 [{"control_id": "<input>", "doc_type": "<input>", "check_type": "text|process|review",
  "confidence": 0.9, "rationale": "ein kurzer satz"}, ...]"""
 def migrate_schema() -> None:
    """Migrate sidecar from PK=control_id to PK=(control_id, doc_type)."""
    Path(SIDECAR_DB).parent.mkdir(parents=True, exist_ok=True)
    with sqlite3.connect(SIDECAR_DB) as c:
        # Check if v2 schema already in place (composite PK)
        cols = c.execute("PRAGMA table_info(mc_classification)").fetchall()
        if not cols:
            # First run — create fresh
            c.executescript("""
                CREATE TABLE mc_classification (
                    control_id    TEXT,
                    doc_type      TEXT,
                    title         TEXT,
                    check_type    TEXT,
                    confidence    REAL,
                    rationale     TEXT,
                    classified_at TEXT,
                    PRIMARY KEY (control_id, doc_type)
                );
                CREATE INDEX idx_doctype ON mc_classification(doc_type);
                CREATE INDEX idx_type    ON mc_classification(check_type);
            """)
            return
        # Check whether the existing table already has composite PK
        pk_cols = [r[1] for r in cols if r[5] > 0]
        if set(pk_cols) == {"control_id", "doc_type"}:
            print("Schema already v2 (composite PK). Skipping migration.")
            return
        print("Migrating sidecar schema to PK(control_id, doc_type)...")
        c.executescript("""
            CREATE TABLE mc_classification_v2 (
                control_id    TEXT,
                doc_type      TEXT,
                title         TEXT,
                check_type    TEXT,
                confidence    REAL,
                rationale     TEXT,
                classified_at TEXT,
                PRIMARY KEY (control_id, doc_type)
            );
            INSERT INTO mc_classification_v2
              (control_id, doc_type, title, check_type, confidence, rationale, classified_at)
            SELECT control_id, COALESCE(doc_type, ''), title, check_type, confidence, rationale, classified_at
            FROM mc_classification;
            DROP TABLE mc_classification;
            ALTER TABLE mc_classification_v2 RENAME TO mc_classification;
            CREATE INDEX idx_doctype ON mc_classification(doc_type);
            CREATE INDEX idx_type ON mc_classification(check_type);
        """)
        n = c.execute("SELECT COUNT(*) FROM mc_classification").fetchone()[0]
        print(f"Migrated {n} existing rows.")
 def fetch_unclassified_pairs(conn) -> list[dict]:
    """All (control_id, doc_type) pairs in PG that aren't yet in sidecar."""
    side_pairs: set[tuple[str, str]] = set()
    with sqlite3.connect(SIDECAR_DB) as side:
        for cid, dt in side.execute("SELECT control_id, doc_type FROM mc_classification"):
            side_pairs.add((cid, dt or ""))
    with conn.cursor(cursor_factory=RealDictCursor) as c:
        c.execute("""SELECT control_id, doc_type, title, check_question
                     FROM compliance.doc_check_controls""")
        all_rows = list(c.fetchall())
    missing = [r for r in all_rows if (r["control_id"], r["doc_type"] or "") not in side_pairs]
    return missing
 def call_claude(api_key: str, batch: list[dict]) -> list[dict]:
    payload = {
        "model": MODEL,
        "max_tokens": 4000,
        "system": SYSTEM_PROMPT,
        "messages": [{
            "role": "user",
            "content": "Klassifiziere die folgenden MCs:\n\n" + json.dumps(
                [{"control_id": m["control_id"],
                  "doc_type": m["doc_type"],
                  "title": m["title"],
                  "check_question": (m["check_question"] or "")[:400]}
                 for m in batch],
                ensure_ascii=False, indent=2),
        }],
    }
    headers = {
        "x-api-key": api_key,
        "anthropic-version": "2023-06-01",
        "content-type": "application/json",
    }
    r = httpx.post(ANTHROPIC_URL, json=payload, headers=headers, timeout=120.0)
    r.raise_for_status()
    txt = r.json()["content"][0]["text"].strip()
    if txt.startswith("```"):
        txt = txt.split("\n", 1)[1].rsplit("```", 1)[0].strip()
        if txt.startswith("json"):
            txt = txt[4:].strip()
    return json.loads(txt)
 def store_results(rows: list[dict], lookup: dict[tuple[str, str], dict]) -> None:
    ts = datetime.now(timezone.utc).isoformat()
    with sqlite3.connect(SIDECAR_DB) as c:
        c.executemany(
            "INSERT OR REPLACE INTO mc_classification "
            "(control_id, doc_type, title, check_type, confidence, rationale, classified_at) "
            "VALUES (?, ?, ?, ?, ?, ?, ?)",
            [
                (
                    r.get("control_id"),
                    r.get("doc_type") or "",
                    lookup.get((r.get("control_id"), r.get("doc_type") or ""), {}).get("title", ""),
                    (r.get("check_type") or "").lower(),
                    float(r.get("confidence") or 0),
                    (r.get("rationale") or "")[:500],
                    ts,
                )
                for r in rows
            ],
        )
        c.commit()
 def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--sample", action="store_true")
    args = ap.parse_args()
    migrate_schema()
    api_key = os.environ["ANTHROPIC_API_KEY"]
    conn = psycopg2.connect(os.environ["DATABASE_URL"])
    missing = fetch_unclassified_pairs(conn)
    if args.sample:
        for m in missing[:5]:
            print(json.dumps(m, ensure_ascii=False, indent=2))
        print(f"\nTotal missing pairs: {len(missing)}")
        return
    print(f"Klassifiziere {len(missing)} fehlende (control_id, doc_type)-Paare in Batches von {BATCH_SIZE}", flush=True)
    if not missing:
        print("Alles klassifiziert. Nichts zu tun.")
        return
    lookup = {(m["control_id"], m["doc_type"] or ""): m for m in missing}
    total = len(missing)
    done = 0
    failed_batches = 0
    t0 = time.time()
    for i in range(0, total, BATCH_SIZE):
        batch = missing[i:i + BATCH_SIZE]
        try:
            out = call_claude(api_key, batch)
            store_results(out, lookup)
            done += len(out)
            elapsed = time.time() - t0
            rate = done / max(elapsed, 0.01)
            eta = (total - done) / max(rate, 0.01)
            print(f"  [{done:>4}/{total}] {rate:.1f} MC/s  ETA {eta/60:.1f}min", flush=True)
        except Exception as e:
            failed_batches += 1
            print(f"  FAIL batch {i//BATCH_SIZE + 1}: {e}", flush=True, file=sys.stderr)
            if failed_batches >= 5:
                print("  Zu viele Fehler — abbrechen.", file=sys.stderr)
                break
        time.sleep(SLEEP_BETWEEN_BATCHES)
    print(f"\nFertig: {done}/{total} klassifiziert ({failed_batches} Fehlbatches).")
    with sqlite3.connect(SIDECAR_DB) as c:
        c.row_factory = sqlite3.Row
        rows = c.execute(
            "SELECT doc_type, check_type, COUNT(*) n FROM mc_classification "
            "GROUP BY doc_type, check_type ORDER BY doc_type, check_type"
        ).fetchall()
        print("\n=== Final-Verteilung doc_type x check_type ===")
        prev = None
        for r in rows:
            if r["doc_type"] != prev:
                print(); print(f"[{r['doc_type']}]")
                prev = r["doc_type"]
            print(f"  {r['check_type']:<8} {r['n']}")
 if __name__ == "__main__":
    main()
@@ -172,6 +172,11 @@ class DSIDiscoveryResult:
    # Schema: [{"kind": str, "url": str, "data": dict}, ...]
    # Backend uses these to build vendor records + run per-vendor checks.
    cmp_payloads: list[dict] = field(default_factory=list)
    # Reconstructed cookie-policy text from all captured CMP payloads
    # (CMP-library reconstruct + heuristic generic). Backend uses this as
    # the authoritative cookie-text so MC checks run on the real policy,
    # not the homepage navigation that DOM extraction returns.
    cmp_cookie_text: str = ""
 def _matches_dsi_keyword(text: str) -> tuple[bool, str]:
    """Check if text contains any DSI keyword. Returns (match, language)."""
@@ -551,8 +556,17 @@ async def discover_dsi_documents(
    result.cmp_payloads = [
        {"kind": kind, "data": data} for kind, data in cmp_capture.payloads
    ]
-    logger.info("DSI discovery complete: %d documents found in %s, %d CMP payloads",
+    if cmp_capture.payloads:
-                result.total_found, result.languages_detected, len(result.cmp_payloads))
+        try:
            result.cmp_cookie_text = cmp_capture.reconstruct_cookie_policy()
        except Exception as e:
            logger.warning("CMP reconstruct on discovery failed: %s", e)
    logger.info(
        "DSI discovery complete: %d documents found in %s, %d CMP payloads, "
        "cmp_cookie_text=%d words",
        result.total_found, result.languages_detected, len(result.cmp_payloads),
        len(result.cmp_cookie_text.split()) if result.cmp_cookie_text else 0,
    )
    return result
 # Nav elements, not real documents