feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features
CI / nodejs-build (push) Successful in 2m47s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped

Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9):

Core Compliance-Check
- Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs
  in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db).
  rag_document_checker filtert auf check_type='text' fuer doc_check.
  Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in
  falscher doc_type-Schublade.
- scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden
  per business_profile gefiltert (FRT skipped fuer BMW etc.).
- Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match:
  Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60),
  Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum.
  Title+check_question als Embedding-Input fuer mehr Kontext.
- Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem
  CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction
  wenn richer (BMW 1824 vs 600 Worte).

Vendor-Redundanz + EU-Alternativen + Cost-Saving
- vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors,
  Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup
  (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...).
- vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl
  + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/
  enterprise/premier).
- Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten
  (nur Media-Spend, separat). DSP-Plattformen behalten enge Range.
- Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den
  oberen 40-100%-Band der Listpreise, nicht starter→premier.
- Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart
  AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere
  Kategorien gleichzeitig.

Cookie-Wissens-DB + Funktionale Klassifikation
- cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...)
  mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk,
  schrems_ii_status, EuGH-Urteile, EU-Alternative.
- cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id,
  ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact.

Country-Inferenz aus Rechtsform
- cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet
  (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table.
  Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors
  (Adform DK, Pinterest IE).

Action-Recipes + Doc-Anchor-Locator
- finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country,
  broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling",
  ...) eine strukturierte Anweisung mit what/why/fix_text/where/example.
  Zum 1:1-Einfuegen in Kunden-Dokumente.
- doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den
  passenden Absatz im existierenden Kundendokument fuer jeden Finding.
  Per-Run Thread-Local-Cache. Fallback: keyword-Match.
- Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail
  + Vendor-Flag-Liste mit aufklappbarer Action-Liste.
- Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip).

Migration-Pipeline (Compliance-Check -> Customer Banner/Documents)
- migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit
  4 Kategorien + Review-Flags.
- migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register
  + Privacy-Policy-Pre-Fills.
- agent_migration_routes: 3 Preview-Endpoints (banner-preview,
  document-preview, summary). Persistierung der cmp_vendors in
  /data/compliance_audits.db check_payloads-Tabelle.

Borlabs-Parity Cookie-Banner-Features
- Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage.
- Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video
  Placeholder bis Einwilligung.
- Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB.
- Consent-Log Export (CSV/JSON) per einwilligungen_export_routes.

Bug-Fixes
- canonical_control_routes: _jsonish-Helper fuer string-typed jsonb,
  similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr).
- Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views.
- Embedding-Service-Batching (32er Batches statt 165 in einem Call).
- KeyError 'control_id' in MC-Result-Aggregation (defensive .get).
- Master-Controls-Klick-Through von /sdk/master-controls auf
  /sdk/control-library?control=<id> mit URL-Param-Auto-Open.
- Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht).
- Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction).
- doc_type-aware MC-Filter (statt all-text-MCs).
- Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag).
- A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert.

Tests
- test_migration_mappers.py (9 Tests)
- test_migration_endpoints.py (4 Tests)

Skripte (one-shot)
- classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type)
- audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires)

BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes):
  DSE     7,5% -> 81-83%
  Impressum 4%   -> 100% (6 echte MCs alle erfuellt)
  Cookie  0%    -> 79-83% (CMP-Text-Routing + Embedding)
  Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr
  Plus: Action-Recipes + Doc-Anchors fuer jeden Fail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-18 18:30:08 +02:00
parent 52fb8b91e7
commit 662327e8b4
31 changed files with 5214 additions and 104 deletions
@@ -33,6 +33,7 @@ _ROUTER_MODULES = [
"vvt_routes",
"legal_document_routes",
"einwilligungen_routes",
"einwilligungen_export_routes",
"escalation_routes",
"consent_template_routes",
"notfallplan_routes",
@@ -159,6 +159,13 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
from .agent_doc_check_routes import CheckItem, DocCheckResult
from .agent_doc_check_report import build_html_report
# Reset anchor-locator cache per run (avoid cross-run leak)
try:
from compliance.services.doc_anchor_locator import reset_cache
reset_cache()
except Exception:
pass
# Step 1: Resolve texts (fetch from URL if needed) — 0-30%
_update(check_id, "Texte werden geladen...", 1)
doc_texts: dict[str, str] = {}
@@ -234,6 +241,20 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
# Filter out doc_types that don't apply to this business profile
skip_types = _get_skip_types(profile)
# Derive business_scope hints for the MC filter (O1 — Doc-type Scope-Flag).
# MCs that explicitly require a feature (e.g. 'biometric_processing',
# 'ai_decision_making', 'child_targeting') get dropped when the
# detected profile doesn't declare it.
business_scope: set[str] = set()
for svc in (getattr(profile, "detected_services", []) or []):
business_scope.add(str(svc).lower())
if (getattr(profile, "business_type", "") or "").lower() == "b2c":
business_scope.add("b2c")
if getattr(profile, "has_online_shop", False):
business_scope.add("ecommerce")
if getattr(profile, "is_regulated_profession", False):
business_scope.add("regulated_profession")
# Document checks: 40-80%
n_entries = max(1, len(doc_entries))
for i, entry in enumerate(doc_entries):
@@ -268,6 +289,7 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
result = await _check_single(
text, doc_type, label, url,
entry["word_count"], use_agent_flag,
business_scope=business_scope,
)
# Apply profile context filter
@@ -421,9 +443,42 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
len(cmp_vendors))
cmp_vendors = await validate_vendor_urls(cmp_vendors)
cmp_vendors = score_vendors(cmp_vendors)
# Enrich each vendor with per-cookie functional roles
try:
from compliance.services.cookie_function_classifier import (
annotate_vendor_cookies,
)
cmp_vendors = [annotate_vendor_cookies(v) for v in cmp_vendors]
except Exception as e:
logger.warning("Cookie function classification skipped: %s", e)
except Exception as e:
logger.warning("VVT vendor extraction skipped: %s", e)
# Vendor-Redundanz + EU-Alternativen + Cost/Savings (O4)
redundancy_report = None
try:
from compliance.services.vendor_redundancy import analyze as analyze_redundancy
from compliance.services.vendor_cost_estimator import infer_company_tier
if cmp_vendors:
# Company-Tier aus business_profile ableiten — beeinflusst die
# Cost-Range so dass z.B. fuer DAX-Konzerne nicht starter-Preise
# die untere Schranke duruecken.
bp_dict = {
"type": getattr(profile, "business_type", ""),
"features": list(business_scope),
}
ctier = infer_company_tier(bp_dict)
redundancy_report = analyze_redundancy(cmp_vendors, company_tier=ctier)
logger.info(
"Redundanz: %d Kategorien mit Mehrfach-Anbietern, "
"Spar-Schaetzung %s pro Jahr (company_tier=%s)",
redundancy_report["summary"]["redundancy_count"],
redundancy_report["summary"]["estimated_saving_pct"],
ctier,
)
except Exception as e:
logger.warning("Vendor redundancy analysis skipped: %s", e)
summary_html = build_management_summary(results)
scanned_html = build_scanned_urls_html(doc_entries)
providers_html = build_provider_list_html(banner_result, vvt_entries)
@@ -468,11 +523,18 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
if scorecard else ""
)
report_html = build_html_report(results, None)
report_html = build_html_report(results, None, doc_texts)
profile_html = _build_profile_html(profile)
# O4: Vendor-Redundanz / EU-Alternativen + Cost-Savings-Block —
# zwischen VVT und Doc-Report einsortiert, damit Geschaeftsfuehrung
# die Einsparung sieht bevor sie in die Detail-Pruefung geht.
from .agent_doc_check_redundancy import build_redundancy_html
redundancy_html = build_redundancy_html(redundancy_report)
full_html = (
summary_html + scanned_html + profile_html + scorecard_html
+ providers_html + vvt_html + report_html
+ providers_html + vvt_html + redundancy_html + report_html
)
# Step 6: Send email — derive site name primarily from entered URL.
@@ -602,6 +664,7 @@ async def _fetch_text(url: str, doc_type: str = "") -> tuple[str, list[dict]]:
payload = resp.json()
docs = payload.get("documents", [])
cmp_payloads = payload.get("cmp_payloads") or []
cmp_cookie_text = payload.get("cmp_cookie_text") or ""
if docs:
texts = []
for doc in docs:
@@ -609,6 +672,22 @@ async def _fetch_text(url: str, doc_type: str = "") -> tuple[str, list[dict]]:
if t and len(t) > 50:
texts.append(t)
merged = "\n\n".join(texts)
# For cookie/dse/social_media: when CMP reconstruction is
# substantially richer than DOM extraction, use it. This
# fixes the BMW case where DOM yields ~600 words of
# navigation but the ePaaS payload reconstructs to ~1800
# words of actual cookie policy.
if (doc_type in short_extract_types
and cmp_cookie_text
and len(cmp_cookie_text.split()) > len(merged.split())):
logger.info(
"Preferring CMP-reconstructed text for %s on %s "
"(%d words CMP vs %d words DOM)",
doc_type, url,
len(cmp_cookie_text.split()),
len(merged.split()),
)
merged = cmp_cookie_text
if merged and len(merged.split()) > 100:
if len(texts) > 1:
logger.info("Merged %d docs from %s (%d words)",
@@ -727,6 +806,7 @@ async def _autodiscover_missing(
discovered: list[dict] = []
disc_payloads: list[dict] = []
disc_cookie_texts: list[str] = []
for base in crawl_bases:
try:
async with httpx.AsyncClient(timeout=180.0) as client:
@@ -742,8 +822,14 @@ async def _autodiscover_missing(
body = resp.json()
discovered.extend(body.get("documents", []) or [])
disc_payloads.extend(body.get("cmp_payloads") or [])
logger.info("auto-discovery on %s: %d docs",
base, len(body.get("documents", []) or []))
cmp_text = body.get("cmp_cookie_text") or ""
if cmp_text:
disc_cookie_texts.append(cmp_text)
logger.info("auto-discovery on %s: %d docs, %d CMP payloads, "
"cmp_cookie_text=%d words", base,
len(body.get("documents", []) or []),
len(body.get("cmp_payloads") or []),
len(cmp_text.split()))
except Exception as e:
logger.warning("auto-discovery failed for %s: %s", base, e)
@@ -772,6 +858,19 @@ async def _autodiscover_missing(
d = by_type.get(dt)
if d:
full = d.get("full_text") or d.get("text_preview") or ""
# For cookie: prefer the CMP-reconstructed text when it's
# substantially richer than the auto-discovered DOM extraction.
# BMW homepage CMP yields ~1800 words of authoritative policy;
# DOM extraction typically yields ~600 words of site chrome.
if dt == "cookie" and disc_cookie_texts:
cmp_merged = "\n\n".join(disc_cookie_texts)
if len(cmp_merged.split()) > len(full.split()):
logger.info(
"cookie: using CMP-reconstructed text (%d words) "
"instead of DOM (%d words)",
len(cmp_merged.split()), len(full.split()),
)
full = cmp_merged
if len(full.split()) >= 100:
new_entry["text"] = full
new_entry["url"] = d.get("url", "")
@@ -829,6 +928,7 @@ def _classify_discovered_doc(title: str, url: str) -> str | None:
async def _check_single(
text: str, doc_type: str, label: str, url: str,
word_count: int, use_agent: bool,
business_scope: set[str] | None = None,
):
"""Run regex + MC checks on a single document."""
from compliance.services.doc_checks.runner import check_document_completeness
@@ -862,6 +962,7 @@ async def _check_single(
# (top-10 FAILs) so cost stays bounded.
mc_results = await check_document_with_controls(
text, doc_type, label, max_controls=0, use_agent=use_agent,
business_scope=business_scope,
)
if mc_results:
for mc in mc_results:
@@ -374,11 +374,52 @@ def _render_vendor_row_full(v: dict) -> str:
)
score_color = ("#16a34a" if score >= 80 else
"#d97706" if score >= 50 else "#dc2626")
# Score-Erklaerung: was wurde gewertet, was fehlt
# Annahme: Score = bestandene Kriterien / Gesamtkriterien * 100.
# Typisch 5 Kriterien fuer EXT: country, cookies, opt_out, privacy, scoring.
# Bei INTERNAL/GROUP: opt_out + privacy nicht gewertet (3 Kriterien).
n_criteria = 3 if is_own else 5
n_failed = len(flags) if flags else 0
score_tooltip = (
f"{n_criteria - n_failed} von {n_criteria} Kriterien erfuellt"
+ (f" — fehlt: {', '.join(_flag_short(f) for f in flags[:3])}"
if flags else "")
)
# Inline-Aktions-Anweisungen pro Flag
actions_html = ""
if flags:
from compliance.services.finding_action_recipes import recipe_for
action_items = []
for f in flags:
rec = recipe_for(f)
if not rec:
continue
action_items.append(
f'<li style="margin-bottom:6px"><strong>{_flag_short(f)}:</strong> '
f'{rec.get("what", "")}<br/>'
f'<span style="color:#475569"><strong>Was tun:</strong> '
f'{rec.get("fix_text", "").splitlines()[0][:200]}</span><br/>'
f'<span style="color:#94a3b8;font-size:9px">Quelle: '
f'{rec.get("why", "")[:160]}</span></li>'
)
if action_items:
actions_html = (
f'<details style="margin-top:4px"><summary style="cursor:pointer;'
f'color:#dc2626;font-size:10px">Was muss ich tun? '
f'({len(action_items)} Action{"s" if len(action_items) != 1 else ""})</summary>'
f'<ul style="margin:4px 0 0 14px;padding:0;font-size:10px;color:#1e293b">'
+ "".join(action_items)
+ '</ul></details>'
)
flag_str = ""
if flags:
flag_str = (
f'<div style="font-size:10px;color:#94a3b8;margin-top:2px">'
f'{", ".join(flags[:4])}</div>'
f'{actions_html}'
)
return (
f'<tr style="border-top:1px solid #e2e8f0">'
@@ -391,11 +432,26 @@ def _render_vendor_row_full(v: dict) -> str:
f'<td style="padding:6px 8px;text-align:center">{opt_status}</td>'
f'<td style="padding:6px 8px;text-align:center">{privacy_status}</td>'
f'<td style="padding:6px 8px;text-align:right;font-weight:600;'
f'color:{score_color};font-size:11px">{score}%</td>'
f'color:{score_color};font-size:11px" title="{score_tooltip}">'
f'{score}%<div style="font-size:9px;font-weight:400;color:#94a3b8">'
f'{n_criteria - n_failed}/{n_criteria}</div></td>'
f'</tr>'
)
def _flag_short(f: str) -> str:
"""Lesbare deutsche Form fuer einen Flag-Token."""
labels = {
"no_cookies_listed": "Cookies fehlen",
"no_country": "Sitzland fehlt",
"no_privacy_url": "Privacy-Link fehlt",
"broken_privacy_url": "Privacy-Link broken",
"no_opt_out_url": "Opt-Out fehlt",
"broken_opt_out": "Opt-Out broken",
}
return labels.get(f, f)
def _link_status_badge(
url: str | None,
ok: bool | None,
@@ -0,0 +1,141 @@
"""
Email-Renderer fuer den Vendor-Redundanz + EU-Alternativen + Cost-/Savings-Block.
Wird im Email-Body unter dem VVT eingebaut.
"""
from __future__ import annotations
def _fmt_eur(low: int, high: int) -> str:
if not low and not high:
return "im Listpreis bundled"
if low == high:
return f"~{low:,}".replace(",", ".")
return f"{low:,}{high:,}".replace(",", ".")
def build_redundancy_html(report: dict | None) -> str:
if not report:
return ""
s = report.get("summary") or {}
redundancies = report.get("redundancies") or []
eu_alts = report.get("eu_alternatives") or []
multi = report.get("multi_function_tools") or []
cur = s.get("estimated_current_year_eur") or [0, 0]
sav = s.get("estimated_saving_year_eur") or [0, 0]
pct = s.get("estimated_saving_pct") or "n/a"
parts = [
'<div style="font-family:-apple-system,BlinkMacSystemFont,sans-serif;'
'max-width:700px;margin:0 auto 16px;padding:14px 18px;'
'background:#fef3c7;border:1px solid #fcd34d;border-radius:8px">',
'<h3 style="margin:0 0 6px;font-size:14px;color:#92400e">'
'Optimierungspotenzial: Redundanzen + EU-Alternativen</h3>',
f'<p style="margin:0 0 10px;font-size:11px;color:#78350f">'
f'<strong>{s.get("redundancy_count", 0)}</strong> Kategorien mit '
f'mehreren Anbietern · <strong>{s.get("consolidation_potential", 0)}</strong> '
f'Anbieter konsolidierbar · '
f'<strong>{s.get("eu_alternative_count", 0)}</strong> EU-Alternativen verfuegbar</p>',
'<div style="background:#fff;border:1px solid #fcd34d;border-radius:6px;'
'padding:10px 12px;margin-bottom:10px">',
'<div style="font-size:10px;color:#94a3b8;margin-bottom:6px;text-transform:uppercase;letter-spacing:0.5px">'
'Diese Schaetzung umfasst NUR die als redundant erkannten Tools — '
'nicht den Gesamt-Stack der Website</div>',
f'<div style="font-size:11px;color:#78350f">'
f'Listpreis-Schaetzung der <strong>redundanten</strong> Tools '
f'(Mehrfach-Anbieter in derselben Funktions-Kategorie):'
f' <strong>{_fmt_eur(*cur)}/Jahr</strong></div>',
f'<div style="font-size:11px;color:#16a34a;margin-top:4px">'
f'Sparpotenzial durch Konsolidierung auf je 1 EU-Tool pro Kategorie:'
f' <strong>{_fmt_eur(*sav)}/Jahr</strong> ({pct})</div>',
'<div style="font-size:10px;color:#94a3b8;margin-top:8px;font-style:italic">'
'<strong>Wichtige Einschraenkungen:</strong><br/>'
'• Konzern-Konditionen liegen ueblicherweise 3050% unter Listpreis — '
'realistisches Saving entsprechend €X·0,5 bis €X·0,7.<br/>'
'• Eintraege "<em>Eigene Marke — Tool</em>" (z.B. "BMW AG — Adobe Analytics") '
'gehoeren oft zu einem einzigen Master-Vertrag, nicht zu mehreren Lizenzen.<br/>'
'• Media-Spend (Google Ads, Meta Ads) ist NICHT enthalten — nur Tooling-Lizenzen.<br/>'
'• Quelle: Gartner/Forrester 2025 + oeffentliche Listpreise.'
'</div></div>',
]
if redundancies:
parts.append(
'<table style="width:100%;border-collapse:collapse;font-size:11px;'
'margin-bottom:10px">'
'<thead><tr style="background:#fde68a;color:#78350f;text-align:left">'
'<th style="padding:6px 8px">Kategorie</th>'
'<th style="padding:6px 8px">#</th>'
'<th style="padding:6px 8px">Anbieter</th>'
'<th style="padding:6px 8px">EU-Empfehlung</th>'
'<th style="padding:6px 8px;text-align:right">Saving / Jahr</th>'
'</tr></thead><tbody>'
)
for r in redundancies[:12]:
vendors_str = ", ".join(r.get("vendors", [])[:6])
if len(r.get("vendors", [])) > 6:
vendors_str += f" (+{len(r['vendors']) - 6} weitere)"
sav_r = r.get("estimated_saving_year_eur") or [0, 0]
parts.append(
f'<tr style="border-top:1px solid #fde68a;vertical-align:top">'
f'<td style="padding:5px 8px;color:#78350f;font-weight:600">{r["category_label"]}</td>'
f'<td style="padding:5px 8px;text-align:center">{r["count"]}</td>'
f'<td style="padding:5px 8px;color:#1e293b;font-size:10px">{vendors_str}</td>'
f'<td style="padding:5px 8px;color:#16a34a;font-size:10px">{r.get("suggested_eu_tool") or ""}</td>'
f'<td style="padding:5px 8px;text-align:right;color:#16a34a;font-weight:600">'
f'{_fmt_eur(*sav_r)}</td></tr>'
)
hint = r.get("consolidation_hint")
if hint:
parts.append(
f'<tr><td colspan="5" style="padding:0 8px 8px;color:#94a3b8;font-size:10px;font-style:italic">'
f'Hinweis: {hint}</td></tr>'
)
caveats = r.get("caveats") or []
if caveats:
parts.append(
f'<tr><td colspan="5" style="padding:0 8px 8px;color:#94a3b8;font-size:10px">'
f'<strong>Moegliche Gruende fuer Mehrfach-Einsatz:</strong> '
+ "; ".join(caveats) + '</td></tr>'
)
parts.append('</tbody></table>')
if multi:
parts.append(
'<div style="margin-top:8px"><strong style="font-size:11px;color:#78350f">'
'Multi-Funktions-Tools (1 Tool ersetzt mehrere Kategorien):</strong>'
'<ul style="margin:6px 0 0 18px;padding:0;font-size:11px;color:#78350f">'
)
for t in multi[:4]:
cats = ", ".join(t.get("replaces_categories", []))
parts.append(
f'<li style="margin-bottom:3px"><strong>{t["name"]}</strong>'
f' ({t["country"]}) — ersetzt <em>{cats}</em>'
f' ({t.get("potential_replacements", 0)} Anbieter heute)</li>'
)
parts.append('</ul></div>')
if eu_alts:
parts.append(
'<details style="margin-top:8px"><summary style="font-size:11px;color:#78350f;'
'cursor:pointer">EU-Alternativen pro Anbieter (Details)</summary>'
'<ul style="margin:6px 0 0 18px;padding:0;font-size:10px;color:#475569">'
)
for e in eu_alts[:20]:
first_alt = (e.get("alternatives") or [{}])[0]
parts.append(
f'<li style="margin-bottom:3px"><strong>{e["current_vendor"]}</strong>'
f'{first_alt.get("name", "")} ({first_alt.get("country", "")})'
f' <span style="color:#94a3b8">— {first_alt.get("notes", "")}</span></li>'
)
parts.append('</ul></details>')
parts.append('</div>')
return "".join(parts)
@@ -7,8 +7,12 @@ including L1/L2 check hierarchy, progress bars, and actionable hints.
from __future__ import annotations
import logging
import re
from typing import TYPE_CHECKING
logger = logging.getLogger(__name__)
if TYPE_CHECKING:
from .agent_doc_check_routes import CheckItem, DocCheckResult
@@ -32,12 +36,93 @@ def _icon(passed: bool, skipped: bool = False) -> str:
return '<span style="color:#ef4444;font-weight:bold">&#10007;</span>'
def _hint_box(hint: str) -> str:
return (
def _first_sentence(text: str, max_chars: int = 300) -> str:
"""Erster vollstaendiger Satz statt erste Zeile — robust gegen
mehrzeilige Fix-Texte die mit Bullet-Listen anfangen."""
if not text:
return ""
# Suche Satz-Endezeichen vor max_chars
snippet = text[:max_chars]
m = re.search(r"^(.+?[\.\?\!])(?:\s|$)", snippet, re.DOTALL)
if m:
first = m.group(1).strip()
# Wenn der "Satz" eine Variant-Header wie "Variante A:" ist, nimm
# weiter — der echte Inhalt kommt erst danach
if re.fullmatch(r"(Variante [A-Z]\s*\([^\)]+\):?|Beispiel\s*\d*:?)",
first, re.IGNORECASE):
rest = text[m.end():].lstrip()
return _first_sentence(rest, max_chars)
return first
# Kein Satz-Endezeichen — nimm bis max_chars
line = (text.splitlines() or [""])[0]
return line[:max_chars] + ("" if len(line) > max_chars else "")
def _hint_box(hint: str, check_label: str = "", doc_text: str = "",
doc_id: str | None = None) -> str:
"""Hint-Block mit angereichertem Recipe + Doc-Anchor wenn moeglich."""
base = (
f'<div style="font-size:11px;color:#dc2626;margin:2px 0 4px 20px;'
f'padding:4px 8px;background:#fef2f2;border-radius:4px;'
f'border-left:3px solid #fca5a5">{hint}</div>'
f'border-left:3px solid #fca5a5">{hint}'
)
# Recipe + Anker hinzufuegen wenn check_label bekannt
if check_label:
try:
from compliance.services.finding_action_recipes import recipe_for
from compliance.services.doc_anchor_locator import locate_anchor
rec = recipe_for(check_label)
if rec and rec.get("fix_text"):
first_sentence = _first_sentence(rec["fix_text"], 300)
full = rec["fix_text"]
# Statt <details> ein einfaches Inline-Block-Layout —
# robuster bei Plain-Text-Mail-Render
more = ""
if len(full) > len(first_sentence) + 10:
more = (
f'<div style="margin-top:4px;padding:6px 8px;background:#fff;'
f'border:1px solid #fcd5d5;border-radius:4px;font-size:10px;'
f'white-space:pre-wrap;color:#1e293b">'
f'<strong style="display:block;margin-bottom:3px;color:#475569">'
f'Vollstaendiger Textbaustein zum Einfuegen:</strong>'
f'{full}</div>'
)
base += (
f'<div style="margin-top:6px;padding-top:6px;border-top:1px solid #fecaca">'
f'<strong style="color:#7c3aed;font-size:10px">Konkrete Massnahme:</strong> '
f'<span style="color:#1e293b">{first_sentence}</span>'
f'{more}'
)
# Anker via Embedding-Locator (mit doc_id-Cache)
if doc_text:
anchor = locate_anchor(check_label, doc_text, doc_id)
if anchor and anchor.get("anchor_phrase") and anchor.get("confidence") != "low":
conf_label = anchor.get("confidence", "")
conf_badge = (
f' <span style="color:#94a3b8;font-size:9px">'
f'(Match-Konfidenz {conf_label}, '
f'Score {anchor.get("score", "")})</span>'
)
base += (
f'<div style="margin-top:4px;color:#475569;font-size:10px">'
f'<strong>Einfuegen:</strong> {anchor["position_hint"]}'
f'{conf_badge}</div>'
)
elif rec.get("where"):
# Kein guter Anchor-Match — zeige generischen Fallback
base += (
f'<div style="margin-top:4px;color:#475569;font-size:10px">'
f'<strong>Einfuegen:</strong> {rec["where"]} '
f'<span style="color:#94a3b8;font-size:9px">'
f'(kein eindeutiger Absatz im Dokument gefunden — '
f'Anweisung allgemein)</span></div>'
)
base += '</div>'
except Exception as e:
logger.debug("Hint-box enrichment failed: %s", e)
pass # Recipes optional — Hint-Box muss nie crashen
base += '</div>'
return base
def build_management_summary(results: list[DocCheckResult]) -> str:
@@ -158,8 +243,14 @@ def _check_to_action(doc_label: str, check_label: str, hint: str) -> str:
def build_html_report(
results: list[DocCheckResult],
cookie_result: dict | None,
doc_texts: dict[str, str] | None = None,
) -> str:
"""Build HTML email report styled like the frontend."""
"""Build HTML email report styled like the frontend.
`doc_texts` is the doc_type→text dict so hint-boxes can locate the
relevant Absatz in the original document for the Einfuege-Empfehlung.
"""
doc_texts = doc_texts or {}
ok_count = sum(1 for r in results if r.completeness_pct == 100)
html = [
'<div style="font-family:-apple-system,BlinkMacSystemFont,sans-serif;'
@@ -170,7 +261,7 @@ def build_html_report(
]
for r in results:
_render_document(html, r)
_render_document(html, r, doc_texts.get(r.doc_type, ""))
if cookie_result:
_render_cookie_banner(html, cookie_result)
@@ -179,7 +270,7 @@ def build_html_report(
return "\n".join(html)
def _render_document(html: list[str], r: DocCheckResult) -> None:
def _render_document(html: list[str], r: DocCheckResult, doc_text: str = "") -> None:
pct = r.completeness_pct
cpct = r.correctness_pct
bar_color = "green" if pct >= 80 else "yellow" if pct >= 50 else "red"
@@ -244,7 +335,7 @@ def _render_document(html: list[str], r: DocCheckResult) -> None:
else:
html.append('<div style="padding:8px 16px 12px">')
for c in l1_checks:
_render_l1_check(html, c, l2_by_parent.get(c.id, []))
_render_l1_check(html, c, l2_by_parent.get(c.id, []), doc_text)
# Master-Control aggregation: with 1874 MCs evaluated per run,
# rendering every L2 check inline produces ~600 rows per doc and
@@ -289,6 +380,7 @@ def _render_document(html: list[str], r: DocCheckResult) -> None:
def _render_l1_check(
html: list[str], c: CheckItem, children: list[CheckItem],
doc_text: str = "",
) -> None:
l2_sub = [ch for ch in children if not ch.skipped]
l2_passed = sum(1 for ch in l2_sub if ch.passed)
@@ -301,16 +393,16 @@ def _render_l1_check(
if l2_sub:
html.append(f' <span style="color:#9ca3af;font-size:11px">({l2_passed}/{len(l2_sub)})</span>')
if not c.passed and c.hint:
html.append(_hint_box(c.hint))
html.append(_hint_box(c.hint, c.label, doc_text))
html.append('</div>')
for ch in children:
if ch.skipped:
continue
_render_l2_check(html, ch)
_render_l2_check(html, ch, doc_text)
def _render_l2_check(html: list[str], ch: CheckItem) -> None:
def _render_l2_check(html: list[str], ch: CheckItem, doc_text: str = "") -> None:
style = "color:#dc2626;font-weight:500" if not ch.passed else "color:#6b7280"
html.append(
f'<div style="padding:2px 0 2px 24px;border-left:2px solid #e5e7eb;margin-left:8px">'
@@ -324,7 +416,7 @@ def _render_l2_check(html: list[str], ch: CheckItem) -> None:
f'white-space:nowrap">"...{ch.matched_text[:80]}..."</div>'
)
if not ch.passed and ch.hint:
html.append(_hint_box(ch.hint))
html.append(_hint_box(ch.hint, ch.label, doc_text))
html.append('</div>')
@@ -1808,6 +1808,32 @@ async def list_categories():
# SIMILAR CONTROLS (Embedding-based dedup)
# =============================================================================
_EMBEDDING_COL_AVAILABLE: bool | None = None
def _has_embedding_col() -> bool:
"""Cache whether canonical_controls has the embedding column.
Returns False on systems where pgvector + embedding backfill weren't
set up. Saves the per-request 500 + log spam.
"""
global _EMBEDDING_COL_AVAILABLE
if _EMBEDDING_COL_AVAILABLE is not None:
return _EMBEDDING_COL_AVAILABLE
try:
with SessionLocal() as db:
r = db.execute(text(
"SELECT 1 FROM information_schema.columns "
"WHERE table_schema='compliance' "
"AND table_name='canonical_controls' "
"AND column_name='embedding'"
)).fetchone()
_EMBEDDING_COL_AVAILABLE = bool(r)
except Exception:
_EMBEDDING_COL_AVAILABLE = False
return _EMBEDDING_COL_AVAILABLE
@router.get("/controls/{control_id}/similar")
async def find_similar_controls(
control_id: str,
@@ -1815,6 +1841,8 @@ async def find_similar_controls(
limit: int = Query(20, ge=1, le=100),
):
"""Find controls similar to the given one using embedding cosine similarity."""
if not _has_embedding_col():
return []
with SessionLocal() as db:
# Get the target control's embedding
target = db.execute(
@@ -1856,7 +1884,7 @@ async def find_similar_controls(
"title": r.title,
"severity": r.severity,
"release_state": r.release_state,
"tags": r.tags or [],
"tags": _jsonish(r.tags) or [],
"license_rule": r.license_rule,
"verification_method": r.verification_method,
"category": r.category,
@@ -1866,6 +1894,10 @@ async def find_similar_controls(
]
except Exception as e:
logger.warning("Embedding similarity query failed (no embedding column?): %s", e)
try:
db.rollback()
except Exception:
pass
return []
@@ -1946,6 +1978,22 @@ async def get_v1_matches_endpoint(control_id: str):
# INTERNAL HELPERS
# =============================================================================
def _jsonish(v):
"""Parse v as JSON if it's a string that looks like JSON, otherwise return as-is.
Some canonical_controls rows were inserted with jsonb columns containing
raw JSON strings (e.g. '["a","b"]' as a TEXT). The frontend expects real
arrays — coerce here so .map() works.
"""
if isinstance(v, str) and v and v[0] in "[{":
try:
import json as _j
return _j.loads(v)
except Exception:
return v
return v
def _control_row(r) -> dict:
return {
"id": str(r.id),
@@ -1954,17 +2002,17 @@ def _control_row(r) -> dict:
"title": r.title,
"objective": r.objective,
"rationale": r.rationale,
"scope": r.scope,
"requirements": r.requirements,
"test_procedure": r.test_procedure,
"evidence": r.evidence,
"scope": _jsonish(r.scope),
"requirements": _jsonish(r.requirements),
"test_procedure": _jsonish(r.test_procedure) or [],
"evidence": _jsonish(r.evidence) or [],
"severity": r.severity,
"risk_score": float(r.risk_score) if r.risk_score is not None else None,
"implementation_effort": r.implementation_effort,
"evidence_confidence": float(r.evidence_confidence) if r.evidence_confidence is not None else None,
"open_anchors": r.open_anchors,
"open_anchors": _jsonish(r.open_anchors) or [],
"release_state": r.release_state,
"tags": r.tags or [],
"tags": _jsonish(r.tags) or [],
"license_rule": r.license_rule,
"source_original_text": r.source_original_text,
"source_citation": r.source_citation,
@@ -0,0 +1,181 @@
"""
Consent-Log Export (Borlabs-Parity + DSB-Audit-Anforderung).
Auditors verlangen routinemaessig einen Auszug aller erteilten/
widerrufenen Einwilligungen pro Tenant heute musste der DSB dafuer
manuell SQL schreiben. Diese Endpunkte liefern CSV + JSON direkt aus
dem Browser.
Endpoints:
GET /einwilligungen/export/consents.csv
GET /einwilligungen/export/consents.json
GET /einwilligungen/export/history.csv Aenderungs-Historie
"""
from __future__ import annotations
import csv
import io
import json
import logging
from datetime import datetime, timezone
from fastapi import APIRouter, Depends, Header, Query
from fastapi.responses import Response
from sqlalchemy.orm import Session
from classroom_engine.database import get_db
from ..db.einwilligungen_models import (
EinwilligungenConsentDB,
EinwilligungenConsentHistoryDB,
)
logger = logging.getLogger(__name__)
router = APIRouter(prefix="/einwilligungen/export", tags=["einwilligungen-export"])
def _get_tenant(x_tenant_id: str | None = Header(None, alias="X-Tenant-ID")) -> str:
if not x_tenant_id:
from .tenant_utils import get_tenant_id
return get_tenant_id()
return x_tenant_id
def _ts() -> str:
return datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
def _consent_rows(consents: list[EinwilligungenConsentDB]) -> list[dict]:
return [
{
"consent_id": str(c.id),
"user_id": c.user_id or "",
"data_point_id": c.data_point_id or "",
"granted": "yes" if c.granted else "no",
"purpose": c.purpose or "",
"consent_version": c.consent_version or "",
"ip_address": c.ip_address or "",
"user_agent": (c.user_agent or "")[:200],
"source": c.source or "",
"created_at": c.created_at.isoformat() if c.created_at else "",
"updated_at": c.updated_at.isoformat() if c.updated_at else "",
"revoked_at": c.revoked_at.isoformat() if getattr(c, "revoked_at", None) else "",
}
for c in consents
]
def _history_rows(entries: list[EinwilligungenConsentHistoryDB]) -> list[dict]:
return [
{
"id": str(e.id),
"consent_id": str(e.consent_id),
"action": e.action or "",
"consent_version": e.consent_version or "",
"ip_address": e.ip_address or "",
"user_agent": (e.user_agent or "")[:200],
"source": e.source or "",
"created_at": e.created_at.isoformat() if e.created_at else "",
}
for e in entries
]
def _csv_response(rows: list[dict], filename: str) -> Response:
if not rows:
return Response(content="", media_type="text/csv",
headers={"Content-Disposition": f"attachment; filename={filename}"})
buf = io.StringIO()
w = csv.DictWriter(buf, fieldnames=list(rows[0].keys()), quoting=csv.QUOTE_ALL)
w.writeheader()
w.writerows(rows)
return Response(content=buf.getvalue(), media_type="text/csv; charset=utf-8",
headers={"Content-Disposition": f"attachment; filename={filename}"})
def _json_response(payload: dict, filename: str) -> Response:
body = json.dumps(payload, ensure_ascii=False, indent=2, default=str)
return Response(content=body, media_type="application/json; charset=utf-8",
headers={"Content-Disposition": f"attachment; filename={filename}"})
@router.get("/consents.csv")
async def export_consents_csv(
user_id: str | None = Query(None, description="Filter by single user"),
granted: bool | None = Query(None),
since: str | None = Query(None, description="ISO timestamp"),
tenant_id: str = Depends(_get_tenant),
db: Session = Depends(get_db),
) -> Response:
"""Download all consent records of this tenant as CSV (auditor-ready)."""
q = db.query(EinwilligungenConsentDB).filter(
EinwilligungenConsentDB.tenant_id == tenant_id,
)
if user_id:
q = q.filter(EinwilligungenConsentDB.user_id == user_id)
if granted is not None:
q = q.filter(EinwilligungenConsentDB.granted == granted)
if since:
try:
since_dt = datetime.fromisoformat(since.rstrip("Z"))
q = q.filter(EinwilligungenConsentDB.created_at >= since_dt)
except Exception:
pass
rows = _consent_rows(q.order_by(EinwilligungenConsentDB.created_at.desc()).all())
return _csv_response(rows, f"consents_{tenant_id[:8]}_{_ts()}.csv")
@router.get("/consents.json")
async def export_consents_json(
user_id: str | None = Query(None),
granted: bool | None = Query(None),
since: str | None = Query(None),
tenant_id: str = Depends(_get_tenant),
db: Session = Depends(get_db),
) -> Response:
"""Same data as the CSV endpoint but JSON-shaped for further processing."""
q = db.query(EinwilligungenConsentDB).filter(
EinwilligungenConsentDB.tenant_id == tenant_id,
)
if user_id:
q = q.filter(EinwilligungenConsentDB.user_id == user_id)
if granted is not None:
q = q.filter(EinwilligungenConsentDB.granted == granted)
if since:
try:
since_dt = datetime.fromisoformat(since.rstrip("Z"))
q = q.filter(EinwilligungenConsentDB.created_at >= since_dt)
except Exception:
pass
rows = _consent_rows(q.order_by(EinwilligungenConsentDB.created_at.desc()).all())
payload = {
"tenant_id": tenant_id,
"exported_at": datetime.now(timezone.utc).isoformat(),
"filter": {"user_id": user_id, "granted": granted, "since": since},
"count": len(rows),
"consents": rows,
}
return _json_response(payload, f"consents_{tenant_id[:8]}_{_ts()}.json")
@router.get("/history.csv")
async def export_history_csv(
consent_id: str | None = Query(None, description="Limit to one consent"),
since: str | None = Query(None),
tenant_id: str = Depends(_get_tenant),
db: Session = Depends(get_db),
) -> Response:
"""Download the consent-change history (Art. 7(1) Nachweispflicht)."""
q = db.query(EinwilligungenConsentHistoryDB).filter(
EinwilligungenConsentHistoryDB.tenant_id == tenant_id,
)
if consent_id:
q = q.filter(EinwilligungenConsentHistoryDB.consent_id == consent_id)
if since:
try:
since_dt = datetime.fromisoformat(since.rstrip("Z"))
q = q.filter(EinwilligungenConsentHistoryDB.created_at >= since_dt)
except Exception:
pass
rows = _history_rows(q.order_by(EinwilligungenConsentHistoryDB.created_at.asc()).all())
return _csv_response(rows, f"consent-history_{tenant_id[:8]}_{_ts()}.csv")