Commit Graph

184 Commits

Author SHA1 Message Date
Benjamin Admin 313982c6f1 feat(profile+report): P17 — 4 Polish-Items
CI / detect-changes (push) Successful in 10s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Successful in 19s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 39s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
A) Cookie-Policy-Architecture-Block Fallback auf DSE-Text wenn cookie via
   P15 deduped wurde. Erkennt jetzt auch single-doc Sites (Safetykon-Pattern).

B) Konkrete-Aufgaben-Liste: Per-Doc-Cap (3) entfernt + globaler Cap 10→20.
   Safetykon zeigt jetzt 7 statt 4 Aufgaben.

C) business_type-Klassifizierer: B2B-Service-Cluster aus P14 als Boost.
   Bei 2+ Service-Indikatoren (CE-Zertifizierung/Compliance/Auditierung)
   wird b2b_score angehoben. Safetykon: "B2C consulting" → "B2B (consulting)".

D) Vendor-Extract Fallback auf DSE-Text wenn cookie deduped + keine CMP-
   Payloads. LLM extrahiert dann Vendors aus dem DSE-Text. Safetykon: 0 → 1
   Vendor (Google Analytics aus dem DSE-Text erkannt).

Smoke-Test Safetykon: alle 4 Polish-Items wirken, kein Regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 12:22:05 +02:00
Benjamin Admin 479ce2225b feat(profile): P14+P15+P16 — B2B-Heuristik + Doc-URL-Dedup + Homepage-Profile
P14 — _detect_no_direct_sales erweitert um 3 Cluster:
  A) OEM-Konfigurator (BMW/Audi/Mercedes/VW/Porsche-Markennamen + Vertragshaendler-Pattern)
  B) B2B-Dienstleister (CE-Zertifizierung, Compliance-Beratung, Schulungen, Auditierung, TISAX, ISO-Normen, Arbeitssicherheit, ...)
  C) NGO/Verein/Public (Spendenkonto, Vereinsregister, gemeinnuetzig, ...)
Schwelle: pos >= 2 pro Cluster UND pos > neg. Bisher: nur OEM.

P15 — Doc-URL-Dedup im Worker: wenn mehrere Doc-Types DASSELBE Dokument
referenzieren (Safetykon-Pattern: User gibt /datenschutz fuer dse, cookie
UND widerruf), wird nur dem primaeren Doc-Type (Priority: dse > impressum
> cookie > widerruf > agb > nutzungsbedingungen) der Text gegeben. Andere
landen als "Nicht separat vorhanden — wird im Dokument 'X' mit-geprueft."
Eliminiert die 8+8 systematischen widerruf/cookie False Positives.

P16 — Profile-Detection auch Homepage-Text: Homepage-HTML wird mit kurzem
Fetch (8s timeout) gezogen, getrippt und zum profile_input gemerged. Vor-
her wirkte P14 nur wenn B2B-Indikatoren im DSE/Impressum-Pflichttext
standen — bei Safetykon stehen sie nur im Homepage-Menue.

Plus Bonus: TDM-Override-Submit-Button wird deaktiviert wenn Reason < 10
Zeichen — verhindert dass User wie heute in den Bug rein klickt.

Smoke-Test Safetykon (B2B Compliance-Dienstleister):
  dse                  geprueft (kein err)
  impressum            geprueft (kein err)
  cookie               "Nicht separat vorhanden — wird in DSE mit-geprueft"
  agb                  "Nicht anwendbar — kein Direkt-Kaufvertrag"
  widerruf             "Nicht anwendbar — kein Direkt-Kaufvertrag"
  nutzungsbedingungen  "Nicht anwendbar — kein Direkt-Kaufvertrag"
Vorher: 16 False Positives. Jetzt: 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 11:46:58 +02:00
Benjamin Admin b87c27d104 fix(llm-verify): P13 — Default-Modell auf qwen3:30b-a3b (statt qwen3.5:35b-a3b)
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / loc-budget (push) Successful in 21s
CI / go-lint (push) Has been skipped
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 18s
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 40s
Bug: qwen3.5:35b-a3b liefert mit format='json' + Batch-Prompt leere
Strings zurueck ('LLM batch: empty response from model'). Im echten
Compliance-Check lief der LLM-Verifier deshalb wirkungslos —
False-Positive-Findings wie 'Vorstand nicht erkannt' (BMW: Klammer-
Liste) wurden nicht overturned.

Fix: Default auf qwen3:30b-a3b umgestellt. Verifiziert mit BMW-
Impressum-Text: representative_person wird mit Evidence 'Milan
Nedeljkovic, Vorsitzender' overturned=True markiert.

OLLAMA_VERIFY_MODEL Env-Var bleibt als Override-Moeglichkeit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 09:11:01 +02:00
Benjamin Admin 28a078ccb4 feat(compliance-check): P10 — Cookie-Policy-Architecture-Detection
CI / detect-changes (push) Successful in 10s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 41s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Neuer Service cookie_policy_architecture.detect_architecture(...) prueft
vier Diagnose-Punkte der Cookie-Policy einer Website:

  1. Layer-Trennung: single (BMW-Pattern: Banner + Info in EINER URL)
                   | separate (Best Practice: getrennte Layer)
  2. Versionierung: "Stand vom DD.MM.JJJJ" / "Version X.Y" / ...
  3. Dynamic content: CMP-Capture auf Doc-URL oder Marker-Texte
  4. Vendor-Count im Text: Indikator ob Liste statisch drinsteht

Risiko-Ampel:
  - gruen: separate + versioned + statisch
  - gelb : single+unversioned (BMW) ODER separate+unversioned
  - rot  : weder noch (Pflicht-Info fehlt)

Wire-in im Compliance-Check-Worker: nach Exec-Summary-Block wird der
Architecture-Block gerendert (build_architecture_html) mit konkreter
Empfehlung. Bei BMW-Pattern: "Snapshot der dynamischen Vendor-Tabelle
als versioniertes PDF im Archiv."

Hintergrund: BMW hat eine HTML-Seite die GLEICHZEITIG Banner-Re-Trigger
und Cookie-Richtlinie ist. Mindestanforderung nach §25 TDDDG + Art. 13
DSGVO erfuellt, aber bei einer Aufsichtsbehoerden-Pruefung kann nicht
belegt werden welche Vendor-Liste an einem bestimmten Stichtag aktiv
war. Das ist kein Verstoss aber best-practice-Luecke.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:01:48 +02:00
Benjamin Admin 0d37822b7c fix(impressum): P9 — 7 False-Positive-Fixes in Pflichtangaben-Checks
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 16s
CI / go-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 37s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
#1 Name des Anbieters: \b Word-Boundary verhindert "ag" in "samstag",
   plus "aktiengesellschaft" als Volltreffer.
#2 Vertretungsberechtigte: Klammer-Liste-Pattern erkennt jetzt BMW-
   Format "Vorstand (Milan Nedeljkovic, Jochen Goller, ...)" plus
   "Vorsitzender des Aufsichtsrats: Name".
#3 V.i.S.d.P.: war schon INFO, OK.
#4 OS-Plattform/VSBG: bei no_direct_sales=True (OEM-Pattern) jetzt als
   "Nicht anwendbar" skipped statt 0/1 fail. Profile fliesst neu durch
   check_document_completeness -> runner.
#5 Zustaendige Kammer: IHK + Handwerkskammer + Tieraerztekammer in
   Pattern aufgenommen + severity LOW -> INFO (konditional).
#6 Stammkapital: war schon INFO, OK.
#7 Link-Disclaimer: neue Check-Eigenschaft "invert"=True. Anti-Pattern
   ist passed wenn NICHT gefunden, fail wenn gefunden. Vorher feuerte
   das Finding immer, jetzt nur wenn ein illegaler Disclaimer im Text
   ist.

Plus: L2-INFO-Checks (z.B. profession_chamber) zaehlen nicht mehr in
correctness-pct und erzeugen keine DSI-DETAIL-Findings. Konsistent
mit P8-Modell: INFO = "selbst pruefen", nicht "fail".

Verifiziert mit BMW-Impressum-Text — alle 7 Faelle korrekt klassifiziert:
  name=passed, representative_person=passed, profession_chamber=INFO,
  illegal_disclaimer=passed (kein Disclaimer im Text),
  dispute_resolution=skipped (no_direct_sales),
  editorial_visdp=INFO, share_capital=INFO.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:52:03 +02:00
Benjamin Admin 575644c9c5 feat(audit): P8 — MC-Severity raus, Email nur harte Findings, MC-Audit als Checkliste
CI / detect-changes (push) Successful in 10s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 17s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m48s
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 40s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Email-Hardening (mc_scorecard.top_fails):
  Neue _is_hard_finding-Heuristik filtert konditionale MCs ohne
  Negativ-Beleg aus den Top-Auffaelligkeiten. matched_text leer + Label
  enthaelt "falls/sofern/wenn/soweit/ggf." -> raus, landet nur noch im
  MC-Audit als "selbst pruefen". DATA-2066-A05 (kostenfreie Abschaltung
  Standortdaten) ist das prototypische Beispiel.

MC-Audit-Frontend (audit/[checkId]/page.tsx):
  Severity-Spalte (CRITICAL/HIGH/MEDIUM/LOW) entfernt — der MC-Audit
  ist eine Checkliste, keine Severity-Drohung. Stattdessen:
   - Spalte "Prioritaet" mit 3-Tier aus regulation-Mapping:
     Gesetz (DSGVO/ePrivacy/TDDDG/...) / Behoerden-Leitlinie
     (EDPB/DSK/EuGH/...) / Best-Practice (ISO/NIST/BSI)
   - 3-Status: erfuellt (✓) / nicht erfuellt (✗) / selbst pruefen (?)
     / nicht anwendbar (—). rowReviewStatus() leitet "selbst pruefen"
     aus matched_text-leer + konditionalem Label ab.
   - Filter umgebaut auf 5 Stati statt 4
   - Default-Filter "Nicht erfuellt" (vorher "Nur Fail")

Bonus: f.payload.risk_label TS-Cast im FindingsTab clean gemacht
(unknown -> string).

Effekt:
  - Email an die GF zeigt nur noch echte Belege ("DSB fehlt",
    "Gebuehr fuer Widerruf")
  - MC-Audit ist eine sachliche Pruefliste fuer den Compliance-Officer

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:30:04 +02:00
Benjamin Admin 6c223c7c9b feat(compliance-check): exec-summary + voll-audit + TDM-respect + cookie-KB-extended + saving-scan-funnel
CI / detect-changes (push) Successful in 10s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 14s
CI / loc-budget (push) Failing after 15s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m43s
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 37s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
P1 — Exec-Summary oben im Email-Report (4 KPIs + 2 CTAs, dunkler Gradient)
P3 — no_direct_sales-Flag fuer OEM-Konfigurator-Sites; AGB/Widerruf/AGB als
     "NICHT ANWENDBAR" (grau) statt "NICHT GEFUNDEN" (rot)
P5 — Voll-Audit Unification: alle Findings (MC + Pflichtangaben + Vendor +
     Redundanz) in /data/compliance_audits.db.unified_findings; neuer
     /api/compliance/agent/findings/<id> Endpoint + FindingsTab im Audit-UI
     mit Filter + CSV-Export
P7 — Crawl-Hardening: TDM-Reservation-Check (robots.txt / ai.txt / Header /
     Meta) vor jedem Run mit 24h-Cache; HeadlessChrome-UA (Firma noch nicht
     gegruendet — Switch via BREAKPILOT_BRANDED_UA env); per-Domain
     Rate-Limit 1 req/s + max 2 concurrent
P2 — Cookie-Knowledge-DB additiv erweitert (35 -> 74 Cookies): Adobe, Meta,
     Microsoft, LinkedIn, TikTok, HubSpot, Marketo, Salesforce, Hotjar,
     FullStory, Mouseflow, Intercom, Drift, Zendesk, Cloudflare, Stripe,
     OneTrust/Cookiebot/Usercentrics, Matomo, Pinterest, Snapchat, X/Twitter,
     YouTube, Vimeo, Klaviyo, Mailchimp, Mixpanel, Segment, Amplitude,
     Optimizely, Datadog; Wire-in in cookie_function_classifier liefert
     compliance_risk-Label (kritisch/hoch/mittel/gering) pro Vendor
A  — k-Anonymitaets-Helper (benchmark_k_anonymity) fuer P6-Vorbereitung
B  — Cross-Tenant-Domain-Assertion im /findings-Endpoint (expected_domain
     Query-Param -> 403 bei Mismatch)
C  — Saving-Scan-Funnel: /api/compliance/agent/saving-scan/start mit
     Validierung + 24h-Rate-Limit pro Domain + Lead-Persistenz in
     saving_scan_leads + Auto-Discovery via _run_compliance_check; 6 Tests
D  — Risk-Badge im Email-Vendor-Row

Rechtliche Leitplanken (Memory feedback_oem_data_legal.md): nur eigene
Knapp-Bewertungen + Source-Pointer, keine 1:1-Kopien fremder CMP-Texte.
TDM-Opt-Out-Respect nach § 44b UrhG. KEINE Schema-Aenderungen — alles in
Sidecar-SQLite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:48:34 +02:00
Benjamin Admin 662327e8b4 feat(compliance-check): MC-Classification + Embedding + Vendor-Redundanz + Action-Recipes + Borlabs-Features
CI / nodejs-build (push) Successful in 2m47s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 17s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Massiv-Update auf Basis BMW-Test-Iterationen (v1→v9):

Core Compliance-Check
- Sonnet check_type Klassifikation: text/process/review fuer alle 1874 MCs
  in compliance.doc_check_controls (script + Sidecar /data/mc_classification.db).
  rag_document_checker filtert auf check_type='text' fuer doc_check.
  Plus fits_doc_type-Audit (v2) + ui_only-Audit fuer DSA/E-Commerce-MCs in
  falscher doc_type-Schublade.
- scope_requires-Filter: biometric/ai_decision/child_targeting MCs werden
  per business_profile gefiltert (FRT skipped fuer BMW etc.).
- Embedding-Match (BGE-M3) als Phase-3 nach Regex-Match:
  Per-doc_type-Threshold-Override (impressum 0.50, dse/cookie 0.60),
  Short-Field-Rescue (15-Wort-Chunks) fuer Pflichtfelder im Impressum.
  Title+check_question als Embedding-Input fuer mehr Kontext.
- Cookie-Text-Routing: consent-tester gibt cmp_cookie_text aus dem
  CMP-Reconstruct zurueck, Backend bevorzugt das gegen DOM-Extraction
  wenn richer (BMW 1824 vs 600 Worte).

Vendor-Redundanz + EU-Alternativen + Cost-Saving
- vendor_redundancy.analyze() — funktionale Kategorisierung der CMP-Vendors,
  Detektion von Mehrfach-Anbietern pro Kategorie, EU-Alternative-Lookup
  (Matomo, IONOS, HERE, Friendly Captcha, Smart AdServer, ...).
- vendor_cost_estimator: Tier-Inferenz aus Cookie-Footprint (Cookie-Anzahl
  + Premium-Feature-Cookies + Third-Party-Quote → starter/professional/
  enterprise/premier).
- Self-Service-Werbung (Google/Meta/Pinterest/...) = 0 Lizenz-Kosten
  (nur Media-Spend, separat). DSP-Plattformen behalten enge Range.
- Tier-aware Saving-Range: bei Enterprise/Premier nutzen wir den
  oberen 40-100%-Band der Listpreise, nicht starter→premier.
- Multi-Function-Tools (Matomo Pro, SAP CX, IONOS Cloud, Userlike, Smart
  AdServer, HERE Maps, Vimeo Pro, LamaPoll) — ein Tool ersetzt mehrere
  Kategorien gleichzeitig.

Cookie-Wissens-DB + Funktionale Klassifikation
- cookie_knowledge_db: 50 kuratierte Top-Cookies (Google/Meta/Adobe/MS/...)
  mit vendor, exact_purpose, data_collected, IAB-TCF-IDs, reid_risk,
  schrems_ii_status, EuGH-Urteile, EU-Alternative.
- cookie_function_classifier: pro Cookie funktionale Rolle (tracking_id,
  ad_pixel, session_id, ab_test, csrf, ...) + blocking_impact.

Country-Inferenz aus Rechtsform
- cookie_link_validator: Country-Field wird aus Vendor-Name abgeleitet
  (A/S=DK, GmbH=DE, Inc=US, B.V.=NL, ...) plus Vendor-Lookup-Table.
  Reduziert false-positive no_country-Flags bei eindeutig-EU-Vendors
  (Adform DK, Pinterest IE).

Action-Recipes + Doc-Anchor-Locator
- finding_action_recipes: pro Finding-Typ (no_cookies_listed, no_country,
  broken_opt_out, "Auftragsverarbeiter erwaehnen", "Art. 22 Profiling",
  ...) eine strukturierte Anweisung mit what/why/fix_text/where/example.
  Zum 1:1-Einfuegen in Kunden-Dokumente.
- doc_anchor_locator: Embedding-basiert (BGE-M3 cosine) — sucht den
  passenden Absatz im existierenden Kundendokument fuer jeden Finding.
  Per-Run Thread-Local-Cache. Fallback: keyword-Match.
- Email-Rendering integriert Recipe + Anchor pro Doc-Pruefungs-Fail
  + Vendor-Flag-Liste mit aufklappbarer Action-Liste.
- Score-Erklaerung pro Vendor-Zeile (3/5-Untertitel + Tooltip).

Migration-Pipeline (Compliance-Check -> Customer Banner/Documents)
- migration_to_banner.py: Vendor-Liste -> CookieBannerConfig mit
  4 Kategorien + Review-Flags.
- migration_to_document.py: Vendor-Liste -> Cookie-Policy + VVT-Register
  + Privacy-Policy-Pre-Fills.
- agent_migration_routes: 3 Preview-Endpoints (banner-preview,
  document-preview, summary). Persistierung der cmp_vendors in
  /data/compliance_audits.db check_payloads-Tabelle.

Borlabs-Parity Cookie-Banner-Features
- Consent-Historie im Banner: window.bpShowConsentHistory() + localStorage.
- Content-Blocker: cookie-banner-content-blocker.ts — YouTube/Maps/Video
  Placeholder bis Einwilligung.
- Google Consent Mode v2 erweitert: wait_for_update + region=EEA/CH/GB.
- Consent-Log Export (CSV/JSON) per einwilligungen_export_routes.

Bug-Fixes
- canonical_control_routes: _jsonish-Helper fuer string-typed jsonb,
  similar-controls-Endpoint mit _has_embedding_col()-Cache (kein 500 mehr).
- Control-Library Frontend: defensive .map-Coercer in 2 Detail-Views.
- Embedding-Service-Batching (32er Batches statt 165 in einem Call).
- KeyError 'control_id' in MC-Result-Aggregation (defensive .get).
- Master-Controls-Klick-Through von /sdk/master-controls auf
  /sdk/control-library?control=<id> mit URL-Param-Auto-Open.
- Dockerfile: /data pre-chowned auf appuser (Audit-DB-Schreibrecht).
- Cookie-Text-Routing-Bug (cmp_reconstructed > DOM-extraction).
- doc_type-aware MC-Filter (statt all-text-MCs).
- Master-Contract-Dedup (60 BMW-Internal-Eintraege = 1 Adobe-Vertrag).
- A3-v2-Audit hat 24 UI-Sprache-MCs als 'process' reklassifiziert.

Tests
- test_migration_mappers.py (9 Tests)
- test_migration_endpoints.py (4 Tests)

Skripte (one-shot)
- classify_mc_check_type.py (v1) + _v2 (PK=control_id,doc_type)
- audit_mc_doctype_fit.py (v1 fits) + _v2 (ui_only + scope_requires)

BMW-Run-Bilanz v1 (broken) -> v9 (alle Fixes):
  DSE     7,5% -> 81-83%
  Impressum 4%   -> 100% (6 echte MCs alle erfuellt)
  Cookie  0%    -> 79-83% (CMP-Text-Routing + Embedding)
  Plus: 10 Konsolidierungs-Kategorien, geschaetzte Saving 200k-3M / Jahr
  Plus: Action-Recipes + Doc-Anchors fuer jeden Fail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:30:08 +02:00
Benjamin Admin df7d83134b feat(agent): migrate compliance-check results to banner + documents (M1-M5)
After a compliance-check run finishes, the user can now apply the
extracted vendor inventory directly to their own:

  - CookieBanner config (admin /sdk/einwilligungen)
  - Cookie-Policy / VVT-Register / Privacy-Policy templates
    (admin /sdk/document-generator)

Backend:
  - migration_to_banner.py: vendor list -> CookieBannerConfig with
    ESSENTIAL/PERFORMANCE/PERSONALIZATION/EXTERNAL_MEDIA buckets +
    review flags (broken opt-out URLs, missing expiry, no cookies listed)
  - migration_to_document.py: vendor list -> pre-fills for 3 doc
    templates, recipient-type aware (INTERNAL/GROUP/PROCESSOR/CONTROLLER)
  - agent_migration_routes.py: GET /banner-preview, /document-preview,
    /summary keyed on check_id
  - compliance_audit_log: new check_payloads table persists cmp_vendors +
    extracted_profile so the preview survives an app restart
  - tests: 9 mapper units + 4 endpoint integration tests

Frontend:
  - MigrationPanel.tsx: modal showing banner-config diff + document
    pre-fills, plus links into the existing editors
  - ComplianceCheckTab.tsx: replaces standalone audit link with the
    panel; net -3 lines, stays at the 500-cap

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 14:06:28 +02:00
Benjamin Admin 6ed30dae5b feat(agent): MC scorecard + audit drill-down + tenant trend (A1-A6)
Now that all 1874 MCs run per check (Task #30 cap removal), the report
was about to drown in noise. This commit adds the full aggregation /
persistence / drill-down stack so each MC is actionable, not just
counted.

A1 mc_scorecard.py (new):
  build_scorecard(checks)    -> per-regulation PASS/FAIL/SKIP + severity
  top_fails(checks, n)       -> N most severe failed MCs
  full_audit_records(...)    -> flat rows ready for sidecar SQLite

A2 Email rendering:
  agent_doc_check_scorecard.py (new) builds an HTML scorecard table
  (regulation × passed/failed/HIGH/MEDIUM/score) shown at the top of
  the email. agent_doc_check_report._render_document now collapses
  the 500-MC L2 forest into 'X/Y bestanden (Z Fail)' summary plus
  a top-10 fails block per doc — old verbose render is gone.

A3 compliance_audit_log.py (new) — sidecar SQLite at
  /data/compliance_audits.db (separate from compliance Postgres
  schema to comply with the no-new-migrations rule in CLAUDE.md):
    check_runs(check_id, ts, tenant_id, site_name, base_domain,
               doc_count, scorecard json, vvt_summary json)
    mc_results(check_id, doc_type, mc_id, label, passed, skipped,
               severity, regulation, matched_text, hint)
  Route persists every run after the email is sent.
  docker-compose.yml adds compliance-audit volume + env.

A4 backfill_mc_regulation_llm.py (new) — Qwen-tagged backfill for
  the 1636 MCs the regex pass couldn't classify. Batches of 25,
  format=json, output constrained to the canonical regulation list.
  Run manually: docker exec bp-compliance-backend python3 \
                 /app/scripts/backfill_mc_regulation_llm.py [--dry-run]

A5 Admin audit tab — GET /api/compliance/agent/audit/<check_id>
  proxied via /api/sdk/v1/agent/audit/<id>. New page
  /sdk/agent/audit/[checkId] renders scorecard + filterable MC table
  (status / doc_type / regulation, expandable rows with matched_text
  + hint). ComplianceCheckTab now shows 'Voll-Audit oeffnen' link.

A6 Trend per tenant — GET /api/compliance/agent/audit/tenant/<id>
  returns recent runs. Email scorecard shows per-regulation delta
  badges ('(+12%)', '(-3%)') compared with the previous run for the
  same tenant + base_domain. Lookup is one SQLite query.

Plumbing:
  rag_document_checker.py — SELECT now includes 'article'; MC results
    carry 'regulation' + 'article' through to CheckItem.
  agent_doc_check_routes.CheckItem schema gains regulation + article
    fields (defaults '') so old clients still parse.
  agent_compliance_check_routes — response gains 'check_id' so the
    frontend can build the audit link.
2026-05-17 13:45:58 +02:00
Benjamin Admin 6d29191e9b fix(vvt): score INTERNAL/GROUP without opt-out/privacy penalty
User feedback after BMW test:
- 60 'BMW AG — XYZ' rows were rendered as ✗ for Opt-Out/Privacy and
  scored 38-52%. That's misleading: BMW processing for itself doesn't
  need a separate opt-out URL (cookie-banner is the consent
  mechanism) or a separate privacy policy (main DSI covers it).
- Title 'Anbieter' was wrong for 60 of 90 rows (internal services).

Three orthogonal fixes:

1. score_vendors becomes recipient_type aware:
   - INTERNAL/GROUP_COMPANY: opt_out_url, privacy_policy_url, country
     are NOT required (the user's main DSI + cookie-banner cover them).
     What IS required: name, purpose, cookies disclosed with name +
     expiry. Cookies-disclosure weight raised to 50 (was 15) so the
     VVT-relevant data is the score driver.
   - 'necessary' category: opt-out still skipped (§25 Abs. 2 TDDDG).
   - External (PROCESSOR/CONTROLLER): existing strict scoring stays.

2. _link_status_badge accepts na_label and renders a neutral em-dash
   with explanation tooltip instead of red ✗ when the column doesn't
   apply to that row. _render_vendor_row_full passes na_label based on
   recipient_type:
     - INTERNAL/GROUP -> 'Nicht erforderlich (eigene Verarbeitung)'
     - necessary       -> 'Nicht erforderlich (§25 Abs. 2 TDDDG)'

3. Header + summary clarify the split:
   - h3 changed to 'Verarbeitungstaetigkeiten und Empfaenger aus der
     Cookie-Richtlinie' (was 'Drittanbieter aus Cookie-Richtlinie').
   - Top line: '90 Verarbeitungen erfasst — 60 eigene + 30 externe
     Empfaenger'.
   - Disclaimer below: explains the INTERNAL/GROUP exemption so the
     reader understands why those rows don't show ✗ for missing URLs.
   - Section labels enriched with the relevant DSGVO article:
     'Eigene Verarbeitungstaetigkeiten — fuer das VVT (Art. 30)',
     'Auftragsverarbeiter — AVV erforderlich (Art. 28)',
     'Joint Controller — Vereinbarung pruefen (Art. 26)'.

Expected BMW result after fix: ~85% of the 60 BMW-AG rows jump from
~52% to 90-100% (the real issue, fehlende Cookies-Disclosure, stays
flagged). The only true findings remaining are external links that
return 4xx (e.g. Criteo 403, Teads 404).
2026-05-17 13:15:40 +02:00
Benjamin Admin 8a44e67293 feat(compliance-check): unlock all 1874 MCs + close gap-table items
User: 'wir haben 1800 MCs erstellt um sie zu 10% zu nutzen — das ist
Schwachsinn'. Fixed all 6 gaps from the audit.

#1 max_controls=0 (was 20):
- agent_compliance_check_routes _check_single: passes max_controls=0 to
  check_document_with_controls -> ALL MCs evaluated per doc_type.
- 8 doc_types now use 1874 MCs instead of 160 (10x coverage).
- Regex matching is cheap (<1s per doc); LLM-enrich cap of 10 stays.

#2 LLM-verify fixed:
- llm_verify.py was getting 0/N parsed. Causes: qwen3 thinking-mode
  wrapped output in <think>...</think>, /api/generate doesn't enforce
  JSON, prompt didn't handle code-fence wrappers.
- Now uses /api/chat with format='json' (forces valid JSON).
- _parse_batch_response strips <think> tags, accepts {results:[...]}
  AND bare [...], adds richer regex-fallback parse, logs raw head on
  total parse failure for diagnosis.

#3 Loeschkonzept checklist (new):
- doc_checks/loeschkonzept_checks.py — 9 L1 + 7 L2 checks per DIN 66398
  + Art. 5(1)(e)/17/32 DSGVO: scope+responsibility, data categories,
  retention periods, legal basis refs (HGB/AO/BGB), deletion trigger,
  deletion process+technical+systems, deletion proof, exceptions +
  Art. 18 lock, review cycle, DSGVO references.
- runner.py registered for loeschkonzept/loeschung/loeschfristen.

#4 regulation backfill script:
- backend-compliance/scripts/backfill_mc_regulation.py — regex-detects
  DSGVO/TDDDG/TMG/BGB/HGB/AO/MStV/UWG/VSBG/PAngV/GwG/BDSG/EU-VO
  references in MC title+question+pass_criteria, UPDATEs regulation +
  article fields.
- Idempotent (only NULL rows), --dry-run flag, batched 200/UPDATE.
- Run inside container: docker exec bp-compliance-backend python3 \
    /app/scripts/backfill_mc_regulation.py

#5 MC alias-fallback:
- rag_document_checker._MC_ALIAS_FALLBACK maps doc_types without own
  MCs to a related set: nutzungsbedingungen->agb, social_media->dse,
  sub_processor/scc/tom_annex->avv, loeschfristen->loeschkonzept,
  eu_institution/dsb->dse.
- _load_controls retries with the alias when the primary query
  returns 0 rows.
- 14 additional doc_types now get MC coverage transparently.

#6 cross-domain auto-discovery:
- _autodiscover_missing builds a crawl plan: primary submitted base
  + up to 2 related domains sharing the owner SLD (e.g. BMW Group:
  bmw.de + bmwgroup.com + bmwgroup.jobs).
- Detection: regex over submitted texts for https?://...<owner>...
  hostnames distinct from the primary base.
- Each crawled base contributes documents + cmp_payloads to the
  discovery pool.

Net effect for BMW: 1874 MCs evaluated (90 from cookie alone, was
20), Loeschkonzept Pflichtangaben benoten-bar, LLM overturns false
regex FAILs, Joint-Controller policies on bmwgroup.jobs (Social
Media) jetzt entdeckbar. Same wins will apply to CRA-Compliance check.
2026-05-17 13:07:50 +02:00
Benjamin Admin fab1e35847 feat(vvt): recipient-type classification + 3-section VVT table
Per user request: BMW (and others) put their own services AND external
vendors in the same cookie-policy widget. The VVT-Tabelle now groups
them by Art. 30(1)(d) DSGVO recipient category so the DSB can act on
the right buckets:

  - INTERNAL      — owner processing for itself ('BMW AG — XYZ')
  - GROUP_COMPANY — same brand family, different legal entity ('BMW Bank')
  - PROCESSOR     — Auftragsverarbeiter, AVV-pflichtig (Adobe, Akamai)
  - CONTROLLER    — independent / joint controller (Meta Pixel, Google
                    Ads, LinkedIn — they run their own profiles)
  - AUTHORITY     — government bodies (rare in cookies)
  - OTHER         — fallback

New module vendor_classifier.py:
- owner_from_url(url) — derive site-owner token (bmw.de -> 'BMW',
  mercedes-benz.de -> 'Mercedes-Benz')
- classify(name, category, owner) — strict 5-tier heuristic:
  * INTERNAL: vendor name first-token is '<Owner>' / '<Owner> AG' /
    '<Owner> SE' / '<Owner> GmbH' / '<Owner> AG & Co. KG'
  * GROUP_COMPANY: starts with '<Owner> ' but isn't '<Owner> AG'
  * CONTROLLER: matches a known joint-controller list (Meta, Google
    Ads, YouTube, LinkedIn Insight, TikTok, Pinterest, Taboola,
    Outbrain, Criteo, Twitter, Reddit, ...)
  * PROCESSOR: legal-form suffix in name (GmbH, AG, Inc., A/S,
    B.V., S.A., Ltd., LLC, ...)
  * OTHER: anything else

vendor_extractor.extract_vendors_from_payloads now takes owner_name:
- Passes it through to classify() for every extracted vendor record
- The route derives owner_name via _company_name_from_url(doc_entries)
- LLM-extracted vendors are classified the same way (so V3 fallback
  also produces tagged records)

agent_doc_check_extras.build_vvt_table_html rewritten:
- Buckets vendors by recipient_type
- Renders one section per non-empty bucket, in canonical order
  (RECIPIENT_TYPE_SECTIONS), each with section header + count + bad
  count + nested table
- Within each section: sorted by compliance_score ascending
- Response JSON cmp_vendors includes recipient_type so the frontend
  can later import per-category into the VVT module

Expected BMW result: ~60 INTERNAL rows (BMW AG own services),
~25 PROCESSOR rows (Adobe, Adform, Akamai, AWS, ...), ~5 CONTROLLER
rows (Meta Pixel, Google, LinkedIn, Pinterest, Outbrain, Taboola).
2026-05-17 12:31:49 +02:00
Benjamin Admin 6c7d4c7552 fix(vvt): correct ePaaS schema mapping + category-aware scoring
The first BMW VVT table rendered all 24 providers at 20% score because
the ePaaS extractor was reading the wrong field names. Actual schema is
nested: providers[].processings[].persistences[], NOT providers[] alone.

Correct ePaaS schema (verified against bmw.com/epaas/.../de_DE.epaas.json):
  Provider:    {id, name, description, processings[]}
  Processing:  {id, name, description, categoryId, optOutLink,
                privacyPolicyLink, persistences[]}
  Persistence: {id, name, domain, type, expiry, description}

Two structural changes:

1. One row per processing (not provider). BMW has 26 providers but ~91
   processings spread across them (Adobe alone has ACMProcessing,
   AdobeAnalytics, AdobeCampaign, AdobeTargetAnalytics, AdobeTargetPers.).
   The cookie widget displays each processing separately — VVT now
   mirrors that. Display name format: 'Provider Name — Processing Name'.

2. Read optOutLink/privacyPolicyLink from PROCESSING (where they live),
   not provider. Persistences flatten to cookies[] with name + expiry +
   description.

Plus category mapping:
  advertising -> marketing
  strictlyNecessary -> necessary
  statistics -> statistics
  functional -> functional

Category-aware scoring (cookie_link_validator.score_vendors):
- 'necessary' (technisch erforderliche, §25 Abs. 2 TDDDG): no opt-out
  required, no country required. Score weight shifts to purpose +
  cookie disclosure (essential cookies must list names + expiry).
- All other categories: opt-out URL still mandatory; missing opt-out
  flags 'no_opt_out_url' and zeros that block of points.

Expected BMW result after this fix:
- ~91 rows (Adobe Analytics, Adform Retargeting, Akamai Infrastructure,
  AWS, ..., plus ~60 strictlyNecessary processings)
- Marketing rows with present opt-out → ~75-90%
- Necessary rows with cookie+expiry → ~85-95%
- Rows missing fields → still flagged
2026-05-17 11:19:31 +02:00
Benjamin Admin 873997c13b feat(vvt): V3 — LLM vendor extraction fallback for unknown CMPs
When the cookie text has no captured CMP payload (long-tail sites that
don't use ePaaS/OneTrust/Cookiebot/etc.) we now fall back to a Qwen → OVH
LLM cascade to extract a structured vendor list from the policy text.

New module backend/compliance/services/vendor_llm_extractor.py:
- extract_vendors_via_llm(cookie_text): runs Qwen first (local Ollama),
  then OVH if Qwen returns nothing usable.
- System prompt instructs the model to return STRICT JSON only:
  {vendors: [{name, country, purpose, category, opt_out_url,
   privacy_policy_url, persistence, cookies: [...]}]}
- Lenient JSON parser tolerates code-fences, prose wrappers, dict vs list.
- _normalize() caps array sizes (80 vendors, 30 cookies each), validates
  URLs (must be http(s)), trims fields to reasonable lengths.

Route integration (agent_compliance_check_routes.py):
- After named-CMP extract: if cmp_vendors is empty AND the cookie text
  has ≥500 words (otherwise it's likely navigation chrome), invoke the
  LLM extractor. Progress message 'Vendor-Liste per LLM extrahieren...'.
- Vendors then run through the same validate_vendor_urls + score_vendors
  pipeline → VVT table rendered identically regardless of source.

docker-compose.yml: backend-compliance gains OLLAMA_URL, CMP_LLM_MODEL,
OVH_LLM_URL/KEY/MODEL env vars (same names as consent-tester so the
configuration is unified).

This closes the 'every site eventually gets a VVT table' goal:
- Known CMP → V1/V2 structured extraction (fast, exact)
- Unknown CMP → V3 LLM extraction (slow, best-effort)
- No text at all → no vendors, but other compliance checks still run.
2026-05-17 09:55:42 +02:00
Benjamin Admin 9c0cc0f59f feat(vvt): V2 — vendor extractors for Cookiebot/Usercentrics/Didomi/TrustArc
Backend vendor_extractor.py gets 4 new per-CMP dispatchers, mirroring the
JSON schemas observed in each platform:

- Cookiebot: 'Categories[*].Cookies[*]' with Vendor/Host, expiry, purpose
- Usercentrics: 'services[*]' with cookieMaxAgeSeconds, processingCompanyCountry
- Didomi: 'app.vendors[*]' with country + policyUrl
- TrustArc: 'vendors[*]' + per-category 'Cookies' with provider

All 6 named CMPs (ePaaS, OneTrust, Cookiebot, Usercentrics, Didomi,
TrustArc) plus the generic-shape fallback are now mapped — every site
hitting Phase B of the cascade gets a structured vendor list, scored
opt-out links, and a VVT-Tabelle in the email.
2026-05-17 09:52:10 +02:00
Benjamin Admin ea4dbb223f feat(vvt): per-vendor extraction + opt-out check + VVT table in email (V1)
When a known CMP (ePaaS, OneTrust) renders the cookie policy, we now
extract structured vendor records, probe their opt-out + privacy URLs,
score each vendor (0-100), and append a 'VVT-Vorschlag' table to the
compliance email — one row per vendor, sortable by compliance score.

consent-tester:
- DSIDiscoveryResult.cmp_payloads: surfaces raw CMP JSON to callers
- DSIDiscoveryResponse: new cmp_payloads field
- discover_dsi_documents sets cmp_payloads from cmp_capture
- cmp_library/{epaas,onetrust}.py: new extract_vendors(d) returning
  list[VendorRecord]

backend:
- _fetch_text() now returns (text, cmp_payloads) tuple
- doc_entries store cmp_payloads per doc (mostly cookie)
- _autodiscover_missing forwards homepage payloads to the cookie entry
- New module vendor_extractor.py: dispatches ePaaS/OneTrust/generic
  schemas; dedupes vendors across multiple payloads
- cookie_link_validator.py extended with validate_vendor_urls(vendors)
  and score_vendors(vendors) — 0-100 score per vendor based on name,
  purpose, country, opt-out reachable, privacy URL reachable, cookies
  with names + expiry
- agent_doc_check_extras.build_vvt_table_html: renders the table
- Route appends VVT HTML after the provider list, before the
  document-by-document report
- Response JSON gains cmp_vendors for future frontend rendering

Example for BMW: ~30 ePaaS providers → table with Name | Kategorie |
Sitz | Cookies | Opt-Out (✓/✗) | Privacy (✓/✗) | Score. Sorted by
score ascending so the worst-compliant vendors are at the top.
2026-05-17 09:50:11 +02:00
Benjamin Admin c9c0fb5965 feat(cookie-check): enhanced patterns + active opt-out link validator
cookie_checks.py:
- cookie_names_listed: now also matches CMP placeholder notation
  (BMW: 'Adfpc###', 'CT###') and 'Diese Datenverarbeitung verwendet die
  folgenden Cookies oder ähnliche Technologien' as list-shape signal.
  Cryptic vendor names like 'audience', 'adformfrpid' are accepted via
  the surrounding markup, not by hard-coding each one.
- cookie_providers_named: new pattern 'Gesetzt von: <Firma>' (BMW/ePaaS
  per-cookie vendor naming) + recognition of full legal-form names
  (Adform A/S, BMW AG, Adobe Systems Software Ireland Limited).
- cookie_duration_values: now matches 'Ablauf: 1 Jahr' / 'Speicherdauer:
  30 Tage' (BMW format) in addition to the legacy '<n> <unit>'.

New L1 + L2 checks for controller in cookie-policy:
- cookie_controller (L1): the cookie policy must name Verantwortlich(er)
- cookie_controller_address (L2): PLZ + Ort or address keywords
- cookie_controller_contact_or_link (L2): email/phone OR link back to
  Datenschutzerklärung (the practical equivalent — BMW does this)

New L2 checks (parented under opt_out):
- cookie_optout_links: detects per-provider opt-out URLs in the text
- cookie_privacy_policy_links: per-provider privacy-policy URLs

New service: cookie_link_validator.py
- extract_links(text): pulls all https?://… URLs that follow 'Opt-Out
  Link:' / 'Link zur Privacy Policy:' (deduped)
- validate_links(links): probes every URL concurrently (HEAD first, GET
  fallback for 405/403). 10 parallel, 8s per request, 60s batch cap.
  Returns reachable=True/False + status + final_url.
- build_check_items(): renders 2 CheckItems (opt-out + privacy-policy),
  each pass if ALL links 2xx/3xx, fail with up-to-5 broken-link examples.

Hook in _check_single: doc_type=='cookie' triggers the validator after
regex+MC checks. Recomputes correctness with the new L2 items.

This addresses two concrete BMW observations:
1. BMW's per-cookie structure (Name + Zweck + Ablauf, Gesetzt von: …,
   Opt-Out Link: …) now recognised → 'Konkrete Cookie-Namen aufgelistet'
   and 'Konkrete Speicherdauern' should pass.
2. Defective opt-out URLs surface as compliance findings rather than
   silently passing — Art. 7(3) DSGVO requires a working withdrawal
   path per provider.
2026-05-17 09:38:32 +02:00
Benjamin Admin b090662524 fix(compliance-check): respect auto-discovery 'not found' verdict; DSB not canonical
Two related bugs in the BMW test result:

1. AGB rendered as 'MANGELHAFT 0/13' even though BMW has no public AGB:
   - Auto-discovery correctly returned 'not found' for AGB (no link on
     bmw.de matches AGB keywords).
   - But auto_fill_from_dsi then found the substring 'AGB' in a section
     of the DSI and pseudo-filled the AGB entry with a 264-word DSI
     fragment.
   - cross_search_documents would have done the same.
   - Both now skip entries where discovery_attempted=True AND
     auto_discovered=False — the 'not found' verdict stands.

2. DSB-Kontakt rendered as a separate 100% OK document with 7566 words
   = the entire DSI text:
   - GDPR practice: the DSB is named *inside* the DSI as an email or
     contact block (Art. 13(1)(b)), not as a stand-alone page.
   - cross_search_documents had been assigning the full DSI to the DSB
     row because it matched 'datenschutzbeauftragte' keywords.
   - DSB removed from _ALL_DOC_TYPES — no longer canonical, no longer
     padded as missing, no longer auto-discovered. The frontend row
     remains so a tenant with a separate DSB page can still submit one.

After this fix BMW should render:
- DSE: OK
- Impressum: LUECKENHAFT (unchanged — regex gaps to fix separately)
- Cookie-Richtlinie: OK
- Social Media: NICHT GEFUNDEN (bmw.de does not link to it)
- AGB: NICHT GEFUNDEN (correct — BMW has no public AGB)
- Nutzungsbedingungen: NICHT GEFUNDEN
- Widerruf: NICHT GEFUNDEN
2026-05-17 01:53:09 +02:00
Benjamin Admin bc21480a2a fix(compliance-check): always render 8 doc types + 4 BMW GT-gap fixes
Always-show-8 (user-requested):
- agent_compliance_check_routes.py: _pad_results_with_missing pads the
  results list to always include all 8 canonical doc_types in canonical
  order. Missing types get a placeholder DocCheckResult with error=
  'Nicht eingereicht' + scenario='missing'.
- agent_doc_check_report.py: NICHT EINGEREICHT status label (neutral),
  friendly grey body block instead of red error.
- ChecklistView.tsx: 'Nicht eingereicht' chip (neutral grey, not red
  'Fehler'); SCENARIO_LABELS adds missing entry + header chip counter.

Impressum-Regression fix (#18):
- _fetch_text(url, doc_type): cookie/dse/social_media -> max_documents=1
  (CMP capture authoritative, sub-pages dilute). Other types -> =3
  (Impressum needs Versicherungsvermittler, Aufsicht, Berufsrecht sub-
  pages). 15s networkidle bail keeps timing safe.

ODR/Verbraucherstreitbeilegung filter (#19):
- _apply_profile_filter: when profile.needs_odr=True (B2C), override the
  check's default B2B-oriented hint with action-oriented B2C guidance
  pointing at Art. 14 EU-VO 524/2013 + §36 VSBG. Previously the check
  contradicted itself: 'profile says B2C' + hint 'only relevant for B2C
  online vendors'.

Registergericht regex (#20):
- impressum_checks.py: accept colon/dot/dash between keyword and city
  (BMW writes 'registergericht: münchen hrb 42243'). Add 'sitz und
  registergericht: X' as separate pattern.

Industry detection (#21):
- business_profiler.py: 'automotive' keywords broadened (antriebs,
  motor, leasing, werkstatt, probefahrt, plus brand names BMW/Mercedes/
  Audi/VW/Porsche/Opel). 'it_services' keywords narrowed — software/
  cloud/hosting are mentioned in every privacy policy and were biasing
  the result toward IT for any tech-aware company.
2026-05-17 01:03:58 +02:00
Benjamin Admin e61e9d9e2a feat(agent): progress_pct + 6 BMW-Run Verbesserungen
Backend (agent_compliance_check_routes.py):
- progress_pct (0-100%) im Job-State, ueber alle Phasen verteilt
  (Laden 0-30, Profil 35-40, Pruefen 40-80, Banner 80-92, Report 95-100)
- Status-Texte vereinheitlicht ("Texte laden X/N", "Pruefen X/N")
- Firmenname fuer Email-Subject jetzt aus URL abgeleitet
  (bmw.de -> "BMW", mercedes-benz.de -> "Mercedes-Benz") statt
  unzuverlaessigem extracted_profile.companyName (matchte oft juris.de)
- E-Mail-Report enthaelt jetzt Banner+TCF-Vendor-Liste (build_provider_list_html)

Backend (agent_doc_check_extras.py — neu):
- build_scanned_urls_html: gepruefte URLs als Tabelle oben im Report
  (transparent fuer GF, welche Quellen wirklich gezogen wurden)
- Cross-Domain-Hinweis bei >1 netloc (BMW: bmw.de / bmwgroup.com /
  bmwgroup.jobs — Auffindbarkeit nach Art. 12 DSGVO)
- build_provider_list_html: Banner-Box + TCF-Vendor-Tabelle mit Spalten
  Name | Kategorie | Zweck | Drittland | Rechtsgrundlage

Backend (business_profiler.py):
- §34d-GewO Versicherungsvermittler-Hinweise zaehlen nicht mehr als
  "finance"-Industrie (BMW wurde dadurch falsch als B2B/finance erkannt)
- Neue Industry "automotive" (Fahrzeug/KFZ/Konfigurator/Modellpalette)
- B2B-Keywords: generische Begriffe wie "unternehmen", "beratung",
  "consulting" entfernt (matchten in jedem Konzerntext)
- B2C-Fallback: bei Verbraucher-Signalen ("widerruf", "kunde",
  redaktioneller Inhalt) tendiert auf b2c statt b2b

Frontend (ComplianceCheckTab.tsx):
- Progress-Balken mit Width-% und XX%-Anzeige rechts
- liest data.progress_pct aus Polling-Response

Consent-Tester (dsi_discovery.py):
- Cookie-Policy-Extraktion kritisch fixt: wait_for_function bis
  body.innerText > 500 chars (BMW SPA-Rendering brauchte mehr Zeit)
- _extract_text_robust: 3-Strategien-Extraktion (Selektoren -> Body-
  Cleanup -> P/LI/TD-Tags)
- _extract_text_from_iframes: liest OneTrust/Sourcepoint/Usercentrics
  Iframe-Inhalte (manche Cookie-Policies leben dort)

Adressiert alle Findings aus dem BMW-Ground-Truth-Vergleich.
2026-05-16 17:53:14 +02:00
Benjamin Admin bd2d6976d6 fix(cross-doc): also check entries with wrong text, not just empty ones
Cross-search now validates if existing text matches the expected
doc_type using keyword scoring. If text is present but doesn't match
(e.g. Nutzungsbedingungen in Widerruf row), searches other texts
and creates a finding explaining the mismatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 00:19:40 +02:00
Benjamin Admin 4e9043f26d feat(cross-doc): search all texts for all doc_types + misplacement finding
Cross-Document Intelligence: When a doc_type row is empty, searches
ALL other loaded documents for that content. If found (e.g. Widerruf
in AGB), extracts the section, runs the check, AND creates a finding:
"Widerrufsbelehrung in falschem Dokument gefunden — schwer auffindbar"

Keywords for: widerruf, cookie, social_media, impressum, agb, dsb.
Integrated as Step 1c in compliance check pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 23:19:39 +02:00
Benjamin Admin 33bf2b7c5a feat(service-detector): detect 118 services in legal texts (was 20)
Build + Deploy / build-admin-compliance (push) Successful in 2m5s
Build + Deploy / build-backend-compliance (push) Successful in 3m26s
Build + Deploy / build-ai-sdk (push) Successful in 56s
Build + Deploy / build-developer-portal (push) Successful in 1m29s
Build + Deploy / build-tts (push) Failing after 1m48s
Build + Deploy / build-document-crawler (push) Successful in 44s
Build + Deploy / build-dsms-gateway (push) Successful in 28s
Build + Deploy / build-dsms-node (push) Successful in 17s
CI / branch-name (push) Has been skipped
Build + Deploy / trigger-orca (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m45s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 52s
CI / test-python-backend (push) Successful in 36s
CI / test-python-document-crawler (push) Successful in 25s
CI / test-python-dsms-gateway (push) Successful in 21s
CI / validate-canonical-controls (push) Successful in 14s
New service_detector.py uses service_registry (88 entries) plus 30+
extra text patterns to detect services mentioned in DSI/legal texts.

Results on Spiegel: 31/32 services detected (97%, was 5/32 = 16%).
Includes metadata: name, category, country, EU adequacy status.

- Profiler now uses detect_services_in_text() instead of 20-entry list
- Profile extractor adds detected_services with full metadata
- Auto-generates scope hint for non-EU services (Drittlandtransfer)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:00:15 +02:00
Benjamin Admin 5e317d2f0f fix: text extraction 50k char limit was root cause of all Spiegel FNs
Build + Deploy / build-admin-compliance (push) Successful in 18s
Build + Deploy / build-backend-compliance (push) Successful in 12s
Build + Deploy / build-ai-sdk (push) Successful in 10s
Build + Deploy / build-developer-portal (push) Successful in 10s
Build + Deploy / build-tts (push) Successful in 10s
Build + Deploy / build-document-crawler (push) Successful in 9s
Build + Deploy / build-dsms-gateway (push) Successful in 10s
Build + Deploy / build-dsms-node (push) Successful in 15s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m46s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 41s
CI / test-python-backend (push) Successful in 37s
CI / test-python-document-crawler (push) Successful in 27s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 13s
Build + Deploy / trigger-orca (push) Successful in 2m13s
ROOT CAUSE: main.py line 338 truncated full_text at 50,000 chars.
Spiegel DSI has 107,720 chars (13,705 words) — only 47% was extracted.
DSB, Art. 77, Betroffenenrechte were all in the truncated portion.

Fixes:
1. Raise text limit from 50k to 200k chars in API response + discovery
2. click_button(): add iframe fallback for Sourcepoint/Quantcast
3. dsi_helpers: iterate ALL page.frames for consent buttons
4. Profiler: only check impressum (not full text) for regulated professions,
   and "rechtsanwalt" must be in first 500 chars (company description)
5. GT: save full Spiegel DSI text (13,705 words) as reference

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 15:22:38 +02:00
Benjamin Admin c702260ec1 fix: 5 regex bugs + text extraction scroll + GT update
Build + Deploy / build-admin-compliance (push) Successful in 13s
Build + Deploy / build-backend-compliance (push) Successful in 23s
Build + Deploy / build-ai-sdk (push) Successful in 13s
Build + Deploy / build-developer-portal (push) Successful in 14s
Build + Deploy / build-tts (push) Successful in 15s
Build + Deploy / build-document-crawler (push) Successful in 13s
Build + Deploy / build-dsms-gateway (push) Successful in 15s
Build + Deploy / build-dsms-node (push) Successful in 14s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 15s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m26s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 39s
CI / test-python-backend (push) Successful in 39s
CI / test-python-document-crawler (push) Successful in 25s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m28s
Root cause: Spiegel DSI text was truncated (lazy-loading) — the
rights/DSB/complaints sections at the bottom were never extracted.

Fixes:
1. Text extraction: scroll to bottom before innerText (dsi_discovery.py)
2. V.i.S.d.P.: add "verantwortlicher i.s.v." + "§18 Abs. N MStV" pattern
3. USt-IdNr: add "umsatzsteuer-id" + "DE 212 442 423" (with spaces)
4. Profiler: remove generic "anwalt"/"praxis" (false positive on Spiegel
   "Redaktionsanwalt"), keep only "rechtsanwalt", "kanzlei" etc.
5. Section splitter: auto_fill_from_dsi() fills empty Cookie/Social-Media
   rows from sections found in the DSI text

Ground Truth 06-spiegel.md fully rewritten with verified data from
live website — 3 L1 False Negatives identified (DSB, Beschwerderecht,
Betroffenenrechte all present on website but not in extracted text).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 01:20:55 +02:00
Benjamin Admin 0b9150f16f feat(vendor-assessment): Pruefprotokoll + Frontend + Sidebar
Build + Deploy / build-admin-compliance (push) Successful in 2m16s
Build + Deploy / build-backend-compliance (push) Successful in 3m27s
Build + Deploy / build-ai-sdk (push) Successful in 58s
Build + Deploy / build-developer-portal (push) Successful in 1m13s
Build + Deploy / build-tts (push) Successful in 1m43s
Build + Deploy / build-document-crawler (push) Successful in 45s
Build + Deploy / build-dsms-gateway (push) Successful in 30s
Build + Deploy / build-dsms-node (push) Successful in 19s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m35s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 43s
CI / test-python-backend (push) Successful in 37s
CI / test-python-document-crawler (push) Successful in 26s
CI / test-python-dsms-gateway (push) Successful in 21s
CI / validate-canonical-controls (push) Successful in 14s
Build + Deploy / trigger-orca (push) Successful in 3m33s
Phase 4-5: Professional Pruefprotokoll report builder with styled HTML
output (Kopfdaten, Kategorie-Scores, L1/L2 Check-Hierarchie, Findings,
Freigabe-Block). Frontend at /sdk/vendor-assessment with 3-step flow:
DocumentUploader → AssessmentProgress → PruefprotokollView.

Sidebar: "Use-Case Audits" → "Vertragspruefung" renamed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 23:24:12 +02:00
Benjamin Admin 0326d5baab feat(vendor-assessment): AVV/SCC/TOM/Sub-Processor checklists + assessment service
Phase 1-3 of the Vendor Contract Assessment:

Backend checklists (Doc-Check L1/L2 engine compatible):
- avv_checks.py: 28 checks (11 L1 + 17 L2) for Art. 28(3) DSGVO
- scc_checks.py: 7 checks for EU SCC 2021 (modules, annexes, TIA)
- tom_annex_checks.py: 12 checks for Art. 32 (8 control objectives)
- sub_processor_checks.py: 7 checks for sub-processor list completeness

Assessment service:
- POST /vendor-compliance/assessments — async contract analysis
- GET /vendor-compliance/assessments/{id} — poll status
- Cross-check engine: detects missing SCC when AVV mentions third-country,
  missing TOM annex, missing sub-processor list

All checklists registered in runner.py CHECKLIST_MAP (27 doc_types total).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 23:14:54 +02:00
Benjamin Admin c867478791 feat(tcf-vendors): GVL cache + vendor extraction + VVT mapping
Build + Deploy / build-admin-compliance (push) Successful in 14s
Build + Deploy / build-backend-compliance (push) Successful in 16s
Build + Deploy / build-ai-sdk (push) Successful in 20s
Build + Deploy / build-developer-portal (push) Successful in 12s
Build + Deploy / build-tts (push) Successful in 15s
Build + Deploy / build-document-crawler (push) Successful in 13s
Build + Deploy / build-dsms-gateway (push) Successful in 13s
Build + Deploy / build-dsms-node (push) Successful in 12s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 16s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m49s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 45s
CI / test-python-backend (push) Successful in 38s
CI / test-python-document-crawler (push) Successful in 26s
CI / test-python-dsms-gateway (push) Successful in 23s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m23s
Phase 1-2 of the closed quality loop:
- GVL cache (consent-tester/services/gvl_cache.py): downloads and caches
  IAB Global Vendor List with 24h TTL, resolves vendor IDs to names,
  purposes, policy URLs, retention, country
- Vendor extraction (consent_interceptor.py): extract_tcf_vendors()
  reads __tcfapi after accept phase, resolves via GVL
- Scan response: tcf_vendors field added to /scan endpoint
- VVT mapper (vendor_vvt_mapper.py): maps TCF vendors to VVT format
  with purpose labels, Rechtsgrundlage, Drittland detection
- Vendor cross-check (banner_cookie_cross_check.py): checks all TCF
  vendors against DSI text — missing vendors, undocumented transfers
- Compliance check integrates Step 3d: TCF vendors vs DSI

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 18:18:50 +02:00
Benjamin Admin 7be34552bb feat(compliance-check): profile extraction + scenario classification
Build + Deploy / build-admin-compliance (push) Successful in 15s
Build + Deploy / build-backend-compliance (push) Successful in 21s
Build + Deploy / build-ai-sdk (push) Successful in 46s
Build + Deploy / build-developer-portal (push) Successful in 12s
Build + Deploy / build-tts (push) Successful in 13s
Build + Deploy / build-document-crawler (push) Successful in 11s
Build + Deploy / build-dsms-gateway (push) Successful in 11s
Build + Deploy / build-dsms-node (push) Successful in 14s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m46s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 47s
CI / test-python-backend (push) Successful in 39s
CI / test-python-document-crawler (push) Successful in 27s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 16s
Build + Deploy / trigger-orca (push) Successful in 2m29s
- New profile_extractor.py: extracts Company Profile fields (name,
  legal form, address, DPO, USt-IdNr) and Compliance Scope hints
  (Art. 9 data, third country, profiling) from document texts
- Scenario per document: regenerate (<30%), fix (30-95%), import (>95%)
- Widerruf for B2B: no longer skipped, instead all checks flagged as
  INFO with "not needed for B2B" hint
- Move _build_profile_html to report builder module
- DocCheckResult gets scenario field

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 17:34:33 +02:00
Benjamin Admin be9cfdc2d4 feat(compliance-check): skip Widerruf for B2B, limit MCs, fix industry
Build + Deploy / build-admin-compliance (push) Successful in 2m1s
Build + Deploy / build-backend-compliance (push) Successful in 4m20s
Build + Deploy / build-ai-sdk (push) Successful in 53s
Build + Deploy / build-developer-portal (push) Successful in 2m6s
Build + Deploy / build-tts (push) Successful in 2m48s
Build + Deploy / build-document-crawler (push) Successful in 52s
Build + Deploy / build-dsms-gateway (push) Successful in 11s
Build + Deploy / build-dsms-node (push) Successful in 13s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 15s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m45s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 45s
CI / test-python-backend (push) Successful in 41s
CI / test-python-document-crawler (push) Successful in 26s
CI / test-python-dsms-gateway (push) Successful in 21s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 3m17s
- Skip Widerrufsbelehrung check entirely for B2B/B2G businesses
- Limit MC checks to top 20 per doc_type (by severity) to reduce noise
  (e.g. 75 impressum MCs → 20, avoiding 55 irrelevant FAILs)
- Add consulting/manufacturing industry keywords (arbeitssicherheit,
  brandschutz, werkzeugbau, etc.)
- Lower industry detection threshold from 2 to 1 keyword hit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 17:03:57 +02:00
Benjamin Admin b42e1cd091 feat(cmp): timezone→geo_country mapping + timezone parameter
Build + Deploy / build-admin-compliance (push) Successful in 2m10s
Build + Deploy / build-backend-compliance (push) Successful in 5m20s
Build + Deploy / build-ai-sdk (push) Successful in 57s
Build + Deploy / build-developer-portal (push) Successful in 1m15s
Build + Deploy / build-tts (push) Successful in 2m3s
Build + Deploy / build-document-crawler (push) Successful in 53s
Build + Deploy / build-dsms-gateway (push) Successful in 38s
Build + Deploy / build-dsms-node (push) Successful in 20s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 18s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m40s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 48s
CI / test-python-backend (push) Successful in 44s
CI / test-python-document-crawler (push) Successful in 26s
CI / test-python-dsms-gateway (push) Successful in 25s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 3m32s
Add _resolve_geo_from_timezone() with 35-country IANA timezone map.
Accept timezone field in ConsentCreate schema and pass through to service.
Populate geo_country automatically from browser timezone.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 14:43:13 +02:00
Benjamin Admin 4a7e09bbb0 fix(impressum): regex [A-Z] never matches on lowercased text
Build + Deploy / build-admin-compliance (push) Successful in 12s
Build + Deploy / build-backend-compliance (push) Successful in 14s
Build + Deploy / build-ai-sdk (push) Successful in 20s
Build + Deploy / build-developer-portal (push) Successful in 13s
Build + Deploy / build-tts (push) Successful in 12s
Build + Deploy / build-document-crawler (push) Successful in 14s
Build + Deploy / build-dsms-gateway (push) Successful in 13s
Build + Deploy / build-dsms-node (push) Successful in 18s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 15s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m39s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 46s
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Successful in 27s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m28s
All patterns matched against text_lower but used [A-Z] character class.
Changed to [a-zA-Z] so patterns like "geschäftsführung: dr. oliver"
are found. Also added "Pflicht"/"Detail" labels to the two progress
bars to clarify what 100% vs 8% means.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 14:02:25 +02:00
Benjamin Admin 74f00bbb0f feat(compliance-check): split shared URLs into sections per doc_type
Build + Deploy / build-admin-compliance (push) Successful in 2m4s
Build + Deploy / build-backend-compliance (push) Successful in 3m39s
Build + Deploy / build-ai-sdk (push) Successful in 50s
Build + Deploy / build-developer-portal (push) Successful in 1m12s
Build + Deploy / build-tts (push) Successful in 2m16s
Build + Deploy / build-document-crawler (push) Successful in 1m9s
Build + Deploy / build-dsms-gateway (push) Successful in 35s
Build + Deploy / build-dsms-node (push) Successful in 32s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 16s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m37s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 43s
CI / test-python-backend (push) Successful in 39s
CI / test-python-document-crawler (push) Successful in 27s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 3m16s
When the same URL is used for multiple document types (e.g. /datenschutz
for DSI + Cookie + DSB), the section splitter now:
- Detects duplicate URLs and fetches text only once
- Splits text at classified headings (Cookie, Google Analytics, etc.)
- Assigns matching sections to each doc_type
- DSI always keeps the full text

Extracted to section_splitter.py (170 LOC) to keep routes under 500.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 12:49:57 +02:00
Benjamin Admin 407a9503e4 fix(profiler): fix B2G false positive + add consulting/manufacturing
Build + Deploy / build-admin-compliance (push) Successful in 2m27s
Build + Deploy / build-backend-compliance (push) Successful in 3m40s
Build + Deploy / build-ai-sdk (push) Successful in 1m0s
Build + Deploy / build-developer-portal (push) Successful in 1m16s
Build + Deploy / build-tts (push) Successful in 1m54s
Build + Deploy / build-document-crawler (push) Successful in 1m2s
Build + Deploy / build-dsms-gateway (push) Successful in 31s
Build + Deploy / build-dsms-node (push) Successful in 20s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m44s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 49s
CI / test-python-backend (push) Successful in 36s
CI / test-python-document-crawler (push) Successful in 25s
CI / test-python-dsms-gateway (push) Successful in 21s
CI / validate-canonical-controls (push) Successful in 14s
Build + Deploy / trigger-orca (push) Successful in 3m23s
- Remove generic B2G keywords (behörde, amt, öffentlich) that match in
  every DSI due to "Aufsichtsbehörde", "Amtsgericht", "veröffentlichen"
- Remove "server" from it_services (too generic, appears in every DSI)
- Add consulting, manufacturing, media industries
- Add B2B fallback for GmbH/AG without B2C signals
- Add 10 ground truth files for unified compliance check

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 12:20:44 +02:00
Benjamin Admin ce77cde309 fix(compliance-check): batch LLM verification + increase poll timeout
Build + Deploy / build-admin-compliance (push) Successful in 1m52s
Build + Deploy / build-backend-compliance (push) Successful in 18s
Build + Deploy / build-ai-sdk (push) Successful in 11s
Build + Deploy / build-developer-portal (push) Successful in 11s
Build + Deploy / build-tts (push) Successful in 12s
Build + Deploy / build-document-crawler (push) Successful in 14s
Build + Deploy / build-dsms-gateway (push) Successful in 10s
Build + Deploy / build-dsms-node (push) Successful in 12s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 15s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m35s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 42s
CI / test-python-backend (push) Successful in 37s
CI / test-python-document-crawler (push) Successful in 25s
CI / test-python-dsms-gateway (push) Successful in 21s
CI / validate-canonical-controls (push) Successful in 16s
Build + Deploy / trigger-orca (push) Successful in 2m24s
- LLM verify now sends ALL failed checks in one batched call instead of
  one Ollama call per check (80+ calls → 1 per document)
- Increase frontend poll timeout from 6 min to 15 min

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 11:49:30 +02:00
Benjamin Admin b6ad958b69 feat(compliance-check): integrate banner cross-check + extract to module
Build + Deploy / build-admin-compliance (push) Successful in 1m57s
Build + Deploy / build-backend-compliance (push) Successful in 3m20s
Build + Deploy / build-ai-sdk (push) Successful in 48s
Build + Deploy / build-developer-portal (push) Successful in 1m6s
Build + Deploy / build-tts (push) Successful in 1m43s
Build + Deploy / build-document-crawler (push) Successful in 44s
Build + Deploy / build-dsms-gateway (push) Successful in 31s
Build + Deploy / build-dsms-node (push) Successful in 18s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 16s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m40s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 47s
CI / test-python-backend (push) Successful in 38s
CI / test-python-document-crawler (push) Successful in 28s
CI / test-python-dsms-gateway (push) Successful in 20s
CI / validate-canonical-controls (push) Successful in 14s
Build + Deploy / trigger-orca (push) Successful in 3m26s
Add automatic banner check (Step 3b) and banner-vs-cookie cross-check
(Step 3c) to unified compliance check. Extract cross-check logic to
banner_cookie_cross_check.py to keep routes under 500 LOC.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 00:08:47 +02:00
Benjamin Admin 66d30568e2 feat(dsms): Stufe 1 — Gap-Analyse Report wird in DSMS archiviert
Build + Deploy / build-admin-compliance (push) Successful in 1m41s
Build + Deploy / build-backend-compliance (push) Successful in 14s
Build + Deploy / build-ai-sdk (push) Successful in 41s
Build + Deploy / build-developer-portal (push) Successful in 10s
Build + Deploy / build-tts (push) Successful in 10s
Build + Deploy / build-document-crawler (push) Successful in 10s
Build + Deploy / build-dsms-gateway (push) Successful in 10s
Build + Deploy / build-dsms-node (push) Successful in 11s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 14s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m31s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 48s
CI / test-python-backend (push) Failing after 1s
CI / test-python-document-crawler (push) Successful in 32s
CI / test-python-dsms-gateway (push) Successful in 25s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m23s
- Go DSMS Client (internal/dsms/client.go): Archive() + Verify()
- Python DSMS Client (compliance/services/dsms_client.py): archive_to_dsms() + verify_dsms()
- Gap-Analyse AnalyzeProject() archiviert Report-JSON nach DSMS
- Response enthält dsms_cid wenn Archivierung erfolgreich
- Frontend: Grünes "Revisionssicher archiviert" Badge mit CID im GapDashboard
- DSMS Proxy Route (/api/sdk/v1/dsms/[...path]) für Verify-Abfragen

Stufe 2 (Evidence Upload → DSMS) und Stufe 3 (Version Chains) folgen.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 23:39:26 +02:00
Benjamin Admin 397de741c1 feat(cmp): Phase 2 — script blocking + cookie tracking
Migration 108: scripts_blocked, scripts_released, cookies_set JSONB columns.
Backend models/schema/service/serializer/routes extended.
Admin detail modal shows released scripts and set cookies with categories.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 22:52:26 +02:00
Benjamin Admin 051890c370 feat(cmp): restore vendor-agnostic fields + module wiring
Build + Deploy / build-admin-compliance (push) Successful in 2m0s
Build + Deploy / build-backend-compliance (push) Successful in 14s
Build + Deploy / build-ai-sdk (push) Successful in 10s
Build + Deploy / build-developer-portal (push) Successful in 14s
Build + Deploy / build-tts (push) Successful in 11s
Build + Deploy / build-document-crawler (push) Successful in 11s
Build + Deploy / build-dsms-gateway (push) Successful in 10s
Build + Deploy / build-dsms-node (push) Successful in 13s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 18s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m55s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 45s
CI / test-python-backend (push) Successful in 41s
CI / test-python-document-crawler (push) Successful in 30s
CI / test-python-dsms-gateway (push) Successful in 26s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m17s
Re-add 13 vendor-agnostic columns to banner models/serializers/service
(consent_method, banner_version, device_type, browser, os, etc.) that
were lost when another session overwrote the code. Keep vendor_consents
dict from the other session.

Add list_consents method back to BannerConsentService.

Wire CookieBanner, Loeschfristen and UseCases into Document Generator
contextBridge (CMP_NAME, analytics tools, retention months, feature flags).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 21:57:54 +02:00
Benjamin Admin 0d0e705117 feat: Unified Compliance-Check — 8 document types in one form
New 3-tab structure: Website-Scan, Compliance-Check, Banner-Check.

Compliance-Check Tab (replaces Dokumenten-Pruefung + Impressum-Check):
- 8 document rows: DSI, Impressum, Social Media, Cookie, AGB,
  Nutzungsbedingungen, Widerruf, DSB-Kontakt
- Each row: URL input + "Text laden" + file upload + manual text
- "Text laden" extracts via consent-tester, shows in editable textarea
- User verifies/corrects text before checking
- Empty fields = "not present" → own finding

Business Profiler (business_profiler.py):
- Detects B2B/B2C/B2G from all documents together
- Recognizes regulated professions, online shops, editorial content
- Context-aware: INFO checks become PASS/FAIL based on profile

Backend: /compliance-check + /extract-text endpoints
Frontend: ComplianceCheckTab.tsx + DocumentRow.tsx
API proxies: compliance-check/route.ts + extract-text/route.ts

Also: Impressum regex fixes (Telefon, AG, Geschaeftsfuehrung)
and INFO severity for context-dependent checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 20:56:10 +02:00
Benjamin Admin 0c25832b5c fix: Context-aware Impressum checks + 3 regex fixes
3 Regex fixes:
- Telefon: matches '0761 / 48 98 09 01' format (spaces around /)
- Registergericht: matches 'AG Freiburg' (not just 'Amtsgericht')
- Vertretung: matches 'Geschaeftsfuehrung:' (not just 'Geschaeftsfuehrer:')

6 checks changed from FAIL to INFO severity:
- V.i.S.d.P.: only relevant if website has editorial content
- Streitbeilegung: only relevant for B2C online shops
- Berufsrecht: only relevant for regulated professions
- Stammkapital: legally required but rarely enforced
- Aufsichtsbehoerde: only for licensed activities
- Berufshaftpflicht: only for mandatory insurance

INFO checks don't count towards completeness percentage.
They appear as hints, not findings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 15:23:19 +02:00
Benjamin Admin 36c6101b91 Merge feat/zeroclaw-compliance-agent into main
Brings all compliance doc-check features:
- 162 regex checks + 1874 Master Controls
- LLM-agnostic agent with tool calling
- Banner check (46 checks, 30 CMPs, stealth, Shadow DOM)
- Impressum check (24 checks)
- Deep consent verification (DataLayer, GCM, TCF)
- CMP E2E tests (39 tests)
- HTML email reports, FAQ, persistent history

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 11:44:20 +02:00
Benjamin Admin 289ec5f396 feat(cmp): vendor-agnostic consent data model — 13 new fields
Build + Deploy / build-admin-compliance (push) Successful in 2m28s
Build + Deploy / build-backend-compliance (push) Successful in 3m48s
Build + Deploy / build-ai-sdk (push) Failing after 45s
Build + Deploy / build-developer-portal (push) Successful in 1m28s
Build + Deploy / build-tts (push) Successful in 1m48s
Build + Deploy / build-document-crawler (push) Successful in 48s
Build + Deploy / build-dsms-gateway (push) Successful in 34s
Build + Deploy / build-dsms-node (push) Successful in 20s
CI / branch-name (push) Has been skipped
Build + Deploy / trigger-orca (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 24s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m1s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 49s
CI / test-python-backend (push) Successful in 45s
CI / test-python-document-crawler (push) Successful in 31s
CI / test-python-dsms-gateway (push) Successful in 27s
CI / validate-canonical-controls (push) Successful in 18s
Extend banner consent records with consent_method, banner_version,
banner_config_hash, geo, page_url, referrer, device info, session_id
and consent_scope for full Art. 7 DSGVO proof with any tracking vendor.

Migration 107, backward-compatible (all fields nullable).
Admin detail modal shows tracking context, device info and technical data.
Fix pre-existing str|None → Optional[str] for Python 3.9 compat.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 23:12:20 +02:00
Benjamin Admin 58f370f4ff feat: LLM-agnostic Compliance Agent with tool calling
New agent architecture for intelligent MC evaluation:

agent_tools.py (367 LOC):
- 5 tools in OpenAI function-calling format
- query_controls: async DB query for MCs by doc_type
- evaluate_controls_batch: deterministic keyword matching
- search_document: text search with context
- get_document_stats: word count, sections, language
- submit_results: finalize check results

compliance_agent.py (398 LOC):
- ComplianceAgent class with agent loop
- 3 LLM providers: Ollama, OpenAI-compatible (OVH), Anthropic
- Tool call dispatch + result collection
- System prompt for systematic compliance analysis
- run_compliance_check() convenience function

Hybrid mode:
- COMPLIANCE_USE_AGENT=false (default): deterministic regex
- COMPLIANCE_USE_AGENT=true: LLM agent with tool calling
- Agent fallback to regex if LLM unavailable

Works with Qwen 35B (Ollama), Qwen 120B (OVH vLLM), Claude.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 22:56:09 +02:00
Benjamin Admin bdbc30e47b feat(cmp): unified consent view — Website-Besucher + Login-Nutzer tabs
Merges two separate consent views into one unified page at /sdk/einwilligungen:
- Tab "Website-Besucher": device-based banner consents with site selector
- Tab "Login-Nutzer": user-based DSGVO consents (existing, unchanged)

Backend:
- New endpoint GET /admin/consents for paginated banner consent records
- Fix: categories JSON string parsing (was iterating chars instead of array)

CMP Dashboard:
- Dynamic site selector replacing hardcoded "preview-test-site"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 22:41:56 +02:00
Benjamin Admin 9cbbc6ee2f feat: LLM interpretation layer for failed MC checks
Deterministic pass/fail stays unchanged. After keyword checking,
ONE batched LLM call enriches the top 10 severity FAILs with
context-specific recommendations based on the actual document.

Example: If document uses Google Analytics but lacks transfer
mechanism → LLM generates: "Sie nutzen Google Analytics (USA).
Ergaenzen Sie einen Verweis auf das EU-US Data Privacy Framework
und pruefen Sie die DPF-Zertifizierung unter dataprivacyframework.gov."

- Pass/fail: deterministic (keyword matching, reproducible)
- Hint enrichment: LLM (contextual, one call for all fails)
- Temperature 0.3 for consistency
- Graceful fallback if Ollama unavailable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 22:08:07 +02:00
Benjamin Admin 5ea83e9b33 feat: Deterministic MC checking — ALL controls, no LLM, reproducible
Replaced LLM-based MC verification with deterministic keyword matching:
- Extracts keywords from pass_criteria/fail_criteria
- Matches against document text via regex (case-insensitive)
- PASS if >= 60% of criteria keywords found AND no fail_criteria triggered
- Same text + same MCs = same result every time

Checks ALL MCs for the doc_type (max_controls=0):
- DSE: all 571 controls checked in <1 second
- Impressum: all 75 controls
- Cookie: all 381 controls

No LLM calls needed — purely deterministic keyword matching.
Bigram extraction for compound terms (e.g. "standardvertragsklauseln").
Stop word filtering for German legal text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 21:51:58 +02:00
Benjamin Admin 26b222d53d feat: Integrate 1.874 Master Controls into document checking
Rewritten rag_document_checker.py to use doc_check_controls table
instead of generic canonical_controls. Each MC has:
- check_question: binary YES/NO for LLM
- pass_criteria: JSONB list of concrete requirements
- fail_criteria: JSONB list of common mistakes

Flow: Regex checks (fast) → LLM verify FAILs → MC deep check (15 per doc)
MC results appear as additional L2 checks in the report.

Coverage: 571 DSE, 381 Cookie, 309 Loeschkonzept, 153 Widerruf,
147 DSFA, 125 AVV, 113 AGB, 75 Impressum = 1.874 total.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 21:06:03 +02:00
Benjamin Admin 82951785ec feat: Impressum checks expanded from 16 to 24 (GAP analysis)
8 new checks: Reglementierte Berufe, Grundkapital, Aufsichtsbehoerde,
Berufshaftpflicht, rechtswidrige Disclaimer, Kammer, Berufsbezeichnung,
berufsrechtliche Regelungen.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 09:29:49 +02:00