T

Benjamin Admin 873997c13b feat(vvt): V3 — LLM vendor extraction fallback for unknown CMPs

When the cookie text has no captured CMP payload (long-tail sites that
don't use ePaaS/OneTrust/Cookiebot/etc.) we now fall back to a Qwen → OVH
LLM cascade to extract a structured vendor list from the policy text.

New module backend/compliance/services/vendor_llm_extractor.py:
- extract_vendors_via_llm(cookie_text): runs Qwen first (local Ollama),
  then OVH if Qwen returns nothing usable.
- System prompt instructs the model to return STRICT JSON only:
  {vendors: [{name, country, purpose, category, opt_out_url,
   privacy_policy_url, persistence, cookies: [...]}]}
- Lenient JSON parser tolerates code-fences, prose wrappers, dict vs list.
- _normalize() caps array sizes (80 vendors, 30 cookies each), validates
  URLs (must be http(s)), trims fields to reasonable lengths.

Route integration (agent_compliance_check_routes.py):
- After named-CMP extract: if cmp_vendors is empty AND the cookie text
  has ≥500 words (otherwise it's likely navigation chrome), invoke the
  LLM extractor. Progress message 'Vendor-Liste per LLM extrahieren...'.
- Vendors then run through the same validate_vendor_urls + score_vendors
  pipeline → VVT table rendered identically regardless of source.

docker-compose.yml: backend-compliance gains OLLAMA_URL, CMP_LLM_MODEL,
OVH_LLM_URL/KEY/MODEL env vars (same names as consent-tester so the
configuration is unified).

This closes the 'every site eventually gets a VVT table' goal:
- Known CMP → V1/V2 structured extraction (fast, exact)
- Unknown CMP → V3 LLM extraction (slow, best-effort)
- No text at all → no vendors, but other compliance checks still run.

2026-05-17 09:55:42 +02:00

.claude

feat(iace): CRA / DIN EN 40000-1-2 cyber-resilience spur

2026-05-17 02:15:51 +02:00

.gitea/workflows

feat(iace): GT-Bremse coverage — 59 expert measures + 7 hazard patterns

2026-05-16 13:08:52 +02:00

.woodpecker

…

admin-compliance

feat(iace): CRA / DIN EN 40000-1-2 cyber-resilience spur

2026-05-17 02:15:51 +02:00

ai-compliance-sdk

feat(iace): CRA / DIN EN 40000-1-2 cyber-resilience spur

2026-05-17 02:15:51 +02:00

backend-compliance

feat(vvt): V3 — LLM vendor extraction fallback for unknown CMPs

2026-05-17 09:55:42 +02:00

breakpilot-compliance-sdk

docs: update service READMEs for refactor progress and stale phase references

2026-04-19 16:07:23 +02:00

compliance-tts-service

Add Edge TTS voices for TR, AR, UK, RU, PL, FR, ES

2026-04-26 23:56:05 +02:00

consent-sdk

refactor(consent-sdk,dsms-gateway): split ConsentManager, types, and main.py

2026-04-18 08:42:32 +02:00

consent-tester

feat(vvt): per-vendor extraction + opt-out check + VVT table in email (V1)

2026-05-17 09:50:11 +02:00

developer-portal

feat: Google Consent Mode v2 + Developer Portal cookie banner docs

2026-05-02 17:13:34 +02:00

docs-src

test+docs: IACE Phase 3/4 — fehlende Tests + Entwickler-Dokumentation

2026-05-10 09:49:29 +02:00

document-crawler

refactor: phase 0 guardrails + phase 1 step 2 (models.py split)

2026-04-07 13:18:29 +02:00

dsms-gateway

feat(dsms): Stufe 2+3 — Evidence/TechFile → DSMS + Version Chains + Audit Timeline

2026-05-12 13:55:07 +02:00

dsms-node

refactor: phase 0 guardrails + phase 1 step 2 (models.py split)

2026-04-07 13:18:29 +02:00

scripts

feat(iace): GT-Bremse coverage — 59 expert measures + 7 hazard patterns

2026-05-16 13:08:52 +02:00

zeroclaw

docs(gt): BMW cross-domain finding — 3 domains, no AGB, Social Media on jobs portal

2026-05-15 13:21:27 +02:00

.env.example

…

.env.orca.example

chore: replace all Coolify references with Orca

2026-04-19 16:33:56 +02:00

.gitignore

docs: add root README, CONTRIBUTING, onboarding section, gitignore fixes

2026-04-19 16:09:28 +02:00

AGENTS.go.md

fix: resolve CI failures in Python tests and admin-compliance build

2026-04-19 16:41:39 +02:00

AGENTS.python.md

fix: resolve CI failures in Python tests and admin-compliance build

2026-04-19 16:41:39 +02:00

AGENTS.typescript.md

docs(agents): require build + lint + test locally before pushing [guardrail-change]

2026-04-19 16:38:21 +02:00

CONTRIBUTING.md

chore: replace all Coolify references with Orca

2026-04-19 16:33:56 +02:00

docker-compose.hetzner.yml

docs: replace all Coolify references with Orca across compliance repo

2026-04-17 10:39:45 +02:00

docker-compose.orca.yml

chore: replace all Coolify references with Orca

2026-04-19 16:33:56 +02:00

docker-compose.yml

feat(vvt): V3 — LLM vendor extraction fallback for unknown CMPs

2026-05-17 09:55:42 +02:00

mkdocs.yml

docs: add Pass 0b cost benchmark — v3 vs v4 vs backfill vs Mac Mini

2026-04-27 16:00:11 +02:00

README.md

chore: replace all Coolify references with Orca

2026-04-19 16:33:56 +02:00

REFACTOR_PLAYBOOK.md

docs: add root README, CONTRIBUTING, onboarding section, gitignore fixes

2026-04-19 16:09:28 +02:00

README.md

breakpilot-compliance

DSGVO/AI-Act compliance platform — 10 services, Go · Python · TypeScript

Overview

breakpilot-compliance is a multi-tenant DSGVO/EU AI Act compliance platform that provides an SDK for consent management, data subject requests (DSR), audit logging, iACE impact assessments, and document archival. It ships as 10 containerised services covering an admin dashboard, a developer portal, a Python/FastAPI backend, a Go AI compliance engine, TTS, and a decentralised document store on IPFS. Every service is deployed automatically via Gitea Actions → Orca on every push to main.

Architecture

Service	Tech	Port	Container
admin-compliance	Next.js 15	3007	bp-compliance-admin
backend-compliance	Python / FastAPI 0.123	8002	bp-compliance-backend
ai-compliance-sdk	Go 1.24 / Gin	8093	bp-compliance-ai-sdk
developer-portal	Next.js 15	3006	bp-compliance-developer-portal
breakpilot-compliance-sdk	TypeScript SDK (React/Vue/Angular/vanilla)	—	—
consent-sdk	JS/TS Consent SDK	—	—
compliance-tts-service	Python / Piper TTS	8095	bp-compliance-tts
document-crawler	Python / FastAPI	8098	bp-compliance-document-crawler
dsms-gateway	Python / FastAPI / IPFS	8082	bp-compliance-dsms-gateway
dsms-node	IPFS Kubo v0.24.0	—	bp-compliance-dsms-node

All containers share the external breakpilot-network Docker network and depend on breakpilot-core (Valkey, Vault, RAG service, Nginx reverse proxy).

Quick Start

Prerequisites: Docker, Go 1.24+, Python 3.12+, Node.js 20+

git clone ssh://git@gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-compliance.git
cd breakpilot-compliance

# Copy and populate secrets (never commit .env)
cp .env.example .env

# Start all services
docker compose up -d

For the Orca/Hetzner production target (x86_64), use the override:

docker compose -f docker-compose.yml -f docker-compose.hetzner.yml up -d

Development Workflow

Use feature branches off main. Supported prefixes: feat/, feature/, hotfix/.

git checkout main && git pull origin main
git checkout -b feat/my-change
# ... make changes ...
git push origin feat/my-change
# Open a PR → squash merge to main

Push to main triggers:

Gitea Actions — lint → test → validate (see CI Pipeline below)
Orca — automatic build + deploy (~3 min total)

Monitor status: https://gitea.meghsakha.com/Benjamin_Boenisch/breakpilot-compliance/actions

CI Pipeline

Defined in .gitea/workflows/ci.yaml.

Job	What it checks
`loc-budget`	All source files ≤ 500 LOC; soft target 300
`guardrail-integrity`	Commits touching guardrail files carry `[guardrail-change]`
`go-lint`	`golangci-lint` on `ai-compliance-sdk/`
`python-lint`	`ruff` + `mypy` on Python services
`nodejs-lint`	`tsc --noEmit` + ESLint on Next.js services
`test-go-ai-compliance`	`go test ./...` in `ai-compliance-sdk/`
`test-python-backend-compliance`	`pytest` in `backend-compliance/`
`test-python-document-crawler`	`pytest` in `document-crawler/`
`test-python-dsms-gateway`	`pytest test_main.py` in `dsms-gateway/`
`sbom-scan`	License + vulnerability scan via `syft` + `grype`
`validate-canonical-controls`	OpenAPI contract baseline diff

File Budget

Limit	Value	How to check
Soft target	300 LOC	`bash scripts/check-loc.sh`
Hard cap	500 LOC	Same; also enforced by `PreToolUse` hook + git pre-commit + CI
Exceptions	`.claude/rules/loc-exceptions.txt`	Require written rationale + `[guardrail-change]` commit marker

The .claude/settings.json PreToolUse hook blocks Claude Code from writing or editing files that would exceed the hard cap. The git pre-commit hook re-checks. CI is the final gate.

Links

	URL
Admin dashboard	https://admin-dev.breakpilot.ai
Developer portal	https://developers-dev.breakpilot.ai
Backend API	https://api-dev.breakpilot.ai
AI SDK API	https://sdk-dev.breakpilot.ai
Gitea repo	https://gitea.meghsakha.com/Benjamin_Boenisch/breakpilot-compliance
Gitea Actions	https://gitea.meghsakha.com/Benjamin_Boenisch/breakpilot-compliance/actions

Languages

TypeScript 43.1%

Python 30.8%

Go 23.5%

Shell 1.2%

PLpgSQL 0.8%

Other 0.3%