feat(vvt): V3 — LLM vendor extraction fallback for unknown CMPs

When the cookie text has no captured CMP payload (long-tail sites that
don't use ePaaS/OneTrust/Cookiebot/etc.) we now fall back to a Qwen → OVH
LLM cascade to extract a structured vendor list from the policy text.

New module backend/compliance/services/vendor_llm_extractor.py:
- extract_vendors_via_llm(cookie_text): runs Qwen first (local Ollama),
  then OVH if Qwen returns nothing usable.
- System prompt instructs the model to return STRICT JSON only:
  {vendors: [{name, country, purpose, category, opt_out_url,
   privacy_policy_url, persistence, cookies: [...]}]}
- Lenient JSON parser tolerates code-fences, prose wrappers, dict vs list.
- _normalize() caps array sizes (80 vendors, 30 cookies each), validates
  URLs (must be http(s)), trims fields to reasonable lengths.

Route integration (agent_compliance_check_routes.py):
- After named-CMP extract: if cmp_vendors is empty AND the cookie text
  has ≥500 words (otherwise it's likely navigation chrome), invoke the
  LLM extractor. Progress message 'Vendor-Liste per LLM extrahieren...'.
- Vendors then run through the same validate_vendor_urls + score_vendors
  pipeline → VVT table rendered identically regardless of source.

docker-compose.yml: backend-compliance gains OLLAMA_URL, CMP_LLM_MODEL,
OVH_LLM_URL/KEY/MODEL env vars (same names as consent-tester so the
configuration is unified).

This closes the 'every site eventually gets a VVT table' goal:
- Known CMP → V1/V2 structured extraction (fast, exact)
- Unknown CMP → V3 LLM extraction (slow, best-effort)
- No text at all → no vendors, but other compliance checks still run.
This commit is contained in:
Benjamin Admin
2026-05-17 09:55:42 +02:00
parent 9c0cc0f59f
commit 873997c13b
3 changed files with 237 additions and 7 deletions
+8
View File
@@ -116,6 +116,14 @@ services:
SMTP_FROM_NAME: ${SMTP_FROM_NAME:-BreakPilot Compliance}
SMTP_FROM_ADDR: ${SMTP_FROM_ADDR:-compliance@breakpilot.app}
RAG_SERVICE_URL: http://bp-core-rag-service:8097
# LLM cascade for V3 vendor extraction (unknown CMPs).
# Reuses the same env vars as the consent-tester so both can be
# configured in one place.
OLLAMA_URL: ${OLLAMA_URL:-http://host.docker.internal:11434}
CMP_LLM_MODEL: ${CMP_LLM_MODEL:-qwen3:30b-a3b}
OVH_LLM_URL: ${OVH_LLM_URL:-}
OVH_LLM_KEY: ${OVH_LLM_KEY:-}
OVH_LLM_MODEL: ${OVH_LLM_MODEL:-}
extra_hosts:
- "host.docker.internal:host-gateway"
depends_on: