When the cookie text has no captured CMP payload (long-tail sites that
don't use ePaaS/OneTrust/Cookiebot/etc.) we now fall back to a Qwen → OVH
LLM cascade to extract a structured vendor list from the policy text.
New module backend/compliance/services/vendor_llm_extractor.py:
- extract_vendors_via_llm(cookie_text): runs Qwen first (local Ollama),
then OVH if Qwen returns nothing usable.
- System prompt instructs the model to return STRICT JSON only:
{vendors: [{name, country, purpose, category, opt_out_url,
privacy_policy_url, persistence, cookies: [...]}]}
- Lenient JSON parser tolerates code-fences, prose wrappers, dict vs list.
- _normalize() caps array sizes (80 vendors, 30 cookies each), validates
URLs (must be http(s)), trims fields to reasonable lengths.
Route integration (agent_compliance_check_routes.py):
- After named-CMP extract: if cmp_vendors is empty AND the cookie text
has ≥500 words (otherwise it's likely navigation chrome), invoke the
LLM extractor. Progress message 'Vendor-Liste per LLM extrahieren...'.
- Vendors then run through the same validate_vendor_urls + score_vendors
pipeline → VVT table rendered identically regardless of source.
docker-compose.yml: backend-compliance gains OLLAMA_URL, CMP_LLM_MODEL,
OVH_LLM_URL/KEY/MODEL env vars (same names as consent-tester so the
configuration is unified).
This closes the 'every site eventually gets a VVT table' goal:
- Known CMP → V1/V2 structured extraction (fast, exact)
- Unknown CMP → V3 LLM extraction (slow, best-effort)
- No text at all → no vendors, but other compliance checks still run.
New module consent-tester/services/cmp_llm_fallback.py:
- LLMCookieExtractor: single-endpoint adapter (Ollama OR OpenAI-compat)
- LLMCascade: tries Qwen (local Mac Mini Ollama) first; falls through to
OVH (managed 120B) when Qwen returns no usable strategy
- LLMCascade.from_env(): reads OLLAMA_URL/CMP_LLM_MODEL + OVH_LLM_URL/
OVH_LLM_KEY/OVH_LLM_MODEL from environment
- LLM returns JSON {strategy: url|selector|text, value: ...}
- Valkey-backed cache per netloc (cmp:hint:<netloc>, 7-day TTL) — next run
against the same domain skips the LLM entirely
dsi_discovery.py:
- Wired network_log collector (URL/status/content-type/size of every JSON
response on the page) — passed to LLM prompt as observation
- After Named CMP (Phase B) + Heuristic (Phase A) both fail AND DOM
< 300 words: invoke LLMCascade.analyze(...)
- _apply_llm_hint executes the LLM's strategy: refetch URL via Playwright
request context, query DOM selector, or use text directly
- Cache HIT path: apply cached hint, only fall back to LLM if cache is stale
docker-compose.yml:
- consent-tester gets env vars + cmp-data volume (for Phase E)
- All LLM endpoints configurable via env, sensible defaults
consent-tester/requirements.txt:
- redis>=5.0 (asyncio client, Valkey-compatible)
- httpx>=0.27
Allows switching between Haiku 4.5 and Sonnet 4.6 for Pass 0b
without rebuilding the backend container.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- LegalRAGClient: QDRANT_HOST+PORT → QDRANT_URL + QDRANT_API_KEY
- docker-compose: env vars updated for hosted Qdrant
- AllowedCollections: added bp_compliance_gdpr, bp_dsfa_templates, bp_dsfa_risks
- Migration scripts (bash + python) for data transfer
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- docker-compose.yml: alle 4 DATABASE_URL auf COMPLIANCE_DATABASE_URL (mit Fallback)
- .env.example: COMPLIANCE_DATABASE_URL Eintrag ergaenzt
- Rollback: ohne .env zeigt Fallback auf bp-core-postgres
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The DATABASE_URL was using postgresql+asyncpg:// with ?options= for search_path,
but database.py uses synchronous SQLAlchemy (create_engine) and asyncpg doesn't
support the 'options' keyword argument. The search_path is already set via an
event listener in database.py, so the options parameter is unnecessary.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- statistics.byStatus.in_progress could crash on empty object → optional chaining
- COURSE_CATEGORY_INFO[course.category] could return undefined → fallback to 'custom'
- Update LLM model to qwen3.5:35b-a3b in docker-compose.yml
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Switch LegalRAGClient from empty bp_legal_corpus to bp_compliance_ce
collection (3,734 chunks across 14 regulations)
- Replace embedding-service (384-dim MiniLM) with Ollama bge-m3 (1024-dim)
- Add standalone RAG search endpoint: POST /sdk/v1/rag/search
- Add regulations list endpoint: GET /sdk/v1/rag/regulations
- Add QDRANT_HOST/PORT env vars to docker-compose.yml
- Update regulation ID mapping to match actual Qdrant payload schema
- Update determineRelevantRegulations for CE corpus regulation IDs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New standalone Python/FastAPI service for automatic compliance document
scanning, LLM-based classification, IPFS archival, and gap analysis.
Includes extractors (PDF, DOCX, XLSX, PPTX), keyword fallback classifier,
compliance matrix, and full REST API on port 8098.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>