feat(audit): VW-Cookie-Bug-Fix + P101/P102 Cookie-Library-Mismatch-Findings
CI / loc-budget (push) Failing after 19s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / detect-changes (push) Successful in 11s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 15s

VW-Bug B1: extract_vendors_via_llm hatte max_text_chars=12000 -> bei
VW-Cookie-Doc (60k chars, 100 Cookies in Tabelle) wurden 80% abgeschnitten,
LLM extrahierte nur 1 Vendor. Fix: max_text_chars=50000, num_predict
6000->16000 fuer mehr Vendor-Output, Ollama-Timeout 120s->420s.

P101 Aggregator-Script (backend-compliance/scripts/cookie_library_enrich.py)
geht alle compliance_check_snapshots durch und extrahiert (cookie_name,
declared_category, observed_sites). Erste Auswertung ueber 8 Snapshots:
101 unique Cookies, 47 in Library, 54 unbekannt, 18 Mismatches.

P102 Cookie-Klassifikations-Pruefung als Mail-Block. Vergleicht
Site-deklarierte Kategorie vs Library + Vendor-Doku. HIGH wenn Library
sagt 'marketing' aber Site als 'essential'/'statistics' deklariert
(faktische Drittland-/Werbe-Verarbeitung versteckt). MEDIUM sonst.
In agent_compliance_check_routes Mail-Komposition + Replay-Pipeline
eingebaut.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-21 15:47:11 +02:00
parent 9c11b5463c
commit 94057b1536
5 changed files with 467 additions and 7 deletions
@@ -49,13 +49,19 @@ _SYSTEM_PROMPT = (
async def extract_vendors_via_llm(
cookie_text: str,
max_text_chars: int = 12000,
max_text_chars: int = 50000,
) -> list[dict]:
"""Run the Qwen → OVH cascade. Returns vendor records (possibly empty)."""
"""Run the Qwen → OVH cascade. Returns vendor records (possibly empty).
max_text_chars: VW-Cookie-Richtlinie hat ~60k chars mit ~100 Cookies in
der Tabelle. Bei 12k waren wir auf die ersten ~5 Cookies begrenzt und
haben nur 1 Vendor extrahiert. 50k deckt VW/BMW/Mercedes komplett ab
und passt in Qwen3-30b-a3b (128k Context) sowie OVH 120B.
"""
if not cookie_text or len(cookie_text) < 500:
return []
excerpt = cookie_text[:max_text_chars]
user_prompt = f"Cookie-Richtlinie-Text (gekuerzt):\n\n{excerpt}"
user_prompt = f"Cookie-Richtlinie-Text:\n\n{excerpt}"
# Stage 1: local Qwen
content = await _call_ollama(user_prompt)
@@ -82,10 +88,13 @@ async def _call_ollama(user_prompt: str) -> str:
{"role": "user", "content": user_prompt},
],
"stream": False, "format": "json",
"options": {"temperature": 0.05, "num_predict": 6000},
# 16k tokens fuer ~80 Vendors mit je 30 Cookies. War vorher 6k →
# output wurde mittendrin abgeschnitten, JSON unparseable → 0 Vendors.
"options": {"temperature": 0.05, "num_predict": 16000},
}
try:
async with httpx.AsyncClient(timeout=120.0) as client:
# Qwen 30b braucht fuer 16k output ~4-6min auf M4 Pro.
async with httpx.AsyncClient(timeout=420.0) as client:
resp = await client.post(f"{base.rstrip('/')}/api/chat", json=payload)
resp.raise_for_status()
return (resp.json().get("message") or {}).get("content", "")
@@ -109,7 +118,7 @@ async def _call_ovh(user_prompt: str) -> str:
{"role": "system", "content": _SYSTEM_PROMPT},
{"role": "user", "content": user_prompt},
],
"temperature": 0.05, "max_tokens": 6000,
"temperature": 0.05, "max_tokens": 16000,
"response_format": {"type": "json_object"},
}
try: