perf(audit): vendor_llm_extractor + mc_solution_generator nutzen P31 LLM-Cascade

Beide rufen jetzt llm_cascade.call_with_cascade() statt direkter Qwen/OVH- Aufrufe. Damit: * Cache-Hit auf identische Eingaben (Valkey, 7d TTL) → ~50ms statt 4-6min beim Re-Run derselben Cookie-Doc. * Tiered Cascade automatisch: Qwen → OVH 120B → Anthropic Claude Haiku wenn lower-tier under confidence-threshold. * Confidence-Scoring (JSON-parse + items_per_input_size) entscheidet ob weiter delegiert wird. Fallback auf alte _call_ollama/_call_ovh bleibt bestehen wenn der Cascade-Aufruf scheitert. Erwartete Wirkung beim 2. VW-Lauf: ~10min statt ~25min (Cache-Hit auf identische Cookie-Doc + MC-Solutions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:40:11 +02:00
parent 64d8b0f1f9
commit cf6005a47c
2 changed files with 38 additions and 8 deletions
@@ -172,6 +172,21 @@ async def generate_solution(
        "Liefere die Loesung als JSON."
    )

+    # P31: tiered Cascade (Qwen → OVH → Anthropic) mit Valkey-Cache.
+    try:
+        from compliance.services.llm_cascade import call_with_cascade
+        res = await call_with_cascade(
+            system=_SYSTEM_PROMPT, user=prompt,
+            min_confidence=0.5, max_tokens=600,
+        )
+        parsed = _parse(res.get("text", ""))
+        if parsed:
+            _cache_put(cache_key, parsed)
+            return parsed
+    except Exception:
+        # fall through to legacy direct calls
+        pass
+
    content = await _call_ollama(prompt)
    parsed = _parse(content)
    if not parsed: