Files
breakpilot-compliance/backend-compliance/compliance/services/finding_plausibility_check.py
T
Benjamin Admin d6b8bf87c2
CI / detect-changes (push) Successful in 9s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / test-python-backend (push) Successful in 29s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 10s
CI / loc-budget (push) Successful in 13s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
fix: 4 Bugs gemeinsam — B22 PDF + B17 Walk-Fallback + company_name + Plausibility-Fallback
(1) B22 Cross-Domain (fix #59):
  Elli-Test fand AGB auf logpay.de NICHT obwohl URL in doc_entries
  korrekt. Vermutete Ursache: Discovery-Phase A drops/überschreibt
  Original-URL bei PDF-Fetch-Fail (word_count=0).
  Fix: _collect_audit_urls() iteriert über state.doc_entries +
  rejected_url + req.documents — Cross-Domain-Hosting ist
  unabhängig vom Text-Inhalt. Plus Trace-Logging für künftige
  Diagnose. Dedup per (doc_type, host_sld).

(2) B17 Audit-Walk-Fail-Fallback (fix #60):
  BMW v5 hatte audit_walk=None ohne Mail-Hinweis. Vermutlich
  180s-Timeout bei OneTrust-CMP-Banner-Tour.
  Fix: Timeout 180s → 300s. Plus: Bei Fail wird ein Hinweis-
  Stub mit error-Grund in state["audit_walk"] + HTML-Block
  geschrieben — Reviewer sieht den Fail statt silent-skip.

(3) company_name + origin_domain im Backend (fix #61):
  Frontend sendet seit ec03317 die zwei Felder — Backend ignorierte
  sie.
  Fix: ComplianceCheckRequest-Schema um company_name +
  origin_domain erweitert. phase_e_email priorisiert User-Input
  vor URL-Heuristik für site_name. Bei origin_domain ohne
  ableitbare doc_entries-domain wird der User-Input als domain
  übernommen.

(4) Plausibility-LLM Fallback-Modell (fix #62):
  qwen3:30b-a3b liefert auf großen DSEs (BMW 122 FAIL) gehäuft
  leere format='json'-Responses — Circuit-Breaker griff aber
  Phase blieb nutzlos.
  Fix: Default-Modell auf qwen2.5:7b umgestellt (4× kleiner,
  zuverlässiger bei format=json, ausreichendes Reasoning für
  PASS/MODIFY/DROP-Klassifikation). Plus Strategy-C eingeführt
  — Fallback-Modell (llama3.2:3b) wenn primary leer bleibt.
  BATCH_SIZE 4 → 3. ENV-Switches PLAUSIBILITY_LLM_MODEL +
  PLAUSIBILITY_FALLBACK_MODEL für Tuning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 16:39:33 +02:00

450 lines
18 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""LLM Plausibility Re-Evaluation for MC findings.
Why this exists:
MC-DB labels are historic compliance-officer questions ("Dokumentiert
die DSI alle Datenübermittlungen gemäß Art. 49 Abs. 1 Unterabs. 2
DS-GVO?"). When the deterministic regex+LLM-verify pipeline flags
them as FAIL, the question stays as the title. The reader sees
"we don't know" — unhelpful.
What this does:
AFTER the MC pipeline finished, run a second LLM pass over EVERY
remaining FAIL with the original doc-text. The LLM:
1. Reformulates the question as a STATEMENT-OF-TOPIC
("Drittland-Übermittlungen nach Art. 49 Abs. 1 Unterabs. 2 DS-GVO")
2. Suggests a plausible severity (or DROP if the finding is bogus)
3. Produces a CONCRETE recommendation ("Im Abschnitt 'Drittland'
der DSE Mechanismus pro Empfänger ergänzen")
What this does NOT do:
- Touch the MC-DB. Original label stays in c.label.
- Touch passed/skipped/regulation/matched_text — those are facts.
- Run for non-fails or already-handled checks.
Stamping schema on each Check (CheckItem dataclass):
llm_title: str — reformulated topic statement
llm_severity: str — suggested severity ("HIGH"|"MED"|"LOW"|"DROP")
llm_recommendation: str — concrete fix recommendation
llm_drop: bool — True if the LLM judged the finding not plausible
llm_plausibility: float — 0..1 confidence (optional)
The mail-render V2 reads these stamps and renders them next to the
original label (🤖 LLM-Plausibility box).
Config:
OLLAMA_URL default "http://host.docker.internal:11434"
PLAUSIBILITY_LLM_MODEL default "qwen3:30b-a3b"
PLAUSIBILITY_BATCH_SIZE default 8
PLAUSIBILITY_TIMEOUT_S default 60.0
"""
from __future__ import annotations
import hashlib
import json
import logging
import os
import httpx
logger = logging.getLogger(__name__)
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://host.docker.internal:11434")
# Default-Modell als ENV-Switch konfigurierbar. qwen3:30b-a3b ist
# bestes Reasoning, aber gibt bei großen DSEs gerne leere Responses
# unter format='json'. qwen2.5:7b ist 4× kleiner, deutlich
# zuverlässiger, leicht schwächeres Reasoning aber für die einfache
# Plausibility-Klassifikation (PASS/MODIFY/DROP) ausreichend.
MODEL = os.getenv("PLAUSIBILITY_LLM_MODEL", "qwen2.5:7b")
# Fallback-Modell wenn das primary trotz Retries nichts liefert
# (Strategy A → B → C → D-Schritte erschöpft). Default ist ein
# kleines, robustes Modell.
FALLBACK_MODEL = os.getenv("PLAUSIBILITY_FALLBACK_MODEL", "llama3.2:3b")
# Mit kleinerem Modell können größere Batches funktionieren — aber
# konservativ bleiben damit Single-Modell-Fail nicht ganz Phase killt.
BATCH_SIZE = int(os.getenv("PLAUSIBILITY_BATCH_SIZE", "3"))
TIMEOUT = float(os.getenv("PLAUSIBILITY_TIMEOUT_S", "45.0"))
# Reduced excerpt 4000 → 1500 chars (same reason).
DOC_EXCERPT_CHARS = int(os.getenv("PLAUSIBILITY_DOC_EXCERPT", "1500"))
# In-memory cache: (input_hash) -> result_dict. Survives one run.
_CACHE: dict[str, dict] = {}
def _checksum(check_id: str, label: str, hint: str,
doc_excerpt: str) -> str:
"""Stable hash of the LLM input — avoid re-asking on retries."""
h = hashlib.sha256()
h.update(check_id.encode())
h.update(b"\x00")
h.update(label.encode())
h.update(b"\x00")
h.update(hint.encode())
h.update(b"\x00")
h.update(doc_excerpt[:2000].encode())
return h.hexdigest()[:16]
_SYSTEM_PROMPT = (
"Du bist Compliance-Plausibilitäts-Auditor für deutsche "
"Datenschutz-Prüfberichte. Für jeden Finding-Eintrag bekommst du "
"die MC-Pflichtfrage, den LLM-Hinweis und einen Ausschnitt aus "
"dem geprüften Dokument.\n\n"
"REGELN — sehr wichtig:\n"
"1. Du gibst für JEDEN Finding-Eintrag im Input GENAU EINEN Output-"
"Eintrag zurück (keine ausgelassen, keine zusätzlichen).\n"
"2. Die ID muss BUCHSTABENGENAU vom Input übernommen werden — "
"nicht abgekürzt, nicht umformatiert (Beispiel: \"mc-DATA-3953-A04\" "
"bleibt \"mc-DATA-3953-A04\").\n"
"3. Reihenfolge der Output-Items entspricht der Input-Reihenfolge.\n\n"
"Pro Finding:\n"
"- title: TOPIC-STATEMENT (max 80 Zeichen, ohne Frageton, "
"nennt die Norm wenn sinnvoll). Beispiel: "
"Frage \"Dokumentiert die DSI Drittlandtransfers nach Art. 49?\" "
"→ title \"Drittlandtransfer-Doku Art. 49 DSGVO\".\n"
"- severity: HIGH (klar verletzt), MEDIUM (verletzt, weniger "
"kritisch), LOW (unsicher / manuelle Prüfung), DROP "
"(Auszug zeigt klar dass die Anforderung erfüllt ist).\n"
"- recommendation: KONKRETE Aktion (max 200 Zeichen), nennt "
"WAS und WO. Beispiel: \"Im Abschnitt 'Drittlandtransfer' "
"der DSE pro Empfänger einen Mechanismus nach Art. 49 ergänzen\".\n"
"- drop: true wenn severity=DROP, sonst false.\n\n"
"JSON-Schema (genauso antworten):\n"
"{\"findings\":["
"{\"id\":\"<exakte-id-vom-input>\",\"title\":\"...\","
"\"severity\":\"HIGH|MEDIUM|LOW|DROP\","
"\"recommendation\":\"...\",\"drop\":false}"
"]}\n\n"
"Beispiel-Antwort bei 2 Inputs mit IDs mc-A und mc-B:\n"
"{\"findings\":[{\"id\":\"mc-A\",\"title\":\"Norm X erfüllen\","
"\"severity\":\"MEDIUM\",\"recommendation\":\"In Abschnitt Y "
"ergänzen: Norm X erfüllt\",\"drop\":false},"
"{\"id\":\"mc-B\",\"title\":\"Norm Z geprüft\",\"severity\":\"DROP\","
"\"recommendation\":\"Bereits erfüllt — Hinweis im Doc Z3\","
"\"drop\":true}]}"
)
def _build_user_prompt(items: list[dict], doc_title: str,
doc_excerpt: str) -> str:
findings_block = "\n".join(
f'{i+1}. ID="{it["id"]}" | FRAGE: {it["label"]} | '
f'HINT: {it.get("hint", "")[:200]} | SEV_REGEX: {it.get("severity")}'
for i, it in enumerate(items)
)
return (
f"DOKUMENT: {doc_title}\n\n"
f"DOKUMENT-AUSZUG (max {DOC_EXCERPT_CHARS} Zeichen):\n"
f"{doc_excerpt[:DOC_EXCERPT_CHARS]}\n\n"
f"FINDINGS ZU BEWERTEN:\n{findings_block}"
)
async def _post_llm(body: dict) -> str:
"""One LLM call. Returns content string or empty on failure.
Catches network errors so the caller can decide fallback strategy."""
try:
async with httpx.AsyncClient(timeout=TIMEOUT) as c:
r = await c.post(f"{OLLAMA_URL}/api/chat", json=body)
r.raise_for_status()
return (r.json().get("message") or {}).get("content", "") or ""
except Exception as e:
logger.warning("plausibility LLM call failed: %s", e)
return ""
def _try_extract_json(content: str) -> dict | None:
"""Extract a JSON object from free-form LLM output. Handles
markdown-fenced and prose-wrapped responses."""
if not content:
return None
s = content.strip()
# Strip ```json … ``` fences
if s.startswith("```"):
s = s.strip("`")
if s.lower().startswith("json"):
s = s[4:]
s = s.strip()
# Heuristic: cut from first { to last }
first = s.find("{")
last = s.rfind("}")
if first >= 0 and last > first:
s = s[first:last + 1]
try:
return json.loads(s)
except Exception:
return None
async def _ask_llm_batch(items: list[dict], doc_title: str,
doc_excerpt: str) -> dict[str, dict]:
"""Send a batch of up to BATCH_SIZE findings to the LLM.
Resilience strategy (P125 fix for empty-response bug):
A. primary MODEL + format='json' (strict)
B. primary MODEL + format='' (loose), parse JSON manuell
C. FALLBACK_MODEL + format='json' (kleineres robusteres Modell)
D. If batch >2: split + recurse
E. Else: give up, return {} (callers stamp llm_skipped=true)
"""
user_prompt = _build_user_prompt(items, doc_title, doc_excerpt)
def _body(model: str) -> dict:
return {
"model": model,
"messages": [
{"role": "system", "content": _SYSTEM_PROMPT},
{"role": "user", "content": user_prompt},
],
"stream": False,
"options": {"temperature": 0.0, "seed": 42, "num_predict": 1500},
}
out: dict[str, dict] = {}
input_ids = [it["id"] for it in items]
try:
# Strategy A: primary + format='json'
content = await _post_llm({**_body(MODEL), "format": "json"})
if not content:
# Strategy B: primary + format-free
logger.info(
"plausibility A→empty, trying B (format-free) batch=%d",
len(items),
)
content = await _post_llm(_body(MODEL))
if not content and FALLBACK_MODEL and FALLBACK_MODEL != MODEL:
# Strategy C: fallback-model + format='json'
logger.info(
"plausibility A+B empty, trying C (fallback=%s) batch=%d",
FALLBACK_MODEL, len(items),
)
content = await _post_llm(
{**_body(FALLBACK_MODEL), "format": "json"},
)
if not content:
# Strategy C: split + recurse
if len(items) > 2:
half = len(items) // 2
logger.info(
"plausibility A+B empty → split %d%dx2",
len(items), half,
)
first = await _ask_llm_batch(
items[:half], doc_title, doc_excerpt,
)
second = await _ask_llm_batch(
items[half:], doc_title, doc_excerpt,
)
out.update(first)
out.update(second)
return out
# Strategy D: give up
logger.warning(
"plausibility gave up after A+B for batch=%d", len(items),
)
return out
data = _try_extract_json(content)
if data is None:
logger.warning(
"plausibility LLM JSON parse failed (after fallback); "
"raw=%s", content[:300],
)
return out
llm_findings = data.get("findings") or []
if not llm_findings:
logger.warning(
"plausibility LLM returned 0 findings for %d input "
"items; raw=%s", len(items), content[:300],
)
return out
# Phase 1: exact ID match
id_set = set(input_ids)
for entry in llm_findings:
fid = (entry.get("id") or "").strip()
if fid in id_set and fid not in out:
out[fid] = _entry_to_stamp(entry)
# Phase 2: position fallback — for any input item still
# unmapped, use the LLM finding at the same index if it's
# otherwise unclaimed.
if len(out) < len(input_ids):
claimed_indices: set[int] = set()
for idx, entry in enumerate(llm_findings):
fid = (entry.get("id") or "").strip()
if fid in out:
claimed_indices.add(idx)
for idx, input_id in enumerate(input_ids):
if input_id in out:
continue
if idx < len(llm_findings) and idx not in claimed_indices:
out[input_id] = _entry_to_stamp(llm_findings[idx])
claimed_indices.add(idx)
# Phase 3: fuzzy match by ID-tail
if len(out) < len(input_ids):
unmapped_ids = [i for i in input_ids if i not in out]
used_entries: set[int] = set()
for idx, entry in enumerate(llm_findings):
fid = (entry.get("id") or "").strip().lower()
if not fid or any(entry == out.get(i) for i in unmapped_ids):
continue
if idx in used_entries:
continue
for inp in unmapped_ids:
if inp in out:
continue
if inp[-8:].lower() in fid or fid in inp.lower():
out[inp] = _entry_to_stamp(entry)
used_entries.add(idx)
break
if not out:
logger.warning(
"plausibility could not map any of %d input IDs; "
"raw=%s", len(input_ids), content[:300],
)
else:
logger.info(
"plausibility mapped %d/%d findings", len(out),
len(input_ids),
)
except Exception as e:
logger.warning("plausibility batch failed: %s", e)
return out
def _entry_to_stamp(entry: dict) -> dict:
return {
"llm_title": (entry.get("title") or "")[:200],
"llm_severity": (entry.get("severity") or "").upper(),
"llm_recommendation": (entry.get("recommendation") or "")[:400],
"llm_drop": bool(entry.get("drop", False)),
}
async def verify_plausibility(results, doc_texts: dict[str, str]) -> None:
"""Stamp llm_* fields onto every FAIL CheckItem in results.
Args:
results: list of DocCheckResult, each with .checks (list of CheckItem)
and .doc_type
doc_texts: doc_type -> source text excerpt for context
"""
if not results:
return
# Gather candidate fails per doc_type so the prompt can scope the
# excerpt correctly.
by_doc: dict[str, list] = {}
by_doc_meta: dict[str, str] = {}
for r in results:
dt = getattr(r, "doc_type", "")
label = getattr(r, "label", "") or dt
for c in getattr(r, "checks", []) or []:
if getattr(c, "passed", True) or getattr(c, "skipped", False):
continue
# MC checks only — skip the structural P-* placement findings
cid = (getattr(c, "id", "") or "").lower()
if not cid.startswith("mc-"):
continue
by_doc.setdefault(dt, []).append(c)
by_doc_meta[dt] = label
if not by_doc:
return
total = sum(len(v) for v in by_doc.values())
logger.info("plausibility-check: %d findings across %d docs",
total, len(by_doc))
# Circuit-Breaker gegen Ollama-Total-Down: nach N consecutive
# batches mit 0 stamped → ganze Phase abbrechen (statt 200 calls
# warten). Wert konservativ: 6 consecutive empties = qwen3 ist
# offensichtlich nicht in der Lage zu antworten.
consecutive_empty_budget = int(
os.getenv("PLAUSIBILITY_EMPTY_BUDGET", "6"),
)
consecutive_empty = 0
breaker_tripped = False
for dt, checks in by_doc.items():
if breaker_tripped:
break
doc_title = by_doc_meta.get(dt) or dt
doc_text = doc_texts.get(dt) or ""
if not doc_text:
# Fall back to DSE excerpt when the doc has no own text
doc_text = doc_texts.get("dse") or ""
for i in range(0, len(checks), BATCH_SIZE):
if breaker_tripped:
break
batch = checks[i:i + BATCH_SIZE]
items = []
for c in batch:
items.append({
"id": getattr(c, "id", ""),
"label": getattr(c, "label", ""),
"hint": getattr(c, "hint", "") or "",
"severity": getattr(c, "severity", ""),
})
# Cache lookup per item — skip those already cached.
uncached_items: list[dict] = []
for it in items:
key = _checksum(it["id"], it["label"], it["hint"], doc_text)
if key in _CACHE:
continue
uncached_items.append(it)
if not uncached_items:
cache_results = {it["id"]: _CACHE[_checksum(
it["id"], it["label"], it["hint"], doc_text,
)] for it in items}
else:
cache_results = await _ask_llm_batch(
uncached_items, doc_title, doc_text,
)
for it in uncached_items:
rid = it["id"]
if rid in cache_results:
key = _checksum(
it["id"], it["label"], it["hint"], doc_text,
)
_CACHE[key] = cache_results[rid]
# add cached ones too
for it in items:
if it["id"] not in cache_results:
key = _checksum(
it["id"], it["label"], it["hint"], doc_text,
)
if key in _CACHE:
cache_results[it["id"]] = _CACHE[key]
# Stamp onto each CheckItem
stamped = 0
for c in batch:
cid = getattr(c, "id", "")
if cid in cache_results:
res = cache_results[cid]
try:
c.llm_title = res.get("llm_title", "") or ""
sev = res.get("llm_severity", "") or ""
c.llm_severity = sev if sev in (
"HIGH", "MEDIUM", "LOW", "DROP") else ""
c.llm_recommendation = res.get(
"llm_recommendation", "") or ""
c.llm_drop = bool(res.get("llm_drop", False))
stamped += 1
except Exception:
pass
# Circuit-Breaker: stamped=0 zählt als consecutive_empty.
# Ausnahme: wenn ALLE items aus dem _CACHE kamen, ist 0 OK
# (kein neuer LLM-Call gemacht).
if uncached_items and stamped == 0:
consecutive_empty += 1
if consecutive_empty >= consecutive_empty_budget:
logger.warning(
"plausibility circuit-breaker tripped after "
"%d consecutive empty batches — aborting phase",
consecutive_empty,
)
breaker_tripped = True
elif stamped > 0:
consecutive_empty = 0
logger.info("plausibility-check %s: batch %d%d stamped",
dt, len(batch), stamped)