feat: RAG-based document verification against 144K Control Library

New module: rag_document_checker.py
- Searches RAG (Qdrant) for controls relevant to document type
- Filters by regulation (DSGVO Art.13, TDDDG §25, BGB §355 etc.)
- LLM (Qwen 3.5:35b) verifies each control against document text
- Returns fulfilled/missing with evidence text + severity
- Supports: DSI, Cookie, Impressum, Widerruf, AGB, DSFA, AVV, Loeschkonzept

Integration in doc-check endpoint:
- Regex checklist runs first (fast, deterministic)
- RAG checks run after (semantic, catches what regex misses)
- Both results combined in single response

LLM prompt returns JSON: {fulfilled, evidence, issue, severity}
Think-tags stripped, JSON extracted from response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-06 13:19:15 +02:00
parent 13c5880f51
commit 090da0f71b
2 changed files with 196 additions and 0 deletions
@@ -198,6 +198,24 @@ async def _check_single_document(entry: DocCheckEntry) -> list[DocCheckResult]:
# Main document check (full text against primary type)
main_result = _run_checklist(doc_text, entry.doc_type, entry.label, entry.url, word_count)
# RAG-based deep check (semantic verification against Control Library)
try:
from compliance.services.rag_document_checker import check_document_with_rag
rag_checks = await check_document_with_rag(
doc_text, entry.doc_type, entry.label, entry.url,
)
if rag_checks:
for rc in rag_checks:
main_result.checks.append(CheckItem(
id=rc["id"], label=rc["label"], passed=rc["passed"],
severity=rc["severity"], matched_text=rc.get("matched_text", ""),
))
if not rc["passed"]:
main_result.findings_count += 1
except Exception as e:
logger.warning("RAG check failed for %s: %s", entry.label, e)
all_results.append(main_result)
# Sub-section checks (auto-detected from headings)