feat: Deterministic MC checking — ALL controls, no LLM, reproducible

Replaced LLM-based MC verification with deterministic keyword matching:
- Extracts keywords from pass_criteria/fail_criteria
- Matches against document text via regex (case-insensitive)
- PASS if >= 60% of criteria keywords found AND no fail_criteria triggered
- Same text + same MCs = same result every time

Checks ALL MCs for the doc_type (max_controls=0):
- DSE: all 571 controls checked in <1 second
- Impressum: all 75 controls
- Cookie: all 381 controls

No LLM calls needed — purely deterministic keyword matching.
Bigram extraction for compound terms (e.g. "standardvertragsklauseln").
Stop word filtering for German legal text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-10 21:51:58 +02:00
parent 9a9a11b248
commit 5ea83e9b33
2 changed files with 207 additions and 158 deletions
@@ -288,7 +288,7 @@ async def _check_single_document(entry: DocCheckEntry) -> list[DocCheckResult]:
try:
from compliance.services.rag_document_checker import check_document_with_controls
mc_results = await check_document_with_controls(
doc_text, entry.doc_type, entry.label, max_controls=15,
doc_text, entry.doc_type, entry.label, max_controls=0,
)
if mc_results:
# Add MC results as additional checks to the main result