38a347a82a
CI / detect-changes (push) Successful in 7s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Successful in 9s
CI / validate-canonical-controls (push) Successful in 12s
CI / loc-budget (push) Successful in 24s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m11s
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 24s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
AGB v2 (decision_method routing, 71%FP->~0) + DSE v3 (4-layer, recovered from container) + Architektur-Tab into /sdk/agent live path. Incl CI robustness (detect-changes.sh + PR-head checkout) + security (hardcoded Qdrant key removed, gitleaks allowlist). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
52 lines
2.2 KiB
Python
52 lines
2.2 KiB
Python
"""CONTENT-Pruefer / decision_method=EMBEDDING.
|
|
|
|
Ist die Pflicht SEMANTISCH im Text vorhanden? Max-Cosinus (Doc-Chunks x Control-
|
|
Paraphrasen) >= per-Control-Schwelle. Deterministisch (festes Embedding-Modell)
|
|
und gecacht. Rettet Recall-FP (Klausel da, anders formuliert).
|
|
|
|
Faellt der Embedding-Service aus, liefert der Checker present=None (unklar) — der
|
|
Aufrufer behaelt dann das Keyword-Ergebnis (kein Hang, kein Crash).
|
|
(Validiert an AGB: 17 Items, per-Item-Schwelle, 0 Fehl-Rescue.)
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import asyncio
|
|
import logging
|
|
|
|
from .base import CheckResult, ControlSpec, DocContext, VerificationMethod
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
# Paraphrasen-Vektoren je Control einmal einbetten + cachen.
|
|
_PARA_CACHE: dict[str, list] = {}
|
|
|
|
|
|
class EmbeddingChecker:
|
|
verification_method = VerificationMethod.CONTENT
|
|
|
|
async def check(self, ctrl: ControlSpec, doc: DocContext) -> CheckResult:
|
|
text = doc.text or ""
|
|
paras = ctrl.paraphrases or []
|
|
thr = ctrl.embed_threshold if ctrl.embed_threshold is not None else 0.60
|
|
if not paras or len(text) < 100:
|
|
return CheckResult(present=None, source="embedding")
|
|
try:
|
|
from compliance.services.mc_embedding_matcher import (
|
|
DIM, _chunk_text, _cosine, _embed_texts,
|
|
)
|
|
if ctrl.control_id not in _PARA_CACHE:
|
|
pv = await _embed_texts(paras)
|
|
_PARA_CACHE[ctrl.control_id] = [v for v in pv if v and len(v) == DIM]
|
|
pvecs = _PARA_CACHE[ctrl.control_id]
|
|
chunks = _chunk_text(text)
|
|
cvecs = [v for v in await asyncio.wait_for(
|
|
_embed_texts(chunks), timeout=90.0) if v and len(v) == DIM]
|
|
except (Exception, asyncio.TimeoutError) as e:
|
|
logger.info("embedding checker inaktiv %s: %s", ctrl.control_id, str(e)[:80])
|
|
return CheckResult(present=None, source="embedding")
|
|
if not pvecs or not cvecs:
|
|
return CheckResult(present=None, source="embedding")
|
|
best = max((_cosine(p, c) for p in pvecs for c in cvecs), default=0.0)
|
|
return CheckResult(present=best >= thr, confidence=round(best, 3),
|
|
source="embedding")
|