feat(pipeline): MC Quality Overhaul — 74.5% → 92.8% accuracy, 5.3K → 13.6K MCs
Phase 0: Quality Audit script (Claude Sonnet, 1750 samples) Phase 1: Object ontology expanded 31 → 74 tokens with descriptions + boundaries Phase 2: 174K controls re-classified via Haiku (10 batches, $50) - Generic tokens removed (documentation, procedure, process) - L2 sub-topics added (108K + 64K controls) - Bad subtopics fixed (stakeholder_*, escalation fragments) Phase 3: Re-clustering K=18704 (37K objects → 16.7K groups) Phase 4: Direct MC generation from canonical tokens (gpre2_direct_mc.py) Phase 5: Regulation-source split (gpre3, dry-run tested) New features: - Tenant-isolated document upload API (rag-service) - BAuA crawler (Playwright, 131 PDFs downloaded) - OSHA Technical Manual crawler (23 chapters) - CE obligation extractor (6141 obligations from Qdrant) RAG ingestion: - 126 BAuA PDFs (TRBS/TRGS/ASR): 27,664 chunks - OSHA Technical Manual: 7,241 chunks - OSHA 1910 Subpart O (full): 745 chunks - EuGH C-588/21 P: 216 chunks - EU 2018/1725: 842 chunks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -460,12 +460,50 @@ WICHTIGE REGELN:
|
||||
|
||||
7. MERGE-KEY: Erzeuge im JSON-Output ein zusaetzliches Feld "merge_key" mit
|
||||
dem Format: "action_type:normalized_object:control_phase"
|
||||
|
||||
WICHTIG: Waehle normalized_object NUR aus dieser Liste kanonischer Tokens:
|
||||
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
|
||||
privileged_access, access_control, encryption, transport_encryption,
|
||||
key_management, certificate_management, network_security, network_segmentation,
|
||||
firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting,
|
||||
compliance_audit, vulnerability, patch_management, backup, disaster_recovery,
|
||||
physical_security, secure_development, api_security, input_validation,
|
||||
container_security, logging_configuration
|
||||
DATA_PROTECTION: personal_data, sensitive_data, health_data, consent,
|
||||
data_subject_rights, data_retention, data_transfer, data_breach_notification,
|
||||
dpia, data_processing_agreement, privacy_by_design, data_processing_register,
|
||||
data_classification, cookie_consent, video_surveillance
|
||||
GOVERNANCE: policy, procedure, process, training, awareness, incident,
|
||||
risk_management, third_party_management, change_management, documentation,
|
||||
records_management, compliance_reporting, asset_management,
|
||||
human_resources_security
|
||||
REGULATORY: supervisory_authority, certification, product_safety, ai_system,
|
||||
financial_reporting, aml, whistleblowing, consumer_protection, ecommerce,
|
||||
telecommunications, medical_device, payment_services, critical_infrastructure,
|
||||
supply_chain_due_diligence, sustainability_reporting
|
||||
|
||||
Wenn KEIN Token passt: "OTHER:kurzbeschreibung" (z.B. "OTHER:battery_recycling")
|
||||
|
||||
ABGRENZUNGEN (haeufige Fehler vermeiden!):
|
||||
- monitoring = NUR kontinuierliche Echtzeit-Ueberwachung von Systemen
|
||||
- audit_logging = Protokollierung, Audit Trail, Nachvollziehbarkeit
|
||||
- compliance_audit = externe Pruefungen, Zertifizierungsaudits
|
||||
- training = Schulungen DURCHFUEHREN (nicht "ueberwachen")
|
||||
- procedure = Verfahren DEFINIEREN (nicht Incident-Behandlung)
|
||||
- incident = Sicherheitsvorfaelle BEHANDELN
|
||||
- alerting = Meldepflichten und Benachrichtigungen
|
||||
- personal_data = DSGVO-Verarbeitungsgrundsaetze (nicht Zertifizierung!)
|
||||
- certification = Zertifizierung/Konformitaet (nicht Datenschutz)
|
||||
|
||||
Beispiele:
|
||||
- "implement:api_rate_limiting:implementation"
|
||||
- "define:access_control_policy:definition"
|
||||
- "monitor:third_party_vulnerabilities:monitoring"
|
||||
- "test:authentication_mechanism:testing"
|
||||
- "implement:multi_factor_auth:implementation"
|
||||
- "define:access_control:definition"
|
||||
- "monitor:network_security:monitoring"
|
||||
- "test:vulnerability:testing"
|
||||
- "report:supervisory_authority:reporting"
|
||||
- "implement:audit_logging:implementation" (NICHT monitoring!)
|
||||
- "define:incident:definition" (Incident-Verfahren, NICHT procedure!)
|
||||
- "train:training:operation" (Schulung, NICHT monitoring!)
|
||||
|
||||
8. APPLICABILITY + SCANNER: Bestimme fuer jedes Control:
|
||||
- applicability: Unter welchen Bedingungen gilt dieses Control?
|
||||
@@ -2472,6 +2510,81 @@ def _ensure_list(val) -> list:
|
||||
return []
|
||||
|
||||
|
||||
# Canonical object tokens from object_ontology (loaded once)
|
||||
_CANONICAL_OBJECTS: set[str] | None = None
|
||||
|
||||
|
||||
def _load_canonical_objects() -> set[str]:
|
||||
"""Load canonical tokens from DB, fallback to hardcoded set."""
|
||||
global _CANONICAL_OBJECTS
|
||||
if _CANONICAL_OBJECTS is not None:
|
||||
return _CANONICAL_OBJECTS
|
||||
try:
|
||||
from db.session import get_engine
|
||||
from sqlalchemy import text
|
||||
engine = get_engine()
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text(
|
||||
"SELECT canonical_token FROM compliance.object_ontology"
|
||||
)).fetchall()
|
||||
_CANONICAL_OBJECTS = {r[0] for r in rows}
|
||||
except Exception:
|
||||
_CANONICAL_OBJECTS = set()
|
||||
if not _CANONICAL_OBJECTS:
|
||||
_CANONICAL_OBJECTS = {
|
||||
"multi_factor_auth", "password_policy", "credentials",
|
||||
"session_management", "privileged_access", "access_control",
|
||||
"encryption", "transport_encryption", "key_management",
|
||||
"certificate_management", "network_security",
|
||||
"network_segmentation", "firewall", "vpn", "remote_access",
|
||||
"monitoring", "audit_logging", "siem", "alerting",
|
||||
"compliance_audit", "vulnerability", "patch_management",
|
||||
"backup", "disaster_recovery", "personal_data",
|
||||
"sensitive_data", "consent", "data_subject_rights",
|
||||
"data_retention", "data_transfer", "data_breach_notification",
|
||||
"dpia", "data_processing_agreement", "privacy_by_design",
|
||||
"policy", "procedure", "process", "training", "awareness",
|
||||
"incident", "risk_management", "third_party_management",
|
||||
"change_management", "documentation", "supervisory_authority",
|
||||
"certification", "product_safety", "ai_system", "aml",
|
||||
"critical_infrastructure", "medical_device",
|
||||
}
|
||||
return _CANONICAL_OBJECTS
|
||||
|
||||
|
||||
def _validate_merge_key(merge_key: str) -> str:
|
||||
"""Validate merge_key object against canonical ontology.
|
||||
|
||||
Returns the merge_key (possibly corrected). Logs warnings for
|
||||
unknown objects so they can be tracked.
|
||||
"""
|
||||
parts = merge_key.split(":", 2)
|
||||
if len(parts) < 2:
|
||||
return merge_key
|
||||
|
||||
action, obj = parts[0], parts[1]
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
|
||||
# Accept OTHER: prefix (LLM signaling unknown object)
|
||||
if obj.startswith("OTHER:"):
|
||||
return merge_key
|
||||
|
||||
# Check against canonical ontology
|
||||
canonical = _load_canonical_objects()
|
||||
if obj in canonical:
|
||||
return merge_key
|
||||
|
||||
# Try normalize_object() as fallback
|
||||
from services.control_dedup import normalize_object
|
||||
normed = normalize_object(obj)
|
||||
if normed in canonical:
|
||||
return f"{action}:{normed}:{phase}"
|
||||
|
||||
# Unknown object — log and keep as-is (will be clustered by embedding)
|
||||
logger.debug("merge_key unknown object: %s (normed: %s)", obj, normed)
|
||||
return merge_key
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Decomposition Pass
|
||||
# ---------------------------------------------------------------------------
|
||||
@@ -3025,10 +3138,10 @@ class DecompositionPass:
|
||||
evidence_type=parsed.get("evidence_type", ""),
|
||||
provides_context=_ensure_list(parsed.get("provides_context", [])),
|
||||
)
|
||||
# Store merge_key from LLM output in metadata
|
||||
# Store merge_key from LLM output in metadata — with validation
|
||||
llm_merge_key = parsed.get("merge_key", "")
|
||||
if llm_merge_key:
|
||||
atomic.merge_group_hint = llm_merge_key
|
||||
atomic.merge_group_hint = _validate_merge_key(llm_merge_key)
|
||||
|
||||
atomic.parent_control_uuid = obl["parent_uuid"]
|
||||
atomic.obligation_candidate_id = obl["candidate_id"]
|
||||
|
||||
@@ -0,0 +1,84 @@
|
||||
"""Shared embedding + sub-clustering utilities for the control pipeline."""
|
||||
|
||||
import logging
|
||||
import os
|
||||
from collections import defaultdict
|
||||
|
||||
import httpx
|
||||
import numpy as np
|
||||
from sklearn.cluster import MiniBatchKMeans
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
EMBEDDING_URL = os.getenv(
|
||||
"EMBEDDING_SERVICE_URL", "http://embedding-service:8087"
|
||||
)
|
||||
|
||||
|
||||
def embed_texts(texts: list[str]) -> np.ndarray | None:
|
||||
"""Embed texts via the embedding-service in batches of 64."""
|
||||
try:
|
||||
result = np.zeros((len(texts), 1024), dtype=np.float32)
|
||||
batch_size = 64
|
||||
for i in range(0, len(texts), batch_size):
|
||||
batch = texts[i : i + batch_size]
|
||||
for attempt in range(3):
|
||||
try:
|
||||
with httpx.Client(
|
||||
timeout=httpx.Timeout(60.0, connect=10.0)
|
||||
) as client:
|
||||
resp = client.post(
|
||||
f"{EMBEDDING_URL}/embed", json={"texts": batch}
|
||||
)
|
||||
resp.raise_for_status()
|
||||
embs = resp.json().get("embeddings", [])
|
||||
end = min(i + len(embs), len(texts))
|
||||
result[i:end] = np.array(embs, dtype=np.float32)
|
||||
break
|
||||
except Exception as e:
|
||||
if attempt == 2:
|
||||
logger.error("Embed batch %d failed: %s", i, e)
|
||||
import time
|
||||
time.sleep(2)
|
||||
return result
|
||||
except Exception as e:
|
||||
logger.error("Embedding failed: %s", e)
|
||||
return None
|
||||
|
||||
|
||||
def subcluster_controls(
|
||||
controls: list[dict], target_size: int = 50
|
||||
) -> list[list[dict]]:
|
||||
"""Sub-cluster controls by embedding similarity.
|
||||
|
||||
Returns a list of clusters. Falls back to naive chunking
|
||||
if embedding fails.
|
||||
"""
|
||||
if len(controls) <= target_size:
|
||||
return [controls]
|
||||
|
||||
texts = [c.get("title", "") or c.get("control_id", "") for c in controls]
|
||||
embeddings = embed_texts(texts)
|
||||
if embeddings is None:
|
||||
return [
|
||||
controls[i : i + target_size]
|
||||
for i in range(0, len(controls), target_size)
|
||||
]
|
||||
|
||||
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
|
||||
norms[norms == 0] = 1
|
||||
normalized = embeddings / norms
|
||||
|
||||
k = max(2, min(len(controls) // target_size, 30))
|
||||
kmeans = MiniBatchKMeans(
|
||||
n_clusters=k,
|
||||
batch_size=min(100, len(controls)),
|
||||
max_iter=50,
|
||||
random_state=42,
|
||||
)
|
||||
labels = kmeans.fit_predict(normalized)
|
||||
|
||||
clusters: dict[int, list[dict]] = defaultdict(list)
|
||||
for i, ctrl in enumerate(controls):
|
||||
clusters[int(labels[i])].append(ctrl)
|
||||
return list(clusters.values())
|
||||
Reference in New Issue
Block a user