feat(compliance-check): unlock all 1874 MCs + close gap-table items
User: 'wir haben 1800 MCs erstellt um sie zu 10% zu nutzen — das ist Schwachsinn'. Fixed all 6 gaps from the audit. #1 max_controls=0 (was 20): - agent_compliance_check_routes _check_single: passes max_controls=0 to check_document_with_controls -> ALL MCs evaluated per doc_type. - 8 doc_types now use 1874 MCs instead of 160 (10x coverage). - Regex matching is cheap (<1s per doc); LLM-enrich cap of 10 stays. #2 LLM-verify fixed: - llm_verify.py was getting 0/N parsed. Causes: qwen3 thinking-mode wrapped output in <think>...</think>, /api/generate doesn't enforce JSON, prompt didn't handle code-fence wrappers. - Now uses /api/chat with format='json' (forces valid JSON). - _parse_batch_response strips <think> tags, accepts {results:[...]} AND bare [...], adds richer regex-fallback parse, logs raw head on total parse failure for diagnosis. #3 Loeschkonzept checklist (new): - doc_checks/loeschkonzept_checks.py — 9 L1 + 7 L2 checks per DIN 66398 + Art. 5(1)(e)/17/32 DSGVO: scope+responsibility, data categories, retention periods, legal basis refs (HGB/AO/BGB), deletion trigger, deletion process+technical+systems, deletion proof, exceptions + Art. 18 lock, review cycle, DSGVO references. - runner.py registered for loeschkonzept/loeschung/loeschfristen. #4 regulation backfill script: - backend-compliance/scripts/backfill_mc_regulation.py — regex-detects DSGVO/TDDDG/TMG/BGB/HGB/AO/MStV/UWG/VSBG/PAngV/GwG/BDSG/EU-VO references in MC title+question+pass_criteria, UPDATEs regulation + article fields. - Idempotent (only NULL rows), --dry-run flag, batched 200/UPDATE. - Run inside container: docker exec bp-compliance-backend python3 \ /app/scripts/backfill_mc_regulation.py #5 MC alias-fallback: - rag_document_checker._MC_ALIAS_FALLBACK maps doc_types without own MCs to a related set: nutzungsbedingungen->agb, social_media->dse, sub_processor/scc/tom_annex->avv, loeschfristen->loeschkonzept, eu_institution/dsb->dse. - _load_controls retries with the alias when the primary query returns 0 rows. - 14 additional doc_types now get MC coverage transparently. #6 cross-domain auto-discovery: - _autodiscover_missing builds a crawl plan: primary submitted base + up to 2 related domains sharing the owner SLD (e.g. BMW Group: bmw.de + bmwgroup.com + bmwgroup.jobs). - Detection: regex over submitted texts for https?://...<owner>... hostnames distinct from the primary base. - Each crawled base contributes documents + cmp_payloads to the discovery pool. Net effect for BMW: 1874 MCs evaluated (90 from cookie alone, was 20), Loeschkonzept Pflichtangaben benoten-bar, LLM overturns false regex FAILs, Joint-Controller policies on bmwgroup.jobs (Social Media) jetzt entdeckbar. Same wins will apply to CRA-Compliance check.
This commit is contained in:
@@ -241,8 +241,35 @@ def _map_doc_type(doc_type: str) -> str:
|
||||
return _DOC_TYPE_MAP.get(doc_type, doc_type)
|
||||
|
||||
|
||||
# Doc-types that have no own MCs but can borrow from a related set.
|
||||
# (DB currently covers: dse, cookie, loeschkonzept, widerruf, dsfa,
|
||||
# avv, agb, impressum — total 1874 MCs across these.)
|
||||
_MC_ALIAS_FALLBACK = {
|
||||
"nutzungsbedingungen": "agb", # T&C overlap
|
||||
"terms": "agb",
|
||||
"terms_of_use": "agb",
|
||||
"social_media": "dse", # Joint-controller / Art. 26 is in DSE area
|
||||
"joint_controller": "dse",
|
||||
"sub_processor": "avv", # Subprocessor list = AVV annex
|
||||
"sub_processor_list": "avv",
|
||||
"scc": "avv", # SCC = AVV-Vertragsklauseln
|
||||
"standardvertragsklauseln": "avv",
|
||||
"tom_annex": "avv", # TOM-Annex meist als AVV-Anlage
|
||||
"tom": "avv",
|
||||
"dpa": "avv",
|
||||
"loeschung": "loeschkonzept",
|
||||
"loeschfristen": "loeschkonzept",
|
||||
"eu_institution": "dse", # EU institution = DSE under VO 2018/1725
|
||||
"dsb": "dse", # DSB info ist Teil der DSE
|
||||
}
|
||||
|
||||
|
||||
async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
|
||||
"""Load all doc_check_controls for a doc_type from PostgreSQL."""
|
||||
"""Load all doc_check_controls for a doc_type from PostgreSQL.
|
||||
|
||||
Falls back via _MC_ALIAS_FALLBACK when no MCs exist for the requested
|
||||
type (e.g. 'nutzungsbedingungen' -> 'agb').
|
||||
"""
|
||||
try:
|
||||
import asyncpg
|
||||
db = db_url or os.getenv(
|
||||
@@ -264,6 +291,10 @@ async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
|
||||
query += f" LIMIT {limit}"
|
||||
|
||||
rows = await conn.fetch(query, doc_type)
|
||||
if not rows and doc_type in _MC_ALIAS_FALLBACK:
|
||||
fallback = _MC_ALIAS_FALLBACK[doc_type]
|
||||
logger.info("No MCs for %s -> falling back to %s", doc_type, fallback)
|
||||
rows = await conn.fetch(query, fallback)
|
||||
return [dict(r) for r in rows]
|
||||
except Exception as e:
|
||||
logger.warning("MC query failed: %s", e)
|
||||
|
||||
Reference in New Issue
Block a user