feat(compliance-check): unlock all 1874 MCs + close gap-table items

User: 'wir haben 1800 MCs erstellt um sie zu 10% zu nutzen — das ist
Schwachsinn'. Fixed all 6 gaps from the audit.

#1 max_controls=0 (was 20):
- agent_compliance_check_routes _check_single: passes max_controls=0 to
  check_document_with_controls -> ALL MCs evaluated per doc_type.
- 8 doc_types now use 1874 MCs instead of 160 (10x coverage).
- Regex matching is cheap (<1s per doc); LLM-enrich cap of 10 stays.

#2 LLM-verify fixed:
- llm_verify.py was getting 0/N parsed. Causes: qwen3 thinking-mode
  wrapped output in <think>...</think>, /api/generate doesn't enforce
  JSON, prompt didn't handle code-fence wrappers.
- Now uses /api/chat with format='json' (forces valid JSON).
- _parse_batch_response strips <think> tags, accepts {results:[...]}
  AND bare [...], adds richer regex-fallback parse, logs raw head on
  total parse failure for diagnosis.

#3 Loeschkonzept checklist (new):
- doc_checks/loeschkonzept_checks.py — 9 L1 + 7 L2 checks per DIN 66398
  + Art. 5(1)(e)/17/32 DSGVO: scope+responsibility, data categories,
  retention periods, legal basis refs (HGB/AO/BGB), deletion trigger,
  deletion process+technical+systems, deletion proof, exceptions +
  Art. 18 lock, review cycle, DSGVO references.
- runner.py registered for loeschkonzept/loeschung/loeschfristen.

#4 regulation backfill script:
- backend-compliance/scripts/backfill_mc_regulation.py — regex-detects
  DSGVO/TDDDG/TMG/BGB/HGB/AO/MStV/UWG/VSBG/PAngV/GwG/BDSG/EU-VO
  references in MC title+question+pass_criteria, UPDATEs regulation +
  article fields.
- Idempotent (only NULL rows), --dry-run flag, batched 200/UPDATE.
- Run inside container: docker exec bp-compliance-backend python3 \
    /app/scripts/backfill_mc_regulation.py

#5 MC alias-fallback:
- rag_document_checker._MC_ALIAS_FALLBACK maps doc_types without own
  MCs to a related set: nutzungsbedingungen->agb, social_media->dse,
  sub_processor/scc/tom_annex->avv, loeschfristen->loeschkonzept,
  eu_institution/dsb->dse.
- _load_controls retries with the alias when the primary query
  returns 0 rows.
- 14 additional doc_types now get MC coverage transparently.

#6 cross-domain auto-discovery:
- _autodiscover_missing builds a crawl plan: primary submitted base
  + up to 2 related domains sharing the owner SLD (e.g. BMW Group:
  bmw.de + bmwgroup.com + bmwgroup.jobs).
- Detection: regex over submitted texts for https?://...<owner>...
  hostnames distinct from the primary base.
- Each crawled base contributes documents + cmp_payloads to the
  discovery pool.

Net effect for BMW: 1874 MCs evaluated (90 from cookie alone, was
20), Loeschkonzept Pflichtangaben benoten-bar, LLM overturns false
regex FAILs, Joint-Controller policies on bmwgroup.jobs (Social
Media) jetzt entdeckbar. Same wins will apply to CRA-Compliance check.
This commit is contained in:
Benjamin Admin
2026-05-17 13:07:50 +02:00
parent fab1e35847
commit 8a44e67293
6 changed files with 565 additions and 70 deletions
@@ -241,8 +241,35 @@ def _map_doc_type(doc_type: str) -> str:
return _DOC_TYPE_MAP.get(doc_type, doc_type)
# Doc-types that have no own MCs but can borrow from a related set.
# (DB currently covers: dse, cookie, loeschkonzept, widerruf, dsfa,
# avv, agb, impressum — total 1874 MCs across these.)
_MC_ALIAS_FALLBACK = {
"nutzungsbedingungen": "agb", # T&C overlap
"terms": "agb",
"terms_of_use": "agb",
"social_media": "dse", # Joint-controller / Art. 26 is in DSE area
"joint_controller": "dse",
"sub_processor": "avv", # Subprocessor list = AVV annex
"sub_processor_list": "avv",
"scc": "avv", # SCC = AVV-Vertragsklauseln
"standardvertragsklauseln": "avv",
"tom_annex": "avv", # TOM-Annex meist als AVV-Anlage
"tom": "avv",
"dpa": "avv",
"loeschung": "loeschkonzept",
"loeschfristen": "loeschkonzept",
"eu_institution": "dse", # EU institution = DSE under VO 2018/1725
"dsb": "dse", # DSB info ist Teil der DSE
}
async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
"""Load all doc_check_controls for a doc_type from PostgreSQL."""
"""Load all doc_check_controls for a doc_type from PostgreSQL.
Falls back via _MC_ALIAS_FALLBACK when no MCs exist for the requested
type (e.g. 'nutzungsbedingungen' -> 'agb').
"""
try:
import asyncpg
db = db_url or os.getenv(
@@ -264,6 +291,10 @@ async def _load_controls(doc_type: str, db_url: str, limit: int) -> list[dict]:
query += f" LIMIT {limit}"
rows = await conn.fetch(query, doc_type)
if not rows and doc_type in _MC_ALIAS_FALLBACK:
fallback = _MC_ALIAS_FALLBACK[doc_type]
logger.info("No MCs for %s -> falling back to %s", doc_type, fallback)
rows = await conn.fetch(query, fallback)
return [dict(r) for r in rows]
except Exception as e:
logger.warning("MC query failed: %s", e)