feat(audit-pipeline): P72 MC-Scope-Classifier + P80 Snapshot/Replay-Foundation [migration-approved]
CI / detect-changes (push) Successful in 11s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 14s
CI / loc-budget (push) Failing after 14s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 37s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped

P72  MC-Scope-Classifier — pro MC den ECHTEN Doc-Adressaten festlegen
     (cookie_richtlinie/dse/banner_implementation/cmp_audit/tom/avv/jc/
      impressum/agb/widerruf/process/accounting/other).
     - Migration 145: scope_doc_type Spalte + Index auf canonical_controls
     - Backfill-Script mit Regex-Heuristik (12 Regeln, Prioritaet-sortiert)
     - Erste 11k-Sample-Distribution: 76% other (Heuristik v1 zu strict —
       v2 muss lockerere Patterns fuer DSE/TOM nachschaerfen)
     - Ziel: bevor MC-Scorecard filtert, weiss jeder MC welches Dokument
       er adressiert. Bisher landeten eHealth-/HGB-MCs im Cookie-Audit.

P80  Snapshot + Replay-Foundation — Roh-Daten persistieren damit
     Audit-Pipeline ohne erneuten Crawl rebuildbar ist.
     - Migration 146: compliance_check_snapshots Tabelle (JSONB pro
       doc_entries/banner_result/profile/cmp_vendors/scan_context)
     - services.check_snapshot.save_snapshot/load_snapshot/list
     - Endpoints GET /snapshots, GET /snapshots/{id}
     - Hook in _run_compliance_check: nach Mail-Send automatischer
       Snapshot-Save via separater SessionLocal (background-task safe)
     - Replay-Endpoint folgt im naechsten PR (braucht Refactoring
       von _run_compliance_check in crawl_phase + interpret_phase)
     - Effekt: Test-Cycle 7min -> 5sec bei reinen Logik-Aenderungen
       (P73/P79/P81+ profitieren direkt). Snapshots dienen auch als
       Regression-Test-Corpus (P81 Golden-Truth-Library).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-21 08:53:31 +02:00
parent 603381a67f
commit cde670617e
5 changed files with 554 additions and 0 deletions
@@ -155,6 +155,53 @@ async def get_compliance_check_status(check_id: str):
)
# ── P80: Snapshot + Replay ───────────────────────────────────────────
@router.get("/snapshots")
async def list_snapshots(domain: str = "", limit: int = 20):
"""P80: list recent snapshots, optionally filtered by site_domain."""
from database import SessionLocal
from compliance.services.check_snapshot import list_snapshots_for_domain
db = SessionLocal()
try:
if domain:
return {"snapshots": list_snapshots_for_domain(db, domain, limit)}
from sqlalchemy import text
rows = db.execute(
text("""
SELECT id, check_id, site_domain, site_label, created_at,
replay_count, notes
FROM compliance.compliance_check_snapshots
ORDER BY created_at DESC
LIMIT :lim
"""),
{"lim": limit},
).fetchall()
return {"snapshots": [
{"id": str(r[0]), "check_id": r[1], "site_domain": r[2],
"site_label": r[3], "created_at": str(r[4]),
"replay_count": r[5], "notes": r[6]}
for r in rows
]}
finally:
db.close()
@router.get("/snapshots/{snapshot_id}")
async def get_snapshot(snapshot_id: str):
"""P80: load full snapshot raw data."""
from database import SessionLocal
from compliance.services.check_snapshot import load_snapshot
db = SessionLocal()
try:
snap = load_snapshot(db, snapshot_id)
if not snap:
return {"error": "snapshot not found"}, 404
return snap
finally:
db.close()
async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
"""Background task: check all documents with business-profile context."""
try:
@@ -1028,6 +1075,29 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
_compliance_check_jobs[check_id]["progress"] = "Fertig"
_compliance_check_jobs[check_id]["progress_pct"] = 100
# P80: persist raw scan data so we can replay audit pipeline
# without re-crawling (7min -> 5sec test cycle).
try:
from database import SessionLocal
from compliance.services.check_snapshot import save_snapshot
snap_db = SessionLocal()
try:
save_snapshot(
snap_db,
check_id=check_id,
doc_entries=doc_entries,
banner_result=banner_result,
profile=profile,
cmp_vendors=cmp_vendors,
scan_context=None, # P79 will fill this
site_label=site_name,
notes=f"recipient={req.recipient}",
)
finally:
snap_db.close()
except Exception as snap_err:
logger.warning("P80 snapshot save skipped: %s", snap_err)
# Persist to sidecar SQLite audit log — enables /audit endpoints
# (A5 admin tab) and trend view (A6). Best-effort; failures here
# do not affect the user-facing response.