feat(audit-pipeline): P72 MC-Scope-Classifier + P80 Snapshot/Replay-Foundation [migration-approved]
CI / detect-changes (push) Successful in 11s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 14s
CI / loc-budget (push) Failing after 14s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 37s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped

P72  MC-Scope-Classifier — pro MC den ECHTEN Doc-Adressaten festlegen
     (cookie_richtlinie/dse/banner_implementation/cmp_audit/tom/avv/jc/
      impressum/agb/widerruf/process/accounting/other).
     - Migration 145: scope_doc_type Spalte + Index auf canonical_controls
     - Backfill-Script mit Regex-Heuristik (12 Regeln, Prioritaet-sortiert)
     - Erste 11k-Sample-Distribution: 76% other (Heuristik v1 zu strict —
       v2 muss lockerere Patterns fuer DSE/TOM nachschaerfen)
     - Ziel: bevor MC-Scorecard filtert, weiss jeder MC welches Dokument
       er adressiert. Bisher landeten eHealth-/HGB-MCs im Cookie-Audit.

P80  Snapshot + Replay-Foundation — Roh-Daten persistieren damit
     Audit-Pipeline ohne erneuten Crawl rebuildbar ist.
     - Migration 146: compliance_check_snapshots Tabelle (JSONB pro
       doc_entries/banner_result/profile/cmp_vendors/scan_context)
     - services.check_snapshot.save_snapshot/load_snapshot/list
     - Endpoints GET /snapshots, GET /snapshots/{id}
     - Hook in _run_compliance_check: nach Mail-Send automatischer
       Snapshot-Save via separater SessionLocal (background-task safe)
     - Replay-Endpoint folgt im naechsten PR (braucht Refactoring
       von _run_compliance_check in crawl_phase + interpret_phase)
     - Effekt: Test-Cycle 7min -> 5sec bei reinen Logik-Aenderungen
       (P73/P79/P81+ profitieren direkt). Snapshots dienen auch als
       Regression-Test-Corpus (P81 Golden-Truth-Library).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-21 08:53:31 +02:00
parent 603381a67f
commit cde670617e
5 changed files with 554 additions and 0 deletions
@@ -0,0 +1,52 @@
-- P72: scope_doc_type fuer canonical_controls
--
-- Erlaubt zu unterscheiden welcher Dokument-Typ der eigentliche Adressat
-- eines MC ist. Bisher landete jeder MC in jedem Doc-Audit was zu Noise
-- fuehrt (z.B. "elektronische Gesundheitsdaten-Transmission" landet im
-- Cookie-Richtlinie-Audit eines Autobauers).
--
-- Werte:
-- cookie_richtlinie — Pflichtangaben Cookie-RL nach DSK-OH 2024
-- dse — Pflichtangaben Datenschutzerklaerung Art. 13/14
-- banner_implementation — Banner-UI-Anforderungen (nicht Text)
-- z.B. "keine pre-ticked Checkboxes"
-- cmp_audit — Consent-Management-Plattform-Audit-Trail
-- z.B. "jede Einwilligung mit Zeitstempel speichern"
-- tom — Technisch-organisatorische Massnahmen
-- z.B. "verschluesselte Backups"
-- avv — Auftragsverarbeitungsvertrag-Inhalt
-- jc — Joint-Controller-Vereinbarung Art. 26
-- impressum — §5 TMG / §18 MStV
-- agb — Allgemeine Geschaeftsbedingungen
-- widerruf — Widerrufsbelehrung
-- process — Prozess-Anforderung (nicht textbasiert,
-- kann nicht durch Text-Einfuegung erfuellt werden)
-- accounting — Rechnungsstellung (UStG, HGB) — nicht Compliance
-- other — Faellt keiner Kategorie zu (Default)
--
-- NULL = noch nicht klassifiziert (Backfill-Skript setzt Wert).
DO $$
BEGIN
IF EXISTS (
SELECT 1 FROM information_schema.tables
WHERE table_name = 'canonical_controls'
AND table_schema = 'compliance'
) THEN
ALTER TABLE compliance.canonical_controls
ADD COLUMN IF NOT EXISTS scope_doc_type VARCHAR(40) DEFAULT NULL
CHECK (scope_doc_type IS NULL OR scope_doc_type IN (
'cookie_richtlinie', 'dse', 'banner_implementation',
'cmp_audit', 'tom', 'avv', 'jc',
'impressum', 'agb', 'widerruf',
'process', 'accounting', 'other'
));
CREATE INDEX IF NOT EXISTS idx_cc_scope_doc_type
ON compliance.canonical_controls(scope_doc_type);
COMMENT ON COLUMN compliance.canonical_controls.scope_doc_type IS
'P72: Doc-Type Adressat. NULL = nicht klassifiziert. Findings nur '
'beim passenden Doc-Type anzeigen, sonst Noise.';
END IF;
END $$;
@@ -0,0 +1,40 @@
-- P80: Compliance-Check Snapshots fuer Replay-Mode
--
-- Persistiert die Roh-Daten eines Scans (DSE-Text, Banner-HTML, Cookies,
-- CMP-Vendors, Profile) damit die Audit-Pipeline ohne erneuten Crawl
-- nur die Interpretations-Logik (MC-Scorecard, Mail-Render) neu laufen
-- kann. Test-Cycle 7min -> 5-10sec bei reinen Logik-Aenderungen.
DO $$
BEGIN
CREATE TABLE IF NOT EXISTS compliance.compliance_check_snapshots (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
check_id VARCHAR(36) NOT NULL,
site_domain VARCHAR(255) NOT NULL,
site_label VARCHAR(255),
-- Roh-Daten als JSONB (alles was sich pro Lauf NICHT aendert)
doc_entries JSONB NOT NULL, -- [{doc_type, url, full_text, cmp_payloads, ...}]
banner_result JSONB, -- {phases, cookies_detailed, cmp_vendors, ...}
profile JSONB, -- {business_type, industry, no_direct_sales, ...}
scan_context JSONB, -- P79: User-Pre-Scan-Felder
cmp_vendors JSONB, -- vendor-list (post-Phase G)
-- Meta
created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now(),
replay_count INTEGER NOT NULL DEFAULT 0,
last_replay_at TIMESTAMP WITH TIME ZONE,
notes TEXT
);
CREATE INDEX IF NOT EXISTS idx_snapshots_check_id
ON compliance.compliance_check_snapshots(check_id);
CREATE INDEX IF NOT EXISTS idx_snapshots_domain
ON compliance.compliance_check_snapshots(site_domain);
CREATE INDEX IF NOT EXISTS idx_snapshots_created
ON compliance.compliance_check_snapshots(created_at DESC);
COMMENT ON TABLE compliance.compliance_check_snapshots IS
'P80 Replay-Mode: persistierte Roh-Daten eines Scans. Ermoeglicht '
'Audit-Pipeline ohne erneuten Browser-Crawl neu zu laufen.';
END $$;