Files

T

Benjamin Admin 0bad74a3bd docs: session handover — Block F complete, pipeline done, G-pre1 analysis

Session 03-05.05.2026:
- Block F1-F5 complete (DB migration of hardcoded dicts)
- Control Generation: 1,599 controls + 11,522 obligations + 1,147 atomics
- Production sync: 2,625 controls + 11,522 obligations synced
- G-pre1 analysis: 183k objects → 144k after normalize (needs hierarchical clustering)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-05 18:02:10 +02:00

3.7 KiB

Raw Blame History

Session-Instruktionen: G-pre1 Object-Normalisierung

Datum: 2026-05-05 Fuer: Naechste Claude-Session Repo: breakpilot-core (~/Projekte/breakpilot-core)

NAECHSTER SCHRITT: G-pre1 — Hierarchisches Themen-Clustering

Analyse-Ergebnis (05.05.2026)

Unique raw objects:       183.058
Nach normalize_object():  144.151 (nur 21% Reduktion)
Singletons:               144.117 (99.98% sind einzigartig!)
Gruppen mit 2+ Members:   34

Erkenntnis: Das Problem ist NICHT "gleiche Objekte mit verschiedenen Namen" sondern "144k granulare Objekte die zu uebergeordneten Themen zusammengefasst werden muessen."

Neuer Ansatz: Hierarchisches Themen-Clustering

Statt 1:1 Synonym-Matching brauchen wir:

Themen-Hierarchie definieren (z.B. "Authentication & Access" → password, mfa, session, rbac)
Embedding-basierte Zuordnung jedes Objects zu einem Thema
Qdrant-basiert (kein voller Distance-Matrix im RAM noetig)
Ggf. Sampling + Mini-Batch K-Means statt DBSCAN

Speicher-Problem

144k × 144k Distance-Matrix = ~83 GB RAM → nicht machbar
Alternative: Qdrant nearest-neighbor search pro Object (O(n) statt O(n²))
Oder: Mini-Batch K-Means mit k=20.000 auf 144k × 1024 Matrix (~600 MB, machbar)

Analyse-Script vorhanden

control-pipeline/scripts/gpre1_analyze.py (lokal, nicht committed)

SESSION 03-05.05.2026 ERLEDIGT

Block F (Hardcoded Knowledge → DB) — KOMPLETT ✅

F1: regulation_registry (223 Eintraege)
F2: action_types (34) + action_synonyms (368)
F3: object_synonyms (320)
F4: LLM Enrichment (+468 neue Synonyme via Ollama)
F5: Validation (8 Tests) + Dicts als Fallback beibehalten
454 Pipeline-Tests pass, 0 Regressionen

Control Generation Pipeline — KOMPLETT ✅

1.599 Rich Controls aus E-Block Chunks generiert (~$17 Anthropic)
11.522 Obligations extrahiert (Pass 0a, ~$4 Anthropic)
1.147 Atomic Controls komponiert (Pass 0b, ~$4.60 Anthropic)
Gesamtkosten: ~$25.60

Production Sync — KOMPLETT ✅

2.625 neue Controls auf Production synchronisiert (ON CONFLICT DO NOTHING)
11.522 Obligations auf Production synchronisiert
Production: 294.027 Controls total (vorher 291.402)
Backups auf MacBook: komprimiert (30 MB) + plain SQL (1.3 GB)

Infrastruktur

Vault CPU-Fix committed (Marker-File + idempotente Checks)
Pass 0a Endpoint im Core Control-Pipeline registriert
61 neue regulation_ids in regulation_registry eingefuegt
Container bp-core-vault, bp-lehrer-opensearch, fewo-finance-agent gestoppt (CPU-Saver)

DB-Tabellen (alle Bloecke)

Tabelle	Rows	Migration
compliance.regulation_registry	223	002_regulation_registry.sql
compliance.action_types	34	003_action_object_ontology.sql
compliance.action_synonyms	368	003_action_object_ontology.sql
compliance.object_synonyms	320	003_action_object_ontology.sql

GESTOPPTE CONTAINER (wieder starten wenn noetig)

ssh macmini "/usr/local/bin/docker start bp-core-vault bp-lehrer-opensearch"
# fewo-finance-agent: fremder Container, nicht starten

Vault: Erst nach Deploy des Fixes (Marker-File) starten, sonst CPU-Loop.

TESTS

# Pipeline (454 Tests)
PYTHONPATH=control-pipeline python3 -m pytest control-pipeline/tests/ -v

API-Zugriff (WICHTIG)

Control-Pipeline: Nur via Docker exec erreichbar (Port 8098 blockiert durch document-crawler)

ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline curl -sf http://127.0.0.1:8098/..."

Compliance Backend: Zeigt auf PRODUCTION DB (nicht lokal!)
Pass 0a Endpoint: /v1/canonical/generate/run-pass0a (auf Core Pipeline, Port 8098)

3.7 KiB Raw Blame History Unescape Escape