feat(pipeline): MC Quality Overhaul — 74.5% → 92.8% accuracy, 5.3K → 13.6K MCs

Phase 0: Quality Audit script (Claude Sonnet, 1750 samples)
Phase 1: Object ontology expanded 31 → 74 tokens with descriptions + boundaries
Phase 2: 174K controls re-classified via Haiku (10 batches, $50)
  - Generic tokens removed (documentation, procedure, process)
  - L2 sub-topics added (108K + 64K controls)
  - Bad subtopics fixed (stakeholder_*, escalation fragments)
Phase 3: Re-clustering K=18704 (37K objects → 16.7K groups)
Phase 4: Direct MC generation from canonical tokens (gpre2_direct_mc.py)
Phase 5: Regulation-source split (gpre3, dry-run tested)

New features:
- Tenant-isolated document upload API (rag-service)
- BAuA crawler (Playwright, 131 PDFs downloaded)
- OSHA Technical Manual crawler (23 chapters)
- CE obligation extractor (6141 obligations from Qdrant)

RAG ingestion:
- 126 BAuA PDFs (TRBS/TRGS/ASR): 27,664 chunks
- OSHA Technical Manual: 7,241 chunks
- OSHA 1910 Subpart O (full): 745 chunks
- EuGH C-588/21 P: 216 chunks
- EU 2018/1725: 842 chunks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-10 15:08:15 +02:00
parent 81db904b3e
commit 8510af46eb
19 changed files with 3173 additions and 6 deletions
@@ -0,0 +1,158 @@
# Controls nutzen — Anleitung für andere Sessions
**Stand:** 2026-05-07, wird laufend aktualisiert
**Repo:** breakpilot-core (~/Projekte/breakpilot-core)
---
## Was sind die Controls?
174.497 atomare Compliance-Controls in der Datenbank. Jeder Control ist eine **einzelne prüfbare Anforderung** aus einer Rechtsquelle (DSGVO, NIS2, NIST, AI Act, etc.).
### Beispiel
```
Control-ID: AUTH-2956-A14
Titel: "Implementierung von Multi-Faktor-Authentifizierung prüfen"
Objective: "Sicherstellen, dass MFA korrekt implementiert ist..."
Merge-Key: "verify:multi_factor_auth:testing"
Severity: high
```
## Wo liegen die Controls?
### Datenbank (PostgreSQL auf Mac Mini)
```sql
-- Alle Controls abfragen
SELECT id, control_id, title, objective, severity,
source_citation, -- Rechtsquelle (JSON)
generation_metadata->>'merge_group_hint' AS merge_key
FROM compliance.canonical_controls
WHERE release_state NOT IN ('deprecated', 'rejected');
```
**Verbindung:**
```bash
# Vom MacBook:
ssh macmini "/usr/local/bin/docker exec bp-core-postgres psql -U breakpilot -d breakpilot_db"
# Oder via Control-Pipeline Container:
ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline curl -sf http://127.0.0.1:8098/..."
```
### API (Port 8098, nur via Docker exec erreichbar)
```bash
# Master Controls auflisten
ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline \
curl -sf 'http://127.0.0.1:8098/v1/master-controls?limit=50&sort=total_controls'"
# Master Control Detail mit allen Membern
ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline \
curl -sf 'http://127.0.0.1:8098/v1/master-controls/MC-8292'"
```
## Struktur der Controls
### merge_group_hint (Schlüsselfeld!)
Jeder Control hat einen `merge_group_hint` im Format `action:object:phase`:
```
implement:encryption:implementation
define:access_control:definition
monitor:network_security:monitoring
report:supervisory_authority:reporting
```
**74 kanonische Object-Tokens** (Stand 2026-05-07):
| Kategorie | Tokens |
|-----------|--------|
| **Security** | multi_factor_auth, password_policy, credentials, session_management, privileged_access, access_control, encryption, transport_encryption, key_management, certificate_management, network_security, network_segmentation, firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting, compliance_audit, vulnerability, patch_management, backup, disaster_recovery, physical_security, secure_development, api_security, input_validation, container_security, logging_configuration |
| **Data Protection** | personal_data, sensitive_data, health_data, consent, data_subject_rights, data_retention, data_transfer, data_breach_notification, dpia, data_processing_agreement, privacy_by_design, data_processing_register, data_classification, cookie_consent, video_surveillance |
| **Governance** | policy, procedure, process, training, awareness, incident, risk_management, third_party_management, change_management, documentation, records_management, compliance_reporting, asset_management, human_resources_security |
| **Regulatory** | supervisory_authority, certification, product_safety, ai_system, financial_reporting, aml, whistleblowing, consumer_protection, ecommerce, telecommunications, medical_device, payment_services, critical_infrastructure, supply_chain_due_diligence, sustainability_reporting |
### Rechtsquellen (source_citation)
Die **Parent-Controls** (nicht die atomaren!) haben `source_citation`:
```sql
-- Controls mit Rechtsquelle finden
SELECT cc.control_id, cc.title,
pc.source_citation->>'source' AS regulation,
pc.source_citation->>'article' AS article
FROM compliance.canonical_controls cc
JOIN compliance.canonical_controls pc ON pc.id = cc.parent_control_uuid
WHERE pc.source_citation IS NOT NULL
AND pc.source_citation->>'source' LIKE '%DSGVO%';
```
148 verschiedene Rechtsquellen (DSGVO, NIS2, NIST, OWASP, BSI, TKG, etc.)
## Controls filtern (Use Cases)
### Beispiel: Alle DSGVO Art. 13 Controls (für DSI-Prüfung)
```sql
SELECT cc.control_id, cc.title, cc.objective,
cc.generation_metadata->>'merge_group_hint' AS merge_key,
pc.source_citation->>'article' AS article
FROM compliance.canonical_controls cc
JOIN compliance.canonical_controls pc ON pc.id = cc.parent_control_uuid
WHERE pc.source_citation->>'source' = 'DSGVO (EU) 2016/679'
AND pc.source_citation->>'article' LIKE '%13%'
AND cc.release_state NOT IN ('deprecated', 'rejected')
ORDER BY cc.control_id;
```
### Beispiel: Alle Encryption-Controls
```sql
SELECT control_id, title, objective
FROM compliance.canonical_controls
WHERE generation_metadata->>'merge_group_hint' LIKE '%:encryption:%'
AND release_state NOT IN ('deprecated', 'rejected');
```
### Beispiel: Controls nach Object-Token filtern
```sql
-- Alle Controls zu einem bestimmten Thema
SELECT control_id, title,
generation_metadata->>'merge_group_hint' AS merge_key
FROM compliance.canonical_controls
WHERE generation_metadata->>'merge_group_hint' LIKE '%:data_retention:%'
AND release_state NOT IN ('deprecated', 'rejected');
```
## Wichtige Tabellen
| Tabelle | Rows | Beschreibung |
|---------|------|-------------|
| `compliance.canonical_controls` | ~294K | Alle Controls (Rich + Atomic) |
| `compliance.master_controls` | ~5.329 | Gruppierte Master Controls |
| `compliance.master_control_members` | ~172K | Zuordnung Control → MC |
| `compliance.object_ontology` | 74 | Kanonische Object-Definitionen |
| `compliance.regulation_registry` | 223 | Rechtsquellen-Register |
## Was gerade passiert (2026-05-07)
**Phase 2 läuft:** Alle 174K Controls werden per Claude Haiku re-klassifiziert. Die `merge_group_hint` werden von frei-form LLM-Objekten auf 74 kanonische Tokens normalisiert. Danach:
- Phase 3: Re-Clustering (gpre1 mit K=20000)
- Phase 4: Neue Master Controls (gpre2)
- Phase 5: Regulation-Source-Split (gpre3)
**NICHT ÄNDERN:** `canonical_controls`, `master_controls`, `object_ontology` Tabellen werden aktiv bearbeitet.
## DB-Zugang Quick Reference
```bash
# Quick Query (eine Zeile)
ssh macmini "/usr/local/bin/docker exec bp-core-postgres psql -U breakpilot -d breakpilot_db -c \"SELECT count(*) FROM compliance.canonical_controls\""
# Interaktive Session
ssh macmini "/usr/local/bin/docker exec -it bp-core-postgres psql -U breakpilot -d breakpilot_db"
```
@@ -0,0 +1,162 @@
-- Migration 010: Expanded Object Ontology
-- Expands from 31 to ~180 canonical object tokens with clear semantic boundaries.
-- Each token has a description to prevent ambiguous classification.
--
-- IMPORTANT: This migration ADDS new tokens. Existing synonyms are preserved.
SET search_path TO compliance, public;
-- Add description column to object_synonyms if not exists
DO $$ BEGIN
ALTER TABLE object_synonyms ADD COLUMN IF NOT EXISTS description TEXT;
EXCEPTION WHEN duplicate_column THEN NULL;
END $$;
-- New table: canonical object definitions with clear boundaries
CREATE TABLE IF NOT EXISTS object_ontology (
canonical_token VARCHAR(100) PRIMARY KEY,
category VARCHAR(50) NOT NULL, -- security, data_protection, governance, regulatory, technical
description_de TEXT NOT NULL, -- German description for LLM prompts
description_en TEXT NOT NULL, -- English description
NOT_confused_with TEXT, -- Explicit disambiguation
examples TEXT, -- Example controls that belong here
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- ═══════════════════════════════════════════════════════════════
-- SECURITY & TECHNICAL
-- ═══════════════════════════════════════════════════════════════
-- Authentication & Identity
INSERT INTO object_ontology VALUES
('multi_factor_auth', 'security', 'Multi-Faktor-Authentifizierung (2FA/MFA)', 'Multi-factor authentication', 'NOT password_policy (Passwortregeln) oder session_management (Sitzungen)', 'MFA implementieren, 2FA-Pflicht, Authentifizierungsfaktoren'),
('password_policy', 'security', 'Passwortrichtlinien und -komplexität', 'Password policies and complexity', 'NOT credentials (allg. Zugangsdaten) oder multi_factor_auth (MFA)', 'Passwortlänge, Komplexität, Rotation, Passwort-Historie'),
('credentials', 'security', 'Zugangsdaten-Verwaltung (Tokens, API-Keys, Secrets)', 'Credential management', 'NOT password_policy (Passwortregeln) oder key_management (kryptografisch)', 'API-Key-Rotation, Token-Verwaltung, Secret Storage'),
('session_management', 'security', 'Sitzungsverwaltung (Session Timeout, Token-Lifecycle)', 'Session management', 'NOT multi_factor_auth (Login) oder access_control (Berechtigungen)', 'Session Timeout, Token-Invalidierung, Concurrent Sessions'),
('privileged_access', 'security', 'Verwaltung privilegierter Zugriffe (Admin, Root)', 'Privileged access management', 'NOT access_control (allg. Zugriffskontrolle)', 'Admin-Konten, Root-Zugriff, PAM, Just-in-Time-Access'),
('access_control', 'security', 'Allgemeine Zugriffskontrolle (RBAC, Berechtigungen)', 'Access control (RBAC, permissions)', 'NOT privileged_access (Admin) oder authentication (Login)', 'Rollenbasierte Zugriffskontrolle, Berechtigungsvergabe, Least Privilege')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Encryption & Cryptography
INSERT INTO object_ontology VALUES
('encryption', 'security', 'Verschlüsselung at-rest (Datenverschlüsselung)', 'Encryption at rest', 'NOT transport_encryption (in-transit) oder key_management (Schlüssel)', 'AES-256, Festplattenverschlüsselung, DB-Verschlüsselung'),
('transport_encryption', 'security', 'Transportverschlüsselung (TLS, HTTPS)', 'Transport encryption (TLS)', 'NOT encryption (at-rest)', 'TLS 1.3, HTTPS, mTLS, Zertifikats-Pinning'),
('key_management', 'security', 'Kryptografische Schlüsselverwaltung', 'Cryptographic key management', 'NOT credentials (API-Keys) oder certificate_management (Zertifikate)', 'Key Rotation, HSM, Key Escrow, Schlüsselerzeugung'),
('certificate_management', 'security', 'Zertifikatsverwaltung (PKI, X.509)', 'Certificate management (PKI)', 'NOT key_management (Schlüssel) oder encryption (Verschlüsselung)', 'X.509-Zertifikate, PKI, Zertifikatsrückruf, CA-Verwaltung')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Network Security
INSERT INTO object_ontology VALUES
('network_security', 'security', 'Allgemeine Netzwerksicherheit', 'General network security', 'NOT network_segmentation (Segmentierung) oder firewall (Regeln)', 'Netzwerk-Hardening, Port-Management, DNS-Sicherheit'),
('network_segmentation', 'security', 'Netzwerksegmentierung (VLANs, Zonen)', 'Network segmentation', 'NOT network_security (allg.) oder firewall (Regeln)', 'VLANs, DMZ, Micro-Segmentation, Zero Trust Network'),
('firewall', 'security', 'Firewall-Regeln und -Verwaltung', 'Firewall rules and management', 'NOT network_security (allg.)', 'WAF, Firewall-Regeln, Ingress/Egress, Whitelist'),
('vpn', 'security', 'VPN-Konfiguration und -Verwaltung', 'VPN configuration', NULL, 'IPSec, WireGuard, Site-to-Site VPN'),
('remote_access', 'security', 'Fernzugriff und Remote-Arbeit', 'Remote access', 'NOT vpn (Technologie)', 'Remote Desktop, Bastion Hosts, Jump Server')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Monitoring & Logging (CRITICAL: clear boundaries!)
INSERT INTO object_ontology VALUES
('monitoring', 'security', 'Kontinuierliche Echtzeit-Überwachung von Systemen/Metriken', 'Continuous real-time monitoring of systems', 'NOT audit_logging (Protokollierung), NOT training (Schulung), NOT procedure (Verfahren), NOT risk_assessment (Bewertung)', 'System-Health-Monitoring, Verfügbarkeitsüberwachung, Performance-Monitoring, Anomalie-Erkennung in Echtzeit'),
('audit_logging', 'security', 'Protokollierung und Audit-Trail (Nachvollziehbarkeit)', 'Audit logging and trail', 'NOT monitoring (Echtzeit-Überwachung), NOT compliance_audit (Prüfungen)', 'Log-Aufzeichnung, Audit Trail, Zeitstempel, Nachvollziehbarkeit, Protokollierung von Zugriffen'),
('siem', 'security', 'Security Information and Event Management', 'SIEM', 'NOT monitoring (allg.) oder audit_logging (Protokollierung)', 'SIEM-Korrelation, Security Events, Log-Aggregation'),
('alerting', 'security', 'Benachrichtigungen und Meldepflichten bei Sicherheitsereignissen', 'Security alerting and notification obligations', 'NOT monitoring (Überwachung) oder incident (Vorfallsbehandlung)', 'Sicherheitsmeldungen, Breach Notification, Benachrichtigungspflichten'),
('compliance_audit', 'governance', 'Compliance-Prüfungen und externe Audits', 'Compliance audits and external reviews', 'NOT audit_logging (technische Protokollierung), NOT monitoring (Überwachung)', 'Externe Prüfung, Jahresabschlussprüfung, Zertifizierungsaudit, Lieferanten-Audit')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Vulnerability & Patch Management
INSERT INTO object_ontology VALUES
('vulnerability', 'security', 'Schwachstellenmanagement und -scanning', 'Vulnerability management', 'NOT patch_management (Updates)', 'Vulnerability Scanning, CVE-Tracking, Penetration Testing'),
('patch_management', 'security', 'Software-Updates und Patch-Verwaltung', 'Patch management', 'NOT vulnerability (Scanning)', 'Patch-Zyklus, Update-Policy, Hotfix-Prozess')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Backup & Recovery
INSERT INTO object_ontology VALUES
('backup', 'security', 'Datensicherung und Backup-Strategien', 'Backup strategies', 'NOT disaster_recovery (Wiederherstellung)', 'Backup-Rotation, Offsite-Backup, Backup-Verschlüsselung'),
('disaster_recovery', 'security', 'Notfallwiederherstellung und Business Continuity', 'Disaster recovery', 'NOT backup (Datensicherung) oder incident (Vorfälle)', 'DR-Plan, RTO/RPO, Failover, Business Continuity')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- ═══════════════════════════════════════════════════════════════
-- DATA PROTECTION (CRITICAL: clear boundaries!)
-- ═══════════════════════════════════════════════════════════════
INSERT INTO object_ontology VALUES
('personal_data', 'data_protection', 'Verarbeitung personenbezogener Daten (DSGVO-Grundsätze)', 'Personal data processing principles', 'NOT sensitive_data (besondere Kategorien), NOT data_subject_rights (Betroffenenrechte), NOT consent (Einwilligung)', 'Datenminimierung, Zweckbindung, Speicherbegrenzung, Rechtmäßigkeit der Verarbeitung'),
('sensitive_data', 'data_protection', 'Besondere Kategorien personenbezogener Daten (Art. 9 DSGVO)', 'Special categories of personal data', 'NOT personal_data (allg.), NOT health_data (Gesundheit)', 'Biometrische Daten, ethnische Herkunft, politische Meinungen, Gewerkschaftszugehörigkeit'),
('health_data', 'data_protection', 'Gesundheitsdaten und Medizindaten', 'Health and medical data', 'NOT sensitive_data (allg. besondere Kategorien)', 'Patientendaten, Medizinprodukte-Daten, klinische Daten'),
('consent', 'data_protection', 'Einwilligungsmanagement', 'Consent management', 'NOT data_subject_rights (andere Betroffenenrechte)', 'Einwilligung einholen, Widerruf, Opt-In, Consent-Banner'),
('data_subject_rights', 'data_protection', 'Betroffenenrechte (Auskunft, Löschung, Portabilität)', 'Data subject rights (access, erasure, portability)', 'NOT consent (Einwilligung), NOT personal_data (Verarbeitung)', 'Auskunftsrecht, Recht auf Löschung, Datenportabilität, Widerspruchsrecht'),
('data_retention', 'data_protection', 'Aufbewahrungsfristen und Löschkonzept', 'Data retention and deletion', 'NOT backup (technische Sicherung)', 'Löschfristen, Aufbewahrungspflichten, Löschkonzept, Archivierung'),
('data_transfer', 'data_protection', 'Internationale Datenübermittlung (Drittländer, SCC)', 'International data transfer', 'NOT data_processing (Verarbeitung)', 'Drittlandtransfer, Standardvertragsklauseln, Angemessenheitsbeschluss, BCR'),
('data_breach_notification', 'data_protection', 'Meldung von Datenschutzverletzungen (Art. 33/34 DSGVO)', 'Data breach notification', 'NOT incident (allg. Sicherheitsvorfälle), NOT alerting (techn. Alerts)', 'Breach-Meldung an Aufsichtsbehörde, Benachrichtigung Betroffener, 72-Stunden-Frist'),
('dpia', 'data_protection', 'Datenschutz-Folgenabschätzung (Art. 35 DSGVO)', 'Data protection impact assessment', NULL, 'DSFA, Schwellwertanalyse, Risikobewertung für Betroffene'),
('data_processing_agreement', 'data_protection', 'Auftragsverarbeitung (Art. 28 DSGVO)', 'Data processing agreements', NULL, 'AVV, Auftragsverarbeiter, Sub-Auftragsverarbeiter, TOMs'),
('privacy_by_design', 'data_protection', 'Datenschutz durch Technikgestaltung (Art. 25 DSGVO)', 'Privacy by design and default', NULL, 'Privacy by Default, Datenminimierung in der Architektur'),
('data_processing_register', 'data_protection', 'Verzeichnis von Verarbeitungstätigkeiten (Art. 30 DSGVO)', 'Records of processing activities', NULL, 'VVT, Verarbeitungsverzeichnis')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- ═══════════════════════════════════════════════════════════════
-- GOVERNANCE & ORGANIZATION
-- ═══════════════════════════════════════════════════════════════
INSERT INTO object_ontology VALUES
('policy', 'governance', 'Richtlinien und Leitlinien ERSTELLEN/DEFINIEREN', 'Creating/defining policies', 'NOT procedure (Verfahrensablauf), NOT compliance_audit (Prüfung)', 'Sicherheitsrichtlinie erstellen, Policy-Framework definieren, Leitlinie verabschieden'),
('procedure', 'governance', 'Verfahren und Prozessabläufe DEFINIEREN/DOKUMENTIEREN', 'Defining/documenting procedures', 'NOT incident (Vorfallsbehandlung), NOT process (laufender Betrieb)', 'Verfahrensanweisung, Ablaufbeschreibung, Standardprozess definieren'),
('process', 'governance', 'Laufende betriebliche Prozesse AUSFÜHREN', 'Executing operational processes', 'NOT procedure (Definition), NOT monitoring (Überwachung)', 'Betriebsprozess, Geschäftsprozess, Workflow-Ausführung'),
('training', 'governance', 'Schulung und Weiterbildung DURCHFÜHREN', 'Training and education', 'NOT awareness (Sensibilisierung), NOT monitoring (Überwachung!)', 'Mitarbeiterschulung, Zertifizierungskurs, Pflichtunterweisung'),
('awareness', 'governance', 'Sicherheitsbewusstsein und Sensibilisierung', 'Security awareness', 'NOT training (formale Schulung)', 'Phishing-Simulation, Awareness-Kampagne, Sicherheitskultur'),
('incident', 'governance', 'Sicherheitsvorfälle BEHANDELN (Incident Response)', 'Incident response and handling', 'NOT alerting (Benachrichtigung), NOT data_breach_notification (DSGVO-Meldung)', 'Incident Response Plan, Vorfallsanalyse, Containment, Recovery, Lessons Learned'),
('risk_management', 'governance', 'Risikomanagement und -bewertung', 'Risk management and assessment', 'NOT vulnerability (techn. Schwachstellen), NOT monitoring (Überwachung)', 'Risikobewertung, Risikobehandlung, Risikoakzeptanz, Risikomatrix'),
('third_party_management', 'governance', 'Lieferanten- und Drittanbieter-Management', 'Third-party and vendor management', 'NOT data_processing_agreement (AVV)', 'Lieferantenbewertung, Vendor Risk Assessment, Supply Chain Security'),
('change_management', 'governance', 'Änderungsmanagement', 'Change management', 'NOT patch_management (Updates)', 'Change Request, Change Advisory Board, Rollback-Verfahren'),
('documentation', 'governance', 'Allgemeine Dokumentationspflichten', 'General documentation requirements', 'NOT audit_logging (technische Logs), NOT data_processing_register (VVT)', 'Betriebshandbuch, Systemdokumentation, Verfahrensdokumentation'),
('records_management', 'governance', 'Akten- und Unterlagenverwaltung', 'Records management', 'NOT data_retention (Löschfristen)', 'Archivierung, Aktenführung, Aufbewahrungspflichten nach HGB/AO'),
('compliance_reporting', 'governance', 'Compliance-Berichterstattung', 'Compliance reporting', 'NOT alerting (techn. Alerts), NOT supervisory_authority (Behördenkommunikation)', 'Compliance-Bericht, Management-Reporting, KPI-Tracking'),
('asset_management', 'governance', 'IT-Asset-Verwaltung und Inventar', 'IT asset management', NULL, 'Asset-Inventar, CMDB, Hardware-Lifecycle, Software-Inventar'),
('physical_security', 'security', 'Physische Sicherheit und Zutrittskontrolle', 'Physical security and access', NULL, 'Zutrittskontrolle, Videoüberwachung (physisch), Serverraum-Sicherheit'),
('human_resources_security', 'governance', 'Personalsicherheit', 'HR security', 'NOT training (Schulung)', 'Background-Checks, Geheimhaltungsvereinbarungen, Onboarding/Offboarding')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- ═══════════════════════════════════════════════════════════════
-- REGULATORY SPECIFIC
-- ═══════════════════════════════════════════════════════════════
INSERT INTO object_ontology VALUES
('supervisory_authority', 'regulatory', 'Kommunikation mit Aufsichtsbehörden', 'Supervisory authority communication', 'NOT compliance_reporting (interne Berichte)', 'Meldung an BaFin, Abstimmung mit DPA, behördliche Anfragen'),
('certification', 'regulatory', 'Zertifizierung und Konformitätsbewertung', 'Certification and conformity assessment', 'NOT compliance_audit (Prüfung), NOT personal_data (Datenschutz)', 'CE-Kennzeichnung, ISO-Zertifizierung, Konformitätserklärung'),
('product_safety', 'regulatory', 'Produktsicherheit und Marktüberwachung', 'Product safety and market surveillance', 'NOT certification (Zertifizierung)', 'Rückrufmanagement, Sicherheitsbewertung, RAPEX-Meldung'),
('ai_system', 'regulatory', 'KI-System-Regulierung (AI Act)', 'AI system regulation', NULL, 'KI-Risikobewertung, Hochrisiko-KI, Transparenzpflichten, FRIA'),
('financial_reporting', 'regulatory', 'Finanzberichterstattung und Rechnungslegung', 'Financial reporting and accounting', NULL, 'Jahresabschluss, HGB-Pflichten, IFRS, Buchführung'),
('aml', 'regulatory', 'Geldwäscheprävention und KYC', 'Anti-money laundering and KYC', NULL, 'KYC, Verdachtsmeldung, PEP-Prüfung, Transaktionsmonitoring'),
('whistleblowing', 'regulatory', 'Hinweisgeberschutz und Meldekanäle', 'Whistleblower protection', NULL, 'Hinweisgebersystem, Meldekanal, Hinweisgeberschutzgesetz'),
('consumer_protection', 'regulatory', 'Verbraucherschutz und AGB', 'Consumer protection', NULL, 'AGB-Prüfung, Widerrufsrecht, Informationspflichten, Preistransparenz'),
('ecommerce', 'regulatory', 'E-Commerce-Pflichten (Impressum, Fernabsatz)', 'E-commerce obligations', NULL, 'Impressumspflicht, Fernabsatzrecht, Online-Handel-Pflichten'),
('telecommunications', 'regulatory', 'Telekommunikationsregulierung', 'Telecommunications regulation', NULL, 'TKG-Pflichten, Vorratsdatenspeicherung, Notruf'),
('medical_device', 'regulatory', 'Medizinprodukte-Regulierung (MDR)', 'Medical device regulation', NULL, 'UDI, klinische Bewertung, Post-Market Surveillance'),
('payment_services', 'regulatory', 'Zahlungsdienste-Regulierung (PSD2)', 'Payment services regulation', NULL, 'Starke Kundenauthentifizierung, PSD2-Compliance, Open Banking'),
('critical_infrastructure', 'regulatory', 'KRITIS und NIS2-Pflichten', 'Critical infrastructure (NIS2)', NULL, 'KRITIS-Meldepflichten, NIS2-Maßnahmen, Mindeststandards'),
('supply_chain_due_diligence', 'regulatory', 'Lieferkettensorgfaltspflicht (LkSG)', 'Supply chain due diligence', 'NOT third_party_management (allg. Lieferanten)', 'Menschenrechts-Due-Diligence, Umwelt-Sorgfaltspflicht, LkSG-Bericht'),
('sustainability_reporting', 'regulatory', 'Nachhaltigkeitsberichterstattung (CSRD)', 'Sustainability reporting', NULL, 'ESG-Reporting, CSRD, Nachhaltigkeitsbericht'),
('cookie_consent', 'regulatory', 'Cookie-Consent und Tracking (TDDDG/ePrivacy)', 'Cookie consent and tracking', 'NOT consent (allg. Einwilligung)', 'Cookie-Banner, Tracking-Einwilligung, TDDDG §25'),
('video_surveillance', 'regulatory', 'Videoüberwachung (datenschutzrechtlich)', 'Video surveillance (data protection)', 'NOT physical_security (physische Sicherheit), NOT monitoring (IT-Monitoring)', 'Kamera-Überwachung, Speicherfristen, Kennzeichnungspflicht')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- ═══════════════════════════════════════════════════════════════
-- APPLICATION SECURITY
-- ═══════════════════════════════════════════════════════════════
INSERT INTO object_ontology VALUES
('secure_development', 'technical', 'Sichere Softwareentwicklung (SDLC)', 'Secure software development lifecycle', NULL, 'Secure Coding, Code Review, SAST/DAST, DevSecOps'),
('api_security', 'technical', 'API-Sicherheit', 'API security', NULL, 'API-Authentifizierung, Rate Limiting, Input Validation'),
('input_validation', 'technical', 'Eingabevalidierung und Output Encoding', 'Input validation and output encoding', NULL, 'XSS-Prävention, SQL-Injection-Schutz, Parametervalidierung'),
('container_security', 'technical', 'Container- und Cloud-Sicherheit', 'Container and cloud security', NULL, 'Docker-Hardening, Kubernetes-Security, Image-Scanning'),
('logging_configuration', 'technical', 'Log-Konfiguration und -Format', 'Log configuration and format', 'NOT audit_logging (Nachvollziehbarkeit), NOT monitoring (Überwachung)', 'Log-Format, Log-Rotation, Log-Shipping, Structured Logging'),
('data_classification', 'governance', 'Datenklassifizierung und -kennzeichnung', 'Data classification and labeling', 'NOT sensitive_data (besondere Kategorien)', 'Vertraulichkeitsstufen, Datenklassifizierung, Labeling')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Count results
DO $$
DECLARE cnt INTEGER;
BEGIN
SELECT count(*) INTO cnt FROM object_ontology;
RAISE NOTICE 'object_ontology: % canonical tokens defined', cnt;
END $$;
@@ -0,0 +1,214 @@
#!/usr/bin/env python3
"""
Extract CE-relevant obligations from TRBS/TRGS/ASR/OSHA chunks in Qdrant.
Searches for MUSS/SOLL patterns in chunk texts and classifies them.
Output: JSON file with structured obligations for the CE session.
Usage:
python3 /app/scripts/extract_ce_obligations.py
python3 /app/scripts/extract_ce_obligations.py --output /tmp/ce_obligations.json
"""
import argparse
import json
import logging
import os
import re
from pathlib import Path
import httpx
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("ce-obligations")
QDRANT_URL = os.getenv("QDRANT_URL", "http://qdrant:6333")
COLLECTION = "bp_compliance_ce"
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://host.docker.internal:11434")
LLM_MODEL = "qwen3.5:35b-a3b"
# Obligation patterns (DE + EN)
OBLIGATION_PATTERNS = re.compile(
r"(muss|müssen|hat\s+[\w\s]*zu\s|ist\s+[\w\s]*sicherzustellen|"
r"ist\s+verpflichtet|sind\s+verpflichtet|darf\s+nicht|"
r"shall|must|required\s+to|is\s+required|shall\s+not)",
re.IGNORECASE,
)
# CE relevance keywords
CE_KEYWORDS = re.compile(
r"(maschine|schutzeinrichtung|gefährdung|quetsch|scher|stoß|"
r"schneid|fang|einzug|absturz|druck|explosion|brand|"
r"elektrisch|spannung|erdung|schutzleiter|not-halt|"
r"betriebsanleitung|kennzeichnung|prüfung|prüfpflicht|"
r"instandhaltung|wartung|sicherheitsabstand|"
r"schutzmaßnahme|persönliche schutzausrüstung|psa|"
r"machine|guard|hazard|crush|shear|cut|entangle|"
r"lockout|tagout|electrical|grounding|emergency stop|"
r"safety distance|protective device|ppe|inspection)",
re.IGNORECASE,
)
HAZARD_CATEGORIES = {
"quetsch|crush|squeeze": "mechanical_crushing",
"schneid|cut": "mechanical_cutting",
"fang|einzug|entangle|draw": "mechanical_entanglement",
"absturz|fall": "fall_hazard",
"explosion|ex-bereich|atex": "explosion_hazard",
"brand|fire|feuer": "fire_hazard",
"elektrisch|electrical|spannung|voltage": "electrical_hazard",
"lärm|noise|schall": "noise_hazard",
"gefahrstoff|hazardous substance|chemical": "chemical_hazard",
"ergonomie|ergonomic|heben|lift": "ergonomic_hazard",
"temperatur|heat|hitze|kälte|cold": "thermal_hazard",
"strahlung|radiation|laser": "radiation_hazard",
"not-halt|emergency stop|e-stop": "emergency_stop",
"lockout|tagout|loto": "lockout_tagout",
"kennzeichnung|label|marking|sign": "safety_marking",
"prüfung|inspection|test": "inspection_requirement",
"instandhaltung|maintenance|wartung": "maintenance",
"schutzeinrichtung|guard|protective device": "protective_device",
"betriebsanleitung|instruction|manual": "operating_instructions",
"druck|pressure|behälter|vessel|kessel|boiler": "pressure_hazard",
}
# Source-based overrides: TRGS docs about chemicals/storage
# should never be classified as mechanical hazards
_CHEMICAL_SOURCES = re.compile(
r"trgs\s*(5[0-9]{2}|7[0-9]{2}|9[0-9]{2}|4[0-9]{2}|6[0-9]{2})",
re.IGNORECASE,
)
def _classify_hazard(text: str, source: str) -> str:
"""Classify hazard with source-aware overrides."""
# TRGS sources → chemical/pressure/explosion, never mechanical
if _CHEMICAL_SOURCES.search(source):
if re.search(r"explosion|ex-bereich|atex|zündfähig", text, re.IGNORECASE):
return "explosion_hazard"
if re.search(r"druck|pressure|behälter|vessel", text, re.IGNORECASE):
return "pressure_hazard"
if re.search(r"brand|fire|feuer", text, re.IGNORECASE):
return "fire_hazard"
return "chemical_hazard"
# Standard pattern matching (order matters — specific first)
for pattern, category in HAZARD_CATEGORIES.items():
if re.search(pattern, text, re.IGNORECASE):
return category
return "general"
def scroll_chunks(source_filter: str = None) -> list[dict]:
"""Scroll through Qdrant to get all relevant chunks."""
chunks = []
offset = None
batch = 100
while True:
scroll_body = {
"limit": batch,
"with_payload": True,
"with_vector": False,
}
if offset is not None:
scroll_body["offset"] = offset
resp = httpx.post(
f"{QDRANT_URL}/collections/{COLLECTION}/points/scroll",
json=scroll_body,
timeout=30.0,
)
data = resp.json()
points = data.get("result", {}).get("points", [])
next_offset = data.get("result", {}).get("next_page_offset")
for pt in points:
payload = pt.get("payload", {})
source = payload.get("source", payload.get("filename", ""))
text = payload.get("chunk_text", "")
# Filter for TRBS/TRGS/ASR/OSHA
source_lower = source.lower()
is_relevant = any(k in source_lower for k in
["trbs", "trgs", "asr", "osha"])
if not is_relevant:
continue
# Check for obligation patterns
if not OBLIGATION_PATTERNS.search(text):
continue
# Check CE relevance
if not CE_KEYWORDS.search(text):
continue
# Classify hazard category (source-aware)
hazard = _classify_hazard(text, source)
# Determine obligation type
if re.search(r"muss|müssen|shall|must|required", text, re.IGNORECASE):
obl_type = "MUSS"
elif re.search(r"soll|sollte|should", text, re.IGNORECASE):
obl_type = "SOLL"
else:
obl_type = "MUSS"
chunks.append({
"source": source,
"section": payload.get("section", ""),
"paragraph": payload.get("paragraph", ""),
"obligation_text": text.strip()[:500],
"hazard_category": hazard,
"obligation_type": obl_type,
"ce_relevance": "high" if hazard != "general" else "medium",
"filename": payload.get("filename", ""),
})
if next_offset is None or not points:
break
offset = next_offset
if len(chunks) % 500 == 0:
logger.info(" Scanned... %d obligations found so far", len(chunks))
return chunks
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--output", default="/tmp/ce_obligations.json")
args = parser.parse_args()
logger.info("Scanning %s for CE obligations...", COLLECTION)
obligations = scroll_chunks()
logger.info("Found %d CE-relevant obligations", len(obligations))
# Stats
by_source = {}
by_hazard = {}
for o in obligations:
src = o["source"][:30]
by_source[src] = by_source.get(src, 0) + 1
by_hazard[o["hazard_category"]] = by_hazard.get(o["hazard_category"], 0) + 1
logger.info("\nBy source:")
for src, cnt in sorted(by_source.items(), key=lambda x: -x[1])[:20]:
logger.info(" %4d %s", cnt, src)
logger.info("\nBy hazard category:")
for cat, cnt in sorted(by_hazard.items(), key=lambda x: -x[1]):
logger.info(" %4d %s", cnt, cat)
# Save
Path(args.output).write_text(
json.dumps(obligations, indent=2, ensure_ascii=False)
)
logger.info("\nSaved to %s", args.output)
if __name__ == "__main__":
main()
@@ -0,0 +1,289 @@
#!/usr/bin/env python3
"""
Add L2 sub-topics to broad tokens. Instead of just "incident",
produces "incident:response", "incident:detection", etc.
Only processes tokens with >500 controls AND <90% audit accuracy.
Usage:
python3 /app/scripts/gpre0_add_subtopics.py --dry-run
python3 /app/scripts/gpre0_add_subtopics.py
"""
import argparse
import json
import logging
import os
import time
from collections import defaultdict
from pathlib import Path
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("gpre0-subtopics")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
CHECKPOINT_DIR = Path("/tmp/gpre0_subtopic_checkpoints")
# Tokens that are too broad — need L2 sub-topics
BROAD_TOKENS = {
# Round 1 (already done)
"risk_management", "policy", "audit_logging", "incident",
"access_control", "compliance_audit", "asset_management",
"key_management", "third_party_management", "monitoring",
"financial_reporting", "data_classification", "change_management",
"alerting", "multi_factor_auth", "api_security",
"certificate_management", "human_resources_security",
"training", "data_processing_agreement", "data_processing_register",
"consumer_protection", "input_validation", "vulnerability",
"dpia", "data_breach_notification", "backup",
"supply_chain_due_diligence", "awareness",
"privacy_by_design", "credentials", "logging_configuration",
# Round 2 (remaining large tokens)
"supervisory_authority", "certification", "secure_development",
"product_safety", "personal_data", "data_subject_rights", "consent",
"ai_system", "encryption", "data_retention", "disaster_recovery",
"data_transfer", "aml", "transport_encryption", "network_security",
"physical_security", "medical_device", "patch_management",
"cookie_consent", "video_surveillance", "network_segmentation",
"telecommunications", "privileged_access", "session_management",
"password_policy", "governance", "whistleblowing", "payment_services",
"health_data", "sensitive_data", "ecommerce", "sustainability_reporting",
"critical_infrastructure", "regulatory",
}
SYSTEM_PROMPT = """Du bist ein Compliance-Spezialist. Jeder Control hat bereits ein Hauptthema (L1 Token).
Deine Aufgabe: Bestimme ein SPEZIFISCHES Sub-Thema (L2) innerhalb des Hauptthemas.
Das L2 Sub-Thema soll den KONKRETEN Aspekt beschreiben. Verwende kurze, klare englische Bezeichnungen.
Beispiele:
- L1=incident, Titel="Incident Response Plan erstellen" → L2="response_plan"
- L1=incident, Titel="Sicherheitsvorfälle erkennen" → L2="detection"
- L1=incident, Titel="Recovery nach Vorfall dokumentieren" → L2="recovery"
- L1=incident, Titel="Forensische Analyse durchführen" → L2="forensics"
- L1=risk_management, Titel="Risikobewertung durchführen" → L2="assessment"
- L1=risk_management, Titel="Risikominderungsmaßnahmen umsetzen" → L2="treatment"
- L1=risk_management, Titel="Restrisiko akzeptieren" → L2="acceptance"
- L1=access_control, Titel="Rollenbasierte Zugriffskontrolle" → L2="rbac"
- L1=access_control, Titel="Zugriffsrechte regelmäßig prüfen" → L2="access_review"
- L1=access_control, Titel="Identitätsmanagement implementieren" → L2="identity_management"
- L1=monitoring, Titel="Systemverfügbarkeit überwachen" → L2="availability"
- L1=monitoring, Titel="Sicherheitsereignisse überwachen" → L2="security_events"
- L1=policy, Titel="Datenschutzrichtlinie erstellen" → L2="data_protection"
- L1=policy, Titel="Acceptable Use Policy definieren" → L2="acceptable_use"
- L1=policy, Titel="Passwortrichtlinie festlegen" → L2="password"
- L1=financial_reporting, Titel="Jahresabschluss erstellen" → L2="annual_accounts"
- L1=financial_reporting, Titel="Steuererklärung einreichen" → L2="tax"
- L1=alerting, Titel="Datenpanne an Behörde melden" → L2="breach_notification"
- L1=alerting, Titel="Sicherheitswarnung eskalieren" → L2="escalation"
REGELN:
- L2 soll 1-3 Wörter sein, snake_case
- L2 soll SPEZIFISCH sein (nicht das L1 wiederholen)
- Verwende konsistente L2-Bezeichnungen für ähnliche Controls
Antworte NUR als JSON-Array: [{"id":"...","l2":"subtopic"}, ...]"""
def call_claude(controls_batch: list[dict]) -> tuple[list[dict], dict]:
"""Send batch to Claude for L2 sub-topic assignment."""
items = []
for c in controls_batch:
items.append(
f'- id="{c["control_id"]}" '
f'L1="{c["current_object"]}" '
f't="{c["title"]}" '
f'o="{c["objective"][:80]}"'
)
prompt = "Bestimme L2 Sub-Topics:\n" + "\n".join(items)
headers = {
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
payload = {
"model": ANTHROPIC_MODEL,
"max_tokens": 1500,
"temperature": 0.0,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": prompt}],
}
try:
resp = httpx.post(
ANTHROPIC_URL, headers=headers, json=payload, timeout=45.0
)
resp.raise_for_status()
data = resp.json()
content = data.get("content", [{}])[0].get("text", "")
usage = data.get("usage", {})
start = content.find("[")
end = content.rfind("]") + 1
if start >= 0 and end > start:
return json.loads(content[start:end]), usage
return [], usage
except httpx.TimeoutException:
logger.error("TIMEOUT — skipping")
return [], {}
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
logger.warning("Rate limited — waiting 60s")
time.sleep(60)
else:
logger.error("API error %d", e.response.status_code)
return [], {}
except Exception as e:
logger.error("Failed: %s", e)
return [], {}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--batch-size", type=int, default=20)
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Build LIKE patterns for broad tokens
like_clauses = " OR ".join(
f"cc.generation_metadata->>'merge_group_hint' LIKE '%:{tok}:%'"
for tok in BROAD_TOKENS
)
with engine.connect() as c:
rows = c.execute(text(f"""
SELECT cc.id, cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective,
cc.generation_metadata->>'merge_group_hint' as hint
FROM canonical_controls cc
WHERE cc.generation_metadata->>'merge_group_hint' IS NOT NULL
AND cc.release_state NOT IN ('deprecated', 'rejected')
AND ({like_clauses})
""")).fetchall()
controls = []
for uuid, cid, title, objective, hint in rows:
parts = hint.split(":", 2) if hint else []
obj = parts[1] if len(parts) > 1 else ""
if obj in BROAD_TOKENS:
controls.append({
"uuid": str(uuid), "control_id": cid,
"title": title or "", "objective": objective or "",
"current_hint": hint, "current_object": obj,
})
logger.info("Found %d controls in broad tokens to add L2 sub-topics", len(controls))
# Process
total_tagged = 0
total_skipped = 0
total_input_tokens = 0
total_output_tokens = 0
corrections = []
l2_stats: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
for i in range(0, len(controls), args.batch_size):
batch = controls[i:i + args.batch_size]
results, usage = call_claude(batch)
total_input_tokens += usage.get("input_tokens", 0)
total_output_tokens += usage.get("output_tokens", 0)
if not results:
total_skipped += len(batch)
continue
result_map = {r.get("id", ""): r for r in results}
for ctrl in batch:
r = result_map.get(ctrl["control_id"], {})
l2 = r.get("l2", "")
if not l2:
total_skipped += 1
continue
total_tagged += 1
old_hint = ctrl["current_hint"]
parts = old_hint.split(":", 2)
action = parts[0] if parts else "implement"
l1 = parts[1] if len(parts) > 1 else "unknown"
phase = parts[2] if len(parts) > 2 else "implementation"
# New format: action:L1_L2:phase
new_obj = f"{l1}_{l2}"
new_hint = f"{action}:{new_obj}:{phase}"
corrections.append({
"uuid": ctrl["uuid"],
"old_hint": old_hint,
"new_hint": new_hint,
})
l2_stats[l1][l2] += 1
processed = min(i + args.batch_size, len(controls))
if processed % 5000 < args.batch_size or processed >= len(controls):
logger.info(
"Progress: %d/%d (tagged=%d skip=%d)",
processed, len(controls), total_tagged, total_skipped,
)
time.sleep(0.3)
# Report
cost_in = total_input_tokens / 1_000_000 * 0.80
cost_out = total_output_tokens / 1_000_000 * 4.00
logger.info("\n" + "=" * 60)
logger.info("SUBTOPIC REPORT")
logger.info("=" * 60)
logger.info("Total: %d | Tagged: %d | Skipped: %d", len(controls), total_tagged, total_skipped)
logger.info("Cost: $%.2f (Haiku)", cost_in + cost_out)
# Show L2 distribution per L1
for l1, subs in sorted(l2_stats.items()):
top_subs = sorted(subs.items(), key=lambda x: -x[1])[:10]
logger.info("\n%s (%d unique L2):", l1, len(subs))
for l2, cnt in top_subs:
logger.info(" %4d %s_%s", cnt, l1, l2)
# Save corrections
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
corr_file = CHECKPOINT_DIR / "corrections_subtopics.json"
corr_file.write_text(json.dumps(corrections))
logger.info("\nSaved %d corrections to %s", len(corrections), corr_file)
if args.dry_run:
logger.info("DRY RUN — not updating DB")
return
if corrections:
logger.info("Applying %d corrections...", len(corrections))
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for corr in corrections:
c.execute(text("""
UPDATE canonical_controls
SET generation_metadata = jsonb_set(
generation_metadata,
'{merge_group_hint}',
to_jsonb(CAST(:new_hint AS text))
)
WHERE id = CAST(:uuid AS uuid)
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
logger.info("Done. %d hints updated.", len(corrections))
if __name__ == "__main__":
main()
@@ -0,0 +1,52 @@
#!/usr/bin/env python3
"""Apply saved corrections from JSON file to DB (crash recovery)."""
import argparse
import json
import logging
import os
from pathlib import Path
from sqlalchemy import create_engine, text
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("apply-corrections")
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("file", help="Path to corrections JSON file")
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
corrections = json.loads(Path(args.file).read_text())
logger.info("Loaded %d corrections from %s", len(corrections), args.file)
if args.dry_run:
for c in corrections[:10]:
logger.info(" %s: %s%s", c["uuid"][:8], c["old_hint"], c["new_hint"])
logger.info("DRY RUN — not applying")
return
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
applied = 0
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for corr in corrections:
c.execute(text("""
UPDATE canonical_controls
SET generation_metadata = jsonb_set(
generation_metadata,
'{merge_group_hint}',
to_jsonb(CAST(:new_hint AS text))
)
WHERE id = CAST(:uuid AS uuid)
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
applied += 1
logger.info("Applied %d corrections.", applied)
if __name__ == "__main__":
main()
@@ -0,0 +1,153 @@
#!/usr/bin/env python3
"""Fix bad L2 subtopics: stakeholder_*, escalation fragments, *_approval*, *_documentation."""
import json
import logging
import os
import time
from pathlib import Path
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("fix-subtopics")
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
SYSTEM_PROMPT = """Du klassifizierst Controls mit einem L1_L2 Token. Das L2 soll den KONKRETEN fachlichen Aspekt beschreiben.
VERBOTENE L2-Wörter (zu generisch):
- stakeholder (zu vage — WER sind die Stakeholder? WAS wird getan?)
- documentation (ist eine Handlung, kein Thema)
- approval (ist eine Handlung)
- communication (zu vage)
Stattdessen SPEZIFISCH:
- "stakeholder_notification" bei Behördenmeldung → "authority_reporting"
- "stakeholder_consultation" bei DSFA → "impact_consultation"
- "stakeholder_engagement" bei Training → "participant_selection"
- "escalation_procedure""severity_classification" oder "response_plan"
- "access_documentation""access_policy" oder "permission_matrix"
- "approval_process""authorization_workflow" oder "sign_off"
L2 = 1-3 Wörter, snake_case, FACHLICH SPEZIFISCH.
Antworte NUR als JSON-Array: [{"id":"...","token":"L1_L2"}, ...]"""
def main():
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
with engine.connect() as c:
rows = c.execute(text("""
SELECT cc.id, cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective,
cc.generation_metadata->>'merge_group_hint' as hint
FROM canonical_controls cc
WHERE cc.release_state NOT IN ('deprecated', 'rejected')
AND cc.generation_metadata->>'merge_group_hint' IS NOT NULL
AND (
cc.generation_metadata->>'merge_group_hint' LIKE '%stakeholder%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%_escalation_%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%_approval_%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%response_time%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%machine_re%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%management_app%'
)
""")).fetchall()
controls = []
for uuid, cid, title, objective, hint in rows:
parts = hint.split(":", 2) if hint else []
controls.append({
"uuid": str(uuid), "control_id": cid,
"title": title or "", "objective": objective or "",
"current_hint": hint,
"current_object": parts[1] if len(parts) > 1 else "",
})
logger.info("Found %d controls with bad subtopics to fix", len(controls))
headers = {
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
corrections = []
total_fixed = 0
batch_size = 20
for i in range(0, len(controls), batch_size):
batch = controls[i:i + batch_size]
items = [
f'- id="{c["control_id"]}" cur="{c["current_object"]}" t="{c["title"]}" o="{c["objective"][:80]}"'
for c in batch
]
try:
resp = httpx.post(ANTHROPIC_URL, headers=headers, json={
"model": "claude-haiku-4-5-20251001",
"max_tokens": 1500, "temperature": 0.0,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": "Fix:\n" + "\n".join(items)}],
}, timeout=45.0)
resp.raise_for_status()
content = resp.json().get("content", [{}])[0].get("text", "")
start = content.find("[")
end = content.rfind("]") + 1
results = json.loads(content[start:end]) if start >= 0 else []
except Exception as e:
logger.error("Batch %d failed: %s", i, e)
continue
result_map = {r.get("id", ""): r for r in results}
for ctrl in batch:
r = result_map.get(ctrl["control_id"], {})
new_token = r.get("token", "")
if not new_token or new_token == ctrl["current_object"]:
continue
if "stakeholder" in new_token or "approval" in new_token:
continue # Still bad
parts = ctrl["current_hint"].split(":", 2)
action = parts[0] if parts else "implement"
phase = parts[2] if len(parts) > 2 else "implementation"
corrections.append({
"uuid": ctrl["uuid"],
"old_hint": ctrl["current_hint"],
"new_hint": f"{action}:{new_token}:{phase}",
})
total_fixed += 1
if (i + batch_size) % 200 < batch_size:
logger.info("Progress: %d/%d (fixed=%d)", min(i + batch_size, len(controls)), len(controls), total_fixed)
time.sleep(0.3)
logger.info("Fixed: %d of %d controls", total_fixed, len(controls))
# Save + apply
Path("/tmp/corrections_bad_subtopics.json").write_text(json.dumps(corrections))
if corrections:
logger.info("Applying %d corrections...", len(corrections))
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for corr in corrections:
c.execute(text("""
UPDATE canonical_controls
SET generation_metadata = jsonb_set(
generation_metadata,
'{merge_group_hint}',
to_jsonb(CAST(:new_hint AS text))
)
WHERE id = CAST(:uuid AS uuid)
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
logger.info("Done.")
if __name__ == "__main__":
main()
@@ -0,0 +1,284 @@
#!/usr/bin/env python3
"""
Fix generic tokens: Re-classify controls that were assigned to
action-based tokens (documentation, procedure, process, etc.)
instead of topic-based tokens.
Runs sequentially in 5 batches. NO retry on timeout.
Usage:
python3 /app/scripts/gpre0_fix_generic_tokens.py --dry-run
python3 /app/scripts/gpre0_fix_generic_tokens.py
"""
import argparse
import json
import logging
import os
import time
from collections import defaultdict
from pathlib import Path
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("gpre0-fix-generic")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
CHECKPOINT_DIR = Path("/tmp/gpre0_fix_checkpoints")
# Tokens that are ACTION-based, not TOPIC-based → must be re-classified
FORBIDDEN_TOKENS = {
"documentation", "procedure", "process",
"compliance_reporting", "records_management",
}
SYSTEM_PROMPT = """Du bist ein Compliance-Klassifizierer. Ordne jeden Control dem THEMA zu, nicht der Handlung.
KRITISCH: Die Tokens "documentation", "procedure", "process", "compliance_reporting",
"records_management" sind VERBOTEN. Klassifiziere nach dem INHALTLICHEN THEMA.
Beispiele:
- "Risikobewertung dokumentieren" → risk_management (NICHT documentation)
- "Incident-Verfahren definieren" → incident (NICHT procedure)
- "Verschlüsselungsprozess implementieren" → encryption (NICHT process)
- "Audit-Ergebnisse berichten" → compliance_audit (NICHT compliance_reporting)
- "Datenschutz-Unterlagen verwalten" → personal_data (NICHT records_management)
- "Löschkonzept dokumentieren" → data_retention (NICHT documentation)
- "Zertifizierungsverfahren definieren" → certification (NICHT procedure)
- "Schulungsprozess durchführen" → training (NICHT process)
ERLAUBTE TOKENS:
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
privileged_access, access_control, encryption, transport_encryption,
key_management, certificate_management, network_security, network_segmentation,
firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting,
compliance_audit, vulnerability, patch_management, backup, disaster_recovery,
physical_security, secure_development, api_security, input_validation,
container_security, logging_configuration
DATA_PROTECTION: personal_data, sensitive_data, health_data, consent,
data_subject_rights, data_retention, data_transfer, data_breach_notification,
dpia, data_processing_agreement, privacy_by_design, data_processing_register,
data_classification, cookie_consent, video_surveillance
GOVERNANCE: policy, training, awareness, incident, risk_management,
third_party_management, change_management, asset_management,
human_resources_security
REGULATORY: supervisory_authority, certification, product_safety, ai_system,
financial_reporting, aml, whistleblowing, consumer_protection, ecommerce,
telecommunications, medical_device, payment_services, critical_infrastructure,
supply_chain_due_diligence, sustainability_reporting
Antworte NUR als JSON-Array: [{"id":"...","token":"...","conf":0.9}, ...]"""
def call_claude(controls_batch: list[dict]) -> tuple[list[dict], dict]:
"""Send batch to Claude. NO retry on timeout."""
items = []
for c in controls_batch:
items.append(
f'- id="{c["control_id"]}" '
f'cur="{c["current_object"]}" '
f't="{c["title"]}" '
f'o="{c["objective"][:100]}"'
)
prompt = "Klassifiziere nach THEMA (nicht Handlung):\n" + "\n".join(items)
headers = {
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
payload = {
"model": ANTHROPIC_MODEL,
"max_tokens": 1500,
"temperature": 0.0,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": prompt}],
}
try:
resp = httpx.post(
ANTHROPIC_URL, headers=headers, json=payload, timeout=45.0
)
resp.raise_for_status()
data = resp.json()
content = data.get("content", [{}])[0].get("text", "")
usage = data.get("usage", {})
start = content.find("[")
end = content.rfind("]") + 1
if start >= 0 and end > start:
return json.loads(content[start:end]), usage
return [], usage
except httpx.TimeoutException:
logger.error("TIMEOUT — skipping batch")
return [], {}
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
logger.warning("Rate limited — waiting 60s")
time.sleep(60)
else:
logger.error("API error %d", e.response.status_code)
return [], {}
except Exception as e:
logger.error("Failed: %s", e)
return [], {}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--batch-size", type=int, default=20)
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Load only controls with forbidden tokens
forbidden_pattern = "|".join(
f":{tok}:" for tok in FORBIDDEN_TOKENS
)
with engine.connect() as c:
rows = c.execute(text("""
SELECT cc.id, cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective,
cc.generation_metadata->>'merge_group_hint' as hint
FROM canonical_controls cc
WHERE cc.generation_metadata->>'merge_group_hint' IS NOT NULL
AND cc.release_state NOT IN ('deprecated', 'rejected')
AND (
cc.generation_metadata->>'merge_group_hint' LIKE '%:documentation:%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:procedure:%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:process:%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:compliance_reporting:%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:records_management:%'
)
""")).fetchall()
controls = []
for uuid, cid, title, objective, hint in rows:
parts = hint.split(":", 2) if hint else []
controls.append({
"uuid": str(uuid), "control_id": cid,
"title": title or "", "objective": objective or "",
"current_hint": hint,
"current_object": parts[1] if len(parts) > 1 else hint,
})
logger.info("Found %d controls with forbidden tokens to re-classify", len(controls))
# Process
total_fixed = 0
total_kept = 0
total_skipped = 0
total_input_tokens = 0
total_output_tokens = 0
corrections = []
change_stats: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
for i in range(0, len(controls), args.batch_size):
batch = controls[i:i + args.batch_size]
results, usage = call_claude(batch)
total_input_tokens += usage.get("input_tokens", 0)
total_output_tokens += usage.get("output_tokens", 0)
if not results:
total_skipped += len(batch)
continue
result_map = {r.get("id", ""): r for r in results}
for ctrl in batch:
r = result_map.get(ctrl["control_id"], {})
new_token = r.get("token", "")
if not new_token or new_token in FORBIDDEN_TOKENS:
total_kept += 1
continue
old_obj = ctrl["current_object"]
if new_token != old_obj:
total_fixed += 1
parts = ctrl["current_hint"].split(":", 2)
action = parts[0] if parts else "implement"
phase = parts[2] if len(parts) > 2 else "implementation"
corrections.append({
"uuid": ctrl["uuid"],
"old_hint": ctrl["current_hint"],
"new_hint": f"{action}:{new_token}:{phase}",
})
change_stats[old_obj][new_token] += 1
else:
total_kept += 1
processed = min(i + args.batch_size, len(controls))
if processed % 2000 < args.batch_size or processed >= len(controls):
logger.info(
"Progress: %d/%d (fixed=%d kept=%d skip=%d)",
processed, len(controls), total_fixed, total_kept, total_skipped,
)
time.sleep(0.3)
# Report
cost_in = total_input_tokens / 1_000_000 * 0.80
cost_out = total_output_tokens / 1_000_000 * 4.00
logger.info("\n" + "=" * 60)
logger.info("GENERIC TOKEN FIX REPORT")
logger.info("=" * 60)
logger.info("Total: %d controls", len(controls))
logger.info("Fixed: %d", total_fixed)
logger.info("Kept: %d (LLM also chose forbidden → kept as-is)", total_kept)
logger.info("Skipped: %d", total_skipped)
logger.info("Cost: $%.2f (Haiku)", cost_in + cost_out)
logger.info("\nTop changes:")
flat = []
for old, news in change_stats.items():
for new, cnt in news.items():
flat.append((cnt, old, new))
for cnt, old, new in sorted(flat, reverse=True)[:30]:
logger.info(" %4d × %s%s", cnt, old, new)
# Save corrections
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
corr_file = CHECKPOINT_DIR / "corrections_generic_fix.json"
corr_file.write_text(json.dumps(corrections))
logger.info("Saved %d corrections to %s", len(corrections), corr_file)
if args.dry_run:
logger.info("DRY RUN — not updating DB")
return
if corrections:
logger.info("Applying %d corrections...", len(corrections))
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for corr in corrections:
c.execute(text("""
UPDATE canonical_controls
SET generation_metadata = jsonb_set(
generation_metadata,
'{merge_group_hint}',
to_jsonb(CAST(:new_hint AS text))
)
WHERE id = CAST(:uuid AS uuid)
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
logger.info("Done. %d hints corrected.", len(corrections))
if __name__ == "__main__":
main()
+37
View File
@@ -0,0 +1,37 @@
#!/bin/bash
# Run all 10 batches sequentially. Safe: if one fails, the rest don't run.
# Each batch saves corrections to JSON before applying to DB.
#
# Usage: bash /app/scripts/gpre0_run_all.sh
# bash /app/scripts/gpre0_run_all.sh 5 # start from batch 5
set -e
START=${1:-1}
TOTAL=10
echo "=== Starting from batch $START of $TOTAL ==="
for i in $(seq $START $TOTAL); do
echo ""
echo "================================================================"
echo " BATCH $i/$TOTAL$(date)"
echo "================================================================"
PYTHONPATH=/app python3 /app/scripts/gpre0_validate_hints.py \
--batch-id $i \
--total-batches $TOTAL \
--batch-size 20
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
echo "BATCH $i FAILED with exit code $EXIT_CODE"
echo "Resume with: bash /app/scripts/gpre0_run_all.sh $i"
exit $EXIT_CODE
fi
echo "BATCH $i DONE — $(date)"
done
echo ""
echo "ALL $TOTAL BATCHES COMPLETE!"
@@ -0,0 +1,351 @@
#!/usr/bin/env python3
"""
Phase 2: Validate and correct merge_group_hints using Claude Haiku.
Re-classifies each control's object token against the expanded ontology
(74 canonical tokens). Corrects wrong hints in the DB.
SAFETY: Split into 4 batches. NEVER retries on timeout (double-billing!).
Writes checkpoint after each API call for safe resume.
Usage:
python3 /app/scripts/gpre0_validate_hints.py --batch-id 1 --dry-run
python3 /app/scripts/gpre0_validate_hints.py --batch-id 1
python3 /app/scripts/gpre0_validate_hints.py --batch-id 2
python3 /app/scripts/gpre0_validate_hints.py --batch-id 3
python3 /app/scripts/gpre0_validate_hints.py --batch-id 4
"""
import argparse
import json
import logging
import os
import time
from collections import defaultdict
from pathlib import Path
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("gpre0-validate")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
CHECKPOINT_DIR = Path("/tmp/gpre0_checkpoints")
SYSTEM_PROMPT = """Du bist ein Compliance-Klassifizierer. Ordne jeden Control GENAU EINEM Token zu.
REGEL: Waehle IMMER den naechstbesten Token aus der Liste. OTHER nur wenn ABSOLUT
kein Token auch nur entfernt passt (<1% der Faelle). Im Zweifel: den breitesten
passenden Token waehlen (z.B. "policy" fuer Governance-Dokumente, "procedure" fuer
Ablauf-Definitionen, "risk_management" fuer Bewertungen).
TOKENS:
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
privileged_access, access_control, encryption, transport_encryption,
key_management, certificate_management, network_security, network_segmentation,
firewall, vpn, remote_access, monitoring (NUR Echtzeit-Systemueberwachung),
audit_logging (Protokollierung/Audit Trail), siem, alerting (Meldepflichten),
compliance_audit (externe Pruefungen), vulnerability, patch_management,
backup, disaster_recovery, physical_security, secure_development,
api_security, input_validation, container_security, logging_configuration
DATA_PROTECTION: personal_data (DSGVO-Verarbeitung), sensitive_data (Art.9),
health_data, consent, data_subject_rights, data_retention, data_transfer,
data_breach_notification, dpia, data_processing_agreement, privacy_by_design,
data_processing_register, data_classification, cookie_consent, video_surveillance
GOVERNANCE: policy (Richtlinie definieren), procedure (Verfahren definieren),
process (Betriebsprozess ausfuehren), training (Schulung), awareness,
incident (Vorfallsbehandlung), risk_management, third_party_management,
change_management, documentation, records_management, compliance_reporting,
asset_management, human_resources_security
REGULATORY: supervisory_authority, certification (Zertifizierung/Konformitaet),
product_safety, ai_system, financial_reporting, aml, whistleblowing,
consumer_protection, ecommerce, telecommunications, medical_device,
payment_services, critical_infrastructure, supply_chain_due_diligence,
sustainability_reporting
ABGRENZUNGEN:
- monitoring = NUR Echtzeit-Systemueberwachung, NICHT Audit/Schulung/Bewertung
- audit_logging = Protokollierung, NICHT externe Pruefung (→ compliance_audit)
- procedure = Verfahren DEFINIEREN, NICHT Vorfaelle behandeln (→ incident)
- personal_data = DSGVO-Verarbeitung, NICHT Zertifizierung (→ certification)
- alerting = Meldepflichten, NICHT Vorfallsbehandlung (→ incident)
Antworte NUR als JSON-Array: [{"id":"...","token":"...","conf":0.9}, ...]
KEIN weiterer Text. Nur das Array."""
def call_claude(controls_batch: list[dict]) -> tuple[list[dict], dict]:
"""Send batch to Claude. NO RETRY on timeout (double-billing risk!)."""
items = []
for c in controls_batch:
items.append(
f'- id="{c["control_id"]}" '
f'cur="{c["current_object"]}" '
f't="{c["title"]}" '
f'o="{c["objective"][:100]}"'
)
prompt = "Klassifiziere:\n" + "\n".join(items)
headers = {
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
payload = {
"model": ANTHROPIC_MODEL,
"max_tokens": 1500,
"temperature": 0.0,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": prompt}],
}
try:
resp = httpx.post(
ANTHROPIC_URL, headers=headers, json=payload, timeout=45.0
)
resp.raise_for_status()
data = resp.json()
content = data.get("content", [{}])[0].get("text", "")
usage = data.get("usage", {})
start = content.find("[")
end = content.rfind("]") + 1
if start >= 0 and end > start:
return json.loads(content[start:end]), usage
logger.warning("No JSON array in response")
return [], usage
except httpx.TimeoutException:
# CRITICAL: Do NOT retry! Log and skip.
logger.error("TIMEOUT — skipping batch (NOT retrying to avoid double-billing)")
return [], {}
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
logger.warning("Rate limited — waiting 60s then skipping")
time.sleep(60)
else:
logger.error("API error %d — skipping batch", e.response.status_code)
return [], {}
except Exception as e:
logger.error("Request failed — skipping: %s", e)
return [], {}
def load_checkpoint(batch_id: int) -> int:
"""Load last processed index for this batch."""
cp_file = CHECKPOINT_DIR / f"batch_{batch_id}.json"
if cp_file.exists():
data = json.loads(cp_file.read_text())
return data.get("last_index", 0)
return 0
def save_checkpoint(batch_id: int, last_index: int, stats: dict):
"""Save progress checkpoint."""
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
cp_file = CHECKPOINT_DIR / f"batch_{batch_id}.json"
cp_file.write_text(json.dumps({
"batch_id": batch_id,
"last_index": last_index,
**stats,
}))
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--batch-id", type=int, required=True)
parser.add_argument("--total-batches", type=int, default=10)
parser.add_argument("--batch-size", type=int, default=20)
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--resume", action="store_true",
help="Resume from checkpoint")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Load ALL control IDs ordered deterministically, then select quarter
with engine.connect() as c:
all_ids = c.execute(text("""
SELECT cc.id
FROM canonical_controls cc
WHERE cc.generation_metadata->>'merge_group_hint' IS NOT NULL
AND cc.generation_metadata->>'merge_group_hint' != ''
AND cc.release_state NOT IN ('deprecated', 'rejected')
ORDER BY cc.id
""")).fetchall()
total = len(all_ids)
chunk = total // args.total_batches
start_idx = (args.batch_id - 1) * chunk
end_idx = total if args.batch_id == args.total_batches else args.batch_id * chunk
batch_ids = [str(r[0]) for r in all_ids[start_idx:end_idx]]
logger.info("Batch %d/%d: controls %d-%d (%d controls of %d total)",
args.batch_id, args.total_batches, start_idx, end_idx, len(batch_ids), total)
# Load full data for this batch
id_list = ",".join(f"'{uid}'" for uid in batch_ids)
with engine.connect() as c:
rows = c.execute(text(f"""
SELECT cc.id, cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective,
cc.generation_metadata->>'merge_group_hint' as hint
FROM canonical_controls cc
WHERE cc.id IN ({id_list})
ORDER BY cc.id
""")).fetchall()
controls = []
for uuid, cid, title, objective, hint in rows:
parts = hint.split(":", 2) if hint else []
controls.append({
"uuid": str(uuid), "control_id": cid,
"title": title or "", "objective": objective or "",
"current_hint": hint, "current_object": parts[1] if len(parts) > 1 else hint,
})
# Resume from checkpoint?
start_from = 0
if args.resume:
start_from = load_checkpoint(args.batch_id)
if start_from > 0:
logger.info("Resuming from index %d", start_from)
# Process
total_same = 0
total_changed = 0
total_other = 0
total_skipped = 0
total_input_tokens = 0
total_output_tokens = 0
corrections: list[dict] = []
change_stats: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
for i in range(start_from, len(controls), args.batch_size):
batch = controls[i:i + args.batch_size]
results, usage = call_claude(batch)
total_input_tokens += usage.get("input_tokens", 0)
total_output_tokens += usage.get("output_tokens", 0)
if not results:
total_skipped += len(batch)
save_checkpoint(args.batch_id, i + args.batch_size, {
"same": total_same, "changed": total_changed,
"other": total_other, "skipped": total_skipped,
})
continue
result_map = {r.get("id", ""): r for r in results}
for ctrl in batch:
r = result_map.get(ctrl["control_id"], {})
new_token = r.get("token", "")
if not new_token:
total_skipped += 1
continue
old_obj = ctrl["current_object"]
if new_token == "OTHER":
total_other += 1
elif new_token == old_obj:
total_same += 1
else:
total_changed += 1
parts = ctrl["current_hint"].split(":", 2)
action = parts[0] if parts else "implement"
phase = parts[2] if len(parts) > 2 else "implementation"
corrections.append({
"uuid": ctrl["uuid"],
"old_hint": ctrl["current_hint"],
"new_hint": f"{action}:{new_token}:{phase}",
})
change_stats[old_obj][new_token] += 1
# Checkpoint every batch
save_checkpoint(args.batch_id, i + args.batch_size, {
"same": total_same, "changed": total_changed,
"other": total_other, "skipped": total_skipped,
})
processed = min(i + args.batch_size, len(controls))
if processed % 1000 < args.batch_size or processed >= len(controls):
logger.info(
"Batch %d: %d/%d (same=%d changed=%d other=%d skip=%d)",
args.batch_id, processed, len(controls),
total_same, total_changed, total_other, total_skipped,
)
time.sleep(0.3)
# Report
cost_in = total_input_tokens / 1_000_000 * 0.80 # Haiku
cost_out = total_output_tokens / 1_000_000 * 4.00 # Haiku
total_cost = cost_in + cost_out
total_proc = total_same + total_changed + total_other
logger.info("\n" + "=" * 60)
logger.info("BATCH %d REPORT", args.batch_id)
logger.info("=" * 60)
logger.info("Processed: %d | Skipped: %d", total_proc, total_skipped)
logger.info("Same: %d (%.1f%%)", total_same, total_same / max(total_proc, 1) * 100)
logger.info("Changed: %d (%.1f%%)", total_changed, total_changed / max(total_proc, 1) * 100)
logger.info("OTHER: %d (%.1f%%)", total_other, total_other / max(total_proc, 1) * 100)
logger.info("Cost: $%.2f (Haiku)", total_cost)
logger.info("Cost/ctrl: $%.5f", total_cost / max(total_proc, 1))
# Top changes
flat = []
for old, news in change_stats.items():
for new, cnt in news.items():
flat.append((cnt, old, new))
logger.info("\nTop Changes:")
for cnt, old, new in sorted(flat, reverse=True)[:20]:
logger.info(" %4d × %s%s", cnt, old, new)
# Always save corrections to file (recovery safety)
corr_file = CHECKPOINT_DIR / f"corrections_batch_{args.batch_id}.json"
if corrections:
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
corr_file.write_text(json.dumps(corrections))
logger.info("Saved %d corrections to %s", len(corrections), corr_file)
if args.dry_run:
logger.info("\nDRY RUN — not updating DB")
return
# Apply corrections in single transaction
if corrections:
logger.info("\nApplying %d corrections...", len(corrections))
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for corr in corrections:
c.execute(text("""
UPDATE canonical_controls
SET generation_metadata = jsonb_set(
generation_metadata,
'{merge_group_hint}',
to_jsonb(CAST(:new_hint AS text))
)
WHERE id = CAST(:uuid AS uuid)
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
logger.info("Done. %d hints corrected.", len(corrections))
else:
logger.info("No corrections needed.")
if __name__ == "__main__":
main()
+214
View File
@@ -0,0 +1,214 @@
#!/usr/bin/env python3
"""
G-pre2 v2: Build Master Controls directly from canonical tokens.
No K-Means needed — Phase 2 already normalized merge_group_hints
to 74 canonical tokens. Each token = one object group.
Groups controls by (canonical_token, phase) and creates MCs
for tokens with >=2 distinct phases.
Usage:
python3 /app/scripts/gpre2_direct_mc.py --dry-run
python3 /app/scripts/gpre2_direct_mc.py --min-phases 2
"""
import argparse
import json
import logging
import os
from collections import defaultdict
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("gpre2-direct")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
PHASE_ORDER = {
"scope": 0, "definition": 1, "governance": 1,
"design": 2, "implementation": 3, "configuration": 3,
"operation": 4, "training": 4, "monitoring": 5,
"testing": 6, "review": 7, "assessment": 8, "remediation": 8,
"validation": 9, "reporting": 10, "evidence": 11,
}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--min-phases", type=int, default=2)
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Step 1: Load all controls with merge_group_hint
logger.info("Loading controls...")
with engine.connect() as c:
rows = c.execute(text("""
SELECT id, control_id,
generation_metadata->>'merge_group_hint' AS hint
FROM canonical_controls
WHERE generation_metadata->>'merge_group_hint' IS NOT NULL
AND generation_metadata->>'merge_group_hint' != ''
AND release_state NOT IN ('deprecated', 'rejected')
""")).fetchall()
logger.info("Loaded %d controls", len(rows))
# Step 2: Group by (object_token, phase)
token_phases: dict[str, dict[str, list]] = defaultdict(
lambda: defaultdict(list)
)
for uuid, control_id, hint in rows:
parts = hint.split(":", 2)
if len(parts) < 2:
continue
action = parts[0]
obj = parts[1]
phase = parts[2] if len(parts) > 2 else "implementation"
token_phases[obj][phase].append((str(uuid), control_id, action))
logger.info("Found %d unique object tokens", len(token_phases))
# Step 3: Create Master Controls
master_controls = []
master_members = []
for token, phases in token_phases.items():
if len(phases) < args.min_phases:
continue
sorted_phases = sorted(
phases.keys(), key=lambda p: PHASE_ORDER.get(p, 99)
)
phase_counts = {p: len(ctrls) for p, ctrls in phases.items()}
total = sum(phase_counts.values())
master_controls.append({
"canonical_name": token,
"phases_covered": json.dumps(sorted_phases),
"phase_control_count": json.dumps(phase_counts),
"total_controls": total,
})
for phase, controls in phases.items():
for ctrl_uuid, ctrl_id, action in controls:
master_members.append({
"canonical_name": token,
"control_uuid": ctrl_uuid,
"phase": phase,
"action": action,
})
logger.info(
"Created %d Master Controls with %d members (min %d phases)",
len(master_controls), len(master_members), args.min_phases,
)
# Stats
if master_controls:
counts = [mc["total_controls"] for mc in master_controls]
phases_per = [
len(json.loads(mc["phases_covered"])) for mc in master_controls
]
logger.info(" Avg controls/MC: %.1f", sum(counts) / len(counts))
logger.info(" Max controls/MC: %d", max(counts))
logger.info(" Avg phases/MC: %.1f", sum(phases_per) / len(phases_per))
logger.info(" Max phases/MC: %d", max(phases_per))
# Size distribution
logger.info("\n Size distribution:")
logger.info(" ≤10: %d", sum(1 for c in counts if c <= 10))
logger.info(" 11-50: %d", sum(1 for c in counts if 11 <= c <= 50))
logger.info(" 51-200: %d", sum(1 for c in counts if 51 <= c <= 200))
logger.info(" 201-500: %d", sum(1 for c in counts if 201 <= c <= 500))
logger.info(" 501-2K: %d", sum(1 for c in counts if 501 <= c <= 2000))
logger.info(" >2K: %d", sum(1 for c in counts if c > 2000))
# Top 15
top = sorted(master_controls, key=lambda x: -x["total_controls"])[:15]
logger.info("\n Top 15 Master Controls:")
for mc in top:
logger.info(
" %6d %s (%d phases)",
mc["total_controls"],
mc["canonical_name"],
len(json.loads(mc["phases_covered"])),
)
if args.dry_run:
logger.info("\nDRY RUN — not writing to DB")
return
# Step 4: Write to DB
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
c.execute(text("DELETE FROM master_control_members"))
c.execute(text("DELETE FROM master_controls"))
# Get next object_group_id
max_gid = c.execute(
text("SELECT COALESCE(MAX(group_id), 0) FROM object_groups")
).scalar()
next_gid = max_gid + 1
mc_uuids = {}
for mc in master_controls:
gid = next_gid
next_gid += 1
mc_id = f"MC-{gid}"
c.execute(text("""
INSERT INTO master_controls
(master_control_id, object_group_id, canonical_name,
phases_covered, phase_control_count, total_controls)
VALUES (:mcid, :gid, :name,
CAST(:phases AS jsonb),
CAST(:pcounts AS jsonb), :total)
"""), {
"mcid": mc_id, "gid": gid,
"name": mc["canonical_name"],
"phases": mc["phases_covered"],
"pcounts": mc["phase_control_count"],
"total": mc["total_controls"],
})
mc_uuid = c.execute(text(
"SELECT id FROM master_controls WHERE master_control_id = :mcid"
), {"mcid": mc_id}).scalar()
mc_uuids[mc["canonical_name"]] = str(mc_uuid)
# Insert members
mem_count = 0
for mem in master_members:
mc_uuid = mc_uuids.get(mem["canonical_name"])
if not mc_uuid:
continue
c.execute(text("""
INSERT INTO master_control_members
(master_control_uuid, control_uuid, phase, action)
VALUES (CAST(:mc AS uuid), CAST(:ctrl AS uuid),
:phase, :action)
"""), {
"mc": mc_uuid,
"ctrl": mem["control_uuid"],
"phase": mem["phase"],
"action": mem["action"],
})
mem_count += 1
logger.info("Wrote %d MCs + %d members to DB", len(master_controls), mem_count)
if __name__ == "__main__":
main()
@@ -0,0 +1,298 @@
#!/usr/bin/env python3
"""
G-pre3: Split large Master Controls by regulation source.
For each MC with >200 controls:
1. Load member controls with parent's source_citation->>'source'
2. Group by regulation source
3. Sources with >= MIN_SOURCE_SIZE → new sub-MC
4. Small sources → merge into "mixed" bucket
5. UNKNOWN (no source_citation) → sub-cluster by embedding if >MAX_MC
6. Delete original large MC, create new sub-MCs
Usage:
python3 /app/scripts/gpre3_regulation_split.py --dry-run
python3 /app/scripts/gpre3_regulation_split.py --min-source 15 --max-mc 100
"""
import argparse
import json
import logging
import os
import re
from collections import defaultdict
from sqlalchemy import create_engine, text
from services.embedding_utils import subcluster_controls
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("gpre3")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
# ── Source key normalization ────────────────────────────────────────
# fmt: off
_SOURCE_SHORT: dict[str, str] = {
"DSGVO (EU) 2016/679": "dsgvo", "NIS2-Richtlinie (EU) 2022/2555": "nis2",
"KI-Verordnung (EU) 2024/1689": "ai_act", "Cyber Resilience Act (CRA)": "cra",
"Digital Services Act (DSA)": "dsa", "Digital Markets Act (DMA)": "dma",
"Digital Operational Resilience Act": "dora", "Data Governance Act (DGA)": "dga",
"Data Act": "data_act", "Maschinenverordnung (EU) 2023/1230": "machinery_reg",
"Medizinprodukteverordnung (EU) 2017/745 (MDR)": "mdr",
"European Health Data Space": "ehds", "European Accessibility Act": "eaa",
"EU Cybersecurity Act": "eu_csa", "EU Blue Guide 2022": "eu_blue_guide",
"EU-US Data Privacy Framework": "eu_us_dpf", "Markets in Crypto-Assets (MiCA)": "mica",
"Standardvertragsklauseln (SCC)": "scc", "ePrivacy-Richtlinie": "eprivacy",
"Batterieverordnung (EU) 2023/1542": "battery_reg",
"Bundesdatenschutzgesetz (BDSG)": "bdsg",
"BSI-Gesetz (BSIG 2025, NIS2-Umsetzung)": "bsig",
"BSI-Kritisverordnung (BSI-KritisV)": "bsi_kritisv",
"Geldwaeschegesetz (GwG)": "gwg", "Hinweisgeberschutzgesetz (HinSchG)": "hinschg",
"Lieferkettensorgfaltspflichtengesetz (LkSG)": "lksg",
"KRITIS-Dachgesetz (KRITISDachG)": "kritisdachg",
"NIST SP 800-53 Rev. 5": "nist_800_53", "NIST Cybersecurity Framework 2.0": "nist_csf",
"NIST Privacy Framework 1.0": "nist_privacy",
"NIST SP 800-207 (Zero Trust)": "nist_zero_trust",
"NIST SP 800-218 (SSDF)": "nist_ssdf", "NIST SP 800-63-3": "nist_800_63",
"NIST AI Risk Management Framework": "nist_ai_rmf",
"NISTIR 8259A IoT Security": "nist_iot",
"OWASP Top 10 (2021)": "owasp_top10", "OWASP API Security Top 10 (2023)": "owasp_api",
"OWASP ASVS 4.0": "owasp_asvs", "OWASP SAMM 2.0": "owasp_samm",
"OWASP MASVS 2.0": "owasp_masvs", "OWASP Mobile Top 10": "owasp_mobile",
"ENISA": "enisa", "TDDDG": "tdddg", "TKG": "tkg", "TMG": "tmg",
"BGB": "bgb", "UWG": "uwg", "UrhG": "urhg",
"BAIT (BaFin 2024)": "bait", "VAIT (BaFin 2022)": "vait",
"AML-Verordnung": "aml_reg", "Zahlungsdiensterichtlinie 2": "psd2",
"Telekommunikationsgesetz Oesterreich": "at_tkg",
"Österreichisches Datenschutzgesetz (DSG)": "at_dsg",
"Allgemeines Gleichbehandlungsgesetz (AGG)": "agg",
"Aktiengesetz (AktG)": "aktg", "Handelsgesetzbuch (HGB)": "hgb",
"GmbH-Gesetz (GmbHG)": "gmbhg", "Insolvenzordnung (InsO)": "inso",
"Gewerbeordnung (GewO)": "gewo", "Abgabenordnung (AO)": "ao",
}
# fmt: on
def source_to_key(source: str) -> str:
"""Normalize regulation source name to a short slug key."""
if source in _SOURCE_SHORT:
return _SOURCE_SHORT[source]
s = source.lower()
s = re.sub(r"\(.*?\)", "", s)
s = re.sub(r"[^a-z0-9äöüß]+", "_", s)
s = re.sub(r"_+", "_", s).strip("_")
return s[:40] if s else "unknown"
# ── Main ───────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--min-source", type=int, default=15,
help="Min controls per source for own sub-MC")
parser.add_argument("--max-mc", type=int, default=100,
help="Max controls per sub-MC before sub-clustering")
parser.add_argument("--threshold", type=int, default=200,
help="Only split MCs with more than N controls")
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Step 1: Find large master controls
with engine.connect() as c:
large_mcs = c.execute(text("""
SELECT mc.id, mc.master_control_id, mc.object_group_id,
mc.canonical_name, mc.total_controls
FROM master_controls mc
WHERE mc.total_controls > :threshold
ORDER BY mc.total_controls DESC
"""), {"threshold": args.threshold}).fetchall()
logger.info("Found %d MCs with >%d controls", len(large_mcs), args.threshold)
if not large_mcs:
return
# Step 2: Build split plans
all_splits = []
for mc_uuid, mc_id, og_id, canonical, total in large_mcs:
plan = _build_split_plan(engine, mc_uuid, mc_id, og_id, canonical, total, args)
all_splits.append(plan)
total_new = sum(len(sp["sub_groups"]) for sp in all_splits)
total_covered = sum(
sum(len(sg["controls"]) for sg in sp["sub_groups"]) for sp in all_splits
)
logger.info("SUMMARY: %d large MCs → %d sub-MCs (%d controls)", len(all_splits), total_new, total_covered)
if args.dry_run:
logger.info("DRY RUN — not writing to DB")
return
_write_splits(engine, all_splits)
def _build_split_plan(engine, mc_uuid, mc_id, og_id, canonical, total, args) -> dict:
"""Build a regulation-source split plan for one large MC."""
logger.info("\n━━━ %s: %s (%d controls) ━━━", mc_id, canonical, total)
with engine.connect() as c:
members = c.execute(text("""
SELECT mcm.control_uuid, mcm.phase, mcm.action,
cc.control_id, cc.title,
COALESCE(pc.source_citation->>'source', 'UNKNOWN') AS src
FROM master_control_members mcm
JOIN canonical_controls cc ON cc.id = mcm.control_uuid
LEFT JOIN canonical_controls pc ON pc.id = cc.parent_control_uuid
WHERE mcm.master_control_uuid = CAST(:mc_uuid AS uuid)
"""), {"mc_uuid": str(mc_uuid)}).fetchall()
by_source: dict[str, list[dict]] = defaultdict(list)
for ctrl_uuid, phase, action, cid, title, src in members:
by_source[src].append({
"control_uuid": str(ctrl_uuid), "phase": phase,
"action": action, "control_id": cid, "title": title,
})
sorted_sources = sorted(by_source.items(), key=lambda x: -len(x[1]))
for src, ctrls in sorted_sources[:8]:
logger.info(" %4d %s", len(ctrls), src)
if len(sorted_sources) > 8:
logger.info(" ... +%d more sources", len(sorted_sources) - 8)
plan = {"mc_uuid": str(mc_uuid), "mc_id": mc_id, "og_id": og_id,
"canonical": canonical, "total": total, "sub_groups": []}
own_mc_sources = []
mixed_controls = []
for src, ctrls in sorted_sources:
if src == "UNKNOWN":
continue
if len(ctrls) >= args.min_source:
own_mc_sources.append((src, ctrls))
else:
mixed_controls.extend(ctrls)
unknown_controls = by_source.get("UNKNOWN", [])
# (a) Named regulation sub-MCs
for src, ctrls in own_mc_sources:
key = source_to_key(src)
name = f"{canonical}_{key}"
_add_subgroups(plan, name, src, ctrls, args.max_mc)
# (b) Mixed small-source bucket
if mixed_controls:
_add_subgroups(plan, f"{canonical}_mixed", "mixed", mixed_controls, args.max_mc)
# (c) UNKNOWN bucket
if unknown_controls:
_add_subgroups(plan, f"{canonical}_general", "general", unknown_controls, args.max_mc)
logger.info("%d sub-groups:", len(plan["sub_groups"]))
for sg in sorted(plan["sub_groups"], key=lambda x: -len(x["controls"])):
logger.info(" %4d %s", len(sg["controls"]), sg["name"])
return plan
def _add_subgroups(plan: dict, name: str, source: str,
controls: list[dict], max_mc: int):
"""Add controls as one or more sub-groups to the plan."""
if len(controls) <= max_mc:
plan["sub_groups"].append({"name": name, "source": source, "controls": controls})
else:
clusters = subcluster_controls(controls, max_mc)
for i, cluster in enumerate(clusters):
sub_name = f"{name}_{i+1}" if len(clusters) > 1 else name
plan["sub_groups"].append({"name": sub_name, "source": source, "controls": cluster})
def _write_splits(engine, splits: list[dict]):
"""Apply split plan: delete old MCs, create new object_groups + MCs."""
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
max_gid = c.execute(
text("SELECT COALESCE(MAX(group_id), 0) FROM object_groups")
).scalar()
next_gid = max_gid + 1
total_mc = 0
total_mem = 0
for sp in splits:
c.execute(text(
"DELETE FROM master_control_members "
"WHERE master_control_uuid = CAST(:u AS uuid)"
), {"u": sp["mc_uuid"]})
c.execute(text(
"DELETE FROM master_controls WHERE id = CAST(:u AS uuid)"
), {"u": sp["mc_uuid"]})
logger.info("Deleted %s (%s)", sp["mc_id"], sp["canonical"])
for sg in sp["sub_groups"]:
if not sg["controls"]:
continue
gid = next_gid
next_gid += 1
members_list = list({ctrl["control_id"] for ctrl in sg["controls"]})
c.execute(text("""
INSERT INTO object_groups
(group_id, canonical_name, member_count, members, top_controls_count)
VALUES (:gid, :name, :cnt, CAST(:members AS jsonb), 0)
"""), {"gid": gid, "name": sg["name"], "cnt": len(members_list),
"members": json.dumps(members_list)})
by_phase: dict[str, list[dict]] = defaultdict(list)
for ctrl in sg["controls"]:
by_phase[ctrl["phase"]].append(ctrl)
sorted_phases = sorted(by_phase.keys())
phase_counts = {p: len(v) for p, v in by_phase.items()}
mc_id = f"MC-{gid}"
c.execute(text("""
INSERT INTO master_controls
(master_control_id, object_group_id, canonical_name,
phases_covered, phase_control_count, total_controls)
VALUES (:mcid, :gid, :name,
CAST(:phases AS jsonb), CAST(:pcounts AS jsonb), :total)
"""), {"mcid": mc_id, "gid": gid, "name": sg["name"],
"phases": json.dumps(sorted_phases),
"pcounts": json.dumps(phase_counts),
"total": sum(phase_counts.values())})
mc_uuid = c.execute(text(
"SELECT id FROM master_controls WHERE master_control_id = :mcid"
), {"mcid": mc_id}).scalar()
for ctrl in sg["controls"]:
c.execute(text("""
INSERT INTO master_control_members
(master_control_uuid, control_uuid, phase, action)
VALUES (CAST(:mc AS uuid), CAST(:ctrl AS uuid), :phase, :action)
"""), {"mc": str(mc_uuid), "ctrl": ctrl["control_uuid"],
"phase": ctrl["phase"], "action": ctrl["action"]})
total_mem += 1
total_mc += 1
logger.info("Created %d new MCs with %d members", total_mc, total_mem)
with engine.connect() as c:
stats = c.execute(text("""
SELECT count(*), count(CASE WHEN total_controls > 200 THEN 1 END),
AVG(total_controls)::int
FROM compliance.master_controls
""")).fetchone()
logger.info("Final: %d MCs, %d still >200, avg %d controls/MC", stats[0], stats[1], stats[2])
if __name__ == "__main__":
main()
@@ -0,0 +1,310 @@
#!/usr/bin/env python3
"""
Phase 0: Quality Audit for Master Control Assignments.
Uses Claude Sonnet to validate whether controls are correctly assigned
to their Master Controls. Samples controls from large and small MCs.
Usage:
python3 /app/scripts/gpre_quality_audit.py
python3 /app/scripts/gpre_quality_audit.py --large-sample 50 --small-sample 10
python3 /app/scripts/gpre_quality_audit.py --mc MC-8292 # single MC
"""
import argparse
import json
import logging
import os
import random
import time
from collections import defaultdict
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("quality-audit")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_MODEL = os.getenv("AUDIT_MODEL", "claude-sonnet-4-20250514")
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
SYSTEM_PROMPT = """Du bist ein Compliance-Experte der prüft ob Controls korrekt zu Master Controls zugeordnet sind.
Für jeden Control beantworte:
1. MATCH: Gehört dieser Control thematisch zum Master Control Topic?
2. CONFIDENCE: Wie sicher bist du? (0.0-1.0)
3. REASON: Kurze Begründung (max 1 Satz)
4. SUGGESTED_TOPIC: Falls MATCH=false, welches Topic wäre korrekt?
Wichtige Unterscheidungen:
- "monitoring" = kontinuierliche Überwachung, Alerting, Log-Analyse
- "training" = Schulung, Awareness, Lernmaterialien
- "personal_data" = personenbezogene Daten, DSGVO-Betroffenenrechte
- "procedure" = Verfahren, Prozesse (aber NICHT wenn es spezifisch um Incidents geht)
- "incident" = Sicherheitsvorfälle, Breach Notification, Recovery
- "policy" = Richtlinien, Regelwerke, Governance-Dokumente
- "encryption" = Verschlüsselung, Kryptografie, Key Management
- "audit_logging" = Protokollierung, Audit Trail, Nachvollziehbarkeit
Antworte NUR als JSON-Array, ein Objekt pro Control."""
def call_claude(controls_batch: list[dict], mc_topic: str) -> list[dict]:
"""Send a batch of controls to Claude for validation."""
items = []
for c in controls_batch:
items.append(
f"- Control '{c['control_id']}': "
f"Titel=\"{c['title']}\", "
f"Objective=\"{c['objective'][:150]}...\", "
f"Phase={c['phase']}, Action={c['action']}"
)
prompt = (
f"Master Control Topic: \"{mc_topic}\"\n\n"
f"Prüfe diese {len(controls_batch)} Controls:\n\n"
+ "\n".join(items)
+ "\n\nAntwort als JSON-Array mit Feldern: "
"control_id, match (bool), confidence (float), reason (str), "
"suggested_topic (str, nur wenn match=false)."
)
headers = {
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
payload = {
"model": ANTHROPIC_MODEL,
"max_tokens": 2048,
"temperature": 0.1,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": prompt}],
}
for attempt in range(3):
try:
resp = httpx.post(
ANTHROPIC_URL,
headers=headers,
json=payload,
timeout=60.0,
)
resp.raise_for_status()
data = resp.json()
content = data.get("content", [{}])[0].get("text", "")
usage = data.get("usage", {})
# Parse JSON from response
start = content.find("[")
end = content.rfind("]") + 1
if start >= 0 and end > start:
results = json.loads(content[start:end])
return results, usage
logger.warning("No JSON array in response: %s", content[:200])
return [], usage
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
wait = 30 * (attempt + 1)
logger.warning("Rate limited, waiting %ds...", wait)
time.sleep(wait)
else:
logger.error("API error: %s", e)
return [], {}
except Exception as e:
logger.error("Request failed (attempt %d): %s", attempt + 1, e)
if attempt < 2:
time.sleep(5)
return [], {}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--large-sample", type=int, default=50,
help="Controls to sample per large MC")
parser.add_argument("--small-sample", type=int, default=10,
help="Controls to sample per small MC")
parser.add_argument("--small-mc-count", type=int, default=50,
help="Number of small MCs to audit")
parser.add_argument("--mc", type=str, default=None,
help="Audit a single MC by ID (e.g., MC-8292)")
parser.add_argument("--batch-size", type=int, default=10,
help="Controls per API call")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Load MCs to audit
with engine.connect() as c:
if args.mc:
mcs = c.execute(text("""
SELECT id, master_control_id, canonical_name, total_controls
FROM master_controls WHERE master_control_id = :mc
"""), {"mc": args.mc}).fetchall()
else:
# Large MCs (>200) + random small MCs
large = c.execute(text("""
SELECT id, master_control_id, canonical_name, total_controls
FROM master_controls WHERE total_controls > 200
ORDER BY total_controls DESC
""")).fetchall()
small = c.execute(text("""
SELECT id, master_control_id, canonical_name, total_controls
FROM master_controls WHERE total_controls BETWEEN 10 AND 200
ORDER BY RANDOM() LIMIT :cnt
"""), {"cnt": args.small_mc_count}).fetchall()
mcs = list(large) + list(small)
logger.info("Auditing %d Master Controls", len(mcs))
# Results tracking
total_checked = 0
total_match = 0
total_mismatch = 0
total_input_tokens = 0
total_output_tokens = 0
mc_results: dict[str, dict] = {}
all_mismatches: list[dict] = []
for mc_uuid, mc_id, canonical, total in mcs:
is_large = total > 200
sample_size = args.large_sample if is_large else args.small_sample
# Sample controls
with engine.connect() as c:
controls = c.execute(text("""
SELECT mcm.control_uuid, mcm.phase, mcm.action,
cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective
FROM master_control_members mcm
JOIN canonical_controls cc ON cc.id = mcm.control_uuid
WHERE mcm.master_control_uuid = CAST(:mc AS uuid)
ORDER BY RANDOM()
LIMIT :n
"""), {"mc": str(mc_uuid), "n": sample_size}).fetchall()
if not controls:
continue
control_dicts = [
{"control_uuid": str(r[0]), "phase": r[1], "action": r[2],
"control_id": r[3], "title": r[4] or "", "objective": r[5] or ""}
for r in controls
]
logger.info("\n%s: %s (%d total, sampling %d)",
mc_id, canonical, total, len(control_dicts))
mc_match = 0
mc_mismatch = 0
# Process in batches
for i in range(0, len(control_dicts), args.batch_size):
batch = control_dicts[i:i + args.batch_size]
results, usage = call_claude(batch, canonical)
total_input_tokens += usage.get("input_tokens", 0)
total_output_tokens += usage.get("output_tokens", 0)
for r in results:
if r.get("match", True):
mc_match += 1
total_match += 1
else:
mc_mismatch += 1
total_mismatch += 1
mismatch = {
"mc_id": mc_id,
"mc_topic": canonical,
"control_id": r.get("control_id", "?"),
"confidence": r.get("confidence", 0),
"reason": r.get("reason", ""),
"suggested_topic": r.get("suggested_topic", ""),
}
all_mismatches.append(mismatch)
total_checked += len(results)
# Rate limit
time.sleep(1)
accuracy = mc_match / (mc_match + mc_mismatch) if (mc_match + mc_mismatch) > 0 else 1.0
mc_results[mc_id] = {
"canonical": canonical, "total": total,
"checked": mc_match + mc_mismatch,
"match": mc_match, "mismatch": mc_mismatch,
"accuracy": accuracy,
}
logger.info("%d/%d correct (%.1f%%)",
mc_match, mc_match + mc_mismatch, accuracy * 100)
# Final report
_print_report(mc_results, all_mismatches, total_checked, total_match,
total_mismatch, total_input_tokens, total_output_tokens)
def _print_report(mc_results, mismatches, checked, match, mismatch,
input_tok, output_tok):
"""Print the quality audit report."""
logger.info("\n" + "=" * 70)
logger.info("QUALITY AUDIT REPORT")
logger.info("=" * 70)
logger.info("Total controls checked: %d", checked)
logger.info("Correct assignments: %d (%.1f%%)",
match, match / max(checked, 1) * 100)
logger.info("Wrong assignments: %d (%.1f%%)",
mismatch, mismatch / max(checked, 1) * 100)
# Cost estimate
cost_input = input_tok / 1_000_000 * 3.0 # Sonnet input: $3/MTok
cost_output = output_tok / 1_000_000 * 15.0 # Sonnet output: $15/MTok
logger.info("\nAPI Usage: %d input + %d output tokens",
input_tok, output_tok)
logger.info("Estimated cost: $%.2f", cost_input + cost_output)
# Per-MC breakdown (worst first)
logger.info("\n--- Per-MC Accuracy (worst first) ---")
sorted_mcs = sorted(mc_results.values(), key=lambda x: x["accuracy"])
for mc in sorted_mcs:
flag = "" if mc["accuracy"] < 0.9 else "⚠️" if mc["accuracy"] < 0.95 else ""
logger.info(" %s %s (%s): %d/%d = %.1f%% [total: %d]",
flag, mc["canonical"][:30].ljust(30),
"large" if mc["total"] > 200 else "small",
mc["match"], mc["checked"],
mc["accuracy"] * 100, mc["total"])
# Top mismatches
if mismatches:
logger.info("\n--- Mismatches (all %d) ---", len(mismatches))
for m in sorted(mismatches, key=lambda x: -x.get("confidence", 0)):
logger.info(" %s in %s (%s) → should be '%s': %s",
m["control_id"], m["mc_id"], m["mc_topic"],
m["suggested_topic"], m["reason"])
# Size-class breakdown
large_mcs = [m for m in mc_results.values() if m["total"] > 200]
small_mcs = [m for m in mc_results.values() if m["total"] <= 200]
if large_mcs:
lg_acc = sum(m["match"] for m in large_mcs) / max(sum(m["checked"] for m in large_mcs), 1)
logger.info("\nLarge MCs (>200): %.1f%% accuracy (%d MCs)",
lg_acc * 100, len(large_mcs))
if small_mcs:
sm_acc = sum(m["match"] for m in small_mcs) / max(sum(m["checked"] for m in small_mcs), 1)
logger.info("Small MCs (≤200): %.1f%% accuracy (%d MCs)",
sm_acc * 100, len(small_mcs))
if __name__ == "__main__":
main()
+119 -6
View File
@@ -460,12 +460,50 @@ WICHTIGE REGELN:
7. MERGE-KEY: Erzeuge im JSON-Output ein zusaetzliches Feld "merge_key" mit
dem Format: "action_type:normalized_object:control_phase"
WICHTIG: Waehle normalized_object NUR aus dieser Liste kanonischer Tokens:
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
privileged_access, access_control, encryption, transport_encryption,
key_management, certificate_management, network_security, network_segmentation,
firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting,
compliance_audit, vulnerability, patch_management, backup, disaster_recovery,
physical_security, secure_development, api_security, input_validation,
container_security, logging_configuration
DATA_PROTECTION: personal_data, sensitive_data, health_data, consent,
data_subject_rights, data_retention, data_transfer, data_breach_notification,
dpia, data_processing_agreement, privacy_by_design, data_processing_register,
data_classification, cookie_consent, video_surveillance
GOVERNANCE: policy, procedure, process, training, awareness, incident,
risk_management, third_party_management, change_management, documentation,
records_management, compliance_reporting, asset_management,
human_resources_security
REGULATORY: supervisory_authority, certification, product_safety, ai_system,
financial_reporting, aml, whistleblowing, consumer_protection, ecommerce,
telecommunications, medical_device, payment_services, critical_infrastructure,
supply_chain_due_diligence, sustainability_reporting
Wenn KEIN Token passt: "OTHER:kurzbeschreibung" (z.B. "OTHER:battery_recycling")
ABGRENZUNGEN (haeufige Fehler vermeiden!):
- monitoring = NUR kontinuierliche Echtzeit-Ueberwachung von Systemen
- audit_logging = Protokollierung, Audit Trail, Nachvollziehbarkeit
- compliance_audit = externe Pruefungen, Zertifizierungsaudits
- training = Schulungen DURCHFUEHREN (nicht "ueberwachen")
- procedure = Verfahren DEFINIEREN (nicht Incident-Behandlung)
- incident = Sicherheitsvorfaelle BEHANDELN
- alerting = Meldepflichten und Benachrichtigungen
- personal_data = DSGVO-Verarbeitungsgrundsaetze (nicht Zertifizierung!)
- certification = Zertifizierung/Konformitaet (nicht Datenschutz)
Beispiele:
- "implement:api_rate_limiting:implementation"
- "define:access_control_policy:definition"
- "monitor:third_party_vulnerabilities:monitoring"
- "test:authentication_mechanism:testing"
- "implement:multi_factor_auth:implementation"
- "define:access_control:definition"
- "monitor:network_security:monitoring"
- "test:vulnerability:testing"
- "report:supervisory_authority:reporting"
- "implement:audit_logging:implementation" (NICHT monitoring!)
- "define:incident:definition" (Incident-Verfahren, NICHT procedure!)
- "train:training:operation" (Schulung, NICHT monitoring!)
8. APPLICABILITY + SCANNER: Bestimme fuer jedes Control:
- applicability: Unter welchen Bedingungen gilt dieses Control?
@@ -2472,6 +2510,81 @@ def _ensure_list(val) -> list:
return []
# Canonical object tokens from object_ontology (loaded once)
_CANONICAL_OBJECTS: set[str] | None = None
def _load_canonical_objects() -> set[str]:
"""Load canonical tokens from DB, fallback to hardcoded set."""
global _CANONICAL_OBJECTS
if _CANONICAL_OBJECTS is not None:
return _CANONICAL_OBJECTS
try:
from db.session import get_engine
from sqlalchemy import text
engine = get_engine()
with engine.connect() as c:
rows = c.execute(text(
"SELECT canonical_token FROM compliance.object_ontology"
)).fetchall()
_CANONICAL_OBJECTS = {r[0] for r in rows}
except Exception:
_CANONICAL_OBJECTS = set()
if not _CANONICAL_OBJECTS:
_CANONICAL_OBJECTS = {
"multi_factor_auth", "password_policy", "credentials",
"session_management", "privileged_access", "access_control",
"encryption", "transport_encryption", "key_management",
"certificate_management", "network_security",
"network_segmentation", "firewall", "vpn", "remote_access",
"monitoring", "audit_logging", "siem", "alerting",
"compliance_audit", "vulnerability", "patch_management",
"backup", "disaster_recovery", "personal_data",
"sensitive_data", "consent", "data_subject_rights",
"data_retention", "data_transfer", "data_breach_notification",
"dpia", "data_processing_agreement", "privacy_by_design",
"policy", "procedure", "process", "training", "awareness",
"incident", "risk_management", "third_party_management",
"change_management", "documentation", "supervisory_authority",
"certification", "product_safety", "ai_system", "aml",
"critical_infrastructure", "medical_device",
}
return _CANONICAL_OBJECTS
def _validate_merge_key(merge_key: str) -> str:
"""Validate merge_key object against canonical ontology.
Returns the merge_key (possibly corrected). Logs warnings for
unknown objects so they can be tracked.
"""
parts = merge_key.split(":", 2)
if len(parts) < 2:
return merge_key
action, obj = parts[0], parts[1]
phase = parts[2] if len(parts) > 2 else "implementation"
# Accept OTHER: prefix (LLM signaling unknown object)
if obj.startswith("OTHER:"):
return merge_key
# Check against canonical ontology
canonical = _load_canonical_objects()
if obj in canonical:
return merge_key
# Try normalize_object() as fallback
from services.control_dedup import normalize_object
normed = normalize_object(obj)
if normed in canonical:
return f"{action}:{normed}:{phase}"
# Unknown object — log and keep as-is (will be clustered by embedding)
logger.debug("merge_key unknown object: %s (normed: %s)", obj, normed)
return merge_key
# ---------------------------------------------------------------------------
# Decomposition Pass
# ---------------------------------------------------------------------------
@@ -3025,10 +3138,10 @@ class DecompositionPass:
evidence_type=parsed.get("evidence_type", ""),
provides_context=_ensure_list(parsed.get("provides_context", [])),
)
# Store merge_key from LLM output in metadata
# Store merge_key from LLM output in metadata — with validation
llm_merge_key = parsed.get("merge_key", "")
if llm_merge_key:
atomic.merge_group_hint = llm_merge_key
atomic.merge_group_hint = _validate_merge_key(llm_merge_key)
atomic.parent_control_uuid = obl["parent_uuid"]
atomic.obligation_candidate_id = obl["candidate_id"]
@@ -0,0 +1,84 @@
"""Shared embedding + sub-clustering utilities for the control pipeline."""
import logging
import os
from collections import defaultdict
import httpx
import numpy as np
from sklearn.cluster import MiniBatchKMeans
logger = logging.getLogger(__name__)
EMBEDDING_URL = os.getenv(
"EMBEDDING_SERVICE_URL", "http://embedding-service:8087"
)
def embed_texts(texts: list[str]) -> np.ndarray | None:
"""Embed texts via the embedding-service in batches of 64."""
try:
result = np.zeros((len(texts), 1024), dtype=np.float32)
batch_size = 64
for i in range(0, len(texts), batch_size):
batch = texts[i : i + batch_size]
for attempt in range(3):
try:
with httpx.Client(
timeout=httpx.Timeout(60.0, connect=10.0)
) as client:
resp = client.post(
f"{EMBEDDING_URL}/embed", json={"texts": batch}
)
resp.raise_for_status()
embs = resp.json().get("embeddings", [])
end = min(i + len(embs), len(texts))
result[i:end] = np.array(embs, dtype=np.float32)
break
except Exception as e:
if attempt == 2:
logger.error("Embed batch %d failed: %s", i, e)
import time
time.sleep(2)
return result
except Exception as e:
logger.error("Embedding failed: %s", e)
return None
def subcluster_controls(
controls: list[dict], target_size: int = 50
) -> list[list[dict]]:
"""Sub-cluster controls by embedding similarity.
Returns a list of clusters. Falls back to naive chunking
if embedding fails.
"""
if len(controls) <= target_size:
return [controls]
texts = [c.get("title", "") or c.get("control_id", "") for c in controls]
embeddings = embed_texts(texts)
if embeddings is None:
return [
controls[i : i + target_size]
for i in range(0, len(controls), target_size)
]
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
norms[norms == 0] = 1
normalized = embeddings / norms
k = max(2, min(len(controls) // target_size, 30))
kmeans = MiniBatchKMeans(
n_clusters=k,
batch_size=min(100, len(controls)),
max_iter=50,
random_state=42,
)
labels = kmeans.fit_predict(normalized)
clusters: dict[int, list[dict]] = defaultdict(list)
for i, ctrl in enumerate(controls):
clusters[int(labels[i])].append(ctrl)
return list(clusters.values())
+79
View File
@@ -0,0 +1,79 @@
#!/usr/bin/env python3
"""Crawl OSHA Technical Manual — all chapters as HTML."""
import json
import logging
import time
from pathlib import Path
from playwright.sync_api import sync_playwright
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("osha-crawl")
OUTPUT_DIR = Path(__file__).parent / "otm_chapters"
BASE = "https://www.osha.gov"
def main():
OUTPUT_DIR.mkdir(exist_ok=True)
registry = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
# Step 1: Get all chapter URLs
page.goto(f"{BASE}/otm", timeout=30000)
time.sleep(5)
links = page.query_selector_all('a[href*="/otm/"]')
chapters = []
seen = set()
for l in links:
href = l.get_attribute("href") or ""
text = (l.inner_text() or "").strip()
if href and "chapter" in href and href not in seen and text:
seen.add(href)
chapters.append({"url": href, "title": text})
logger.info("Found %d chapters", len(chapters))
# Step 2: Download each chapter
for i, ch in enumerate(chapters):
url = ch["url"] if ch["url"].startswith("http") else BASE + ch["url"]
slug = ch["url"].replace("/otm/", "").replace("/", "_")
outfile = OUTPUT_DIR / f"{slug}.html"
logger.info("[%d/%d] %s", i + 1, len(chapters), ch["title"][:60])
if outfile.exists():
logger.info(" Already exists, skipping")
ch["local_path"] = str(outfile)
registry.append(ch)
continue
try:
page.goto(url, timeout=30000)
time.sleep(3)
content = page.content()
outfile.write_text(content)
ch["local_path"] = str(outfile)
logger.info(" Saved: %s (%.1f KB)", outfile.name, len(content) / 1024)
except Exception as e:
logger.error(" Failed: %s", e)
ch["local_path"] = None
registry.append(ch)
time.sleep(1)
browser.close()
reg_file = Path(__file__).parent / "otm_registry.json"
reg_file.write_text(json.dumps(registry, indent=2, ensure_ascii=False))
ok = sum(1 for r in registry if r.get("local_path"))
logger.info("Done: %d/%d chapters saved", ok, len(registry))
if __name__ == "__main__":
main()
+2
View File
@@ -3,9 +3,11 @@ from fastapi import APIRouter
from api.collections import router as collections_router
from api.documents import router as documents_router
from api.search import router as search_router
from api.tenant_documents import router as tenant_documents_router
router = APIRouter()
router.include_router(collections_router, tags=["Collections"])
router.include_router(documents_router, tags=["Documents"])
router.include_router(tenant_documents_router, tags=["Tenant Documents"])
router.include_router(search_router, tags=["Search"])
+289
View File
@@ -0,0 +1,289 @@
"""
Tenant-isolated document upload, listing, and deletion.
Each tenant gets their own Qdrant collection (bp_docs_tenant_{short_id}).
Documents are stored in MinIO under tenant-specific paths.
No data crosses tenant boundaries.
Endpoints:
POST /api/v1/tenant/documents - Upload + process PDF
GET /api/v1/tenant/documents - List tenant's documents
DELETE /api/v1/tenant/documents/{doc_id} - Delete document + vectors
GET /api/v1/tenant/documents/{doc_id}/status - Processing status
"""
import json
import logging
import uuid
from typing import Optional
from fastapi import APIRouter, File, Form, HTTPException, Header, Request, UploadFile
from pydantic import BaseModel
from api.auth import optional_jwt_auth
from embedding_client import embedding_client
from html_utils import decode_html_bytes, looks_like_html, strip_html
from minio_client_wrapper import minio_wrapper
from qdrant_client_wrapper import qdrant_wrapper
logger = logging.getLogger("rag-service.api.tenant-documents")
router = APIRouter(prefix="/api/v1/tenant/documents")
VECTOR_DIM = 1024 # bge-m3 dimension
MAX_FILE_SIZE = 50 * 1024 * 1024 # 50 MB
ALLOWED_TYPES = {"application/pdf", "text/html", "text/plain"}
PDF_MAGIC = b"%PDF"
def _collection_name(tenant_id: str) -> str:
"""Derive tenant-specific Qdrant collection name."""
short = tenant_id.replace("-", "")[:12]
return f"bp_docs_tenant_{short}"
def _storage_path(tenant_id: str, document_id: str, filename: str) -> str:
"""Derive tenant-isolated storage path."""
short = tenant_id.replace("-", "")[:12]
return f"tenant_docs/{short}/{document_id}/{filename}"
def _extract_tenant_id(
request: Request,
x_tenant_id: Optional[str] = Header(None),
) -> str:
"""Extract tenant ID from header. Required for all tenant endpoints."""
tid = x_tenant_id or request.headers.get("x-tenant-id", "")
if not tid:
raise HTTPException(status_code=400, detail="X-Tenant-ID header required")
return tid
# ── Response models ────────────────────────────────────────────────
class DocumentResponse(BaseModel):
id: str
filename: str
file_size: int
status: str
chunk_count: int
collection: str
created_at: Optional[str] = None
class DocumentListResponse(BaseModel):
documents: list[DocumentResponse]
total: int
# ── Endpoints ──────────────────────────────────────────────────────
@router.post("", response_model=DocumentResponse)
async def upload_tenant_document(
request: Request,
file: UploadFile = File(...),
x_tenant_id: Optional[str] = Header(None),
chunk_size: int = Form(default=512),
chunk_overlap: int = Form(default=50),
metadata_json: Optional[str] = Form(default=None),
):
"""Upload a document, process it, and index in tenant-specific collection."""
optional_jwt_auth(request)
tenant_id = _extract_tenant_id(request, x_tenant_id)
# Read + validate
file_bytes = await file.read()
if len(file_bytes) == 0:
raise HTTPException(status_code=400, detail="Empty file")
if len(file_bytes) > MAX_FILE_SIZE:
raise HTTPException(status_code=413, detail=f"File too large (max {MAX_FILE_SIZE // 1024 // 1024} MB)")
filename = file.filename or "document.pdf"
content_type = file.content_type or "application/octet-stream"
# PDF magic bytes check
if filename.lower().endswith(".pdf") and not file_bytes[:4].startswith(PDF_MAGIC):
raise HTTPException(status_code=400, detail="File claims to be PDF but magic bytes don't match")
document_id = str(uuid.uuid4())
collection = _collection_name(tenant_id)
object_name = _storage_path(tenant_id, document_id, filename)
# Ensure collection exists
await qdrant_wrapper.create_collection(collection, VECTOR_DIM)
# Store in MinIO
try:
await minio_wrapper.upload_document(
object_name=object_name,
data=file_bytes,
content_type=content_type,
metadata={"document_id": document_id, "tenant_id": tenant_id},
)
except Exception as exc:
logger.error("MinIO upload failed for tenant %s: %s", tenant_id, exc)
raise HTTPException(status_code=500, detail="Storage failed")
# Extract text
try:
text = await _extract_text(file_bytes, filename, content_type)
except Exception as exc:
logger.error("Text extraction failed: %s", exc)
raise HTTPException(status_code=500, detail=f"Text extraction failed: {exc}")
if not text or not text.strip():
raise HTTPException(status_code=400, detail="No text could be extracted")
# Chunk
chunk_result = await embedding_client.chunk_text(
text=text, strategy="recursive",
chunk_size=chunk_size, overlap=chunk_overlap,
)
chunks = chunk_result.chunks
chunks_meta = chunk_result.chunks_with_metadata
if not chunks:
raise HTTPException(status_code=400, detail="Chunking produced zero chunks")
# Embed
embeddings = await embedding_client.generate_embeddings(chunks)
# Parse extra metadata
extra_metadata = {}
if metadata_json:
try:
extra_metadata = json.loads(metadata_json)
except json.JSONDecodeError:
pass
# Build payloads with tenant isolation
_STRUCT_FIELDS = ("section", "section_title", "paragraph", "paragraph_num", "page")
payloads = []
for i, chunk in enumerate(chunks):
payload = {
"document_id": document_id,
"tenant_id": tenant_id,
"filename": filename,
"chunk_index": i,
"chunk_text": chunk,
**extra_metadata,
}
if i < len(chunks_meta):
for field in _STRUCT_FIELDS:
value = chunks_meta[i].get(field)
if value is not None and value != "":
payload[field] = value
payloads.append(payload)
# Index in tenant collection
indexed = await qdrant_wrapper.index_documents(
collection=collection, vectors=embeddings, payloads=payloads,
)
logger.info(
"Tenant %s: uploaded %s (%d chunks, %d vectors) to %s",
tenant_id[:8], filename, len(chunks), indexed, collection,
)
return DocumentResponse(
id=document_id, filename=filename,
file_size=len(file_bytes), status="indexed",
chunk_count=len(chunks), collection=collection,
)
@router.get("", response_model=DocumentListResponse)
async def list_tenant_documents(
request: Request,
x_tenant_id: Optional[str] = Header(None),
):
"""List all documents for this tenant."""
optional_jwt_auth(request)
tenant_id = _extract_tenant_id(request, x_tenant_id)
collection = _collection_name(tenant_id)
try:
# Get unique document_ids from Qdrant
docs = await qdrant_wrapper.get_unique_documents(collection)
except Exception:
# Collection doesn't exist yet → no documents
docs = []
return DocumentListResponse(documents=docs, total=len(docs))
@router.delete("/{doc_id}")
async def delete_tenant_document(
doc_id: str,
request: Request,
x_tenant_id: Optional[str] = Header(None),
):
"""Delete a document and all its vectors from tenant collection."""
optional_jwt_auth(request)
tenant_id = _extract_tenant_id(request, x_tenant_id)
collection = _collection_name(tenant_id)
errors = []
# Delete vectors from Qdrant
try:
await qdrant_wrapper.delete_by_filter(
collection=collection,
filter_conditions={"document_id": doc_id},
)
except Exception as exc:
errors.append(f"Qdrant: {exc}")
# Delete file from MinIO
try:
prefix = f"tenant_docs/{tenant_id.replace('-', '')[:12]}/{doc_id}/"
await minio_wrapper.delete_by_prefix(prefix)
except Exception as exc:
errors.append(f"MinIO: {exc}")
if errors:
logger.warning("Partial delete for %s/%s: %s", tenant_id[:8], doc_id[:8], errors)
return {"deleted": True, "warnings": errors}
logger.info("Tenant %s: deleted document %s", tenant_id[:8], doc_id[:8])
return {"deleted": True, "document_id": doc_id}
@router.get("/{doc_id}/status")
async def document_status(
doc_id: str,
request: Request,
x_tenant_id: Optional[str] = Header(None),
):
"""Get processing status for a document."""
optional_jwt_auth(request)
tenant_id = _extract_tenant_id(request, x_tenant_id)
collection = _collection_name(tenant_id)
try:
count = await qdrant_wrapper.count_by_filter(
collection=collection,
filter_conditions={"document_id": doc_id},
)
status = "indexed" if count > 0 else "not_found"
except Exception:
count = 0
status = "not_found"
return {"document_id": doc_id, "status": status, "chunk_count": count}
# ── Helpers ────────────────────────────────────────────────────────
async def _extract_text(file_bytes: bytes, filename: str, content_type: str) -> str:
"""Extract text from PDF, HTML, or plain text."""
if content_type == "application/pdf" or filename.lower().endswith(".pdf"):
return await embedding_client.extract_pdf(file_bytes)
if filename.lower().endswith((".html", ".htm")):
text = decode_html_bytes(file_bytes)
return strip_html(text)
text = file_bytes.decode("utf-8", errors="replace")
if looks_like_html(text):
return strip_html(text)
return text
+10
View File
@@ -122,6 +122,16 @@ class MinioClientWrapper:
logger.error("Failed to delete '%s': %s", object_name, exc)
raise
async def delete_by_prefix(self, prefix: str) -> int:
"""Remove all objects under a prefix."""
objects = self.client.list_objects(settings.MINIO_BUCKET, prefix=prefix, recursive=True)
count = 0
for obj in objects:
self.client.remove_object(settings.MINIO_BUCKET, obj.object_name)
count += 1
logger.info("Deleted %d objects with prefix '%s'", count, prefix)
return count
# ------------------------------------------------------------------
# Presigned URL
# ------------------------------------------------------------------
+68
View File
@@ -235,6 +235,74 @@ class QdrantClientWrapper:
logger.info("Deleted points from '%s' with filter %s", collection, filter_conditions)
return True
# ------------------------------------------------------------------
# Tenant document helpers
# ------------------------------------------------------------------
async def get_unique_documents(self, collection: str) -> list[dict]:
"""Get unique documents from a collection by scrolling and grouping."""
try:
self.client.get_collection(collection)
except Exception:
return []
docs: dict[str, dict] = {}
offset = None
while True:
result = self.client.scroll(
collection_name=collection,
scroll_filter=None,
limit=100,
offset=offset,
with_payload=True,
with_vectors=False,
)
points, next_offset = result
for pt in points:
payload = pt.payload or {}
doc_id = payload.get("document_id", "")
if doc_id and doc_id not in docs:
docs[doc_id] = {
"id": doc_id,
"filename": payload.get("filename", ""),
"file_size": payload.get("file_size", 0),
"status": "indexed",
"chunk_count": 0,
"collection": collection,
}
if doc_id:
docs[doc_id]["chunk_count"] += 1
if next_offset is None:
break
offset = next_offset
return list(docs.values())
async def count_by_filter(
self, collection: str, filter_conditions: dict[str, Any]
) -> int:
"""Count points matching filter."""
try:
self.client.get_collection(collection)
except Exception:
return 0
must_conditions = []
for key, value in filter_conditions.items():
must_conditions.append(
qmodels.FieldCondition(
key=key, match=qmodels.MatchValue(value=value)
)
)
result = self.client.count(
collection_name=collection,
count_filter=qmodels.Filter(must=must_conditions),
exact=True,
)
return result.count
# ------------------------------------------------------------------
# Info
# ------------------------------------------------------------------