Compare commits
3 Commits
572052285c
...
0bb9726ddd
| Author | SHA1 | Date | |
|---|---|---|---|
| 0bb9726ddd | |||
| 8510af46eb | |||
| 81db904b3e |
@@ -0,0 +1,158 @@
|
||||
# Controls nutzen — Anleitung für andere Sessions
|
||||
|
||||
**Stand:** 2026-05-07, wird laufend aktualisiert
|
||||
**Repo:** breakpilot-core (~/Projekte/breakpilot-core)
|
||||
|
||||
---
|
||||
|
||||
## Was sind die Controls?
|
||||
|
||||
174.497 atomare Compliance-Controls in der Datenbank. Jeder Control ist eine **einzelne prüfbare Anforderung** aus einer Rechtsquelle (DSGVO, NIS2, NIST, AI Act, etc.).
|
||||
|
||||
### Beispiel
|
||||
|
||||
```
|
||||
Control-ID: AUTH-2956-A14
|
||||
Titel: "Implementierung von Multi-Faktor-Authentifizierung prüfen"
|
||||
Objective: "Sicherstellen, dass MFA korrekt implementiert ist..."
|
||||
Merge-Key: "verify:multi_factor_auth:testing"
|
||||
Severity: high
|
||||
```
|
||||
|
||||
## Wo liegen die Controls?
|
||||
|
||||
### Datenbank (PostgreSQL auf Mac Mini)
|
||||
|
||||
```sql
|
||||
-- Alle Controls abfragen
|
||||
SELECT id, control_id, title, objective, severity,
|
||||
source_citation, -- Rechtsquelle (JSON)
|
||||
generation_metadata->>'merge_group_hint' AS merge_key
|
||||
FROM compliance.canonical_controls
|
||||
WHERE release_state NOT IN ('deprecated', 'rejected');
|
||||
```
|
||||
|
||||
**Verbindung:**
|
||||
```bash
|
||||
# Vom MacBook:
|
||||
ssh macmini "/usr/local/bin/docker exec bp-core-postgres psql -U breakpilot -d breakpilot_db"
|
||||
|
||||
# Oder via Control-Pipeline Container:
|
||||
ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline curl -sf http://127.0.0.1:8098/..."
|
||||
```
|
||||
|
||||
### API (Port 8098, nur via Docker exec erreichbar)
|
||||
|
||||
```bash
|
||||
# Master Controls auflisten
|
||||
ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline \
|
||||
curl -sf 'http://127.0.0.1:8098/v1/master-controls?limit=50&sort=total_controls'"
|
||||
|
||||
# Master Control Detail mit allen Membern
|
||||
ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline \
|
||||
curl -sf 'http://127.0.0.1:8098/v1/master-controls/MC-8292'"
|
||||
```
|
||||
|
||||
## Struktur der Controls
|
||||
|
||||
### merge_group_hint (Schlüsselfeld!)
|
||||
|
||||
Jeder Control hat einen `merge_group_hint` im Format `action:object:phase`:
|
||||
|
||||
```
|
||||
implement:encryption:implementation
|
||||
define:access_control:definition
|
||||
monitor:network_security:monitoring
|
||||
report:supervisory_authority:reporting
|
||||
```
|
||||
|
||||
**74 kanonische Object-Tokens** (Stand 2026-05-07):
|
||||
|
||||
| Kategorie | Tokens |
|
||||
|-----------|--------|
|
||||
| **Security** | multi_factor_auth, password_policy, credentials, session_management, privileged_access, access_control, encryption, transport_encryption, key_management, certificate_management, network_security, network_segmentation, firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting, compliance_audit, vulnerability, patch_management, backup, disaster_recovery, physical_security, secure_development, api_security, input_validation, container_security, logging_configuration |
|
||||
| **Data Protection** | personal_data, sensitive_data, health_data, consent, data_subject_rights, data_retention, data_transfer, data_breach_notification, dpia, data_processing_agreement, privacy_by_design, data_processing_register, data_classification, cookie_consent, video_surveillance |
|
||||
| **Governance** | policy, procedure, process, training, awareness, incident, risk_management, third_party_management, change_management, documentation, records_management, compliance_reporting, asset_management, human_resources_security |
|
||||
| **Regulatory** | supervisory_authority, certification, product_safety, ai_system, financial_reporting, aml, whistleblowing, consumer_protection, ecommerce, telecommunications, medical_device, payment_services, critical_infrastructure, supply_chain_due_diligence, sustainability_reporting |
|
||||
|
||||
### Rechtsquellen (source_citation)
|
||||
|
||||
Die **Parent-Controls** (nicht die atomaren!) haben `source_citation`:
|
||||
|
||||
```sql
|
||||
-- Controls mit Rechtsquelle finden
|
||||
SELECT cc.control_id, cc.title,
|
||||
pc.source_citation->>'source' AS regulation,
|
||||
pc.source_citation->>'article' AS article
|
||||
FROM compliance.canonical_controls cc
|
||||
JOIN compliance.canonical_controls pc ON pc.id = cc.parent_control_uuid
|
||||
WHERE pc.source_citation IS NOT NULL
|
||||
AND pc.source_citation->>'source' LIKE '%DSGVO%';
|
||||
```
|
||||
|
||||
148 verschiedene Rechtsquellen (DSGVO, NIS2, NIST, OWASP, BSI, TKG, etc.)
|
||||
|
||||
## Controls filtern (Use Cases)
|
||||
|
||||
### Beispiel: Alle DSGVO Art. 13 Controls (für DSI-Prüfung)
|
||||
|
||||
```sql
|
||||
SELECT cc.control_id, cc.title, cc.objective,
|
||||
cc.generation_metadata->>'merge_group_hint' AS merge_key,
|
||||
pc.source_citation->>'article' AS article
|
||||
FROM compliance.canonical_controls cc
|
||||
JOIN compliance.canonical_controls pc ON pc.id = cc.parent_control_uuid
|
||||
WHERE pc.source_citation->>'source' = 'DSGVO (EU) 2016/679'
|
||||
AND pc.source_citation->>'article' LIKE '%13%'
|
||||
AND cc.release_state NOT IN ('deprecated', 'rejected')
|
||||
ORDER BY cc.control_id;
|
||||
```
|
||||
|
||||
### Beispiel: Alle Encryption-Controls
|
||||
|
||||
```sql
|
||||
SELECT control_id, title, objective
|
||||
FROM compliance.canonical_controls
|
||||
WHERE generation_metadata->>'merge_group_hint' LIKE '%:encryption:%'
|
||||
AND release_state NOT IN ('deprecated', 'rejected');
|
||||
```
|
||||
|
||||
### Beispiel: Controls nach Object-Token filtern
|
||||
|
||||
```sql
|
||||
-- Alle Controls zu einem bestimmten Thema
|
||||
SELECT control_id, title,
|
||||
generation_metadata->>'merge_group_hint' AS merge_key
|
||||
FROM compliance.canonical_controls
|
||||
WHERE generation_metadata->>'merge_group_hint' LIKE '%:data_retention:%'
|
||||
AND release_state NOT IN ('deprecated', 'rejected');
|
||||
```
|
||||
|
||||
## Wichtige Tabellen
|
||||
|
||||
| Tabelle | Rows | Beschreibung |
|
||||
|---------|------|-------------|
|
||||
| `compliance.canonical_controls` | ~294K | Alle Controls (Rich + Atomic) |
|
||||
| `compliance.master_controls` | ~5.329 | Gruppierte Master Controls |
|
||||
| `compliance.master_control_members` | ~172K | Zuordnung Control → MC |
|
||||
| `compliance.object_ontology` | 74 | Kanonische Object-Definitionen |
|
||||
| `compliance.regulation_registry` | 223 | Rechtsquellen-Register |
|
||||
|
||||
## Was gerade passiert (2026-05-07)
|
||||
|
||||
**Phase 2 läuft:** Alle 174K Controls werden per Claude Haiku re-klassifiziert. Die `merge_group_hint` werden von frei-form LLM-Objekten auf 74 kanonische Tokens normalisiert. Danach:
|
||||
- Phase 3: Re-Clustering (gpre1 mit K=20000)
|
||||
- Phase 4: Neue Master Controls (gpre2)
|
||||
- Phase 5: Regulation-Source-Split (gpre3)
|
||||
|
||||
**NICHT ÄNDERN:** `canonical_controls`, `master_controls`, `object_ontology` Tabellen werden aktiv bearbeitet.
|
||||
|
||||
## DB-Zugang Quick Reference
|
||||
|
||||
```bash
|
||||
# Quick Query (eine Zeile)
|
||||
ssh macmini "/usr/local/bin/docker exec bp-core-postgres psql -U breakpilot -d breakpilot_db -c \"SELECT count(*) FROM compliance.canonical_controls\""
|
||||
|
||||
# Interaktive Session
|
||||
ssh macmini "/usr/local/bin/docker exec -it bp-core-postgres psql -U breakpilot -d breakpilot_db"
|
||||
```
|
||||
@@ -0,0 +1,162 @@
|
||||
-- Migration 010: Expanded Object Ontology
|
||||
-- Expands from 31 to ~180 canonical object tokens with clear semantic boundaries.
|
||||
-- Each token has a description to prevent ambiguous classification.
|
||||
--
|
||||
-- IMPORTANT: This migration ADDS new tokens. Existing synonyms are preserved.
|
||||
|
||||
SET search_path TO compliance, public;
|
||||
|
||||
-- Add description column to object_synonyms if not exists
|
||||
DO $$ BEGIN
|
||||
ALTER TABLE object_synonyms ADD COLUMN IF NOT EXISTS description TEXT;
|
||||
EXCEPTION WHEN duplicate_column THEN NULL;
|
||||
END $$;
|
||||
|
||||
-- New table: canonical object definitions with clear boundaries
|
||||
CREATE TABLE IF NOT EXISTS object_ontology (
|
||||
canonical_token VARCHAR(100) PRIMARY KEY,
|
||||
category VARCHAR(50) NOT NULL, -- security, data_protection, governance, regulatory, technical
|
||||
description_de TEXT NOT NULL, -- German description for LLM prompts
|
||||
description_en TEXT NOT NULL, -- English description
|
||||
NOT_confused_with TEXT, -- Explicit disambiguation
|
||||
examples TEXT, -- Example controls that belong here
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
-- SECURITY & TECHNICAL
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
|
||||
-- Authentication & Identity
|
||||
INSERT INTO object_ontology VALUES
|
||||
('multi_factor_auth', 'security', 'Multi-Faktor-Authentifizierung (2FA/MFA)', 'Multi-factor authentication', 'NOT password_policy (Passwortregeln) oder session_management (Sitzungen)', 'MFA implementieren, 2FA-Pflicht, Authentifizierungsfaktoren'),
|
||||
('password_policy', 'security', 'Passwortrichtlinien und -komplexität', 'Password policies and complexity', 'NOT credentials (allg. Zugangsdaten) oder multi_factor_auth (MFA)', 'Passwortlänge, Komplexität, Rotation, Passwort-Historie'),
|
||||
('credentials', 'security', 'Zugangsdaten-Verwaltung (Tokens, API-Keys, Secrets)', 'Credential management', 'NOT password_policy (Passwortregeln) oder key_management (kryptografisch)', 'API-Key-Rotation, Token-Verwaltung, Secret Storage'),
|
||||
('session_management', 'security', 'Sitzungsverwaltung (Session Timeout, Token-Lifecycle)', 'Session management', 'NOT multi_factor_auth (Login) oder access_control (Berechtigungen)', 'Session Timeout, Token-Invalidierung, Concurrent Sessions'),
|
||||
('privileged_access', 'security', 'Verwaltung privilegierter Zugriffe (Admin, Root)', 'Privileged access management', 'NOT access_control (allg. Zugriffskontrolle)', 'Admin-Konten, Root-Zugriff, PAM, Just-in-Time-Access'),
|
||||
('access_control', 'security', 'Allgemeine Zugriffskontrolle (RBAC, Berechtigungen)', 'Access control (RBAC, permissions)', 'NOT privileged_access (Admin) oder authentication (Login)', 'Rollenbasierte Zugriffskontrolle, Berechtigungsvergabe, Least Privilege')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- Encryption & Cryptography
|
||||
INSERT INTO object_ontology VALUES
|
||||
('encryption', 'security', 'Verschlüsselung at-rest (Datenverschlüsselung)', 'Encryption at rest', 'NOT transport_encryption (in-transit) oder key_management (Schlüssel)', 'AES-256, Festplattenverschlüsselung, DB-Verschlüsselung'),
|
||||
('transport_encryption', 'security', 'Transportverschlüsselung (TLS, HTTPS)', 'Transport encryption (TLS)', 'NOT encryption (at-rest)', 'TLS 1.3, HTTPS, mTLS, Zertifikats-Pinning'),
|
||||
('key_management', 'security', 'Kryptografische Schlüsselverwaltung', 'Cryptographic key management', 'NOT credentials (API-Keys) oder certificate_management (Zertifikate)', 'Key Rotation, HSM, Key Escrow, Schlüsselerzeugung'),
|
||||
('certificate_management', 'security', 'Zertifikatsverwaltung (PKI, X.509)', 'Certificate management (PKI)', 'NOT key_management (Schlüssel) oder encryption (Verschlüsselung)', 'X.509-Zertifikate, PKI, Zertifikatsrückruf, CA-Verwaltung')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- Network Security
|
||||
INSERT INTO object_ontology VALUES
|
||||
('network_security', 'security', 'Allgemeine Netzwerksicherheit', 'General network security', 'NOT network_segmentation (Segmentierung) oder firewall (Regeln)', 'Netzwerk-Hardening, Port-Management, DNS-Sicherheit'),
|
||||
('network_segmentation', 'security', 'Netzwerksegmentierung (VLANs, Zonen)', 'Network segmentation', 'NOT network_security (allg.) oder firewall (Regeln)', 'VLANs, DMZ, Micro-Segmentation, Zero Trust Network'),
|
||||
('firewall', 'security', 'Firewall-Regeln und -Verwaltung', 'Firewall rules and management', 'NOT network_security (allg.)', 'WAF, Firewall-Regeln, Ingress/Egress, Whitelist'),
|
||||
('vpn', 'security', 'VPN-Konfiguration und -Verwaltung', 'VPN configuration', NULL, 'IPSec, WireGuard, Site-to-Site VPN'),
|
||||
('remote_access', 'security', 'Fernzugriff und Remote-Arbeit', 'Remote access', 'NOT vpn (Technologie)', 'Remote Desktop, Bastion Hosts, Jump Server')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- Monitoring & Logging (CRITICAL: clear boundaries!)
|
||||
INSERT INTO object_ontology VALUES
|
||||
('monitoring', 'security', 'Kontinuierliche Echtzeit-Überwachung von Systemen/Metriken', 'Continuous real-time monitoring of systems', 'NOT audit_logging (Protokollierung), NOT training (Schulung), NOT procedure (Verfahren), NOT risk_assessment (Bewertung)', 'System-Health-Monitoring, Verfügbarkeitsüberwachung, Performance-Monitoring, Anomalie-Erkennung in Echtzeit'),
|
||||
('audit_logging', 'security', 'Protokollierung und Audit-Trail (Nachvollziehbarkeit)', 'Audit logging and trail', 'NOT monitoring (Echtzeit-Überwachung), NOT compliance_audit (Prüfungen)', 'Log-Aufzeichnung, Audit Trail, Zeitstempel, Nachvollziehbarkeit, Protokollierung von Zugriffen'),
|
||||
('siem', 'security', 'Security Information and Event Management', 'SIEM', 'NOT monitoring (allg.) oder audit_logging (Protokollierung)', 'SIEM-Korrelation, Security Events, Log-Aggregation'),
|
||||
('alerting', 'security', 'Benachrichtigungen und Meldepflichten bei Sicherheitsereignissen', 'Security alerting and notification obligations', 'NOT monitoring (Überwachung) oder incident (Vorfallsbehandlung)', 'Sicherheitsmeldungen, Breach Notification, Benachrichtigungspflichten'),
|
||||
('compliance_audit', 'governance', 'Compliance-Prüfungen und externe Audits', 'Compliance audits and external reviews', 'NOT audit_logging (technische Protokollierung), NOT monitoring (Überwachung)', 'Externe Prüfung, Jahresabschlussprüfung, Zertifizierungsaudit, Lieferanten-Audit')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- Vulnerability & Patch Management
|
||||
INSERT INTO object_ontology VALUES
|
||||
('vulnerability', 'security', 'Schwachstellenmanagement und -scanning', 'Vulnerability management', 'NOT patch_management (Updates)', 'Vulnerability Scanning, CVE-Tracking, Penetration Testing'),
|
||||
('patch_management', 'security', 'Software-Updates und Patch-Verwaltung', 'Patch management', 'NOT vulnerability (Scanning)', 'Patch-Zyklus, Update-Policy, Hotfix-Prozess')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- Backup & Recovery
|
||||
INSERT INTO object_ontology VALUES
|
||||
('backup', 'security', 'Datensicherung und Backup-Strategien', 'Backup strategies', 'NOT disaster_recovery (Wiederherstellung)', 'Backup-Rotation, Offsite-Backup, Backup-Verschlüsselung'),
|
||||
('disaster_recovery', 'security', 'Notfallwiederherstellung und Business Continuity', 'Disaster recovery', 'NOT backup (Datensicherung) oder incident (Vorfälle)', 'DR-Plan, RTO/RPO, Failover, Business Continuity')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
-- DATA PROTECTION (CRITICAL: clear boundaries!)
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
|
||||
INSERT INTO object_ontology VALUES
|
||||
('personal_data', 'data_protection', 'Verarbeitung personenbezogener Daten (DSGVO-Grundsätze)', 'Personal data processing principles', 'NOT sensitive_data (besondere Kategorien), NOT data_subject_rights (Betroffenenrechte), NOT consent (Einwilligung)', 'Datenminimierung, Zweckbindung, Speicherbegrenzung, Rechtmäßigkeit der Verarbeitung'),
|
||||
('sensitive_data', 'data_protection', 'Besondere Kategorien personenbezogener Daten (Art. 9 DSGVO)', 'Special categories of personal data', 'NOT personal_data (allg.), NOT health_data (Gesundheit)', 'Biometrische Daten, ethnische Herkunft, politische Meinungen, Gewerkschaftszugehörigkeit'),
|
||||
('health_data', 'data_protection', 'Gesundheitsdaten und Medizindaten', 'Health and medical data', 'NOT sensitive_data (allg. besondere Kategorien)', 'Patientendaten, Medizinprodukte-Daten, klinische Daten'),
|
||||
('consent', 'data_protection', 'Einwilligungsmanagement', 'Consent management', 'NOT data_subject_rights (andere Betroffenenrechte)', 'Einwilligung einholen, Widerruf, Opt-In, Consent-Banner'),
|
||||
('data_subject_rights', 'data_protection', 'Betroffenenrechte (Auskunft, Löschung, Portabilität)', 'Data subject rights (access, erasure, portability)', 'NOT consent (Einwilligung), NOT personal_data (Verarbeitung)', 'Auskunftsrecht, Recht auf Löschung, Datenportabilität, Widerspruchsrecht'),
|
||||
('data_retention', 'data_protection', 'Aufbewahrungsfristen und Löschkonzept', 'Data retention and deletion', 'NOT backup (technische Sicherung)', 'Löschfristen, Aufbewahrungspflichten, Löschkonzept, Archivierung'),
|
||||
('data_transfer', 'data_protection', 'Internationale Datenübermittlung (Drittländer, SCC)', 'International data transfer', 'NOT data_processing (Verarbeitung)', 'Drittlandtransfer, Standardvertragsklauseln, Angemessenheitsbeschluss, BCR'),
|
||||
('data_breach_notification', 'data_protection', 'Meldung von Datenschutzverletzungen (Art. 33/34 DSGVO)', 'Data breach notification', 'NOT incident (allg. Sicherheitsvorfälle), NOT alerting (techn. Alerts)', 'Breach-Meldung an Aufsichtsbehörde, Benachrichtigung Betroffener, 72-Stunden-Frist'),
|
||||
('dpia', 'data_protection', 'Datenschutz-Folgenabschätzung (Art. 35 DSGVO)', 'Data protection impact assessment', NULL, 'DSFA, Schwellwertanalyse, Risikobewertung für Betroffene'),
|
||||
('data_processing_agreement', 'data_protection', 'Auftragsverarbeitung (Art. 28 DSGVO)', 'Data processing agreements', NULL, 'AVV, Auftragsverarbeiter, Sub-Auftragsverarbeiter, TOMs'),
|
||||
('privacy_by_design', 'data_protection', 'Datenschutz durch Technikgestaltung (Art. 25 DSGVO)', 'Privacy by design and default', NULL, 'Privacy by Default, Datenminimierung in der Architektur'),
|
||||
('data_processing_register', 'data_protection', 'Verzeichnis von Verarbeitungstätigkeiten (Art. 30 DSGVO)', 'Records of processing activities', NULL, 'VVT, Verarbeitungsverzeichnis')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
-- GOVERNANCE & ORGANIZATION
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
|
||||
INSERT INTO object_ontology VALUES
|
||||
('policy', 'governance', 'Richtlinien und Leitlinien ERSTELLEN/DEFINIEREN', 'Creating/defining policies', 'NOT procedure (Verfahrensablauf), NOT compliance_audit (Prüfung)', 'Sicherheitsrichtlinie erstellen, Policy-Framework definieren, Leitlinie verabschieden'),
|
||||
('procedure', 'governance', 'Verfahren und Prozessabläufe DEFINIEREN/DOKUMENTIEREN', 'Defining/documenting procedures', 'NOT incident (Vorfallsbehandlung), NOT process (laufender Betrieb)', 'Verfahrensanweisung, Ablaufbeschreibung, Standardprozess definieren'),
|
||||
('process', 'governance', 'Laufende betriebliche Prozesse AUSFÜHREN', 'Executing operational processes', 'NOT procedure (Definition), NOT monitoring (Überwachung)', 'Betriebsprozess, Geschäftsprozess, Workflow-Ausführung'),
|
||||
('training', 'governance', 'Schulung und Weiterbildung DURCHFÜHREN', 'Training and education', 'NOT awareness (Sensibilisierung), NOT monitoring (Überwachung!)', 'Mitarbeiterschulung, Zertifizierungskurs, Pflichtunterweisung'),
|
||||
('awareness', 'governance', 'Sicherheitsbewusstsein und Sensibilisierung', 'Security awareness', 'NOT training (formale Schulung)', 'Phishing-Simulation, Awareness-Kampagne, Sicherheitskultur'),
|
||||
('incident', 'governance', 'Sicherheitsvorfälle BEHANDELN (Incident Response)', 'Incident response and handling', 'NOT alerting (Benachrichtigung), NOT data_breach_notification (DSGVO-Meldung)', 'Incident Response Plan, Vorfallsanalyse, Containment, Recovery, Lessons Learned'),
|
||||
('risk_management', 'governance', 'Risikomanagement und -bewertung', 'Risk management and assessment', 'NOT vulnerability (techn. Schwachstellen), NOT monitoring (Überwachung)', 'Risikobewertung, Risikobehandlung, Risikoakzeptanz, Risikomatrix'),
|
||||
('third_party_management', 'governance', 'Lieferanten- und Drittanbieter-Management', 'Third-party and vendor management', 'NOT data_processing_agreement (AVV)', 'Lieferantenbewertung, Vendor Risk Assessment, Supply Chain Security'),
|
||||
('change_management', 'governance', 'Änderungsmanagement', 'Change management', 'NOT patch_management (Updates)', 'Change Request, Change Advisory Board, Rollback-Verfahren'),
|
||||
('documentation', 'governance', 'Allgemeine Dokumentationspflichten', 'General documentation requirements', 'NOT audit_logging (technische Logs), NOT data_processing_register (VVT)', 'Betriebshandbuch, Systemdokumentation, Verfahrensdokumentation'),
|
||||
('records_management', 'governance', 'Akten- und Unterlagenverwaltung', 'Records management', 'NOT data_retention (Löschfristen)', 'Archivierung, Aktenführung, Aufbewahrungspflichten nach HGB/AO'),
|
||||
('compliance_reporting', 'governance', 'Compliance-Berichterstattung', 'Compliance reporting', 'NOT alerting (techn. Alerts), NOT supervisory_authority (Behördenkommunikation)', 'Compliance-Bericht, Management-Reporting, KPI-Tracking'),
|
||||
('asset_management', 'governance', 'IT-Asset-Verwaltung und Inventar', 'IT asset management', NULL, 'Asset-Inventar, CMDB, Hardware-Lifecycle, Software-Inventar'),
|
||||
('physical_security', 'security', 'Physische Sicherheit und Zutrittskontrolle', 'Physical security and access', NULL, 'Zutrittskontrolle, Videoüberwachung (physisch), Serverraum-Sicherheit'),
|
||||
('human_resources_security', 'governance', 'Personalsicherheit', 'HR security', 'NOT training (Schulung)', 'Background-Checks, Geheimhaltungsvereinbarungen, Onboarding/Offboarding')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
-- REGULATORY SPECIFIC
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
|
||||
INSERT INTO object_ontology VALUES
|
||||
('supervisory_authority', 'regulatory', 'Kommunikation mit Aufsichtsbehörden', 'Supervisory authority communication', 'NOT compliance_reporting (interne Berichte)', 'Meldung an BaFin, Abstimmung mit DPA, behördliche Anfragen'),
|
||||
('certification', 'regulatory', 'Zertifizierung und Konformitätsbewertung', 'Certification and conformity assessment', 'NOT compliance_audit (Prüfung), NOT personal_data (Datenschutz)', 'CE-Kennzeichnung, ISO-Zertifizierung, Konformitätserklärung'),
|
||||
('product_safety', 'regulatory', 'Produktsicherheit und Marktüberwachung', 'Product safety and market surveillance', 'NOT certification (Zertifizierung)', 'Rückrufmanagement, Sicherheitsbewertung, RAPEX-Meldung'),
|
||||
('ai_system', 'regulatory', 'KI-System-Regulierung (AI Act)', 'AI system regulation', NULL, 'KI-Risikobewertung, Hochrisiko-KI, Transparenzpflichten, FRIA'),
|
||||
('financial_reporting', 'regulatory', 'Finanzberichterstattung und Rechnungslegung', 'Financial reporting and accounting', NULL, 'Jahresabschluss, HGB-Pflichten, IFRS, Buchführung'),
|
||||
('aml', 'regulatory', 'Geldwäscheprävention und KYC', 'Anti-money laundering and KYC', NULL, 'KYC, Verdachtsmeldung, PEP-Prüfung, Transaktionsmonitoring'),
|
||||
('whistleblowing', 'regulatory', 'Hinweisgeberschutz und Meldekanäle', 'Whistleblower protection', NULL, 'Hinweisgebersystem, Meldekanal, Hinweisgeberschutzgesetz'),
|
||||
('consumer_protection', 'regulatory', 'Verbraucherschutz und AGB', 'Consumer protection', NULL, 'AGB-Prüfung, Widerrufsrecht, Informationspflichten, Preistransparenz'),
|
||||
('ecommerce', 'regulatory', 'E-Commerce-Pflichten (Impressum, Fernabsatz)', 'E-commerce obligations', NULL, 'Impressumspflicht, Fernabsatzrecht, Online-Handel-Pflichten'),
|
||||
('telecommunications', 'regulatory', 'Telekommunikationsregulierung', 'Telecommunications regulation', NULL, 'TKG-Pflichten, Vorratsdatenspeicherung, Notruf'),
|
||||
('medical_device', 'regulatory', 'Medizinprodukte-Regulierung (MDR)', 'Medical device regulation', NULL, 'UDI, klinische Bewertung, Post-Market Surveillance'),
|
||||
('payment_services', 'regulatory', 'Zahlungsdienste-Regulierung (PSD2)', 'Payment services regulation', NULL, 'Starke Kundenauthentifizierung, PSD2-Compliance, Open Banking'),
|
||||
('critical_infrastructure', 'regulatory', 'KRITIS und NIS2-Pflichten', 'Critical infrastructure (NIS2)', NULL, 'KRITIS-Meldepflichten, NIS2-Maßnahmen, Mindeststandards'),
|
||||
('supply_chain_due_diligence', 'regulatory', 'Lieferkettensorgfaltspflicht (LkSG)', 'Supply chain due diligence', 'NOT third_party_management (allg. Lieferanten)', 'Menschenrechts-Due-Diligence, Umwelt-Sorgfaltspflicht, LkSG-Bericht'),
|
||||
('sustainability_reporting', 'regulatory', 'Nachhaltigkeitsberichterstattung (CSRD)', 'Sustainability reporting', NULL, 'ESG-Reporting, CSRD, Nachhaltigkeitsbericht'),
|
||||
('cookie_consent', 'regulatory', 'Cookie-Consent und Tracking (TDDDG/ePrivacy)', 'Cookie consent and tracking', 'NOT consent (allg. Einwilligung)', 'Cookie-Banner, Tracking-Einwilligung, TDDDG §25'),
|
||||
('video_surveillance', 'regulatory', 'Videoüberwachung (datenschutzrechtlich)', 'Video surveillance (data protection)', 'NOT physical_security (physische Sicherheit), NOT monitoring (IT-Monitoring)', 'Kamera-Überwachung, Speicherfristen, Kennzeichnungspflicht')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
-- APPLICATION SECURITY
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
|
||||
INSERT INTO object_ontology VALUES
|
||||
('secure_development', 'technical', 'Sichere Softwareentwicklung (SDLC)', 'Secure software development lifecycle', NULL, 'Secure Coding, Code Review, SAST/DAST, DevSecOps'),
|
||||
('api_security', 'technical', 'API-Sicherheit', 'API security', NULL, 'API-Authentifizierung, Rate Limiting, Input Validation'),
|
||||
('input_validation', 'technical', 'Eingabevalidierung und Output Encoding', 'Input validation and output encoding', NULL, 'XSS-Prävention, SQL-Injection-Schutz, Parametervalidierung'),
|
||||
('container_security', 'technical', 'Container- und Cloud-Sicherheit', 'Container and cloud security', NULL, 'Docker-Hardening, Kubernetes-Security, Image-Scanning'),
|
||||
('logging_configuration', 'technical', 'Log-Konfiguration und -Format', 'Log configuration and format', 'NOT audit_logging (Nachvollziehbarkeit), NOT monitoring (Überwachung)', 'Log-Format, Log-Rotation, Log-Shipping, Structured Logging'),
|
||||
('data_classification', 'governance', 'Datenklassifizierung und -kennzeichnung', 'Data classification and labeling', 'NOT sensitive_data (besondere Kategorien)', 'Vertraulichkeitsstufen, Datenklassifizierung, Labeling')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- Count results
|
||||
DO $$
|
||||
DECLARE cnt INTEGER;
|
||||
BEGIN
|
||||
SELECT count(*) INTO cnt FROM object_ontology;
|
||||
RAISE NOTICE 'object_ontology: % canonical tokens defined', cnt;
|
||||
END $$;
|
||||
@@ -0,0 +1,214 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Extract CE-relevant obligations from TRBS/TRGS/ASR/OSHA chunks in Qdrant.
|
||||
|
||||
Searches for MUSS/SOLL patterns in chunk texts and classifies them.
|
||||
Output: JSON file with structured obligations for the CE session.
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/extract_ce_obligations.py
|
||||
python3 /app/scripts/extract_ce_obligations.py --output /tmp/ce_obligations.json
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("ce-obligations")
|
||||
|
||||
QDRANT_URL = os.getenv("QDRANT_URL", "http://qdrant:6333")
|
||||
COLLECTION = "bp_compliance_ce"
|
||||
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://host.docker.internal:11434")
|
||||
LLM_MODEL = "qwen3.5:35b-a3b"
|
||||
|
||||
# Obligation patterns (DE + EN)
|
||||
OBLIGATION_PATTERNS = re.compile(
|
||||
r"(muss|müssen|hat\s+[\w\s]*zu\s|ist\s+[\w\s]*sicherzustellen|"
|
||||
r"ist\s+verpflichtet|sind\s+verpflichtet|darf\s+nicht|"
|
||||
r"shall|must|required\s+to|is\s+required|shall\s+not)",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
# CE relevance keywords
|
||||
CE_KEYWORDS = re.compile(
|
||||
r"(maschine|schutzeinrichtung|gefährdung|quetsch|scher|stoß|"
|
||||
r"schneid|fang|einzug|absturz|druck|explosion|brand|"
|
||||
r"elektrisch|spannung|erdung|schutzleiter|not-halt|"
|
||||
r"betriebsanleitung|kennzeichnung|prüfung|prüfpflicht|"
|
||||
r"instandhaltung|wartung|sicherheitsabstand|"
|
||||
r"schutzmaßnahme|persönliche schutzausrüstung|psa|"
|
||||
r"machine|guard|hazard|crush|shear|cut|entangle|"
|
||||
r"lockout|tagout|electrical|grounding|emergency stop|"
|
||||
r"safety distance|protective device|ppe|inspection)",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
HAZARD_CATEGORIES = {
|
||||
"quetsch|crush|squeeze": "mechanical_crushing",
|
||||
"schneid|cut": "mechanical_cutting",
|
||||
"fang|einzug|entangle|draw": "mechanical_entanglement",
|
||||
"absturz|fall": "fall_hazard",
|
||||
"explosion|ex-bereich|atex": "explosion_hazard",
|
||||
"brand|fire|feuer": "fire_hazard",
|
||||
"elektrisch|electrical|spannung|voltage": "electrical_hazard",
|
||||
"lärm|noise|schall": "noise_hazard",
|
||||
"gefahrstoff|hazardous substance|chemical": "chemical_hazard",
|
||||
"ergonomie|ergonomic|heben|lift": "ergonomic_hazard",
|
||||
"temperatur|heat|hitze|kälte|cold": "thermal_hazard",
|
||||
"strahlung|radiation|laser": "radiation_hazard",
|
||||
"not-halt|emergency stop|e-stop": "emergency_stop",
|
||||
"lockout|tagout|loto": "lockout_tagout",
|
||||
"kennzeichnung|label|marking|sign": "safety_marking",
|
||||
"prüfung|inspection|test": "inspection_requirement",
|
||||
"instandhaltung|maintenance|wartung": "maintenance",
|
||||
"schutzeinrichtung|guard|protective device": "protective_device",
|
||||
"betriebsanleitung|instruction|manual": "operating_instructions",
|
||||
"druck|pressure|behälter|vessel|kessel|boiler": "pressure_hazard",
|
||||
}
|
||||
|
||||
# Source-based overrides: TRGS docs about chemicals/storage
|
||||
# should never be classified as mechanical hazards
|
||||
_CHEMICAL_SOURCES = re.compile(
|
||||
r"trgs\s*(5[0-9]{2}|7[0-9]{2}|9[0-9]{2}|4[0-9]{2}|6[0-9]{2})",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
|
||||
def _classify_hazard(text: str, source: str) -> str:
|
||||
"""Classify hazard with source-aware overrides."""
|
||||
# TRGS sources → chemical/pressure/explosion, never mechanical
|
||||
if _CHEMICAL_SOURCES.search(source):
|
||||
if re.search(r"explosion|ex-bereich|atex|zündfähig", text, re.IGNORECASE):
|
||||
return "explosion_hazard"
|
||||
if re.search(r"druck|pressure|behälter|vessel", text, re.IGNORECASE):
|
||||
return "pressure_hazard"
|
||||
if re.search(r"brand|fire|feuer", text, re.IGNORECASE):
|
||||
return "fire_hazard"
|
||||
return "chemical_hazard"
|
||||
|
||||
# Standard pattern matching (order matters — specific first)
|
||||
for pattern, category in HAZARD_CATEGORIES.items():
|
||||
if re.search(pattern, text, re.IGNORECASE):
|
||||
return category
|
||||
return "general"
|
||||
|
||||
|
||||
def scroll_chunks(source_filter: str = None) -> list[dict]:
|
||||
"""Scroll through Qdrant to get all relevant chunks."""
|
||||
chunks = []
|
||||
offset = None
|
||||
batch = 100
|
||||
|
||||
while True:
|
||||
scroll_body = {
|
||||
"limit": batch,
|
||||
"with_payload": True,
|
||||
"with_vector": False,
|
||||
}
|
||||
if offset is not None:
|
||||
scroll_body["offset"] = offset
|
||||
|
||||
resp = httpx.post(
|
||||
f"{QDRANT_URL}/collections/{COLLECTION}/points/scroll",
|
||||
json=scroll_body,
|
||||
timeout=30.0,
|
||||
)
|
||||
data = resp.json()
|
||||
points = data.get("result", {}).get("points", [])
|
||||
next_offset = data.get("result", {}).get("next_page_offset")
|
||||
|
||||
for pt in points:
|
||||
payload = pt.get("payload", {})
|
||||
source = payload.get("source", payload.get("filename", ""))
|
||||
text = payload.get("chunk_text", "")
|
||||
|
||||
# Filter for TRBS/TRGS/ASR/OSHA
|
||||
source_lower = source.lower()
|
||||
is_relevant = any(k in source_lower for k in
|
||||
["trbs", "trgs", "asr", "osha"])
|
||||
if not is_relevant:
|
||||
continue
|
||||
|
||||
# Check for obligation patterns
|
||||
if not OBLIGATION_PATTERNS.search(text):
|
||||
continue
|
||||
|
||||
# Check CE relevance
|
||||
if not CE_KEYWORDS.search(text):
|
||||
continue
|
||||
|
||||
# Classify hazard category (source-aware)
|
||||
hazard = _classify_hazard(text, source)
|
||||
|
||||
# Determine obligation type
|
||||
if re.search(r"muss|müssen|shall|must|required", text, re.IGNORECASE):
|
||||
obl_type = "MUSS"
|
||||
elif re.search(r"soll|sollte|should", text, re.IGNORECASE):
|
||||
obl_type = "SOLL"
|
||||
else:
|
||||
obl_type = "MUSS"
|
||||
|
||||
chunks.append({
|
||||
"source": source,
|
||||
"section": payload.get("section", ""),
|
||||
"paragraph": payload.get("paragraph", ""),
|
||||
"obligation_text": text.strip()[:500],
|
||||
"hazard_category": hazard,
|
||||
"obligation_type": obl_type,
|
||||
"ce_relevance": "high" if hazard != "general" else "medium",
|
||||
"filename": payload.get("filename", ""),
|
||||
})
|
||||
|
||||
if next_offset is None or not points:
|
||||
break
|
||||
offset = next_offset
|
||||
|
||||
if len(chunks) % 500 == 0:
|
||||
logger.info(" Scanned... %d obligations found so far", len(chunks))
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--output", default="/tmp/ce_obligations.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("Scanning %s for CE obligations...", COLLECTION)
|
||||
obligations = scroll_chunks()
|
||||
|
||||
logger.info("Found %d CE-relevant obligations", len(obligations))
|
||||
|
||||
# Stats
|
||||
by_source = {}
|
||||
by_hazard = {}
|
||||
for o in obligations:
|
||||
src = o["source"][:30]
|
||||
by_source[src] = by_source.get(src, 0) + 1
|
||||
by_hazard[o["hazard_category"]] = by_hazard.get(o["hazard_category"], 0) + 1
|
||||
|
||||
logger.info("\nBy source:")
|
||||
for src, cnt in sorted(by_source.items(), key=lambda x: -x[1])[:20]:
|
||||
logger.info(" %4d %s", cnt, src)
|
||||
|
||||
logger.info("\nBy hazard category:")
|
||||
for cat, cnt in sorted(by_hazard.items(), key=lambda x: -x[1]):
|
||||
logger.info(" %4d %s", cnt, cat)
|
||||
|
||||
# Save
|
||||
Path(args.output).write_text(
|
||||
json.dumps(obligations, indent=2, ensure_ascii=False)
|
||||
)
|
||||
logger.info("\nSaved to %s", args.output)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,289 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Add L2 sub-topics to broad tokens. Instead of just "incident",
|
||||
produces "incident:response", "incident:detection", etc.
|
||||
|
||||
Only processes tokens with >500 controls AND <90% audit accuracy.
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/gpre0_add_subtopics.py --dry-run
|
||||
python3 /app/scripts/gpre0_add_subtopics.py
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("gpre0-subtopics")
|
||||
|
||||
DB_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
|
||||
)
|
||||
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
|
||||
ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"
|
||||
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||
CHECKPOINT_DIR = Path("/tmp/gpre0_subtopic_checkpoints")
|
||||
|
||||
# Tokens that are too broad — need L2 sub-topics
|
||||
BROAD_TOKENS = {
|
||||
# Round 1 (already done)
|
||||
"risk_management", "policy", "audit_logging", "incident",
|
||||
"access_control", "compliance_audit", "asset_management",
|
||||
"key_management", "third_party_management", "monitoring",
|
||||
"financial_reporting", "data_classification", "change_management",
|
||||
"alerting", "multi_factor_auth", "api_security",
|
||||
"certificate_management", "human_resources_security",
|
||||
"training", "data_processing_agreement", "data_processing_register",
|
||||
"consumer_protection", "input_validation", "vulnerability",
|
||||
"dpia", "data_breach_notification", "backup",
|
||||
"supply_chain_due_diligence", "awareness",
|
||||
"privacy_by_design", "credentials", "logging_configuration",
|
||||
# Round 2 (remaining large tokens)
|
||||
"supervisory_authority", "certification", "secure_development",
|
||||
"product_safety", "personal_data", "data_subject_rights", "consent",
|
||||
"ai_system", "encryption", "data_retention", "disaster_recovery",
|
||||
"data_transfer", "aml", "transport_encryption", "network_security",
|
||||
"physical_security", "medical_device", "patch_management",
|
||||
"cookie_consent", "video_surveillance", "network_segmentation",
|
||||
"telecommunications", "privileged_access", "session_management",
|
||||
"password_policy", "governance", "whistleblowing", "payment_services",
|
||||
"health_data", "sensitive_data", "ecommerce", "sustainability_reporting",
|
||||
"critical_infrastructure", "regulatory",
|
||||
}
|
||||
|
||||
SYSTEM_PROMPT = """Du bist ein Compliance-Spezialist. Jeder Control hat bereits ein Hauptthema (L1 Token).
|
||||
Deine Aufgabe: Bestimme ein SPEZIFISCHES Sub-Thema (L2) innerhalb des Hauptthemas.
|
||||
|
||||
Das L2 Sub-Thema soll den KONKRETEN Aspekt beschreiben. Verwende kurze, klare englische Bezeichnungen.
|
||||
|
||||
Beispiele:
|
||||
- L1=incident, Titel="Incident Response Plan erstellen" → L2="response_plan"
|
||||
- L1=incident, Titel="Sicherheitsvorfälle erkennen" → L2="detection"
|
||||
- L1=incident, Titel="Recovery nach Vorfall dokumentieren" → L2="recovery"
|
||||
- L1=incident, Titel="Forensische Analyse durchführen" → L2="forensics"
|
||||
- L1=risk_management, Titel="Risikobewertung durchführen" → L2="assessment"
|
||||
- L1=risk_management, Titel="Risikominderungsmaßnahmen umsetzen" → L2="treatment"
|
||||
- L1=risk_management, Titel="Restrisiko akzeptieren" → L2="acceptance"
|
||||
- L1=access_control, Titel="Rollenbasierte Zugriffskontrolle" → L2="rbac"
|
||||
- L1=access_control, Titel="Zugriffsrechte regelmäßig prüfen" → L2="access_review"
|
||||
- L1=access_control, Titel="Identitätsmanagement implementieren" → L2="identity_management"
|
||||
- L1=monitoring, Titel="Systemverfügbarkeit überwachen" → L2="availability"
|
||||
- L1=monitoring, Titel="Sicherheitsereignisse überwachen" → L2="security_events"
|
||||
- L1=policy, Titel="Datenschutzrichtlinie erstellen" → L2="data_protection"
|
||||
- L1=policy, Titel="Acceptable Use Policy definieren" → L2="acceptable_use"
|
||||
- L1=policy, Titel="Passwortrichtlinie festlegen" → L2="password"
|
||||
- L1=financial_reporting, Titel="Jahresabschluss erstellen" → L2="annual_accounts"
|
||||
- L1=financial_reporting, Titel="Steuererklärung einreichen" → L2="tax"
|
||||
- L1=alerting, Titel="Datenpanne an Behörde melden" → L2="breach_notification"
|
||||
- L1=alerting, Titel="Sicherheitswarnung eskalieren" → L2="escalation"
|
||||
|
||||
REGELN:
|
||||
- L2 soll 1-3 Wörter sein, snake_case
|
||||
- L2 soll SPEZIFISCH sein (nicht das L1 wiederholen)
|
||||
- Verwende konsistente L2-Bezeichnungen für ähnliche Controls
|
||||
|
||||
Antworte NUR als JSON-Array: [{"id":"...","l2":"subtopic"}, ...]"""
|
||||
|
||||
|
||||
def call_claude(controls_batch: list[dict]) -> tuple[list[dict], dict]:
|
||||
"""Send batch to Claude for L2 sub-topic assignment."""
|
||||
items = []
|
||||
for c in controls_batch:
|
||||
items.append(
|
||||
f'- id="{c["control_id"]}" '
|
||||
f'L1="{c["current_object"]}" '
|
||||
f't="{c["title"]}" '
|
||||
f'o="{c["objective"][:80]}"'
|
||||
)
|
||||
|
||||
prompt = "Bestimme L2 Sub-Topics:\n" + "\n".join(items)
|
||||
|
||||
headers = {
|
||||
"x-api-key": ANTHROPIC_API_KEY,
|
||||
"anthropic-version": "2023-06-01",
|
||||
"content-type": "application/json",
|
||||
}
|
||||
payload = {
|
||||
"model": ANTHROPIC_MODEL,
|
||||
"max_tokens": 1500,
|
||||
"temperature": 0.0,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
}
|
||||
|
||||
try:
|
||||
resp = httpx.post(
|
||||
ANTHROPIC_URL, headers=headers, json=payload, timeout=45.0
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
content = data.get("content", [{}])[0].get("text", "")
|
||||
usage = data.get("usage", {})
|
||||
start = content.find("[")
|
||||
end = content.rfind("]") + 1
|
||||
if start >= 0 and end > start:
|
||||
return json.loads(content[start:end]), usage
|
||||
return [], usage
|
||||
except httpx.TimeoutException:
|
||||
logger.error("TIMEOUT — skipping")
|
||||
return [], {}
|
||||
except httpx.HTTPStatusError as e:
|
||||
if e.response.status_code == 429:
|
||||
logger.warning("Rate limited — waiting 60s")
|
||||
time.sleep(60)
|
||||
else:
|
||||
logger.error("API error %d", e.response.status_code)
|
||||
return [], {}
|
||||
except Exception as e:
|
||||
logger.error("Failed: %s", e)
|
||||
return [], {}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--batch-size", type=int, default=20)
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(
|
||||
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
|
||||
)
|
||||
|
||||
# Build LIKE patterns for broad tokens
|
||||
like_clauses = " OR ".join(
|
||||
f"cc.generation_metadata->>'merge_group_hint' LIKE '%:{tok}:%'"
|
||||
for tok in BROAD_TOKENS
|
||||
)
|
||||
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text(f"""
|
||||
SELECT cc.id, cc.control_id, cc.title,
|
||||
COALESCE(cc.objective, '') as objective,
|
||||
cc.generation_metadata->>'merge_group_hint' as hint
|
||||
FROM canonical_controls cc
|
||||
WHERE cc.generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND cc.release_state NOT IN ('deprecated', 'rejected')
|
||||
AND ({like_clauses})
|
||||
""")).fetchall()
|
||||
|
||||
controls = []
|
||||
for uuid, cid, title, objective, hint in rows:
|
||||
parts = hint.split(":", 2) if hint else []
|
||||
obj = parts[1] if len(parts) > 1 else ""
|
||||
if obj in BROAD_TOKENS:
|
||||
controls.append({
|
||||
"uuid": str(uuid), "control_id": cid,
|
||||
"title": title or "", "objective": objective or "",
|
||||
"current_hint": hint, "current_object": obj,
|
||||
})
|
||||
|
||||
logger.info("Found %d controls in broad tokens to add L2 sub-topics", len(controls))
|
||||
|
||||
# Process
|
||||
total_tagged = 0
|
||||
total_skipped = 0
|
||||
total_input_tokens = 0
|
||||
total_output_tokens = 0
|
||||
corrections = []
|
||||
l2_stats: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
|
||||
|
||||
for i in range(0, len(controls), args.batch_size):
|
||||
batch = controls[i:i + args.batch_size]
|
||||
results, usage = call_claude(batch)
|
||||
|
||||
total_input_tokens += usage.get("input_tokens", 0)
|
||||
total_output_tokens += usage.get("output_tokens", 0)
|
||||
|
||||
if not results:
|
||||
total_skipped += len(batch)
|
||||
continue
|
||||
|
||||
result_map = {r.get("id", ""): r for r in results}
|
||||
for ctrl in batch:
|
||||
r = result_map.get(ctrl["control_id"], {})
|
||||
l2 = r.get("l2", "")
|
||||
if not l2:
|
||||
total_skipped += 1
|
||||
continue
|
||||
|
||||
total_tagged += 1
|
||||
old_hint = ctrl["current_hint"]
|
||||
parts = old_hint.split(":", 2)
|
||||
action = parts[0] if parts else "implement"
|
||||
l1 = parts[1] if len(parts) > 1 else "unknown"
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
# New format: action:L1_L2:phase
|
||||
new_obj = f"{l1}_{l2}"
|
||||
new_hint = f"{action}:{new_obj}:{phase}"
|
||||
corrections.append({
|
||||
"uuid": ctrl["uuid"],
|
||||
"old_hint": old_hint,
|
||||
"new_hint": new_hint,
|
||||
})
|
||||
l2_stats[l1][l2] += 1
|
||||
|
||||
processed = min(i + args.batch_size, len(controls))
|
||||
if processed % 5000 < args.batch_size or processed >= len(controls):
|
||||
logger.info(
|
||||
"Progress: %d/%d (tagged=%d skip=%d)",
|
||||
processed, len(controls), total_tagged, total_skipped,
|
||||
)
|
||||
|
||||
time.sleep(0.3)
|
||||
|
||||
# Report
|
||||
cost_in = total_input_tokens / 1_000_000 * 0.80
|
||||
cost_out = total_output_tokens / 1_000_000 * 4.00
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("SUBTOPIC REPORT")
|
||||
logger.info("=" * 60)
|
||||
logger.info("Total: %d | Tagged: %d | Skipped: %d", len(controls), total_tagged, total_skipped)
|
||||
logger.info("Cost: $%.2f (Haiku)", cost_in + cost_out)
|
||||
|
||||
# Show L2 distribution per L1
|
||||
for l1, subs in sorted(l2_stats.items()):
|
||||
top_subs = sorted(subs.items(), key=lambda x: -x[1])[:10]
|
||||
logger.info("\n%s (%d unique L2):", l1, len(subs))
|
||||
for l2, cnt in top_subs:
|
||||
logger.info(" %4d %s_%s", cnt, l1, l2)
|
||||
|
||||
# Save corrections
|
||||
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
corr_file = CHECKPOINT_DIR / "corrections_subtopics.json"
|
||||
corr_file.write_text(json.dumps(corrections))
|
||||
logger.info("\nSaved %d corrections to %s", len(corrections), corr_file)
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("DRY RUN — not updating DB")
|
||||
return
|
||||
|
||||
if corrections:
|
||||
logger.info("Applying %d corrections...", len(corrections))
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
for corr in corrections:
|
||||
c.execute(text("""
|
||||
UPDATE canonical_controls
|
||||
SET generation_metadata = jsonb_set(
|
||||
generation_metadata,
|
||||
'{merge_group_hint}',
|
||||
to_jsonb(CAST(:new_hint AS text))
|
||||
)
|
||||
WHERE id = CAST(:uuid AS uuid)
|
||||
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
|
||||
logger.info("Done. %d hints updated.", len(corrections))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,52 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Apply saved corrections from JSON file to DB (crash recovery)."""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("apply-corrections")
|
||||
|
||||
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("file", help="Path to corrections JSON file")
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
corrections = json.loads(Path(args.file).read_text())
|
||||
logger.info("Loaded %d corrections from %s", len(corrections), args.file)
|
||||
|
||||
if args.dry_run:
|
||||
for c in corrections[:10]:
|
||||
logger.info(" %s: %s → %s", c["uuid"][:8], c["old_hint"], c["new_hint"])
|
||||
logger.info("DRY RUN — not applying")
|
||||
return
|
||||
|
||||
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
|
||||
applied = 0
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
for corr in corrections:
|
||||
c.execute(text("""
|
||||
UPDATE canonical_controls
|
||||
SET generation_metadata = jsonb_set(
|
||||
generation_metadata,
|
||||
'{merge_group_hint}',
|
||||
to_jsonb(CAST(:new_hint AS text))
|
||||
)
|
||||
WHERE id = CAST(:uuid AS uuid)
|
||||
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
|
||||
applied += 1
|
||||
logger.info("Applied %d corrections.", applied)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,153 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Fix bad L2 subtopics: stakeholder_*, escalation fragments, *_approval*, *_documentation."""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("fix-subtopics")
|
||||
|
||||
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
|
||||
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
|
||||
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||
|
||||
SYSTEM_PROMPT = """Du klassifizierst Controls mit einem L1_L2 Token. Das L2 soll den KONKRETEN fachlichen Aspekt beschreiben.
|
||||
|
||||
VERBOTENE L2-Wörter (zu generisch):
|
||||
- stakeholder (zu vage — WER sind die Stakeholder? WAS wird getan?)
|
||||
- documentation (ist eine Handlung, kein Thema)
|
||||
- approval (ist eine Handlung)
|
||||
- communication (zu vage)
|
||||
|
||||
Stattdessen SPEZIFISCH:
|
||||
- "stakeholder_notification" bei Behördenmeldung → "authority_reporting"
|
||||
- "stakeholder_consultation" bei DSFA → "impact_consultation"
|
||||
- "stakeholder_engagement" bei Training → "participant_selection"
|
||||
- "escalation_procedure" → "severity_classification" oder "response_plan"
|
||||
- "access_documentation" → "access_policy" oder "permission_matrix"
|
||||
- "approval_process" → "authorization_workflow" oder "sign_off"
|
||||
|
||||
L2 = 1-3 Wörter, snake_case, FACHLICH SPEZIFISCH.
|
||||
|
||||
Antworte NUR als JSON-Array: [{"id":"...","token":"L1_L2"}, ...]"""
|
||||
|
||||
|
||||
def main():
|
||||
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
|
||||
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text("""
|
||||
SELECT cc.id, cc.control_id, cc.title,
|
||||
COALESCE(cc.objective, '') as objective,
|
||||
cc.generation_metadata->>'merge_group_hint' as hint
|
||||
FROM canonical_controls cc
|
||||
WHERE cc.release_state NOT IN ('deprecated', 'rejected')
|
||||
AND cc.generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND (
|
||||
cc.generation_metadata->>'merge_group_hint' LIKE '%stakeholder%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%_escalation_%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%_approval_%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%response_time%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%machine_re%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%management_app%'
|
||||
)
|
||||
""")).fetchall()
|
||||
|
||||
controls = []
|
||||
for uuid, cid, title, objective, hint in rows:
|
||||
parts = hint.split(":", 2) if hint else []
|
||||
controls.append({
|
||||
"uuid": str(uuid), "control_id": cid,
|
||||
"title": title or "", "objective": objective or "",
|
||||
"current_hint": hint,
|
||||
"current_object": parts[1] if len(parts) > 1 else "",
|
||||
})
|
||||
|
||||
logger.info("Found %d controls with bad subtopics to fix", len(controls))
|
||||
|
||||
headers = {
|
||||
"x-api-key": ANTHROPIC_API_KEY,
|
||||
"anthropic-version": "2023-06-01",
|
||||
"content-type": "application/json",
|
||||
}
|
||||
|
||||
corrections = []
|
||||
total_fixed = 0
|
||||
batch_size = 20
|
||||
|
||||
for i in range(0, len(controls), batch_size):
|
||||
batch = controls[i:i + batch_size]
|
||||
items = [
|
||||
f'- id="{c["control_id"]}" cur="{c["current_object"]}" t="{c["title"]}" o="{c["objective"][:80]}"'
|
||||
for c in batch
|
||||
]
|
||||
|
||||
try:
|
||||
resp = httpx.post(ANTHROPIC_URL, headers=headers, json={
|
||||
"model": "claude-haiku-4-5-20251001",
|
||||
"max_tokens": 1500, "temperature": 0.0,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": [{"role": "user", "content": "Fix:\n" + "\n".join(items)}],
|
||||
}, timeout=45.0)
|
||||
resp.raise_for_status()
|
||||
content = resp.json().get("content", [{}])[0].get("text", "")
|
||||
start = content.find("[")
|
||||
end = content.rfind("]") + 1
|
||||
results = json.loads(content[start:end]) if start >= 0 else []
|
||||
except Exception as e:
|
||||
logger.error("Batch %d failed: %s", i, e)
|
||||
continue
|
||||
|
||||
result_map = {r.get("id", ""): r for r in results}
|
||||
for ctrl in batch:
|
||||
r = result_map.get(ctrl["control_id"], {})
|
||||
new_token = r.get("token", "")
|
||||
if not new_token or new_token == ctrl["current_object"]:
|
||||
continue
|
||||
if "stakeholder" in new_token or "approval" in new_token:
|
||||
continue # Still bad
|
||||
|
||||
parts = ctrl["current_hint"].split(":", 2)
|
||||
action = parts[0] if parts else "implement"
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
corrections.append({
|
||||
"uuid": ctrl["uuid"],
|
||||
"old_hint": ctrl["current_hint"],
|
||||
"new_hint": f"{action}:{new_token}:{phase}",
|
||||
})
|
||||
total_fixed += 1
|
||||
|
||||
if (i + batch_size) % 200 < batch_size:
|
||||
logger.info("Progress: %d/%d (fixed=%d)", min(i + batch_size, len(controls)), len(controls), total_fixed)
|
||||
time.sleep(0.3)
|
||||
|
||||
logger.info("Fixed: %d of %d controls", total_fixed, len(controls))
|
||||
|
||||
# Save + apply
|
||||
Path("/tmp/corrections_bad_subtopics.json").write_text(json.dumps(corrections))
|
||||
|
||||
if corrections:
|
||||
logger.info("Applying %d corrections...", len(corrections))
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
for corr in corrections:
|
||||
c.execute(text("""
|
||||
UPDATE canonical_controls
|
||||
SET generation_metadata = jsonb_set(
|
||||
generation_metadata,
|
||||
'{merge_group_hint}',
|
||||
to_jsonb(CAST(:new_hint AS text))
|
||||
)
|
||||
WHERE id = CAST(:uuid AS uuid)
|
||||
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
|
||||
logger.info("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,284 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Fix generic tokens: Re-classify controls that were assigned to
|
||||
action-based tokens (documentation, procedure, process, etc.)
|
||||
instead of topic-based tokens.
|
||||
|
||||
Runs sequentially in 5 batches. NO retry on timeout.
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/gpre0_fix_generic_tokens.py --dry-run
|
||||
python3 /app/scripts/gpre0_fix_generic_tokens.py
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("gpre0-fix-generic")
|
||||
|
||||
DB_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
|
||||
)
|
||||
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
|
||||
ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"
|
||||
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||
CHECKPOINT_DIR = Path("/tmp/gpre0_fix_checkpoints")
|
||||
|
||||
# Tokens that are ACTION-based, not TOPIC-based → must be re-classified
|
||||
FORBIDDEN_TOKENS = {
|
||||
"documentation", "procedure", "process",
|
||||
"compliance_reporting", "records_management",
|
||||
}
|
||||
|
||||
SYSTEM_PROMPT = """Du bist ein Compliance-Klassifizierer. Ordne jeden Control dem THEMA zu, nicht der Handlung.
|
||||
|
||||
KRITISCH: Die Tokens "documentation", "procedure", "process", "compliance_reporting",
|
||||
"records_management" sind VERBOTEN. Klassifiziere nach dem INHALTLICHEN THEMA.
|
||||
|
||||
Beispiele:
|
||||
- "Risikobewertung dokumentieren" → risk_management (NICHT documentation)
|
||||
- "Incident-Verfahren definieren" → incident (NICHT procedure)
|
||||
- "Verschlüsselungsprozess implementieren" → encryption (NICHT process)
|
||||
- "Audit-Ergebnisse berichten" → compliance_audit (NICHT compliance_reporting)
|
||||
- "Datenschutz-Unterlagen verwalten" → personal_data (NICHT records_management)
|
||||
- "Löschkonzept dokumentieren" → data_retention (NICHT documentation)
|
||||
- "Zertifizierungsverfahren definieren" → certification (NICHT procedure)
|
||||
- "Schulungsprozess durchführen" → training (NICHT process)
|
||||
|
||||
ERLAUBTE TOKENS:
|
||||
|
||||
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
|
||||
privileged_access, access_control, encryption, transport_encryption,
|
||||
key_management, certificate_management, network_security, network_segmentation,
|
||||
firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting,
|
||||
compliance_audit, vulnerability, patch_management, backup, disaster_recovery,
|
||||
physical_security, secure_development, api_security, input_validation,
|
||||
container_security, logging_configuration
|
||||
|
||||
DATA_PROTECTION: personal_data, sensitive_data, health_data, consent,
|
||||
data_subject_rights, data_retention, data_transfer, data_breach_notification,
|
||||
dpia, data_processing_agreement, privacy_by_design, data_processing_register,
|
||||
data_classification, cookie_consent, video_surveillance
|
||||
|
||||
GOVERNANCE: policy, training, awareness, incident, risk_management,
|
||||
third_party_management, change_management, asset_management,
|
||||
human_resources_security
|
||||
|
||||
REGULATORY: supervisory_authority, certification, product_safety, ai_system,
|
||||
financial_reporting, aml, whistleblowing, consumer_protection, ecommerce,
|
||||
telecommunications, medical_device, payment_services, critical_infrastructure,
|
||||
supply_chain_due_diligence, sustainability_reporting
|
||||
|
||||
Antworte NUR als JSON-Array: [{"id":"...","token":"...","conf":0.9}, ...]"""
|
||||
|
||||
|
||||
def call_claude(controls_batch: list[dict]) -> tuple[list[dict], dict]:
|
||||
"""Send batch to Claude. NO retry on timeout."""
|
||||
items = []
|
||||
for c in controls_batch:
|
||||
items.append(
|
||||
f'- id="{c["control_id"]}" '
|
||||
f'cur="{c["current_object"]}" '
|
||||
f't="{c["title"]}" '
|
||||
f'o="{c["objective"][:100]}"'
|
||||
)
|
||||
|
||||
prompt = "Klassifiziere nach THEMA (nicht Handlung):\n" + "\n".join(items)
|
||||
|
||||
headers = {
|
||||
"x-api-key": ANTHROPIC_API_KEY,
|
||||
"anthropic-version": "2023-06-01",
|
||||
"content-type": "application/json",
|
||||
}
|
||||
payload = {
|
||||
"model": ANTHROPIC_MODEL,
|
||||
"max_tokens": 1500,
|
||||
"temperature": 0.0,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
}
|
||||
|
||||
try:
|
||||
resp = httpx.post(
|
||||
ANTHROPIC_URL, headers=headers, json=payload, timeout=45.0
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
content = data.get("content", [{}])[0].get("text", "")
|
||||
usage = data.get("usage", {})
|
||||
start = content.find("[")
|
||||
end = content.rfind("]") + 1
|
||||
if start >= 0 and end > start:
|
||||
return json.loads(content[start:end]), usage
|
||||
return [], usage
|
||||
except httpx.TimeoutException:
|
||||
logger.error("TIMEOUT — skipping batch")
|
||||
return [], {}
|
||||
except httpx.HTTPStatusError as e:
|
||||
if e.response.status_code == 429:
|
||||
logger.warning("Rate limited — waiting 60s")
|
||||
time.sleep(60)
|
||||
else:
|
||||
logger.error("API error %d", e.response.status_code)
|
||||
return [], {}
|
||||
except Exception as e:
|
||||
logger.error("Failed: %s", e)
|
||||
return [], {}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--batch-size", type=int, default=20)
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(
|
||||
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
|
||||
)
|
||||
|
||||
# Load only controls with forbidden tokens
|
||||
forbidden_pattern = "|".join(
|
||||
f":{tok}:" for tok in FORBIDDEN_TOKENS
|
||||
)
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text("""
|
||||
SELECT cc.id, cc.control_id, cc.title,
|
||||
COALESCE(cc.objective, '') as objective,
|
||||
cc.generation_metadata->>'merge_group_hint' as hint
|
||||
FROM canonical_controls cc
|
||||
WHERE cc.generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND cc.release_state NOT IN ('deprecated', 'rejected')
|
||||
AND (
|
||||
cc.generation_metadata->>'merge_group_hint' LIKE '%:documentation:%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:procedure:%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:process:%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:compliance_reporting:%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:records_management:%'
|
||||
)
|
||||
""")).fetchall()
|
||||
|
||||
controls = []
|
||||
for uuid, cid, title, objective, hint in rows:
|
||||
parts = hint.split(":", 2) if hint else []
|
||||
controls.append({
|
||||
"uuid": str(uuid), "control_id": cid,
|
||||
"title": title or "", "objective": objective or "",
|
||||
"current_hint": hint,
|
||||
"current_object": parts[1] if len(parts) > 1 else hint,
|
||||
})
|
||||
|
||||
logger.info("Found %d controls with forbidden tokens to re-classify", len(controls))
|
||||
|
||||
# Process
|
||||
total_fixed = 0
|
||||
total_kept = 0
|
||||
total_skipped = 0
|
||||
total_input_tokens = 0
|
||||
total_output_tokens = 0
|
||||
corrections = []
|
||||
change_stats: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
|
||||
|
||||
for i in range(0, len(controls), args.batch_size):
|
||||
batch = controls[i:i + args.batch_size]
|
||||
results, usage = call_claude(batch)
|
||||
|
||||
total_input_tokens += usage.get("input_tokens", 0)
|
||||
total_output_tokens += usage.get("output_tokens", 0)
|
||||
|
||||
if not results:
|
||||
total_skipped += len(batch)
|
||||
continue
|
||||
|
||||
result_map = {r.get("id", ""): r for r in results}
|
||||
for ctrl in batch:
|
||||
r = result_map.get(ctrl["control_id"], {})
|
||||
new_token = r.get("token", "")
|
||||
if not new_token or new_token in FORBIDDEN_TOKENS:
|
||||
total_kept += 1
|
||||
continue
|
||||
|
||||
old_obj = ctrl["current_object"]
|
||||
if new_token != old_obj:
|
||||
total_fixed += 1
|
||||
parts = ctrl["current_hint"].split(":", 2)
|
||||
action = parts[0] if parts else "implement"
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
corrections.append({
|
||||
"uuid": ctrl["uuid"],
|
||||
"old_hint": ctrl["current_hint"],
|
||||
"new_hint": f"{action}:{new_token}:{phase}",
|
||||
})
|
||||
change_stats[old_obj][new_token] += 1
|
||||
else:
|
||||
total_kept += 1
|
||||
|
||||
processed = min(i + args.batch_size, len(controls))
|
||||
if processed % 2000 < args.batch_size or processed >= len(controls):
|
||||
logger.info(
|
||||
"Progress: %d/%d (fixed=%d kept=%d skip=%d)",
|
||||
processed, len(controls), total_fixed, total_kept, total_skipped,
|
||||
)
|
||||
|
||||
time.sleep(0.3)
|
||||
|
||||
# Report
|
||||
cost_in = total_input_tokens / 1_000_000 * 0.80
|
||||
cost_out = total_output_tokens / 1_000_000 * 4.00
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("GENERIC TOKEN FIX REPORT")
|
||||
logger.info("=" * 60)
|
||||
logger.info("Total: %d controls", len(controls))
|
||||
logger.info("Fixed: %d", total_fixed)
|
||||
logger.info("Kept: %d (LLM also chose forbidden → kept as-is)", total_kept)
|
||||
logger.info("Skipped: %d", total_skipped)
|
||||
logger.info("Cost: $%.2f (Haiku)", cost_in + cost_out)
|
||||
|
||||
logger.info("\nTop changes:")
|
||||
flat = []
|
||||
for old, news in change_stats.items():
|
||||
for new, cnt in news.items():
|
||||
flat.append((cnt, old, new))
|
||||
for cnt, old, new in sorted(flat, reverse=True)[:30]:
|
||||
logger.info(" %4d × %s → %s", cnt, old, new)
|
||||
|
||||
# Save corrections
|
||||
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
corr_file = CHECKPOINT_DIR / "corrections_generic_fix.json"
|
||||
corr_file.write_text(json.dumps(corrections))
|
||||
logger.info("Saved %d corrections to %s", len(corrections), corr_file)
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("DRY RUN — not updating DB")
|
||||
return
|
||||
|
||||
if corrections:
|
||||
logger.info("Applying %d corrections...", len(corrections))
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
for corr in corrections:
|
||||
c.execute(text("""
|
||||
UPDATE canonical_controls
|
||||
SET generation_metadata = jsonb_set(
|
||||
generation_metadata,
|
||||
'{merge_group_hint}',
|
||||
to_jsonb(CAST(:new_hint AS text))
|
||||
)
|
||||
WHERE id = CAST(:uuid AS uuid)
|
||||
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
|
||||
logger.info("Done. %d hints corrected.", len(corrections))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,37 @@
|
||||
#!/bin/bash
|
||||
# Run all 10 batches sequentially. Safe: if one fails, the rest don't run.
|
||||
# Each batch saves corrections to JSON before applying to DB.
|
||||
#
|
||||
# Usage: bash /app/scripts/gpre0_run_all.sh
|
||||
# bash /app/scripts/gpre0_run_all.sh 5 # start from batch 5
|
||||
|
||||
set -e
|
||||
|
||||
START=${1:-1}
|
||||
TOTAL=10
|
||||
|
||||
echo "=== Starting from batch $START of $TOTAL ==="
|
||||
|
||||
for i in $(seq $START $TOTAL); do
|
||||
echo ""
|
||||
echo "================================================================"
|
||||
echo " BATCH $i/$TOTAL — $(date)"
|
||||
echo "================================================================"
|
||||
|
||||
PYTHONPATH=/app python3 /app/scripts/gpre0_validate_hints.py \
|
||||
--batch-id $i \
|
||||
--total-batches $TOTAL \
|
||||
--batch-size 20
|
||||
|
||||
EXIT_CODE=$?
|
||||
if [ $EXIT_CODE -ne 0 ]; then
|
||||
echo "BATCH $i FAILED with exit code $EXIT_CODE"
|
||||
echo "Resume with: bash /app/scripts/gpre0_run_all.sh $i"
|
||||
exit $EXIT_CODE
|
||||
fi
|
||||
|
||||
echo "BATCH $i DONE — $(date)"
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "ALL $TOTAL BATCHES COMPLETE!"
|
||||
@@ -0,0 +1,351 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Phase 2: Validate and correct merge_group_hints using Claude Haiku.
|
||||
|
||||
Re-classifies each control's object token against the expanded ontology
|
||||
(74 canonical tokens). Corrects wrong hints in the DB.
|
||||
|
||||
SAFETY: Split into 4 batches. NEVER retries on timeout (double-billing!).
|
||||
Writes checkpoint after each API call for safe resume.
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/gpre0_validate_hints.py --batch-id 1 --dry-run
|
||||
python3 /app/scripts/gpre0_validate_hints.py --batch-id 1
|
||||
python3 /app/scripts/gpre0_validate_hints.py --batch-id 2
|
||||
python3 /app/scripts/gpre0_validate_hints.py --batch-id 3
|
||||
python3 /app/scripts/gpre0_validate_hints.py --batch-id 4
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("gpre0-validate")
|
||||
|
||||
DB_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
|
||||
)
|
||||
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
|
||||
ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"
|
||||
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||
CHECKPOINT_DIR = Path("/tmp/gpre0_checkpoints")
|
||||
|
||||
SYSTEM_PROMPT = """Du bist ein Compliance-Klassifizierer. Ordne jeden Control GENAU EINEM Token zu.
|
||||
|
||||
REGEL: Waehle IMMER den naechstbesten Token aus der Liste. OTHER nur wenn ABSOLUT
|
||||
kein Token auch nur entfernt passt (<1% der Faelle). Im Zweifel: den breitesten
|
||||
passenden Token waehlen (z.B. "policy" fuer Governance-Dokumente, "procedure" fuer
|
||||
Ablauf-Definitionen, "risk_management" fuer Bewertungen).
|
||||
|
||||
TOKENS:
|
||||
|
||||
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
|
||||
privileged_access, access_control, encryption, transport_encryption,
|
||||
key_management, certificate_management, network_security, network_segmentation,
|
||||
firewall, vpn, remote_access, monitoring (NUR Echtzeit-Systemueberwachung),
|
||||
audit_logging (Protokollierung/Audit Trail), siem, alerting (Meldepflichten),
|
||||
compliance_audit (externe Pruefungen), vulnerability, patch_management,
|
||||
backup, disaster_recovery, physical_security, secure_development,
|
||||
api_security, input_validation, container_security, logging_configuration
|
||||
|
||||
DATA_PROTECTION: personal_data (DSGVO-Verarbeitung), sensitive_data (Art.9),
|
||||
health_data, consent, data_subject_rights, data_retention, data_transfer,
|
||||
data_breach_notification, dpia, data_processing_agreement, privacy_by_design,
|
||||
data_processing_register, data_classification, cookie_consent, video_surveillance
|
||||
|
||||
GOVERNANCE: policy (Richtlinie definieren), procedure (Verfahren definieren),
|
||||
process (Betriebsprozess ausfuehren), training (Schulung), awareness,
|
||||
incident (Vorfallsbehandlung), risk_management, third_party_management,
|
||||
change_management, documentation, records_management, compliance_reporting,
|
||||
asset_management, human_resources_security
|
||||
|
||||
REGULATORY: supervisory_authority, certification (Zertifizierung/Konformitaet),
|
||||
product_safety, ai_system, financial_reporting, aml, whistleblowing,
|
||||
consumer_protection, ecommerce, telecommunications, medical_device,
|
||||
payment_services, critical_infrastructure, supply_chain_due_diligence,
|
||||
sustainability_reporting
|
||||
|
||||
ABGRENZUNGEN:
|
||||
- monitoring = NUR Echtzeit-Systemueberwachung, NICHT Audit/Schulung/Bewertung
|
||||
- audit_logging = Protokollierung, NICHT externe Pruefung (→ compliance_audit)
|
||||
- procedure = Verfahren DEFINIEREN, NICHT Vorfaelle behandeln (→ incident)
|
||||
- personal_data = DSGVO-Verarbeitung, NICHT Zertifizierung (→ certification)
|
||||
- alerting = Meldepflichten, NICHT Vorfallsbehandlung (→ incident)
|
||||
|
||||
Antworte NUR als JSON-Array: [{"id":"...","token":"...","conf":0.9}, ...]
|
||||
KEIN weiterer Text. Nur das Array."""
|
||||
|
||||
|
||||
def call_claude(controls_batch: list[dict]) -> tuple[list[dict], dict]:
|
||||
"""Send batch to Claude. NO RETRY on timeout (double-billing risk!)."""
|
||||
items = []
|
||||
for c in controls_batch:
|
||||
items.append(
|
||||
f'- id="{c["control_id"]}" '
|
||||
f'cur="{c["current_object"]}" '
|
||||
f't="{c["title"]}" '
|
||||
f'o="{c["objective"][:100]}"'
|
||||
)
|
||||
|
||||
prompt = "Klassifiziere:\n" + "\n".join(items)
|
||||
|
||||
headers = {
|
||||
"x-api-key": ANTHROPIC_API_KEY,
|
||||
"anthropic-version": "2023-06-01",
|
||||
"content-type": "application/json",
|
||||
}
|
||||
payload = {
|
||||
"model": ANTHROPIC_MODEL,
|
||||
"max_tokens": 1500,
|
||||
"temperature": 0.0,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
}
|
||||
|
||||
try:
|
||||
resp = httpx.post(
|
||||
ANTHROPIC_URL, headers=headers, json=payload, timeout=45.0
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
content = data.get("content", [{}])[0].get("text", "")
|
||||
usage = data.get("usage", {})
|
||||
start = content.find("[")
|
||||
end = content.rfind("]") + 1
|
||||
if start >= 0 and end > start:
|
||||
return json.loads(content[start:end]), usage
|
||||
logger.warning("No JSON array in response")
|
||||
return [], usage
|
||||
except httpx.TimeoutException:
|
||||
# CRITICAL: Do NOT retry! Log and skip.
|
||||
logger.error("TIMEOUT — skipping batch (NOT retrying to avoid double-billing)")
|
||||
return [], {}
|
||||
except httpx.HTTPStatusError as e:
|
||||
if e.response.status_code == 429:
|
||||
logger.warning("Rate limited — waiting 60s then skipping")
|
||||
time.sleep(60)
|
||||
else:
|
||||
logger.error("API error %d — skipping batch", e.response.status_code)
|
||||
return [], {}
|
||||
except Exception as e:
|
||||
logger.error("Request failed — skipping: %s", e)
|
||||
return [], {}
|
||||
|
||||
|
||||
def load_checkpoint(batch_id: int) -> int:
|
||||
"""Load last processed index for this batch."""
|
||||
cp_file = CHECKPOINT_DIR / f"batch_{batch_id}.json"
|
||||
if cp_file.exists():
|
||||
data = json.loads(cp_file.read_text())
|
||||
return data.get("last_index", 0)
|
||||
return 0
|
||||
|
||||
|
||||
def save_checkpoint(batch_id: int, last_index: int, stats: dict):
|
||||
"""Save progress checkpoint."""
|
||||
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
cp_file = CHECKPOINT_DIR / f"batch_{batch_id}.json"
|
||||
cp_file.write_text(json.dumps({
|
||||
"batch_id": batch_id,
|
||||
"last_index": last_index,
|
||||
**stats,
|
||||
}))
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--batch-id", type=int, required=True)
|
||||
parser.add_argument("--total-batches", type=int, default=10)
|
||||
parser.add_argument("--batch-size", type=int, default=20)
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
parser.add_argument("--resume", action="store_true",
|
||||
help="Resume from checkpoint")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(
|
||||
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
|
||||
)
|
||||
|
||||
# Load ALL control IDs ordered deterministically, then select quarter
|
||||
with engine.connect() as c:
|
||||
all_ids = c.execute(text("""
|
||||
SELECT cc.id
|
||||
FROM canonical_controls cc
|
||||
WHERE cc.generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND cc.generation_metadata->>'merge_group_hint' != ''
|
||||
AND cc.release_state NOT IN ('deprecated', 'rejected')
|
||||
ORDER BY cc.id
|
||||
""")).fetchall()
|
||||
|
||||
total = len(all_ids)
|
||||
chunk = total // args.total_batches
|
||||
start_idx = (args.batch_id - 1) * chunk
|
||||
end_idx = total if args.batch_id == args.total_batches else args.batch_id * chunk
|
||||
batch_ids = [str(r[0]) for r in all_ids[start_idx:end_idx]]
|
||||
|
||||
logger.info("Batch %d/%d: controls %d-%d (%d controls of %d total)",
|
||||
args.batch_id, args.total_batches, start_idx, end_idx, len(batch_ids), total)
|
||||
|
||||
# Load full data for this batch
|
||||
id_list = ",".join(f"'{uid}'" for uid in batch_ids)
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text(f"""
|
||||
SELECT cc.id, cc.control_id, cc.title,
|
||||
COALESCE(cc.objective, '') as objective,
|
||||
cc.generation_metadata->>'merge_group_hint' as hint
|
||||
FROM canonical_controls cc
|
||||
WHERE cc.id IN ({id_list})
|
||||
ORDER BY cc.id
|
||||
""")).fetchall()
|
||||
|
||||
controls = []
|
||||
for uuid, cid, title, objective, hint in rows:
|
||||
parts = hint.split(":", 2) if hint else []
|
||||
controls.append({
|
||||
"uuid": str(uuid), "control_id": cid,
|
||||
"title": title or "", "objective": objective or "",
|
||||
"current_hint": hint, "current_object": parts[1] if len(parts) > 1 else hint,
|
||||
})
|
||||
|
||||
# Resume from checkpoint?
|
||||
start_from = 0
|
||||
if args.resume:
|
||||
start_from = load_checkpoint(args.batch_id)
|
||||
if start_from > 0:
|
||||
logger.info("Resuming from index %d", start_from)
|
||||
|
||||
# Process
|
||||
total_same = 0
|
||||
total_changed = 0
|
||||
total_other = 0
|
||||
total_skipped = 0
|
||||
total_input_tokens = 0
|
||||
total_output_tokens = 0
|
||||
corrections: list[dict] = []
|
||||
change_stats: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
|
||||
|
||||
for i in range(start_from, len(controls), args.batch_size):
|
||||
batch = controls[i:i + args.batch_size]
|
||||
results, usage = call_claude(batch)
|
||||
|
||||
total_input_tokens += usage.get("input_tokens", 0)
|
||||
total_output_tokens += usage.get("output_tokens", 0)
|
||||
|
||||
if not results:
|
||||
total_skipped += len(batch)
|
||||
save_checkpoint(args.batch_id, i + args.batch_size, {
|
||||
"same": total_same, "changed": total_changed,
|
||||
"other": total_other, "skipped": total_skipped,
|
||||
})
|
||||
continue
|
||||
|
||||
result_map = {r.get("id", ""): r for r in results}
|
||||
for ctrl in batch:
|
||||
r = result_map.get(ctrl["control_id"], {})
|
||||
new_token = r.get("token", "")
|
||||
if not new_token:
|
||||
total_skipped += 1
|
||||
continue
|
||||
|
||||
old_obj = ctrl["current_object"]
|
||||
if new_token == "OTHER":
|
||||
total_other += 1
|
||||
elif new_token == old_obj:
|
||||
total_same += 1
|
||||
else:
|
||||
total_changed += 1
|
||||
parts = ctrl["current_hint"].split(":", 2)
|
||||
action = parts[0] if parts else "implement"
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
corrections.append({
|
||||
"uuid": ctrl["uuid"],
|
||||
"old_hint": ctrl["current_hint"],
|
||||
"new_hint": f"{action}:{new_token}:{phase}",
|
||||
})
|
||||
change_stats[old_obj][new_token] += 1
|
||||
|
||||
# Checkpoint every batch
|
||||
save_checkpoint(args.batch_id, i + args.batch_size, {
|
||||
"same": total_same, "changed": total_changed,
|
||||
"other": total_other, "skipped": total_skipped,
|
||||
})
|
||||
|
||||
processed = min(i + args.batch_size, len(controls))
|
||||
if processed % 1000 < args.batch_size or processed >= len(controls):
|
||||
logger.info(
|
||||
"Batch %d: %d/%d (same=%d changed=%d other=%d skip=%d)",
|
||||
args.batch_id, processed, len(controls),
|
||||
total_same, total_changed, total_other, total_skipped,
|
||||
)
|
||||
|
||||
time.sleep(0.3)
|
||||
|
||||
# Report
|
||||
cost_in = total_input_tokens / 1_000_000 * 0.80 # Haiku
|
||||
cost_out = total_output_tokens / 1_000_000 * 4.00 # Haiku
|
||||
total_cost = cost_in + cost_out
|
||||
total_proc = total_same + total_changed + total_other
|
||||
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("BATCH %d REPORT", args.batch_id)
|
||||
logger.info("=" * 60)
|
||||
logger.info("Processed: %d | Skipped: %d", total_proc, total_skipped)
|
||||
logger.info("Same: %d (%.1f%%)", total_same, total_same / max(total_proc, 1) * 100)
|
||||
logger.info("Changed: %d (%.1f%%)", total_changed, total_changed / max(total_proc, 1) * 100)
|
||||
logger.info("OTHER: %d (%.1f%%)", total_other, total_other / max(total_proc, 1) * 100)
|
||||
logger.info("Cost: $%.2f (Haiku)", total_cost)
|
||||
logger.info("Cost/ctrl: $%.5f", total_cost / max(total_proc, 1))
|
||||
|
||||
# Top changes
|
||||
flat = []
|
||||
for old, news in change_stats.items():
|
||||
for new, cnt in news.items():
|
||||
flat.append((cnt, old, new))
|
||||
logger.info("\nTop Changes:")
|
||||
for cnt, old, new in sorted(flat, reverse=True)[:20]:
|
||||
logger.info(" %4d × %s → %s", cnt, old, new)
|
||||
|
||||
# Always save corrections to file (recovery safety)
|
||||
corr_file = CHECKPOINT_DIR / f"corrections_batch_{args.batch_id}.json"
|
||||
if corrections:
|
||||
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
corr_file.write_text(json.dumps(corrections))
|
||||
logger.info("Saved %d corrections to %s", len(corrections), corr_file)
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("\nDRY RUN — not updating DB")
|
||||
return
|
||||
|
||||
# Apply corrections in single transaction
|
||||
if corrections:
|
||||
logger.info("\nApplying %d corrections...", len(corrections))
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
for corr in corrections:
|
||||
c.execute(text("""
|
||||
UPDATE canonical_controls
|
||||
SET generation_metadata = jsonb_set(
|
||||
generation_metadata,
|
||||
'{merge_group_hint}',
|
||||
to_jsonb(CAST(:new_hint AS text))
|
||||
)
|
||||
WHERE id = CAST(:uuid AS uuid)
|
||||
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
|
||||
logger.info("Done. %d hints corrected.", len(corrections))
|
||||
else:
|
||||
logger.info("No corrections needed.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,214 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
G-pre2 v2: Build Master Controls directly from canonical tokens.
|
||||
|
||||
No K-Means needed — Phase 2 already normalized merge_group_hints
|
||||
to 74 canonical tokens. Each token = one object group.
|
||||
|
||||
Groups controls by (canonical_token, phase) and creates MCs
|
||||
for tokens with >=2 distinct phases.
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/gpre2_direct_mc.py --dry-run
|
||||
python3 /app/scripts/gpre2_direct_mc.py --min-phases 2
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from collections import defaultdict
|
||||
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("gpre2-direct")
|
||||
|
||||
DB_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
|
||||
)
|
||||
|
||||
PHASE_ORDER = {
|
||||
"scope": 0, "definition": 1, "governance": 1,
|
||||
"design": 2, "implementation": 3, "configuration": 3,
|
||||
"operation": 4, "training": 4, "monitoring": 5,
|
||||
"testing": 6, "review": 7, "assessment": 8, "remediation": 8,
|
||||
"validation": 9, "reporting": 10, "evidence": 11,
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--min-phases", type=int, default=2)
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(
|
||||
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
|
||||
)
|
||||
|
||||
# Step 1: Load all controls with merge_group_hint
|
||||
logger.info("Loading controls...")
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text("""
|
||||
SELECT id, control_id,
|
||||
generation_metadata->>'merge_group_hint' AS hint
|
||||
FROM canonical_controls
|
||||
WHERE generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND generation_metadata->>'merge_group_hint' != ''
|
||||
AND release_state NOT IN ('deprecated', 'rejected')
|
||||
""")).fetchall()
|
||||
|
||||
logger.info("Loaded %d controls", len(rows))
|
||||
|
||||
# Step 2: Group by (object_token, phase)
|
||||
token_phases: dict[str, dict[str, list]] = defaultdict(
|
||||
lambda: defaultdict(list)
|
||||
)
|
||||
|
||||
for uuid, control_id, hint in rows:
|
||||
parts = hint.split(":", 2)
|
||||
if len(parts) < 2:
|
||||
continue
|
||||
action = parts[0]
|
||||
obj = parts[1]
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
token_phases[obj][phase].append((str(uuid), control_id, action))
|
||||
|
||||
logger.info("Found %d unique object tokens", len(token_phases))
|
||||
|
||||
# Step 3: Create Master Controls
|
||||
master_controls = []
|
||||
master_members = []
|
||||
|
||||
for token, phases in token_phases.items():
|
||||
if len(phases) < args.min_phases:
|
||||
continue
|
||||
|
||||
sorted_phases = sorted(
|
||||
phases.keys(), key=lambda p: PHASE_ORDER.get(p, 99)
|
||||
)
|
||||
phase_counts = {p: len(ctrls) for p, ctrls in phases.items()}
|
||||
total = sum(phase_counts.values())
|
||||
|
||||
master_controls.append({
|
||||
"canonical_name": token,
|
||||
"phases_covered": json.dumps(sorted_phases),
|
||||
"phase_control_count": json.dumps(phase_counts),
|
||||
"total_controls": total,
|
||||
})
|
||||
|
||||
for phase, controls in phases.items():
|
||||
for ctrl_uuid, ctrl_id, action in controls:
|
||||
master_members.append({
|
||||
"canonical_name": token,
|
||||
"control_uuid": ctrl_uuid,
|
||||
"phase": phase,
|
||||
"action": action,
|
||||
})
|
||||
|
||||
logger.info(
|
||||
"Created %d Master Controls with %d members (min %d phases)",
|
||||
len(master_controls), len(master_members), args.min_phases,
|
||||
)
|
||||
|
||||
# Stats
|
||||
if master_controls:
|
||||
counts = [mc["total_controls"] for mc in master_controls]
|
||||
phases_per = [
|
||||
len(json.loads(mc["phases_covered"])) for mc in master_controls
|
||||
]
|
||||
logger.info(" Avg controls/MC: %.1f", sum(counts) / len(counts))
|
||||
logger.info(" Max controls/MC: %d", max(counts))
|
||||
logger.info(" Avg phases/MC: %.1f", sum(phases_per) / len(phases_per))
|
||||
logger.info(" Max phases/MC: %d", max(phases_per))
|
||||
|
||||
# Size distribution
|
||||
logger.info("\n Size distribution:")
|
||||
logger.info(" ≤10: %d", sum(1 for c in counts if c <= 10))
|
||||
logger.info(" 11-50: %d", sum(1 for c in counts if 11 <= c <= 50))
|
||||
logger.info(" 51-200: %d", sum(1 for c in counts if 51 <= c <= 200))
|
||||
logger.info(" 201-500: %d", sum(1 for c in counts if 201 <= c <= 500))
|
||||
logger.info(" 501-2K: %d", sum(1 for c in counts if 501 <= c <= 2000))
|
||||
logger.info(" >2K: %d", sum(1 for c in counts if c > 2000))
|
||||
|
||||
# Top 15
|
||||
top = sorted(master_controls, key=lambda x: -x["total_controls"])[:15]
|
||||
logger.info("\n Top 15 Master Controls:")
|
||||
for mc in top:
|
||||
logger.info(
|
||||
" %6d %s (%d phases)",
|
||||
mc["total_controls"],
|
||||
mc["canonical_name"],
|
||||
len(json.loads(mc["phases_covered"])),
|
||||
)
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("\nDRY RUN — not writing to DB")
|
||||
return
|
||||
|
||||
# Step 4: Write to DB
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
c.execute(text("DELETE FROM master_control_members"))
|
||||
c.execute(text("DELETE FROM master_controls"))
|
||||
|
||||
# Get next object_group_id
|
||||
max_gid = c.execute(
|
||||
text("SELECT COALESCE(MAX(group_id), 0) FROM object_groups")
|
||||
).scalar()
|
||||
next_gid = max_gid + 1
|
||||
|
||||
mc_uuids = {}
|
||||
for mc in master_controls:
|
||||
gid = next_gid
|
||||
next_gid += 1
|
||||
mc_id = f"MC-{gid}"
|
||||
|
||||
c.execute(text("""
|
||||
INSERT INTO master_controls
|
||||
(master_control_id, object_group_id, canonical_name,
|
||||
phases_covered, phase_control_count, total_controls)
|
||||
VALUES (:mcid, :gid, :name,
|
||||
CAST(:phases AS jsonb),
|
||||
CAST(:pcounts AS jsonb), :total)
|
||||
"""), {
|
||||
"mcid": mc_id, "gid": gid,
|
||||
"name": mc["canonical_name"],
|
||||
"phases": mc["phases_covered"],
|
||||
"pcounts": mc["phase_control_count"],
|
||||
"total": mc["total_controls"],
|
||||
})
|
||||
|
||||
mc_uuid = c.execute(text(
|
||||
"SELECT id FROM master_controls WHERE master_control_id = :mcid"
|
||||
), {"mcid": mc_id}).scalar()
|
||||
mc_uuids[mc["canonical_name"]] = str(mc_uuid)
|
||||
|
||||
# Insert members
|
||||
mem_count = 0
|
||||
for mem in master_members:
|
||||
mc_uuid = mc_uuids.get(mem["canonical_name"])
|
||||
if not mc_uuid:
|
||||
continue
|
||||
c.execute(text("""
|
||||
INSERT INTO master_control_members
|
||||
(master_control_uuid, control_uuid, phase, action)
|
||||
VALUES (CAST(:mc AS uuid), CAST(:ctrl AS uuid),
|
||||
:phase, :action)
|
||||
"""), {
|
||||
"mc": mc_uuid,
|
||||
"ctrl": mem["control_uuid"],
|
||||
"phase": mem["phase"],
|
||||
"action": mem["action"],
|
||||
})
|
||||
mem_count += 1
|
||||
|
||||
logger.info("Wrote %d MCs + %d members to DB", len(master_controls), mem_count)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,298 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
G-pre3: Split large Master Controls by regulation source.
|
||||
|
||||
For each MC with >200 controls:
|
||||
1. Load member controls with parent's source_citation->>'source'
|
||||
2. Group by regulation source
|
||||
3. Sources with >= MIN_SOURCE_SIZE → new sub-MC
|
||||
4. Small sources → merge into "mixed" bucket
|
||||
5. UNKNOWN (no source_citation) → sub-cluster by embedding if >MAX_MC
|
||||
6. Delete original large MC, create new sub-MCs
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/gpre3_regulation_split.py --dry-run
|
||||
python3 /app/scripts/gpre3_regulation_split.py --min-source 15 --max-mc 100
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
from collections import defaultdict
|
||||
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
from services.embedding_utils import subcluster_controls
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("gpre3")
|
||||
|
||||
DB_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
|
||||
)
|
||||
|
||||
# ── Source key normalization ────────────────────────────────────────
|
||||
# fmt: off
|
||||
_SOURCE_SHORT: dict[str, str] = {
|
||||
"DSGVO (EU) 2016/679": "dsgvo", "NIS2-Richtlinie (EU) 2022/2555": "nis2",
|
||||
"KI-Verordnung (EU) 2024/1689": "ai_act", "Cyber Resilience Act (CRA)": "cra",
|
||||
"Digital Services Act (DSA)": "dsa", "Digital Markets Act (DMA)": "dma",
|
||||
"Digital Operational Resilience Act": "dora", "Data Governance Act (DGA)": "dga",
|
||||
"Data Act": "data_act", "Maschinenverordnung (EU) 2023/1230": "machinery_reg",
|
||||
"Medizinprodukteverordnung (EU) 2017/745 (MDR)": "mdr",
|
||||
"European Health Data Space": "ehds", "European Accessibility Act": "eaa",
|
||||
"EU Cybersecurity Act": "eu_csa", "EU Blue Guide 2022": "eu_blue_guide",
|
||||
"EU-US Data Privacy Framework": "eu_us_dpf", "Markets in Crypto-Assets (MiCA)": "mica",
|
||||
"Standardvertragsklauseln (SCC)": "scc", "ePrivacy-Richtlinie": "eprivacy",
|
||||
"Batterieverordnung (EU) 2023/1542": "battery_reg",
|
||||
"Bundesdatenschutzgesetz (BDSG)": "bdsg",
|
||||
"BSI-Gesetz (BSIG 2025, NIS2-Umsetzung)": "bsig",
|
||||
"BSI-Kritisverordnung (BSI-KritisV)": "bsi_kritisv",
|
||||
"Geldwaeschegesetz (GwG)": "gwg", "Hinweisgeberschutzgesetz (HinSchG)": "hinschg",
|
||||
"Lieferkettensorgfaltspflichtengesetz (LkSG)": "lksg",
|
||||
"KRITIS-Dachgesetz (KRITISDachG)": "kritisdachg",
|
||||
"NIST SP 800-53 Rev. 5": "nist_800_53", "NIST Cybersecurity Framework 2.0": "nist_csf",
|
||||
"NIST Privacy Framework 1.0": "nist_privacy",
|
||||
"NIST SP 800-207 (Zero Trust)": "nist_zero_trust",
|
||||
"NIST SP 800-218 (SSDF)": "nist_ssdf", "NIST SP 800-63-3": "nist_800_63",
|
||||
"NIST AI Risk Management Framework": "nist_ai_rmf",
|
||||
"NISTIR 8259A IoT Security": "nist_iot",
|
||||
"OWASP Top 10 (2021)": "owasp_top10", "OWASP API Security Top 10 (2023)": "owasp_api",
|
||||
"OWASP ASVS 4.0": "owasp_asvs", "OWASP SAMM 2.0": "owasp_samm",
|
||||
"OWASP MASVS 2.0": "owasp_masvs", "OWASP Mobile Top 10": "owasp_mobile",
|
||||
"ENISA": "enisa", "TDDDG": "tdddg", "TKG": "tkg", "TMG": "tmg",
|
||||
"BGB": "bgb", "UWG": "uwg", "UrhG": "urhg",
|
||||
"BAIT (BaFin 2024)": "bait", "VAIT (BaFin 2022)": "vait",
|
||||
"AML-Verordnung": "aml_reg", "Zahlungsdiensterichtlinie 2": "psd2",
|
||||
"Telekommunikationsgesetz Oesterreich": "at_tkg",
|
||||
"Österreichisches Datenschutzgesetz (DSG)": "at_dsg",
|
||||
"Allgemeines Gleichbehandlungsgesetz (AGG)": "agg",
|
||||
"Aktiengesetz (AktG)": "aktg", "Handelsgesetzbuch (HGB)": "hgb",
|
||||
"GmbH-Gesetz (GmbHG)": "gmbhg", "Insolvenzordnung (InsO)": "inso",
|
||||
"Gewerbeordnung (GewO)": "gewo", "Abgabenordnung (AO)": "ao",
|
||||
}
|
||||
# fmt: on
|
||||
|
||||
|
||||
def source_to_key(source: str) -> str:
|
||||
"""Normalize regulation source name to a short slug key."""
|
||||
if source in _SOURCE_SHORT:
|
||||
return _SOURCE_SHORT[source]
|
||||
s = source.lower()
|
||||
s = re.sub(r"\(.*?\)", "", s)
|
||||
s = re.sub(r"[^a-z0-9äöüß]+", "_", s)
|
||||
s = re.sub(r"_+", "_", s).strip("_")
|
||||
return s[:40] if s else "unknown"
|
||||
|
||||
|
||||
# ── Main ───────────────────────────────────────────────────────────
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--min-source", type=int, default=15,
|
||||
help="Min controls per source for own sub-MC")
|
||||
parser.add_argument("--max-mc", type=int, default=100,
|
||||
help="Max controls per sub-MC before sub-clustering")
|
||||
parser.add_argument("--threshold", type=int, default=200,
|
||||
help="Only split MCs with more than N controls")
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(
|
||||
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
|
||||
)
|
||||
|
||||
# Step 1: Find large master controls
|
||||
with engine.connect() as c:
|
||||
large_mcs = c.execute(text("""
|
||||
SELECT mc.id, mc.master_control_id, mc.object_group_id,
|
||||
mc.canonical_name, mc.total_controls
|
||||
FROM master_controls mc
|
||||
WHERE mc.total_controls > :threshold
|
||||
ORDER BY mc.total_controls DESC
|
||||
"""), {"threshold": args.threshold}).fetchall()
|
||||
|
||||
logger.info("Found %d MCs with >%d controls", len(large_mcs), args.threshold)
|
||||
if not large_mcs:
|
||||
return
|
||||
|
||||
# Step 2: Build split plans
|
||||
all_splits = []
|
||||
for mc_uuid, mc_id, og_id, canonical, total in large_mcs:
|
||||
plan = _build_split_plan(engine, mc_uuid, mc_id, og_id, canonical, total, args)
|
||||
all_splits.append(plan)
|
||||
|
||||
total_new = sum(len(sp["sub_groups"]) for sp in all_splits)
|
||||
total_covered = sum(
|
||||
sum(len(sg["controls"]) for sg in sp["sub_groups"]) for sp in all_splits
|
||||
)
|
||||
logger.info("SUMMARY: %d large MCs → %d sub-MCs (%d controls)", len(all_splits), total_new, total_covered)
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("DRY RUN — not writing to DB")
|
||||
return
|
||||
|
||||
_write_splits(engine, all_splits)
|
||||
|
||||
|
||||
def _build_split_plan(engine, mc_uuid, mc_id, og_id, canonical, total, args) -> dict:
|
||||
"""Build a regulation-source split plan for one large MC."""
|
||||
logger.info("\n━━━ %s: %s (%d controls) ━━━", mc_id, canonical, total)
|
||||
|
||||
with engine.connect() as c:
|
||||
members = c.execute(text("""
|
||||
SELECT mcm.control_uuid, mcm.phase, mcm.action,
|
||||
cc.control_id, cc.title,
|
||||
COALESCE(pc.source_citation->>'source', 'UNKNOWN') AS src
|
||||
FROM master_control_members mcm
|
||||
JOIN canonical_controls cc ON cc.id = mcm.control_uuid
|
||||
LEFT JOIN canonical_controls pc ON pc.id = cc.parent_control_uuid
|
||||
WHERE mcm.master_control_uuid = CAST(:mc_uuid AS uuid)
|
||||
"""), {"mc_uuid": str(mc_uuid)}).fetchall()
|
||||
|
||||
by_source: dict[str, list[dict]] = defaultdict(list)
|
||||
for ctrl_uuid, phase, action, cid, title, src in members:
|
||||
by_source[src].append({
|
||||
"control_uuid": str(ctrl_uuid), "phase": phase,
|
||||
"action": action, "control_id": cid, "title": title,
|
||||
})
|
||||
|
||||
sorted_sources = sorted(by_source.items(), key=lambda x: -len(x[1]))
|
||||
for src, ctrls in sorted_sources[:8]:
|
||||
logger.info(" %4d %s", len(ctrls), src)
|
||||
if len(sorted_sources) > 8:
|
||||
logger.info(" ... +%d more sources", len(sorted_sources) - 8)
|
||||
|
||||
plan = {"mc_uuid": str(mc_uuid), "mc_id": mc_id, "og_id": og_id,
|
||||
"canonical": canonical, "total": total, "sub_groups": []}
|
||||
|
||||
own_mc_sources = []
|
||||
mixed_controls = []
|
||||
for src, ctrls in sorted_sources:
|
||||
if src == "UNKNOWN":
|
||||
continue
|
||||
if len(ctrls) >= args.min_source:
|
||||
own_mc_sources.append((src, ctrls))
|
||||
else:
|
||||
mixed_controls.extend(ctrls)
|
||||
|
||||
unknown_controls = by_source.get("UNKNOWN", [])
|
||||
|
||||
# (a) Named regulation sub-MCs
|
||||
for src, ctrls in own_mc_sources:
|
||||
key = source_to_key(src)
|
||||
name = f"{canonical}_{key}"
|
||||
_add_subgroups(plan, name, src, ctrls, args.max_mc)
|
||||
|
||||
# (b) Mixed small-source bucket
|
||||
if mixed_controls:
|
||||
_add_subgroups(plan, f"{canonical}_mixed", "mixed", mixed_controls, args.max_mc)
|
||||
|
||||
# (c) UNKNOWN bucket
|
||||
if unknown_controls:
|
||||
_add_subgroups(plan, f"{canonical}_general", "general", unknown_controls, args.max_mc)
|
||||
|
||||
logger.info(" → %d sub-groups:", len(plan["sub_groups"]))
|
||||
for sg in sorted(plan["sub_groups"], key=lambda x: -len(x["controls"])):
|
||||
logger.info(" %4d %s", len(sg["controls"]), sg["name"])
|
||||
|
||||
return plan
|
||||
|
||||
|
||||
def _add_subgroups(plan: dict, name: str, source: str,
|
||||
controls: list[dict], max_mc: int):
|
||||
"""Add controls as one or more sub-groups to the plan."""
|
||||
if len(controls) <= max_mc:
|
||||
plan["sub_groups"].append({"name": name, "source": source, "controls": controls})
|
||||
else:
|
||||
clusters = subcluster_controls(controls, max_mc)
|
||||
for i, cluster in enumerate(clusters):
|
||||
sub_name = f"{name}_{i+1}" if len(clusters) > 1 else name
|
||||
plan["sub_groups"].append({"name": sub_name, "source": source, "controls": cluster})
|
||||
|
||||
|
||||
def _write_splits(engine, splits: list[dict]):
|
||||
"""Apply split plan: delete old MCs, create new object_groups + MCs."""
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
max_gid = c.execute(
|
||||
text("SELECT COALESCE(MAX(group_id), 0) FROM object_groups")
|
||||
).scalar()
|
||||
next_gid = max_gid + 1
|
||||
total_mc = 0
|
||||
total_mem = 0
|
||||
|
||||
for sp in splits:
|
||||
c.execute(text(
|
||||
"DELETE FROM master_control_members "
|
||||
"WHERE master_control_uuid = CAST(:u AS uuid)"
|
||||
), {"u": sp["mc_uuid"]})
|
||||
c.execute(text(
|
||||
"DELETE FROM master_controls WHERE id = CAST(:u AS uuid)"
|
||||
), {"u": sp["mc_uuid"]})
|
||||
logger.info("Deleted %s (%s)", sp["mc_id"], sp["canonical"])
|
||||
|
||||
for sg in sp["sub_groups"]:
|
||||
if not sg["controls"]:
|
||||
continue
|
||||
gid = next_gid
|
||||
next_gid += 1
|
||||
|
||||
members_list = list({ctrl["control_id"] for ctrl in sg["controls"]})
|
||||
c.execute(text("""
|
||||
INSERT INTO object_groups
|
||||
(group_id, canonical_name, member_count, members, top_controls_count)
|
||||
VALUES (:gid, :name, :cnt, CAST(:members AS jsonb), 0)
|
||||
"""), {"gid": gid, "name": sg["name"], "cnt": len(members_list),
|
||||
"members": json.dumps(members_list)})
|
||||
|
||||
by_phase: dict[str, list[dict]] = defaultdict(list)
|
||||
for ctrl in sg["controls"]:
|
||||
by_phase[ctrl["phase"]].append(ctrl)
|
||||
|
||||
sorted_phases = sorted(by_phase.keys())
|
||||
phase_counts = {p: len(v) for p, v in by_phase.items()}
|
||||
mc_id = f"MC-{gid}"
|
||||
|
||||
c.execute(text("""
|
||||
INSERT INTO master_controls
|
||||
(master_control_id, object_group_id, canonical_name,
|
||||
phases_covered, phase_control_count, total_controls)
|
||||
VALUES (:mcid, :gid, :name,
|
||||
CAST(:phases AS jsonb), CAST(:pcounts AS jsonb), :total)
|
||||
"""), {"mcid": mc_id, "gid": gid, "name": sg["name"],
|
||||
"phases": json.dumps(sorted_phases),
|
||||
"pcounts": json.dumps(phase_counts),
|
||||
"total": sum(phase_counts.values())})
|
||||
|
||||
mc_uuid = c.execute(text(
|
||||
"SELECT id FROM master_controls WHERE master_control_id = :mcid"
|
||||
), {"mcid": mc_id}).scalar()
|
||||
|
||||
for ctrl in sg["controls"]:
|
||||
c.execute(text("""
|
||||
INSERT INTO master_control_members
|
||||
(master_control_uuid, control_uuid, phase, action)
|
||||
VALUES (CAST(:mc AS uuid), CAST(:ctrl AS uuid), :phase, :action)
|
||||
"""), {"mc": str(mc_uuid), "ctrl": ctrl["control_uuid"],
|
||||
"phase": ctrl["phase"], "action": ctrl["action"]})
|
||||
total_mem += 1
|
||||
total_mc += 1
|
||||
|
||||
logger.info("Created %d new MCs with %d members", total_mc, total_mem)
|
||||
|
||||
with engine.connect() as c:
|
||||
stats = c.execute(text("""
|
||||
SELECT count(*), count(CASE WHEN total_controls > 200 THEN 1 END),
|
||||
AVG(total_controls)::int
|
||||
FROM compliance.master_controls
|
||||
""")).fetchone()
|
||||
logger.info("Final: %d MCs, %d still >200, avg %d controls/MC", stats[0], stats[1], stats[2])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,310 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Phase 0: Quality Audit for Master Control Assignments.
|
||||
|
||||
Uses Claude Sonnet to validate whether controls are correctly assigned
|
||||
to their Master Controls. Samples controls from large and small MCs.
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/gpre_quality_audit.py
|
||||
python3 /app/scripts/gpre_quality_audit.py --large-sample 50 --small-sample 10
|
||||
python3 /app/scripts/gpre_quality_audit.py --mc MC-8292 # single MC
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
import time
|
||||
from collections import defaultdict
|
||||
|
||||
import httpx
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("quality-audit")
|
||||
|
||||
DB_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
|
||||
)
|
||||
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
|
||||
ANTHROPIC_MODEL = os.getenv("AUDIT_MODEL", "claude-sonnet-4-20250514")
|
||||
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||
|
||||
SYSTEM_PROMPT = """Du bist ein Compliance-Experte der prüft ob Controls korrekt zu Master Controls zugeordnet sind.
|
||||
|
||||
Für jeden Control beantworte:
|
||||
1. MATCH: Gehört dieser Control thematisch zum Master Control Topic?
|
||||
2. CONFIDENCE: Wie sicher bist du? (0.0-1.0)
|
||||
3. REASON: Kurze Begründung (max 1 Satz)
|
||||
4. SUGGESTED_TOPIC: Falls MATCH=false, welches Topic wäre korrekt?
|
||||
|
||||
Wichtige Unterscheidungen:
|
||||
- "monitoring" = kontinuierliche Überwachung, Alerting, Log-Analyse
|
||||
- "training" = Schulung, Awareness, Lernmaterialien
|
||||
- "personal_data" = personenbezogene Daten, DSGVO-Betroffenenrechte
|
||||
- "procedure" = Verfahren, Prozesse (aber NICHT wenn es spezifisch um Incidents geht)
|
||||
- "incident" = Sicherheitsvorfälle, Breach Notification, Recovery
|
||||
- "policy" = Richtlinien, Regelwerke, Governance-Dokumente
|
||||
- "encryption" = Verschlüsselung, Kryptografie, Key Management
|
||||
- "audit_logging" = Protokollierung, Audit Trail, Nachvollziehbarkeit
|
||||
|
||||
Antworte NUR als JSON-Array, ein Objekt pro Control."""
|
||||
|
||||
|
||||
def call_claude(controls_batch: list[dict], mc_topic: str) -> list[dict]:
|
||||
"""Send a batch of controls to Claude for validation."""
|
||||
items = []
|
||||
for c in controls_batch:
|
||||
items.append(
|
||||
f"- Control '{c['control_id']}': "
|
||||
f"Titel=\"{c['title']}\", "
|
||||
f"Objective=\"{c['objective'][:150]}...\", "
|
||||
f"Phase={c['phase']}, Action={c['action']}"
|
||||
)
|
||||
|
||||
prompt = (
|
||||
f"Master Control Topic: \"{mc_topic}\"\n\n"
|
||||
f"Prüfe diese {len(controls_batch)} Controls:\n\n"
|
||||
+ "\n".join(items)
|
||||
+ "\n\nAntwort als JSON-Array mit Feldern: "
|
||||
"control_id, match (bool), confidence (float), reason (str), "
|
||||
"suggested_topic (str, nur wenn match=false)."
|
||||
)
|
||||
|
||||
headers = {
|
||||
"x-api-key": ANTHROPIC_API_KEY,
|
||||
"anthropic-version": "2023-06-01",
|
||||
"content-type": "application/json",
|
||||
}
|
||||
payload = {
|
||||
"model": ANTHROPIC_MODEL,
|
||||
"max_tokens": 2048,
|
||||
"temperature": 0.1,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
}
|
||||
|
||||
for attempt in range(3):
|
||||
try:
|
||||
resp = httpx.post(
|
||||
ANTHROPIC_URL,
|
||||
headers=headers,
|
||||
json=payload,
|
||||
timeout=60.0,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
content = data.get("content", [{}])[0].get("text", "")
|
||||
usage = data.get("usage", {})
|
||||
|
||||
# Parse JSON from response
|
||||
start = content.find("[")
|
||||
end = content.rfind("]") + 1
|
||||
if start >= 0 and end > start:
|
||||
results = json.loads(content[start:end])
|
||||
return results, usage
|
||||
logger.warning("No JSON array in response: %s", content[:200])
|
||||
return [], usage
|
||||
except httpx.HTTPStatusError as e:
|
||||
if e.response.status_code == 429:
|
||||
wait = 30 * (attempt + 1)
|
||||
logger.warning("Rate limited, waiting %ds...", wait)
|
||||
time.sleep(wait)
|
||||
else:
|
||||
logger.error("API error: %s", e)
|
||||
return [], {}
|
||||
except Exception as e:
|
||||
logger.error("Request failed (attempt %d): %s", attempt + 1, e)
|
||||
if attempt < 2:
|
||||
time.sleep(5)
|
||||
return [], {}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--large-sample", type=int, default=50,
|
||||
help="Controls to sample per large MC")
|
||||
parser.add_argument("--small-sample", type=int, default=10,
|
||||
help="Controls to sample per small MC")
|
||||
parser.add_argument("--small-mc-count", type=int, default=50,
|
||||
help="Number of small MCs to audit")
|
||||
parser.add_argument("--mc", type=str, default=None,
|
||||
help="Audit a single MC by ID (e.g., MC-8292)")
|
||||
parser.add_argument("--batch-size", type=int, default=10,
|
||||
help="Controls per API call")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(
|
||||
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
|
||||
)
|
||||
|
||||
# Load MCs to audit
|
||||
with engine.connect() as c:
|
||||
if args.mc:
|
||||
mcs = c.execute(text("""
|
||||
SELECT id, master_control_id, canonical_name, total_controls
|
||||
FROM master_controls WHERE master_control_id = :mc
|
||||
"""), {"mc": args.mc}).fetchall()
|
||||
else:
|
||||
# Large MCs (>200) + random small MCs
|
||||
large = c.execute(text("""
|
||||
SELECT id, master_control_id, canonical_name, total_controls
|
||||
FROM master_controls WHERE total_controls > 200
|
||||
ORDER BY total_controls DESC
|
||||
""")).fetchall()
|
||||
|
||||
small = c.execute(text("""
|
||||
SELECT id, master_control_id, canonical_name, total_controls
|
||||
FROM master_controls WHERE total_controls BETWEEN 10 AND 200
|
||||
ORDER BY RANDOM() LIMIT :cnt
|
||||
"""), {"cnt": args.small_mc_count}).fetchall()
|
||||
|
||||
mcs = list(large) + list(small)
|
||||
|
||||
logger.info("Auditing %d Master Controls", len(mcs))
|
||||
|
||||
# Results tracking
|
||||
total_checked = 0
|
||||
total_match = 0
|
||||
total_mismatch = 0
|
||||
total_input_tokens = 0
|
||||
total_output_tokens = 0
|
||||
mc_results: dict[str, dict] = {}
|
||||
all_mismatches: list[dict] = []
|
||||
|
||||
for mc_uuid, mc_id, canonical, total in mcs:
|
||||
is_large = total > 200
|
||||
sample_size = args.large_sample if is_large else args.small_sample
|
||||
|
||||
# Sample controls
|
||||
with engine.connect() as c:
|
||||
controls = c.execute(text("""
|
||||
SELECT mcm.control_uuid, mcm.phase, mcm.action,
|
||||
cc.control_id, cc.title,
|
||||
COALESCE(cc.objective, '') as objective
|
||||
FROM master_control_members mcm
|
||||
JOIN canonical_controls cc ON cc.id = mcm.control_uuid
|
||||
WHERE mcm.master_control_uuid = CAST(:mc AS uuid)
|
||||
ORDER BY RANDOM()
|
||||
LIMIT :n
|
||||
"""), {"mc": str(mc_uuid), "n": sample_size}).fetchall()
|
||||
|
||||
if not controls:
|
||||
continue
|
||||
|
||||
control_dicts = [
|
||||
{"control_uuid": str(r[0]), "phase": r[1], "action": r[2],
|
||||
"control_id": r[3], "title": r[4] or "", "objective": r[5] or ""}
|
||||
for r in controls
|
||||
]
|
||||
|
||||
logger.info("\n%s: %s (%d total, sampling %d)",
|
||||
mc_id, canonical, total, len(control_dicts))
|
||||
|
||||
mc_match = 0
|
||||
mc_mismatch = 0
|
||||
|
||||
# Process in batches
|
||||
for i in range(0, len(control_dicts), args.batch_size):
|
||||
batch = control_dicts[i:i + args.batch_size]
|
||||
results, usage = call_claude(batch, canonical)
|
||||
|
||||
total_input_tokens += usage.get("input_tokens", 0)
|
||||
total_output_tokens += usage.get("output_tokens", 0)
|
||||
|
||||
for r in results:
|
||||
if r.get("match", True):
|
||||
mc_match += 1
|
||||
total_match += 1
|
||||
else:
|
||||
mc_mismatch += 1
|
||||
total_mismatch += 1
|
||||
mismatch = {
|
||||
"mc_id": mc_id,
|
||||
"mc_topic": canonical,
|
||||
"control_id": r.get("control_id", "?"),
|
||||
"confidence": r.get("confidence", 0),
|
||||
"reason": r.get("reason", ""),
|
||||
"suggested_topic": r.get("suggested_topic", ""),
|
||||
}
|
||||
all_mismatches.append(mismatch)
|
||||
|
||||
total_checked += len(results)
|
||||
|
||||
# Rate limit
|
||||
time.sleep(1)
|
||||
|
||||
accuracy = mc_match / (mc_match + mc_mismatch) if (mc_match + mc_mismatch) > 0 else 1.0
|
||||
mc_results[mc_id] = {
|
||||
"canonical": canonical, "total": total,
|
||||
"checked": mc_match + mc_mismatch,
|
||||
"match": mc_match, "mismatch": mc_mismatch,
|
||||
"accuracy": accuracy,
|
||||
}
|
||||
logger.info(" → %d/%d correct (%.1f%%)",
|
||||
mc_match, mc_match + mc_mismatch, accuracy * 100)
|
||||
|
||||
# Final report
|
||||
_print_report(mc_results, all_mismatches, total_checked, total_match,
|
||||
total_mismatch, total_input_tokens, total_output_tokens)
|
||||
|
||||
|
||||
def _print_report(mc_results, mismatches, checked, match, mismatch,
|
||||
input_tok, output_tok):
|
||||
"""Print the quality audit report."""
|
||||
logger.info("\n" + "=" * 70)
|
||||
logger.info("QUALITY AUDIT REPORT")
|
||||
logger.info("=" * 70)
|
||||
logger.info("Total controls checked: %d", checked)
|
||||
logger.info("Correct assignments: %d (%.1f%%)",
|
||||
match, match / max(checked, 1) * 100)
|
||||
logger.info("Wrong assignments: %d (%.1f%%)",
|
||||
mismatch, mismatch / max(checked, 1) * 100)
|
||||
|
||||
# Cost estimate
|
||||
cost_input = input_tok / 1_000_000 * 3.0 # Sonnet input: $3/MTok
|
||||
cost_output = output_tok / 1_000_000 * 15.0 # Sonnet output: $15/MTok
|
||||
logger.info("\nAPI Usage: %d input + %d output tokens",
|
||||
input_tok, output_tok)
|
||||
logger.info("Estimated cost: $%.2f", cost_input + cost_output)
|
||||
|
||||
# Per-MC breakdown (worst first)
|
||||
logger.info("\n--- Per-MC Accuracy (worst first) ---")
|
||||
sorted_mcs = sorted(mc_results.values(), key=lambda x: x["accuracy"])
|
||||
for mc in sorted_mcs:
|
||||
flag = "❌" if mc["accuracy"] < 0.9 else "⚠️" if mc["accuracy"] < 0.95 else "✅"
|
||||
logger.info(" %s %s (%s): %d/%d = %.1f%% [total: %d]",
|
||||
flag, mc["canonical"][:30].ljust(30),
|
||||
"large" if mc["total"] > 200 else "small",
|
||||
mc["match"], mc["checked"],
|
||||
mc["accuracy"] * 100, mc["total"])
|
||||
|
||||
# Top mismatches
|
||||
if mismatches:
|
||||
logger.info("\n--- Mismatches (all %d) ---", len(mismatches))
|
||||
for m in sorted(mismatches, key=lambda x: -x.get("confidence", 0)):
|
||||
logger.info(" %s in %s (%s) → should be '%s': %s",
|
||||
m["control_id"], m["mc_id"], m["mc_topic"],
|
||||
m["suggested_topic"], m["reason"])
|
||||
|
||||
# Size-class breakdown
|
||||
large_mcs = [m for m in mc_results.values() if m["total"] > 200]
|
||||
small_mcs = [m for m in mc_results.values() if m["total"] <= 200]
|
||||
|
||||
if large_mcs:
|
||||
lg_acc = sum(m["match"] for m in large_mcs) / max(sum(m["checked"] for m in large_mcs), 1)
|
||||
logger.info("\nLarge MCs (>200): %.1f%% accuracy (%d MCs)",
|
||||
lg_acc * 100, len(large_mcs))
|
||||
if small_mcs:
|
||||
sm_acc = sum(m["match"] for m in small_mcs) / max(sum(m["checked"] for m in small_mcs), 1)
|
||||
logger.info("Small MCs (≤200): %.1f%% accuracy (%d MCs)",
|
||||
sm_acc * 100, len(small_mcs))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -460,12 +460,50 @@ WICHTIGE REGELN:
|
||||
|
||||
7. MERGE-KEY: Erzeuge im JSON-Output ein zusaetzliches Feld "merge_key" mit
|
||||
dem Format: "action_type:normalized_object:control_phase"
|
||||
|
||||
WICHTIG: Waehle normalized_object NUR aus dieser Liste kanonischer Tokens:
|
||||
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
|
||||
privileged_access, access_control, encryption, transport_encryption,
|
||||
key_management, certificate_management, network_security, network_segmentation,
|
||||
firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting,
|
||||
compliance_audit, vulnerability, patch_management, backup, disaster_recovery,
|
||||
physical_security, secure_development, api_security, input_validation,
|
||||
container_security, logging_configuration
|
||||
DATA_PROTECTION: personal_data, sensitive_data, health_data, consent,
|
||||
data_subject_rights, data_retention, data_transfer, data_breach_notification,
|
||||
dpia, data_processing_agreement, privacy_by_design, data_processing_register,
|
||||
data_classification, cookie_consent, video_surveillance
|
||||
GOVERNANCE: policy, procedure, process, training, awareness, incident,
|
||||
risk_management, third_party_management, change_management, documentation,
|
||||
records_management, compliance_reporting, asset_management,
|
||||
human_resources_security
|
||||
REGULATORY: supervisory_authority, certification, product_safety, ai_system,
|
||||
financial_reporting, aml, whistleblowing, consumer_protection, ecommerce,
|
||||
telecommunications, medical_device, payment_services, critical_infrastructure,
|
||||
supply_chain_due_diligence, sustainability_reporting
|
||||
|
||||
Wenn KEIN Token passt: "OTHER:kurzbeschreibung" (z.B. "OTHER:battery_recycling")
|
||||
|
||||
ABGRENZUNGEN (haeufige Fehler vermeiden!):
|
||||
- monitoring = NUR kontinuierliche Echtzeit-Ueberwachung von Systemen
|
||||
- audit_logging = Protokollierung, Audit Trail, Nachvollziehbarkeit
|
||||
- compliance_audit = externe Pruefungen, Zertifizierungsaudits
|
||||
- training = Schulungen DURCHFUEHREN (nicht "ueberwachen")
|
||||
- procedure = Verfahren DEFINIEREN (nicht Incident-Behandlung)
|
||||
- incident = Sicherheitsvorfaelle BEHANDELN
|
||||
- alerting = Meldepflichten und Benachrichtigungen
|
||||
- personal_data = DSGVO-Verarbeitungsgrundsaetze (nicht Zertifizierung!)
|
||||
- certification = Zertifizierung/Konformitaet (nicht Datenschutz)
|
||||
|
||||
Beispiele:
|
||||
- "implement:api_rate_limiting:implementation"
|
||||
- "define:access_control_policy:definition"
|
||||
- "monitor:third_party_vulnerabilities:monitoring"
|
||||
- "test:authentication_mechanism:testing"
|
||||
- "implement:multi_factor_auth:implementation"
|
||||
- "define:access_control:definition"
|
||||
- "monitor:network_security:monitoring"
|
||||
- "test:vulnerability:testing"
|
||||
- "report:supervisory_authority:reporting"
|
||||
- "implement:audit_logging:implementation" (NICHT monitoring!)
|
||||
- "define:incident:definition" (Incident-Verfahren, NICHT procedure!)
|
||||
- "train:training:operation" (Schulung, NICHT monitoring!)
|
||||
|
||||
8. APPLICABILITY + SCANNER: Bestimme fuer jedes Control:
|
||||
- applicability: Unter welchen Bedingungen gilt dieses Control?
|
||||
@@ -2472,6 +2510,81 @@ def _ensure_list(val) -> list:
|
||||
return []
|
||||
|
||||
|
||||
# Canonical object tokens from object_ontology (loaded once)
|
||||
_CANONICAL_OBJECTS: set[str] | None = None
|
||||
|
||||
|
||||
def _load_canonical_objects() -> set[str]:
|
||||
"""Load canonical tokens from DB, fallback to hardcoded set."""
|
||||
global _CANONICAL_OBJECTS
|
||||
if _CANONICAL_OBJECTS is not None:
|
||||
return _CANONICAL_OBJECTS
|
||||
try:
|
||||
from db.session import get_engine
|
||||
from sqlalchemy import text
|
||||
engine = get_engine()
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text(
|
||||
"SELECT canonical_token FROM compliance.object_ontology"
|
||||
)).fetchall()
|
||||
_CANONICAL_OBJECTS = {r[0] for r in rows}
|
||||
except Exception:
|
||||
_CANONICAL_OBJECTS = set()
|
||||
if not _CANONICAL_OBJECTS:
|
||||
_CANONICAL_OBJECTS = {
|
||||
"multi_factor_auth", "password_policy", "credentials",
|
||||
"session_management", "privileged_access", "access_control",
|
||||
"encryption", "transport_encryption", "key_management",
|
||||
"certificate_management", "network_security",
|
||||
"network_segmentation", "firewall", "vpn", "remote_access",
|
||||
"monitoring", "audit_logging", "siem", "alerting",
|
||||
"compliance_audit", "vulnerability", "patch_management",
|
||||
"backup", "disaster_recovery", "personal_data",
|
||||
"sensitive_data", "consent", "data_subject_rights",
|
||||
"data_retention", "data_transfer", "data_breach_notification",
|
||||
"dpia", "data_processing_agreement", "privacy_by_design",
|
||||
"policy", "procedure", "process", "training", "awareness",
|
||||
"incident", "risk_management", "third_party_management",
|
||||
"change_management", "documentation", "supervisory_authority",
|
||||
"certification", "product_safety", "ai_system", "aml",
|
||||
"critical_infrastructure", "medical_device",
|
||||
}
|
||||
return _CANONICAL_OBJECTS
|
||||
|
||||
|
||||
def _validate_merge_key(merge_key: str) -> str:
|
||||
"""Validate merge_key object against canonical ontology.
|
||||
|
||||
Returns the merge_key (possibly corrected). Logs warnings for
|
||||
unknown objects so they can be tracked.
|
||||
"""
|
||||
parts = merge_key.split(":", 2)
|
||||
if len(parts) < 2:
|
||||
return merge_key
|
||||
|
||||
action, obj = parts[0], parts[1]
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
|
||||
# Accept OTHER: prefix (LLM signaling unknown object)
|
||||
if obj.startswith("OTHER:"):
|
||||
return merge_key
|
||||
|
||||
# Check against canonical ontology
|
||||
canonical = _load_canonical_objects()
|
||||
if obj in canonical:
|
||||
return merge_key
|
||||
|
||||
# Try normalize_object() as fallback
|
||||
from services.control_dedup import normalize_object
|
||||
normed = normalize_object(obj)
|
||||
if normed in canonical:
|
||||
return f"{action}:{normed}:{phase}"
|
||||
|
||||
# Unknown object — log and keep as-is (will be clustered by embedding)
|
||||
logger.debug("merge_key unknown object: %s (normed: %s)", obj, normed)
|
||||
return merge_key
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Decomposition Pass
|
||||
# ---------------------------------------------------------------------------
|
||||
@@ -3025,10 +3138,10 @@ class DecompositionPass:
|
||||
evidence_type=parsed.get("evidence_type", ""),
|
||||
provides_context=_ensure_list(parsed.get("provides_context", [])),
|
||||
)
|
||||
# Store merge_key from LLM output in metadata
|
||||
# Store merge_key from LLM output in metadata — with validation
|
||||
llm_merge_key = parsed.get("merge_key", "")
|
||||
if llm_merge_key:
|
||||
atomic.merge_group_hint = llm_merge_key
|
||||
atomic.merge_group_hint = _validate_merge_key(llm_merge_key)
|
||||
|
||||
atomic.parent_control_uuid = obl["parent_uuid"]
|
||||
atomic.obligation_candidate_id = obl["candidate_id"]
|
||||
|
||||
@@ -0,0 +1,84 @@
|
||||
"""Shared embedding + sub-clustering utilities for the control pipeline."""
|
||||
|
||||
import logging
|
||||
import os
|
||||
from collections import defaultdict
|
||||
|
||||
import httpx
|
||||
import numpy as np
|
||||
from sklearn.cluster import MiniBatchKMeans
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
EMBEDDING_URL = os.getenv(
|
||||
"EMBEDDING_SERVICE_URL", "http://embedding-service:8087"
|
||||
)
|
||||
|
||||
|
||||
def embed_texts(texts: list[str]) -> np.ndarray | None:
|
||||
"""Embed texts via the embedding-service in batches of 64."""
|
||||
try:
|
||||
result = np.zeros((len(texts), 1024), dtype=np.float32)
|
||||
batch_size = 64
|
||||
for i in range(0, len(texts), batch_size):
|
||||
batch = texts[i : i + batch_size]
|
||||
for attempt in range(3):
|
||||
try:
|
||||
with httpx.Client(
|
||||
timeout=httpx.Timeout(60.0, connect=10.0)
|
||||
) as client:
|
||||
resp = client.post(
|
||||
f"{EMBEDDING_URL}/embed", json={"texts": batch}
|
||||
)
|
||||
resp.raise_for_status()
|
||||
embs = resp.json().get("embeddings", [])
|
||||
end = min(i + len(embs), len(texts))
|
||||
result[i:end] = np.array(embs, dtype=np.float32)
|
||||
break
|
||||
except Exception as e:
|
||||
if attempt == 2:
|
||||
logger.error("Embed batch %d failed: %s", i, e)
|
||||
import time
|
||||
time.sleep(2)
|
||||
return result
|
||||
except Exception as e:
|
||||
logger.error("Embedding failed: %s", e)
|
||||
return None
|
||||
|
||||
|
||||
def subcluster_controls(
|
||||
controls: list[dict], target_size: int = 50
|
||||
) -> list[list[dict]]:
|
||||
"""Sub-cluster controls by embedding similarity.
|
||||
|
||||
Returns a list of clusters. Falls back to naive chunking
|
||||
if embedding fails.
|
||||
"""
|
||||
if len(controls) <= target_size:
|
||||
return [controls]
|
||||
|
||||
texts = [c.get("title", "") or c.get("control_id", "") for c in controls]
|
||||
embeddings = embed_texts(texts)
|
||||
if embeddings is None:
|
||||
return [
|
||||
controls[i : i + target_size]
|
||||
for i in range(0, len(controls), target_size)
|
||||
]
|
||||
|
||||
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
|
||||
norms[norms == 0] = 1
|
||||
normalized = embeddings / norms
|
||||
|
||||
k = max(2, min(len(controls) // target_size, 30))
|
||||
kmeans = MiniBatchKMeans(
|
||||
n_clusters=k,
|
||||
batch_size=min(100, len(controls)),
|
||||
max_iter=50,
|
||||
random_state=42,
|
||||
)
|
||||
labels = kmeans.fit_predict(normalized)
|
||||
|
||||
clusters: dict[int, list[dict]] = defaultdict(list)
|
||||
for i, ctrl in enumerate(controls):
|
||||
clusters[int(labels[i])].append(ctrl)
|
||||
return list(clusters.values())
|
||||
@@ -0,0 +1,97 @@
|
||||
# Internationale Normen-Mappings: ISO/EN ↔ Nationale Aequivalente
|
||||
|
||||
## Ziel
|
||||
Frei zugaengliche nationale Normen laden die inhaltlich aequivalent zu kostenpflichtigen
|
||||
DIN/EN/ISO-Normen sind. Eigene Uebersetzung + Zuordnung = rechtlich sicher (Rule 3).
|
||||
|
||||
## Status: IDT = Identical, MOD = Modified, NEQ = Not Equivalent
|
||||
|
||||
---
|
||||
|
||||
## China (GB/T) — Frei auf openstd.samr.gov.cn
|
||||
|
||||
| ISO/EN Norm | GB/T Aequivalent | Status | Thema |
|
||||
|---|---|---|---|
|
||||
| ISO 12100:2010 | GB/T 15706-2012 | IDT | Risikobeurteilung Grundnorm |
|
||||
| ISO 13849-1:2023 | GB/T 16855.1-2018 | IDT | Sicherheitssteuerungen PL |
|
||||
| ISO 13849-2:2012 | GB/T 16855.2-2015 | IDT | Validierung Steuerungen |
|
||||
| IEC 62061:2021 | GB/T 16855.3 | IDT | SIL Steuerungssysteme |
|
||||
| IEC 60204-1:2016 | GB/T 5226.1-2019 | IDT | Elektrische Ausruestung |
|
||||
| ISO 13855:2010 | GB/T 19876-2012 | IDT | Sicherheitsabstaende |
|
||||
| ISO 13850:2015 | GB/T 16754-2022 | IDT | Not-Halt |
|
||||
| ISO 14119:2013 | GB/T 18831 | IDT | Verriegelungseinrichtungen |
|
||||
| ISO 14120:2015 | GB/T 8196-2018 | IDT | Trennende Schutzeinrichtungen |
|
||||
| ISO 13857:2019 | GB/T 23821-2022 | IDT | Sicherheitsabstaende Gliedmassen |
|
||||
| ISO 10218-1:2011 | GB 11291.1-2011 | IDT | Industrieroboter Sicherheit |
|
||||
|
||||
Quelle: https://openstd.samr.gov.cn (SAMR/SAC, frei zugaenglich)
|
||||
|
||||
---
|
||||
|
||||
## USA (OSHA/ANSI) — Frei auf osha.gov
|
||||
|
||||
| ISO/EN Norm | US Aequivalent | Frei? | Thema |
|
||||
|---|---|---|---|
|
||||
| ISO 12100 | ANSI/ISO 12100 (identisch) | ❌ ANSI kostenpflichtig |
|
||||
| Maschinenrichtlinie | OSHA 29 CFR 1910 Subpart O | ✅ Frei | Machine Guarding |
|
||||
| EN 60204-1 | NFPA 79 | ❌ Kostenpflichtig |
|
||||
| Allgemein | OSHA Technical Manual | ✅ Frei | Umfassende Anleitungen |
|
||||
|
||||
Frei nutzbar: OSHA Standards (29 CFR) + Technical Manual
|
||||
Quelle: https://www.osha.gov/otm
|
||||
|
||||
---
|
||||
|
||||
## Korea (KS) — Teilweise frei auf standard.go.kr
|
||||
|
||||
| ISO/EN Norm | KS Aequivalent | Status | Thema |
|
||||
|---|---|---|---|
|
||||
| ISO 12100:2010 | KS B ISO 12100:2014 | IDT | Risikobeurteilung |
|
||||
| ISO 13849-1 | KS B ISO 13849-1 | IDT | Sicherheitssteuerungen |
|
||||
| IEC 60204-1 | KS C IEC 60204-1 | IDT | Elektrische Ausruestung |
|
||||
|
||||
Quelle: https://standard.go.kr (Korean Agency for Technology and Standards, KATS)
|
||||
|
||||
---
|
||||
|
||||
## Indien (BIS) — Teilweise frei auf bis.gov.in
|
||||
|
||||
| ISO/EN Norm | IS Aequivalent | Status | Thema |
|
||||
|---|---|---|---|
|
||||
| ISO 12100:2010 | IS/ISO 12100:2010 | IDT | Risikobeurteilung |
|
||||
| IEC 60204-1 | IS/IEC 60204-1 | IDT | Elektrische Ausruestung |
|
||||
|
||||
Quelle: https://www.services.bis.gov.in (Bureau of Indian Standards)
|
||||
|
||||
---
|
||||
|
||||
## Download-Status (Stand 2026-05-09)
|
||||
|
||||
| Quelle | Sprache | Volltext frei? | Status |
|
||||
|---|---|---|---|
|
||||
| China GB/T (openstd.samr.gov.cn) | Chinesisch | ❌ "Copyright protection" fuer ISO-basierte | Nur Metadaten frei |
|
||||
| USA OSHA 29 CFR 1910 (osha.gov) | Englisch | ✅ Public Domain | ✅ 1910.212 geladen |
|
||||
| USA OSHA Technical Manual | Englisch | ✅ Public Domain | Teilweise geladen |
|
||||
| Korea KS (standard.go.kr) | Koreanisch | ❌ Kostenpflichtig | Nur Metadaten |
|
||||
| Indien BIS (bis.gov.in) | Englisch | ❌ Kostenpflichtig | Nur Metadaten |
|
||||
|
||||
**Ernuechterndes Ergebnis:** Auch China, Korea und Indien schuetzen das ISO-Copyright
|
||||
fuer ihre identischen nationalen Uebernahmen (IDT). Der Volltext ist NIRGENDS frei
|
||||
zugaenglich — nur die USA (OSHA) haben eigene, unabhaengige Regulierungstexte.
|
||||
|
||||
**Was trotzdem nutzbar ist:**
|
||||
1. OSHA 29 CFR 1910 Subpart O — eigene US-Anforderungen, frei, englisch
|
||||
2. OSHA Technical Manual — detaillierte Anleitungen, frei
|
||||
3. Metadaten aller Laender — Normnummern, Titel, Mappings (fuer Referenz-Tabelle)
|
||||
4. Chinesische GB-Normen die NICHT auf ISO basieren (rein chinesische Standards)
|
||||
|
||||
---
|
||||
|
||||
## Rechtliche Bewertung
|
||||
|
||||
- Nationale Aequivalente sind als "IDT" (identical) markiert = gleicher Inhalt
|
||||
- Wir laden die NATIONALEN Versionen (nicht die ISO-Version)
|
||||
- Eigene Uebersetzung ins Deutsche = eigenes Werk (transformative use)
|
||||
- Mapping-Tabelle zeigt transparent die Herkunft
|
||||
- Wir sagen "aequivalent zu ISO 12100", nicht "identisch mit ISO 12100"
|
||||
- Kein ISO-Normtext wird reproduziert — nur eigene Formulierungen
|
||||
@@ -0,0 +1,69 @@
|
||||
# OSHA 29 CFR 1910 Subpart O — Machinery and Machine Guarding
|
||||
# Quelle: https://www.osha.gov/laws-regs/regulations/standardnumber/1910/1910SubpartO
|
||||
# Lizenz: US Federal Law — Public Domain
|
||||
# Geladen: 2026-05-09
|
||||
|
||||
## 1910.211 — Definitions
|
||||
Definitionen fuer Woodworking, Abrasive Wheels, Rubber/Plastics Mills, Power Presses, Forging, Power Transmission.
|
||||
|
||||
## 1910.212 — General Requirements for All Machines
|
||||
- (a)(1) Guarding: barrier guards, two-hand tripping, electronic safety devices
|
||||
- (a)(2) Guards affixed to machine, must not create hazards
|
||||
- (a)(3) Point of operation guarding: guillotine cutters, shears, power presses, milling, saws, jointers, portable tools, forming rolls
|
||||
- (a)(4) Revolving equipment: interlocked enclosure
|
||||
- (a)(5) Fan blades below 7 feet: guards with max 1/2 inch openings
|
||||
- (b) Fixed machinery anchoring
|
||||
|
||||
## 1910.213 — Woodworking Machinery
|
||||
- Machine construction: no excessive vibration, secure bearings
|
||||
- Controls: accessible power cutoff, locking belt shifters, anti-restart
|
||||
- Hand-fed ripsaws: complete hoods, spreaders, non-kickback devices
|
||||
- Crosscut saws: hood requirements
|
||||
- Radial saws: upper/lower blade guarding, forward travel stops
|
||||
- Bandsaws: full wheel encasement (0.037" min wire mesh)
|
||||
- Jointers: automatic guards, max 2.5" throat, knife projection limits
|
||||
- Shapers: cage or adjustable guards
|
||||
- Sanding machines: feed roll guards, enclosed drums
|
||||
|
||||
## 1910.214 — Cooperage Machinery
|
||||
[Reserved]
|
||||
|
||||
## 1910.215 — Abrasive Wheel Machinery
|
||||
- Safety guards required (except internal work, mounted wheels ≤2")
|
||||
- Angular exposure: bench/floor max 90°, cylindrical max 180°, surface/cutting max 150°
|
||||
- Flanges: min 1/3 wheel diameter
|
||||
- Speed limits: ≤8000 SFPM cast iron OK, 8000-16000 cast/structural steel
|
||||
- Ring test before mounting
|
||||
- Work rests: max 1/8" opening
|
||||
|
||||
## 1910.216 — Mills and Calenders (Rubber/Plastics)
|
||||
- Top rolls min 50" above operator level
|
||||
- Safety trip controls: pressure-sensitive body bars, triprods, tripwire
|
||||
- Stopping limits: mills ≤1.5% peripheral speed, calenders ≤1.75%
|
||||
- Manual reset required (no automatic)
|
||||
|
||||
## 1910.217 — Mechanical Power Presses
|
||||
- Brakes: self-engaging
|
||||
- Clutches: single-stroke with compression springs
|
||||
- Two-hand controls: concurrent use, antirepeat
|
||||
- Point of operation: Table O-10 max permissible openings
|
||||
- Guard types: die enclosure, fixed barrier, interlocked, adjustable, presence sensing, pull-out, two-hand
|
||||
- PSDI mode: light curtain, annual certification, min safety distance
|
||||
- Injury reporting to OSHA within 30 days
|
||||
|
||||
## 1910.218 — Forging Machines
|
||||
- Periodic inspection with documented certification
|
||||
- Ram blocking during die changes (Table O-11)
|
||||
- Tongs sufficient length to prevent kickback contact
|
||||
- Scale guards at hammer/press backs
|
||||
- Safety cylinder heads, quick-closing emergency valves
|
||||
- Power lockout requirements
|
||||
|
||||
## 1910.219 — Mechanical Power-Transmission Apparatus
|
||||
- Belts: guard if ≤7 feet from floor, 15" above belt minimum
|
||||
- Overhead belts: full enclosure if >1800 ft/min or >8" wide
|
||||
- Pulleys: guard if ≤7 feet, no cracked/broken pulleys
|
||||
- Shafts: stationary casing ≤7 feet, projecting ends smooth with caps
|
||||
- Gears: complete enclosure or 7-foot guard extending 6" above mesh
|
||||
- Sprockets/chains: enclosure unless >7 feet
|
||||
- Inspection: max 60-day intervals
|
||||
@@ -0,0 +1,79 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Crawl OSHA Technical Manual — all chapters as HTML."""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
from playwright.sync_api import sync_playwright
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("osha-crawl")
|
||||
|
||||
OUTPUT_DIR = Path(__file__).parent / "otm_chapters"
|
||||
BASE = "https://www.osha.gov"
|
||||
|
||||
|
||||
def main():
|
||||
OUTPUT_DIR.mkdir(exist_ok=True)
|
||||
registry = []
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch(headless=False)
|
||||
page = browser.new_page()
|
||||
|
||||
# Step 1: Get all chapter URLs
|
||||
page.goto(f"{BASE}/otm", timeout=30000)
|
||||
time.sleep(5)
|
||||
|
||||
links = page.query_selector_all('a[href*="/otm/"]')
|
||||
chapters = []
|
||||
seen = set()
|
||||
for l in links:
|
||||
href = l.get_attribute("href") or ""
|
||||
text = (l.inner_text() or "").strip()
|
||||
if href and "chapter" in href and href not in seen and text:
|
||||
seen.add(href)
|
||||
chapters.append({"url": href, "title": text})
|
||||
|
||||
logger.info("Found %d chapters", len(chapters))
|
||||
|
||||
# Step 2: Download each chapter
|
||||
for i, ch in enumerate(chapters):
|
||||
url = ch["url"] if ch["url"].startswith("http") else BASE + ch["url"]
|
||||
slug = ch["url"].replace("/otm/", "").replace("/", "_")
|
||||
outfile = OUTPUT_DIR / f"{slug}.html"
|
||||
|
||||
logger.info("[%d/%d] %s", i + 1, len(chapters), ch["title"][:60])
|
||||
|
||||
if outfile.exists():
|
||||
logger.info(" Already exists, skipping")
|
||||
ch["local_path"] = str(outfile)
|
||||
registry.append(ch)
|
||||
continue
|
||||
|
||||
try:
|
||||
page.goto(url, timeout=30000)
|
||||
time.sleep(3)
|
||||
content = page.content()
|
||||
outfile.write_text(content)
|
||||
ch["local_path"] = str(outfile)
|
||||
logger.info(" Saved: %s (%.1f KB)", outfile.name, len(content) / 1024)
|
||||
except Exception as e:
|
||||
logger.error(" Failed: %s", e)
|
||||
ch["local_path"] = None
|
||||
|
||||
registry.append(ch)
|
||||
time.sleep(1)
|
||||
|
||||
browser.close()
|
||||
|
||||
reg_file = Path(__file__).parent / "otm_registry.json"
|
||||
reg_file.write_text(json.dumps(registry, indent=2, ensure_ascii=False))
|
||||
ok = sum(1 for r in registry if r.get("local_path"))
|
||||
logger.info("Done: %d/%d chapters saved", ok, len(registry))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,177 @@
|
||||
# TRBS + TRGS + ASR — Download-URLs
|
||||
|
||||
**Stand:** 2026-05-09
|
||||
**Quelle:** BAuA (Bundesanstalt für Arbeitsschutz und Arbeitsmedizin)
|
||||
**Lizenz:** Gemeinfrei (§5 UrhG — amtliche Bekanntmachungen)
|
||||
|
||||
## Anleitung
|
||||
|
||||
BAuA hat Bot-Schutz. Die PDFs müssen **manuell im Browser** heruntergeladen werden.
|
||||
Jede URL führt zur BAuA-Detailseite → dort den PDF-Download-Link klicken.
|
||||
|
||||
Alle heruntergeladenen PDFs in dieses Verzeichnis legen:
|
||||
```
|
||||
legal-sources/trbs-trgs-asr/
|
||||
```
|
||||
|
||||
Dateinamen-Konvention: `trbs_1111.pdf`, `trgs_400.pdf`, `asr_a1_3.pdf`
|
||||
|
||||
---
|
||||
|
||||
## TRBS — Technische Regeln für Betriebssicherheit (~35 Dokumente)
|
||||
|
||||
### 1000er Reihe (Allgemein)
|
||||
1. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1001.html — TRBS 1001: Struktur und Anwendung
|
||||
2. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1111.html — TRBS 1111: Gefährdungsbeurteilung
|
||||
3. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1112.html — TRBS 1112: Instandhaltung
|
||||
4. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1112-Teil-1.html — TRBS 1112 Teil 1: Explosionsgefährdungen bei Instandhaltung
|
||||
5. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1115.html — TRBS 1115: Sicherheitsrelevante MSR-Einrichtungen
|
||||
6. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1115-Teil-1.html — TRBS 1115 Teil 1: Cybersicherheit für MSR
|
||||
7. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1116.html — TRBS 1116: Qualifikation und Unterweisung
|
||||
8. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1121.html — TRBS 1121: Änderungen an Aufzugsanlagen
|
||||
9. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1122.html — TRBS 1122: Änderungen an Anlagen (§1 Abs.2 Nr.4)
|
||||
10. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1123.html — TRBS 1123: Änderungen an Anlagen (§1 Abs.2 Nr.3)
|
||||
11. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1151.html — TRBS 1151: Mensch-Arbeitsmittel-Schnittstelle, Ergonomie
|
||||
12. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1201.html — TRBS 1201: Prüfungen von Arbeitsmitteln
|
||||
13. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1201-Teil-1.html — TRBS 1201 Teil 1: Prüfung in Ex-Bereichen
|
||||
14. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1201-Teil-2.html — TRBS 1201 Teil 2: Prüfung bei Dampf/Druck
|
||||
15. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1201-Teil-4.html — TRBS 1201 Teil 4: Prüfung von Aufzugsanlagen
|
||||
16. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1201-Teil-5.html — TRBS 1201 Teil 5: Prüfung Lager-/Tankstellen
|
||||
17. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1203.html — TRBS 1203: Befähigte Personen
|
||||
|
||||
### 2000er Reihe (Gefährdungsbezogen)
|
||||
18. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2111.html — TRBS 2111: Mechanische Gefährdungen
|
||||
19. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2111-Teil-1.html — TRBS 2111 Teil 1: Kontrolliert bewegte Teile
|
||||
20. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2111-Teil-2.html — TRBS 2111 Teil 2: Unkontrolliert bewegte Teile
|
||||
21. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2111-Teil-3.html — TRBS 2111 Teil 3: Gefährliche Oberflächen
|
||||
22. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2111-Teil-4.html — TRBS 2111 Teil 4: Mobile Arbeitsmittel
|
||||
23. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2121.html — TRBS 2121: Absturzgefährdung
|
||||
24. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2141.html — TRBS 2141: Dampf und Druck
|
||||
25. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2141-Teil-1.html — TRBS 2141 Teil 1: Versagen drucktragender Wandung
|
||||
26. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2152.html — TRBS 2152: Explosionsfähige Atmosphäre
|
||||
27. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2152-Teil-1.html — TRBS 2152 Teil 1: Beurteilung Explosionsgefährdung
|
||||
28. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2152-Teil-2.html — TRBS 2152 Teil 2: Vermeidung Ex-Atmosphäre
|
||||
29. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2152-Teil-3.html — TRBS 2152 Teil 3: Vermeidung Entzündung
|
||||
30. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2152-Teil-4.html — TRBS 2152 Teil 4: Konstruktiver Explosionsschutz
|
||||
31. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2181.html — TRBS 2181: Eingeschlossensein in Personenaufnahmemitteln
|
||||
32. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2210.html — TRBS 2210: Wechselwirkungen
|
||||
|
||||
### 3000er Reihe (Spezifisch)
|
||||
33. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-3121.html — TRBS 3121: Betrieb von Aufzugsanlagen
|
||||
34. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-3151.html — TRBS 3151: Brand-/Explosionsschutz Tankstellen
|
||||
|
||||
---
|
||||
|
||||
## TRGS — Technische Regeln für Gefahrstoffe (~50 Dokumente)
|
||||
|
||||
### 200er Reihe (Einstufung/Kennzeichnung)
|
||||
35. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-200.html — TRGS 200: Einstufung und Kennzeichnung
|
||||
36. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-201.html — TRGS 201: Einstufung und Kennzeichnung bei Tätigkeiten
|
||||
37. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-220.html — TRGS 220: Sicherheitsdatenblatt
|
||||
|
||||
### 400er Reihe (Gefährdungsbeurteilung)
|
||||
38. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-400.html — TRGS 400: Gefährdungsbeurteilung Gefahrstoffe
|
||||
39. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-401.html — TRGS 401: Hautgefährdung
|
||||
40. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-402.html — TRGS 402: Inhalative Exposition
|
||||
41. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-406.html — TRGS 406: Sensibilisierende Stoffe
|
||||
42. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-407.html — TRGS 407: Tätigkeiten mit Gasen
|
||||
43. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-410.html — TRGS 410: Expositionsverzeichnis krebserzeugende Stoffe
|
||||
44. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-420.html — TRGS 420: Verfahrens- und stoffspezifische Kriterien
|
||||
45. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-430.html — TRGS 430: Isocyanate
|
||||
46. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-460.html — TRGS 460: Stand der Technik
|
||||
|
||||
### 500er Reihe (Schutzmaßnahmen)
|
||||
47. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-500.html — TRGS 500: Schutzmaßnahmen
|
||||
48. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-504.html — TRGS 504: Tätigkeiten mit Blei
|
||||
49. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-505.html — TRGS 505: Oberflächenbehandlung in Räumen
|
||||
50. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-507.html — TRGS 507: Oberflächenbehandlung in Räumen und Behältern
|
||||
51. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-509.html — TRGS 509: Lagern von flüssigen/festen Gefahrstoffen in ortsfesten Behältern
|
||||
52. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-510.html — TRGS 510: Lagerung von Gefahrstoffen in ortsbeweglichen Behältern
|
||||
53. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-512.html — TRGS 512: Begasungen
|
||||
54. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-513.html — TRGS 513: Tätigkeiten an Sterilisatoren mit ETO
|
||||
55. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-519.html — TRGS 519: Asbest
|
||||
56. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-520.html — TRGS 520: Errichtung und Betrieb von Sammelstellen
|
||||
57. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-521.html — TRGS 521: Abbruch/Sanierung alte Mineralwolle
|
||||
58. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-522.html — TRGS 522: Raumdesinfektion mit Formaldehyd
|
||||
59. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-523.html — TRGS 523: Schädlingsbekämpfung mit sehr giftigen/giftigen Stoffen
|
||||
60. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-524.html — TRGS 524: Schutzmaßnahmen bei kontaminierten Bereichen
|
||||
61. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-525.html — TRGS 525: Gefahrstoffe in Einrichtungen der medizinischen Versorgung
|
||||
62. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-526.html — TRGS 526: Laboratorien
|
||||
63. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-527.html — TRGS 527: Tätigkeiten mit Nanomaterialien
|
||||
64. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-528.html — TRGS 528: Schweißtechnische Arbeiten
|
||||
65. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-529.html — TRGS 529: Tätigkeiten bei Biogasanlagen
|
||||
66. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-530.html — TRGS 530: Friseurhandwerk
|
||||
67. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-551.html — TRGS 551: Teer und andere PAK-haltige Stoffe
|
||||
68. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-552.html — TRGS 552: N-Nitrosamine
|
||||
69. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-553.html — TRGS 553: Holzstaub
|
||||
70. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-554.html — TRGS 554: Abgase von Dieselmotoren
|
||||
71. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-555.html — TRGS 555: Betriebsanweisung und Information
|
||||
72. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-557.html — TRGS 557: Dioxine
|
||||
73. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-558.html — TRGS 558: Quarzfeinstaub
|
||||
74. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-559.html — TRGS 559: Mineralischer Staub
|
||||
75. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-561.html — TRGS 561: Krebserzeugende Metalle
|
||||
|
||||
### 600er Reihe (Substitution)
|
||||
76. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-600.html — TRGS 600: Substitution
|
||||
77. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-610.html — TRGS 610: Ersatzstoffe und Ersatzverfahren für chrysotilhaltigen Asbest
|
||||
78. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-617.html — TRGS 617: Ersatzstoffe für Kühlschmierstoffe
|
||||
79. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-619.html — TRGS 619: Substitution für chromat-haltige Beschichtungsstoffe
|
||||
|
||||
### 700er Reihe (Brand-/Explosionsschutz)
|
||||
80. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-720.html — TRGS 720: Gefährliche explosionsfähige Gemische
|
||||
81. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-721.html — TRGS 721: Beurteilung Explosionsgefährdung
|
||||
82. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-722.html — TRGS 722: Vermeidung explosionsfähiger Gemische
|
||||
83. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-723.html — TRGS 723: Gefährliche explosionsfähige Gemische – Vermeidung Entzündung
|
||||
84. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-724.html — TRGS 724: Gefährliche explosionsfähige Gemische – Konstruktiver Schutz
|
||||
85. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-725.html — TRGS 725: Gefährliche explosionsfähige Gemische – MSR-Einrichtungen
|
||||
86. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-726.html — TRGS 726: Sauerstoffgrenzkonzentration
|
||||
87. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-727.html — TRGS 727: Vermeidung von Zündgefahren (elektrostatisch)
|
||||
88. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-741.html — TRGS 741: Organische Peroxide
|
||||
89. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-745.html — TRGS 745: Ortsbewegliche Druckgasbehälter
|
||||
90. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-746.html — TRGS 746: Ortsfeste Druckanlagen für Gase
|
||||
91. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-751.html — TRGS 751: Vermeidung von Brand-/Explosionsgefahren Tankstellen
|
||||
92. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-800.html — TRGS 800: Brandschutzmaßnahmen
|
||||
|
||||
### 900er Reihe (Grenzwerte)
|
||||
93. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-900.html — TRGS 900: Arbeitsplatzgrenzwerte
|
||||
94. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-903.html — TRGS 903: Biologische Grenzwerte
|
||||
95. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-905.html — TRGS 905: Verzeichnis krebserzeugender Stoffe
|
||||
96. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-906.html — TRGS 906: Verzeichnis krebserzeugender Verfahren
|
||||
97. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-907.html — TRGS 907: Verzeichnis sensibilisierender Stoffe
|
||||
98. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-910.html — TRGS 910: Risikobezogenes Maßnahmenkonzept krebserzeugende Stoffe
|
||||
|
||||
---
|
||||
|
||||
## ASR — Arbeitsstättenregeln (~21 Dokumente)
|
||||
|
||||
99. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-V3.html — ASR V3: Gefährdungsbeurteilung
|
||||
100. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-V3a-2.html — ASR V3a.2: Barrierefreie Gestaltung
|
||||
101. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A1-2.html — ASR A1.2: Raumabmessungen und Bewegungsflächen
|
||||
102. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A1-3.html — ASR A1.3: Sicherheits-/Gesundheitsschutzkennzeichnung
|
||||
103. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A1-5-1-2.html — ASR A1.5/1,2: Fußböden
|
||||
104. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A1-6.html — ASR A1.6: Fenster, Oberlichter
|
||||
105. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A1-7.html — ASR A1.7: Türen und Tore
|
||||
106. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A1-8.html — ASR A1.8: Verkehrswege
|
||||
107. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A2-1.html — ASR A2.1: Schutz vor Absturz
|
||||
108. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A2-2.html — ASR A2.2: Maßnahmen gegen Brände
|
||||
109. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A2-3.html — ASR A2.3: Fluchtwege und Notausgänge
|
||||
110. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A3-4.html — ASR A3.4: Beleuchtung und Sichtverbindung
|
||||
111. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A3-4-3.html — ASR A3.4/3: Sicherheitsbeleuchtung
|
||||
112. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A3-5.html — ASR A3.5: Raumtemperatur
|
||||
113. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A3-6.html — ASR A3.6: Lüftung
|
||||
114. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A3-7.html — ASR A3.7: Lärm
|
||||
115. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A4-1.html — ASR A4.1: Sanitärräume
|
||||
116. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A4-2.html — ASR A4.2: Pausen-/Bereitschaftsräume
|
||||
117. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A4-3.html — ASR A4.3: Erste-Hilfe-Räume
|
||||
118. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A4-4.html — ASR A4.4: Unterkünfte
|
||||
119. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A5-2.html — ASR A5.2: Baustellen
|
||||
120. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A6.html — ASR A6: Bildschirmarbeit
|
||||
|
||||
---
|
||||
|
||||
**Gesamt: 120 Dokumente** (34 TRBS + 64 TRGS + 22 ASR)
|
||||
|
||||
**Hinweis:** Einige URLs könnten leicht abweichen (Bindestriche vs. Punkte). Im Browser die BAuA-Übersichtsseite nutzen und von dort die PDFs einzeln herunterladen:
|
||||
- https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS.html
|
||||
- https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS.html
|
||||
- https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR.html
|
||||
@@ -0,0 +1,256 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
BAuA Regulatory Crawler — TRBS, TRGS, ASR
|
||||
|
||||
Crawls the BAuA website using Playwright (headless browser),
|
||||
extracts PDF links, downloads all documents.
|
||||
|
||||
Usage:
|
||||
python3 crawl_baua.py # download all
|
||||
python3 crawl_baua.py --category trbs # only TRBS
|
||||
python3 crawl_baua.py --dry-run # list PDFs without downloading
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
import time
|
||||
from pathlib import Path
|
||||
from urllib.parse import urljoin
|
||||
|
||||
from playwright.sync_api import sync_playwright
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("baua-crawler")
|
||||
|
||||
BASE_URL = "https://www.baua.de"
|
||||
OUTPUT_DIR = Path(__file__).parent / "pdfs"
|
||||
REGISTRY_FILE = Path(__file__).parent / "source_registry.json"
|
||||
|
||||
CATEGORIES = {
|
||||
"trbs": {
|
||||
"url": f"{BASE_URL}/DE/Angebote/Regelwerk/TRBS/TRBS.html",
|
||||
"name": "Technische Regeln für Betriebssicherheit",
|
||||
"source_type": "technical_rule",
|
||||
"legal_basis": "BetrSichV",
|
||||
},
|
||||
"trgs": {
|
||||
"url": f"{BASE_URL}/DE/Angebote/Regelwerk/TRGS/TRGS.html",
|
||||
"name": "Technische Regeln für Gefahrstoffe",
|
||||
"source_type": "technical_rule",
|
||||
"legal_basis": "GefStoffV",
|
||||
},
|
||||
"asr": {
|
||||
"url": f"{BASE_URL}/DE/Angebote/Regelwerk/ASR/ASR.html",
|
||||
"name": "Arbeitsstättenregeln",
|
||||
"source_type": "technical_rule",
|
||||
"legal_basis": "ArbStättV",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def crawl_index(page, category: str, config: dict) -> list[dict]:
|
||||
"""Crawl index page and extract detail page links."""
|
||||
logger.info("Crawling %s index: %s", category.upper(), config["url"])
|
||||
page.goto(config["url"], wait_until="networkidle", timeout=30000)
|
||||
time.sleep(3) # Wait for BunnyShield
|
||||
|
||||
# Extract all links to detail pages
|
||||
links = page.query_selector_all("a[href]")
|
||||
detail_urls = []
|
||||
seen = set()
|
||||
|
||||
for link in links:
|
||||
href = link.get_attribute("href") or ""
|
||||
text = (link.inner_text() or "").strip()
|
||||
|
||||
# Match pattern: /DE/Angebote/Regelwerk/TRBS/TRBS-1111 (no .html!)
|
||||
# ASR uses ASR-A1-3 (not ASR-ASR-A1-3)
|
||||
base_pattern = f"/DE/Angebote/Regelwerk/{category.upper()}/"
|
||||
is_detail = (base_pattern in href
|
||||
and "#" not in href and "?" not in href
|
||||
and href != base_pattern.rstrip("/")
|
||||
and href.split("/")[-1] != category.upper())
|
||||
if is_detail and href not in seen:
|
||||
full_url = urljoin(BASE_URL, href)
|
||||
seen.add(href)
|
||||
|
||||
# Extract regulation number from URL
|
||||
filename = href.split("/")[-1]
|
||||
detail_urls.append({
|
||||
"detail_url": full_url,
|
||||
"title": text[:200] if text else filename,
|
||||
"filename": filename,
|
||||
"category": category,
|
||||
})
|
||||
|
||||
logger.info("Found %d detail pages for %s", len(detail_urls), category.upper())
|
||||
return detail_urls
|
||||
|
||||
|
||||
def extract_pdf_url(page, detail: dict) -> dict:
|
||||
"""Visit detail page and extract PDF download link."""
|
||||
try:
|
||||
page.goto(detail["detail_url"], wait_until="networkidle", timeout=30000)
|
||||
time.sleep(2)
|
||||
|
||||
# Strategy 1: Direct PDF link
|
||||
pdf_links = page.query_selector_all('a[href$=".pdf"]')
|
||||
for link in pdf_links:
|
||||
href = link.get_attribute("href") or ""
|
||||
if href:
|
||||
detail["pdf_url"] = urljoin(BASE_URL, href)
|
||||
return detail
|
||||
|
||||
# Strategy 2: Download button with data attribute
|
||||
download_btns = page.query_selector_all("[data-download-url]")
|
||||
for btn in download_btns:
|
||||
url = btn.get_attribute("data-download-url") or ""
|
||||
if url and ".pdf" in url:
|
||||
detail["pdf_url"] = urljoin(BASE_URL, url)
|
||||
return detail
|
||||
|
||||
# Strategy 3: Links containing "pdf" or "download"
|
||||
all_links = page.query_selector_all("a[href]")
|
||||
for link in all_links:
|
||||
href = link.get_attribute("href") or ""
|
||||
text = (link.inner_text() or "").lower()
|
||||
if (".pdf" in href or "download" in text) and href:
|
||||
detail["pdf_url"] = urljoin(BASE_URL, href)
|
||||
return detail
|
||||
|
||||
# Strategy 4: Check for blob/dynamic download
|
||||
download_links = page.query_selector_all(
|
||||
'a[href*="blob"], a[href*="download"], a[href*="__blob"]'
|
||||
)
|
||||
for link in download_links:
|
||||
href = link.get_attribute("href") or ""
|
||||
if href:
|
||||
detail["pdf_url"] = urljoin(BASE_URL, href)
|
||||
return detail
|
||||
|
||||
logger.warning("No PDF found for %s", detail["filename"])
|
||||
detail["pdf_url"] = None
|
||||
return detail
|
||||
|
||||
except Exception as e:
|
||||
logger.error("Error on %s: %s", detail["detail_url"], e)
|
||||
detail["pdf_url"] = None
|
||||
return detail
|
||||
|
||||
|
||||
def download_pdf(page, detail: dict, output_dir: Path) -> dict:
|
||||
"""Download PDF and compute hash."""
|
||||
if not detail.get("pdf_url"):
|
||||
return detail
|
||||
|
||||
cat = detail["category"]
|
||||
safe_name = re.sub(r"[^a-zA-Z0-9_\-]", "_", detail["filename"]).lower()
|
||||
pdf_path = output_dir / cat / f"{safe_name}.pdf"
|
||||
pdf_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
if pdf_path.exists():
|
||||
logger.info(" Already exists: %s", pdf_path.name)
|
||||
detail["local_path"] = str(pdf_path)
|
||||
detail["sha256"] = hashlib.sha256(pdf_path.read_bytes()).hexdigest()
|
||||
return detail
|
||||
|
||||
try:
|
||||
with page.expect_download(timeout=60000) as download_info:
|
||||
page.goto(detail["pdf_url"], timeout=30000)
|
||||
download = download_info.value
|
||||
download.save_as(str(pdf_path))
|
||||
except Exception:
|
||||
# Fallback: direct download via response
|
||||
try:
|
||||
response = page.request.get(detail["pdf_url"])
|
||||
if response.ok:
|
||||
pdf_path.write_bytes(response.body())
|
||||
else:
|
||||
logger.error(" Download failed: %s (HTTP %d)",
|
||||
detail["filename"], response.status)
|
||||
return detail
|
||||
except Exception as e:
|
||||
logger.error(" Download failed: %s — %s", detail["filename"], e)
|
||||
return detail
|
||||
|
||||
size = pdf_path.stat().st_size
|
||||
detail["local_path"] = str(pdf_path)
|
||||
detail["sha256"] = hashlib.sha256(pdf_path.read_bytes()).hexdigest()
|
||||
detail["size_bytes"] = size
|
||||
logger.info(" Downloaded: %s (%.1f KB)", pdf_path.name, size / 1024)
|
||||
return detail
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--category", choices=["trbs", "trgs", "asr"],
|
||||
help="Only crawl one category")
|
||||
parser.add_argument("--dry-run", action="store_true",
|
||||
help="List PDFs without downloading")
|
||||
parser.add_argument("--headless", action="store_true", default=True)
|
||||
parser.add_argument("--no-headless", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
headless = not args.no_headless
|
||||
categories = [args.category] if args.category else list(CATEGORIES.keys())
|
||||
|
||||
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
registry = []
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch(headless=headless)
|
||||
context = browser.new_context(
|
||||
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
|
||||
"AppleWebKit/537.36 (KHTML, like Gecko) "
|
||||
"Chrome/120.0.0.0 Safari/537.36"
|
||||
)
|
||||
page = context.new_page()
|
||||
|
||||
for cat in categories:
|
||||
config = CATEGORIES[cat]
|
||||
logger.info("\n=== %s ===", cat.upper())
|
||||
|
||||
# Step 1: Crawl index
|
||||
details = crawl_index(page, cat, config)
|
||||
|
||||
# Step 2: Extract PDF URLs
|
||||
for i, detail in enumerate(details):
|
||||
logger.info("[%d/%d] %s", i + 1, len(details), detail["filename"])
|
||||
extract_pdf_url(page, detail)
|
||||
time.sleep(1) # Be polite
|
||||
|
||||
# Step 3: Download PDFs
|
||||
if not args.dry_run:
|
||||
for detail in details:
|
||||
download_pdf(page, detail, OUTPUT_DIR)
|
||||
time.sleep(0.5)
|
||||
|
||||
# Add metadata
|
||||
for detail in details:
|
||||
detail["source_type"] = config["source_type"]
|
||||
detail["legal_basis"] = config["legal_basis"]
|
||||
detail["license_rule"] = 1 # §5 UrhG, gemeinfrei
|
||||
detail["jurisdiction"] = "DE"
|
||||
|
||||
registry.extend(details)
|
||||
|
||||
browser.close()
|
||||
|
||||
# Save registry
|
||||
REGISTRY_FILE.write_text(json.dumps(registry, indent=2, ensure_ascii=False))
|
||||
logger.info("\nRegistry saved: %s (%d entries)", REGISTRY_FILE, len(registry))
|
||||
|
||||
# Summary
|
||||
total = len(registry)
|
||||
with_pdf = sum(1 for r in registry if r.get("pdf_url"))
|
||||
downloaded = sum(1 for r in registry if r.get("local_path"))
|
||||
logger.info("Total: %d | PDF found: %d | Downloaded: %d", total, with_pdf, downloaded)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,119 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Ingest downloaded TRBS/TRGS/ASR PDFs into Qdrant via RAG Service.
|
||||
|
||||
Reads the source_registry.json and uploads each PDF to the RAG service.
|
||||
|
||||
Usage:
|
||||
python3 ingest_to_qdrant.py # ingest all
|
||||
python3 ingest_to_qdrant.py --category trbs # only TRBS
|
||||
python3 ingest_to_qdrant.py --dry-run # list without uploading
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("ingest-trbs")
|
||||
|
||||
REGISTRY_FILE = Path(__file__).parent / "source_registry.json"
|
||||
RAG_URL = "https://macmini:8097/api/v1/documents/upload"
|
||||
COLLECTION = "bp_compliance_ce" # Same collection as other CE documents
|
||||
|
||||
|
||||
def ingest_pdf(entry: dict) -> dict:
|
||||
"""Upload a single PDF to the RAG service."""
|
||||
local_path = entry.get("local_path", "")
|
||||
if not local_path or not Path(local_path).exists():
|
||||
return {"status": "skipped", "reason": "no local file"}
|
||||
|
||||
pdf_path = Path(local_path)
|
||||
category = entry.get("category", "unknown")
|
||||
filename = entry.get("filename", pdf_path.name)
|
||||
title = entry.get("title", filename)
|
||||
|
||||
metadata = {
|
||||
"source": title,
|
||||
"regulation_id": f"{category}_{filename}".lower().replace("-", "_"),
|
||||
"jurisdiction": "DE",
|
||||
"source_type": "technical_rule",
|
||||
"license_rule": 1,
|
||||
"category": category,
|
||||
"legal_basis": entry.get("legal_basis", ""),
|
||||
}
|
||||
|
||||
try:
|
||||
with open(pdf_path, "rb") as f:
|
||||
files = {"file": (pdf_path.name, f, "application/pdf")}
|
||||
data = {
|
||||
"collection": COLLECTION,
|
||||
"data_type": "legal",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_size": "512",
|
||||
"chunk_overlap": "50",
|
||||
"metadata_json": json.dumps(metadata),
|
||||
}
|
||||
resp = httpx.post(RAG_URL, files=files, data=data, timeout=300.0, verify=False)
|
||||
resp.raise_for_status()
|
||||
result = resp.json()
|
||||
return {
|
||||
"status": "ok",
|
||||
"document_id": result.get("document_id", ""),
|
||||
"chunks": result.get("chunks_count", 0),
|
||||
}
|
||||
except Exception as e:
|
||||
return {"status": "error", "reason": str(e)}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--category", choices=["trbs", "trgs", "asr"])
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
registry = json.loads(REGISTRY_FILE.read_text())
|
||||
if args.category:
|
||||
registry = [e for e in registry if e.get("category") == args.category]
|
||||
|
||||
logger.info("Ingesting %d documents into Qdrant (%s)", len(registry), COLLECTION)
|
||||
|
||||
total_ok = 0
|
||||
total_chunks = 0
|
||||
total_err = 0
|
||||
|
||||
for i, entry in enumerate(registry):
|
||||
if not entry.get("local_path"):
|
||||
continue
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("[%d/%d] %s — %s (dry-run)",
|
||||
i + 1, len(registry), entry["filename"], entry.get("title", "")[:60])
|
||||
continue
|
||||
|
||||
logger.info("[%d/%d] %s", i + 1, len(registry), entry["filename"])
|
||||
result = ingest_pdf(entry)
|
||||
|
||||
if result["status"] == "ok":
|
||||
total_ok += 1
|
||||
total_chunks += result["chunks"]
|
||||
logger.info(" → %d chunks indexed", result["chunks"])
|
||||
else:
|
||||
total_err += 1
|
||||
logger.error(" → %s: %s", result["status"], result.get("reason", ""))
|
||||
|
||||
time.sleep(1) # Be gentle
|
||||
|
||||
logger.info("\nDone: %d OK (%d chunks), %d errors, %d total",
|
||||
total_ok, total_chunks, total_err, len(registry))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,114 @@
|
||||
# Urteile zum Download — Priorisiert nach Scannbarkeit
|
||||
|
||||
## Prioritaet 1: Website-scannbar (11 Urteile)
|
||||
|
||||
### 1. LG Muenchen I — Google Fonts
|
||||
- Az: 3 O 17493/20 (20.01.2022)
|
||||
- URL: https://www.gesetze-bayern.de/Content/Document/Y-300-Z-GRURRS-B-2022-N-612
|
||||
- Scanner: fonts.googleapis.com, fonts.gstatic.com im HTML
|
||||
|
||||
### 2. DSB Oesterreich — Google Analytics
|
||||
- Az: D155.027 (22.12.2021)
|
||||
- URL: https://noyb.eu/de/oesterreichische-dsb-eu-us-datenuebermittlung-google-analytics-illegal
|
||||
- Originalbescheid: https://noyb.eu/sites/default/files/2022-01/E-Bescheid%20%20redacted.pdf
|
||||
- Scanner: google-analytics.com, gtag/js, analytics.js
|
||||
|
||||
### 3. CNIL — Cookie-Banner 150 Mio
|
||||
- Sanktionsbescheid gegen Google (31.12.2021)
|
||||
- URL: https://www.cnil.fr/en/cookies-cnil-fines-google-150-million-euros
|
||||
- Scanner: Cookie-Banner DOM (Ablehnen vs Akzeptieren Button-Paritaet)
|
||||
|
||||
### 4. BGH — Planet49 / Opt-In
|
||||
- Az: I ZR 7/16 (28.05.2020, nach EuGH C-673/17)
|
||||
- URL: https://juris.bundesgerichtshof.de/cgi-bin/rechtsprechung/document.py?Gericht=bgh&Art=en&nr=107124
|
||||
- Scanner: Cookies vor Consent, vorausgefuellte Checkboxen
|
||||
|
||||
### 5. EuGH — Schrems II
|
||||
- Az: C-311/18 (16.07.2020)
|
||||
- URL: https://curia.europa.eu/juris/liste.jsf?num=C-311/18
|
||||
- Scanner: HTTP-Requests an US-Server (IP-Geolocation)
|
||||
|
||||
### 6. OLG Koeln — Dark Patterns Cookie-Banner
|
||||
- Az: 6 U 58/21 (19.11.2021)
|
||||
- Scanner: Button-Groesse, Farbe, Hierarchie im Consent-Banner
|
||||
|
||||
### 7. EuGH — Button-Loesung (Amazon)
|
||||
- Az: C-649/17 (07.04.2022)
|
||||
- Scanner: Bestell-Button Text ("zahlungspflichtig bestellen"?)
|
||||
|
||||
### 8. BGH — Impressum Social Media
|
||||
- Az: I ZR 169/22 (09.09.2021)
|
||||
- Scanner: Vollstaendiges Impressum innerhalb 2 Klicks
|
||||
|
||||
### 9. BGH — Grundpreis PAngV
|
||||
- Az: I ZR 46/20 (20.01.2022)
|
||||
- Scanner: Grundpreis neben Endpreis bei mengenbasierten Produkten
|
||||
|
||||
### 10. LG Berlin — Datenschutzerklaerung Vollstaendigkeit
|
||||
- Az: 16 O 341/15
|
||||
- Scanner: Art. 13/14 DSGVO Pflichtangaben in DSE
|
||||
|
||||
### 11. DSK — Telemedien Orientierungshilfe
|
||||
- Bereits im RAG als: dsk_oh_telemedien (589 Chunks)
|
||||
- KEIN Download noetig ✅
|
||||
|
||||
## Prioritaet 2: Dokument/Prozess-Checks (8 Urteile)
|
||||
|
||||
### 12. EuGH — SCHUFA Scoring / Art. 22
|
||||
- Az: C-634/21 (07.12.2023)
|
||||
- URL: https://curia.europa.eu/juris/liste.jsf?num=C-634/21
|
||||
|
||||
### 13. BAG — Zeiterfassung
|
||||
- Az: 1 ABR 22/21 (13.09.2022)
|
||||
- Bereits im RAG als: bag_1_abr_22_21 (237 Chunks)
|
||||
- KEIN Download noetig ✅
|
||||
|
||||
### 14. EuGH — Schadensersatz bei Datenleck (Befuerchtung reicht)
|
||||
- Az: C-340/21 (14.12.2023)
|
||||
- URL: https://curia.europa.eu/juris/liste.jsf?num=C-340/21
|
||||
|
||||
### 15. EuGH — Meta / Berechtigtes Interesse
|
||||
- Az: C-252/21 (04.07.2023)
|
||||
- URL: https://curia.europa.eu/juris/liste.jsf?num=C-252/21
|
||||
|
||||
### 16. LAG Hamm — Microsoft 365 Mitbestimmung
|
||||
- Az: 11 Sa 1108/22 (20.06.2023)
|
||||
|
||||
### 17. OLG Muenchen — Widerrufsbelehrung
|
||||
- Az: 29 U 2698/19
|
||||
|
||||
### 18. BVerfG — Recht auf Vergessenwerden
|
||||
- Az: 1 BvR 1547/19 (06.11.2019)
|
||||
|
||||
### 19. 1&1 Bussgeld (BfDI)
|
||||
- 9,55 Mio EUR (09.12.2019)
|
||||
- Unzureichende Authentifizierung im Kundenservice
|
||||
|
||||
### 20. BFSG/EAA
|
||||
- Bereits im RAG als: bfsg (219 Chunks)
|
||||
- KEIN Download noetig ✅
|
||||
|
||||
## Bereits im RAG vorhanden (kein Download):
|
||||
- dsk_oh_telemedien (589 Chunks) ✅
|
||||
- bag_1_abr_22_21 — Zeiterfassung (237 Chunks) ✅
|
||||
- bfsg (219 Chunks) ✅
|
||||
- 13 weitere BAG-Urteile ✅
|
||||
|
||||
## Download-Status:
|
||||
- [ ] 1. Google Fonts
|
||||
- [ ] 2. Google Analytics (DSB AT)
|
||||
- [ ] 3. CNIL Cookie-Banner
|
||||
- [ ] 4. BGH Planet49
|
||||
- [ ] 5. EuGH Schrems II
|
||||
- [ ] 6. OLG Koeln Dark Patterns
|
||||
- [ ] 7. EuGH Button-Loesung
|
||||
- [ ] 8. BGH Impressum
|
||||
- [ ] 9. BGH Grundpreis
|
||||
- [ ] 10. LG Berlin DSE
|
||||
- [ ] 12. EuGH SCHUFA
|
||||
- [ ] 14. EuGH Schadensersatz Datenleck
|
||||
- [ ] 15. EuGH Meta
|
||||
- [ ] 16. LAG Hamm M365
|
||||
- [ ] 17. OLG Muenchen Widerruf
|
||||
- [ ] 18. BVerfG Vergessenwerden
|
||||
- [ ] 19. 1&1 Bussgeld
|
||||
@@ -0,0 +1,28 @@
|
||||
URTEIL DES GERICHTSHOFS (Dritte Kammer)
|
||||
4. Mai 2023
|
||||
Rechtssache C-300/21 — UI gegen Oesterreichische Post AG
|
||||
|
||||
TENOR:
|
||||
|
||||
1. Art. 82 Abs. 1 der Verordnung (EU) 2016/679 (DSGVO) ist dahin auszulegen, dass der blosse Verstoss gegen die Bestimmungen dieser Verordnung nicht ausreicht, um einen Schadenersatzanspruch zu begruenden.
|
||||
|
||||
2. Art. 82 Abs. 1 der DSGVO ist dahin auszulegen, dass er einer nationalen Regelung oder Praxis entgegensteht, die den Ersatz eines immateriellen Schadens davon abhaengig macht, dass der der betroffenen Person entstandene Schaden einen bestimmten Grad an Erheblichkeit erreicht hat.
|
||||
|
||||
3. Art. 82 der DSGVO ist dahin auszulegen, dass die nationalen Gerichte bei der Festsetzung der Hoehe des Schadenersatzes die innerstaatlichen Vorschriften anzuwenden haben, sofern die unionsrechtlichen Grundsaetze der Aequivalenz und der Effektivitaet beachtet werden.
|
||||
|
||||
KERNAUSSAGEN:
|
||||
- DSGVO-Verstoss allein begruendet KEINEN Schadenersatzanspruch — es braucht einen konkreten Schaden
|
||||
- Aber: KEINE Erheblichkeitsschwelle fuer immateriellen Schaden (jeder nachweisbare Schaden genuegt)
|
||||
- 3 kumulative Voraussetzungen fuer Art. 82: Verstoss + Schaden + Kausalzusammenhang
|
||||
- "Schaden" ist weit auszulegen (146. Erwaegungsgrund DSGVO)
|
||||
- Kein Strafschadensersatz — nur Ausgleichsfunktion (vollstaendiger und wirksamer Ersatz)
|
||||
- Nationale Gerichte wenden nationales Recht fuer die Hoehe an (Verfahrensautonomie)
|
||||
- Grundsaetze der Aequivalenz und Effektivitaet muessen beachtet werden
|
||||
- Unangenehme Gefuehle koennen immateriellen Schaden darstellen (keine Bagatellgrenze)
|
||||
|
||||
RELEVANTE NORMEN:
|
||||
- Art. 82 DSGVO (Haftung und Recht auf Schadenersatz)
|
||||
- Art. 83 DSGVO (Geldbussen — ergaenzt Schadenersatz, aber eigenstaendig)
|
||||
- Art. 84 DSGVO (Sanktionen)
|
||||
- Erwaegungsgrund 146 DSGVO (weite Auslegung des Schadensbegriffs)
|
||||
- Erwaegungsgruende 75, 85 DSGVO (moegliche Schaeden)
|
||||
@@ -0,0 +1,44 @@
|
||||
URTEIL DES GERICHTSHOFS (Grosse Kammer)
|
||||
16. Juli 2020
|
||||
Rechtssache C-311/18 — Data Protection Commissioner gegen Facebook Ireland Ltd, Maximillian Schrems
|
||||
|
||||
TENOR:
|
||||
|
||||
1. Art. 2 Abs. 1 und 2 der Verordnung (EU) 2016/679 (DSGVO) ist dahin auszulegen, dass eine zu gewerblichen Zwecken erfolgende Uebermittlung personenbezogener Daten durch einen in einem Mitgliedstaat ansaessigen Wirtschaftsteilnehmer an einen anderen, in einem Drittland ansaessigen Wirtschaftsteilnehmer in den Anwendungsbereich dieser Verordnung faellt, ungeachtet dessen, ob die Daten bei ihrer Uebermittlung oder im Anschluss daran von den Behoerden des betreffenden Drittlands fuer Zwecke der oeffentlichen Sicherheit, der Landesverteidigung und der Sicherheit des Staates verarbeitet werden koennen.
|
||||
|
||||
2. Art. 46 Abs. 1 und Art. 46 Abs. 2 Buchst. c der DSGVO sind dahin auszulegen, dass die nach diesen Vorschriften erforderlichen geeigneten Garantien, durchsetzbaren Rechte und wirksamen Rechtsbehelfe gewaehrleisten muessen, dass die Rechte der Personen, deren personenbezogene Daten auf der Grundlage von Standarddatenschutzklauseln in ein Drittland uebermittelt werden, ein Schutzniveau geniessen, das dem in der EU durch die DSGVO im Licht der Charta garantierten Niveau der Sache nach gleichwertig ist.
|
||||
|
||||
3. Art. 58 Abs. 2 Buchst. f und j der DSGVO ist dahin auszulegen, dass die zustaendige Aufsichtsbehoerde verpflichtet ist, eine auf Standarddatenschutzklauseln gestuetzte Uebermittlung personenbezogener Daten in ein Drittland auszusetzen oder zu verbieten, wenn die Klauseln in diesem Drittland nicht eingehalten werden oder nicht eingehalten werden koennen und der nach dem Unionsrecht erforderliche Schutz nicht mit anderen Mitteln gewaehrleistet werden kann.
|
||||
|
||||
4. Die Pruefung des Beschlusses 2010/87/EU (Standardvertragsklauseln) anhand der Art. 7, 8 und 47 der Charta hat nichts ergeben, was seine Gueltigkeit beruehren koennte.
|
||||
|
||||
5. Der Durchfuehrungsbeschluss (EU) 2016/1250 (EU-US-Datenschutzschild / Privacy Shield) ist UNGUELTIG.
|
||||
|
||||
KERNAUSSAGEN:
|
||||
- Privacy Shield (EU-US-Datenschutzschild) ist ungueltig
|
||||
- US-Ueberwachungsprogramme (PRISM, UPSTREAM via Section 702 FISA + E.O. 12333) verstoessen gegen EU-Grundrechte
|
||||
- Weder Section 702 FISA noch E.O. 12333 genuegen dem Verhaeltnismaessigkeitsgrundsatz
|
||||
- PPD-28 verleiht betroffenen EU-Buergern keine durchsetzbaren Rechte
|
||||
- Die Ombudsperson des Datenschutzschilds ist KEIN unabhaengiges Gericht i.S.v. Art. 47 Charta
|
||||
- Standardvertragsklauseln (SCCs) bleiben gueltig, ABER:
|
||||
- Der Verantwortliche muss VOR der Uebermittlung pruefen ob das Drittland angemessenen Schutz bietet
|
||||
- Ggf. muessen zusaetzliche Massnahmen ergriffen werden
|
||||
- Wenn kein angemessener Schutz moeglich: Uebermittlung aussetzen/verbieten
|
||||
- Aufsichtsbehoerden sind VERPFLICHTET Uebermittlungen zu verbieten wenn Schutz nicht gewaehrleistet
|
||||
- DSGVO gilt auch wenn Drittland-Behoerden Daten fuer nationale Sicherheit nutzen koennten
|
||||
|
||||
RELEVANTE NORMEN:
|
||||
- Art. 44-49 DSGVO (Uebermittlungen in Drittlaender)
|
||||
- Art. 45 DSGVO (Angemessenheitsbeschluss)
|
||||
- Art. 46 DSGVO (Geeignete Garantien / Standardvertragsklauseln)
|
||||
- Art. 58 Abs. 2 DSGVO (Befugnisse der Aufsichtsbehoerden)
|
||||
- Art. 7, 8, 47 EU-Grundrechtecharta
|
||||
- Art. 52 Abs. 1 EU-Grundrechtecharta (Verhaeltnismaessigkeit)
|
||||
- Section 702 FISA (US-Auslandsaufklaerung)
|
||||
- Executive Order 12333 (US-Nachrichtendienste)
|
||||
- PPD-28 (Presidential Policy Directive)
|
||||
|
||||
AUSWIRKUNGEN:
|
||||
- Jede Datenuebermittlung in die USA muss einzeln geprueft werden (Transfer Impact Assessment)
|
||||
- Zusaetzliche technische Massnahmen (z.B. Verschluesselung) erforderlich
|
||||
- Nachfolger: EU-US Data Privacy Framework (2023)
|
||||
@@ -0,0 +1,30 @@
|
||||
URTEIL DES GERICHTSHOFS (Erste Kammer)
|
||||
|
||||
7. Dezember 2023
|
||||
|
||||
Vorlage zur Vorabentscheidung – Schutz natuerlicher Personen bei der Verarbeitung personenbezogener Daten – Verordnung (EU) 2016/679 – Art. 22 – Automatisierte Entscheidung im Einzelfall – Wirtschaftsauskunfteien – Automatisierte Erstellung eines Wahrscheinlichkeitswerts in Bezug auf die Faehigkeit einer Person zur Erfuellung kuenftiger Zahlungsverpflichtungen (Scoring) – Verwendung dieses Wahrscheinlichkeitswerts durch Dritte
|
||||
|
||||
In der Rechtssache C-634/21 — OQ gegen Land Hessen, Beteiligte: SCHUFA Holding AG
|
||||
|
||||
TENOR:
|
||||
|
||||
Art. 22 Abs. 1 der Verordnung (EU) 2016/679 (DSGVO) ist dahin auszulegen, dass eine automatisierte Entscheidung im Einzelfall im Sinne dieser Bestimmung vorliegt, wenn ein auf personenbezogene Daten zu einer Person gestuetzter Wahrscheinlichkeitswert in Bezug auf deren Faehigkeit zur Erfuellung kuenftiger Zahlungsverpflichtungen durch eine Wirtschaftsauskunftei automatisiert erstellt wird, sofern von diesem Wahrscheinlichkeitswert massgeblich abhaengt, ob ein Dritter, dem dieser Wahrscheinlichkeitswert uebermittelt wird, ein Vertragsverhaeltnis mit dieser Person begruendet, durchfuehrt oder beendet.
|
||||
|
||||
KERNAUSSAGEN:
|
||||
- SCHUFA-Scoring ist eine automatisierte Entscheidung im Einzelfall gemaess Art. 22 DSGVO
|
||||
- Der Score-Wert selbst ist bereits die "Entscheidung" (nicht erst die Handlung des Dritten)
|
||||
- Art. 22 DSGVO stellt ein grundsaetzliches VERBOT automatisierter Entscheidungen auf
|
||||
- Ausnahmen nur nach Art. 22 Abs. 2 DSGVO (Vertrag, Rechtsvorschrift, Einwilligung)
|
||||
- Betroffene haben Recht auf Auskunft ueber die involvierte Logik (Art. 15 Abs. 1 Buchst. h)
|
||||
- Nationale Regelungen (wie § 31 BDSG) muessen Art. 5, 6 und 22 DSGVO genuegen
|
||||
- Enge Auslegung wuerde zu Rechtsschutzluecke fuehren (3-Akteure-Problem)
|
||||
- Angemessene Massnahmen: Recht auf menschliches Eingreifen, Darlegung des Standpunkts, Anfechtung
|
||||
|
||||
RELEVANTE NORMEN:
|
||||
- Art. 22 DSGVO (Automatisierte Entscheidungen im Einzelfall)
|
||||
- Art. 4 Nr. 4 DSGVO (Definition Profiling)
|
||||
- Art. 15 Abs. 1 Buchst. h DSGVO (Auskunftsrecht bei automatisierter Entscheidung)
|
||||
- Art. 13 Abs. 2 Buchst. f DSGVO (Informationspflicht)
|
||||
- Art. 5 DSGVO (Grundsaetze der Verarbeitung)
|
||||
- Art. 6 DSGVO (Rechtmaessigkeit)
|
||||
- § 31 BDSG (Scoring — Vereinbarkeit mit EU-Recht zweifelhaft)
|
||||
@@ -0,0 +1,19 @@
|
||||
URTEIL DES GERICHTSHOFS (Große Kammer)
|
||||
|
||||
1. Oktober 2019
|
||||
|
||||
Vorlage zur Vorabentscheidung – Richtlinie 95/46/EG – Richtlinie 2002/58/EG – Verordnung (EU) 2016/679 – Verarbeitung personenbezogener Daten und Schutz der Privatsphäre in der elektronischen Kommunikation – Cookies – Begriff der Einwilligung der betroffenen Person – Einwilligungserklaerung mittels eines mit einem voreingestellten Haekchen versehenen Ankreuzkaestchens
|
||||
|
||||
In der Rechtssache C-673/17
|
||||
|
||||
Bundesverband der Verbraucherzentralen und Verbraucherverbände – Verbraucherzentrale Bundesverband e. V. gegen Planet49 GmbH
|
||||
|
||||
TENOR:
|
||||
|
||||
1. Art. 2 Buchst. f und Art. 5 Abs. 3 der Richtlinie 2002/58/EG in Verbindung mit Art. 2 Buchst. h der Richtlinie 95/46/EG bzw. mit Art. 4 Nr. 11 und Art. 6 Abs. 1 Buchst. a der Verordnung 2016/679 sind dahin auszulegen, dass keine wirksame Einwilligung im Sinne dieser Bestimmungen vorliegt, wenn die Speicherung von Informationen oder der Zugriff auf Informationen, die bereits im Endgeraet des Nutzers einer Website gespeichert sind, mittels Cookies durch ein voreingestelltes Ankreuzkaestchen erlaubt wird, das der Nutzer zur Verweigerung seiner Einwilligung abwaehlen muss.
|
||||
|
||||
2. Art. 2 Buchst. f und Art. 5 Abs. 3 der Richtlinie 2002/58 sind nicht unterschiedlich auszulegen, je nachdem, ob es sich bei den im Endgeraet des Nutzers einer Website gespeicherten oder abgerufenen Informationen um personenbezogene Daten handelt oder nicht.
|
||||
|
||||
3. Art. 5 Abs. 3 der Richtlinie 2002/58 ist dahin auszulegen, dass Angaben zur Funktionsdauer der Cookies und dazu, ob Dritte Zugriff auf die Cookies erhalten koennen, zu den Informationen zaehlen, die der Diensteanbieter dem Nutzer einer Website zu geben hat.
|
||||
|
||||
Verkuendet in oeffentlicher Sitzung in Luxemburg am 1. Oktober 2019.
|
||||
@@ -0,0 +1,29 @@
|
||||
# LG München I — Google Fonts Urteil
|
||||
# Az: 3 O 17493/20 (20.01.2022)
|
||||
# Quelle: gesetze-bayern.de
|
||||
|
||||
## Tenor (Entscheidung)
|
||||
|
||||
1. Die Beklagte wird verurteilt, es zu unterlassen, die dynamische IP-Adresse des Klägers an Google weiterzugeben, wenn der Kläger die Website der Beklagten aufruft, ohne dass der Kläger in die Weitergabe eingewilligt hat. Androhung: Ordnungsgeld bis 250.000 EUR oder Ordnungshaft bis 6 Monate.
|
||||
|
||||
2. Die Beklagte wird verurteilt, dem Kläger Auskunft zu erteilen, welche personenbezogenen Daten über ihn verarbeitet werden.
|
||||
|
||||
3. Die Beklagte wird verurteilt, 100 EUR Schmerzensgeld nebst Zinsen zu zahlen.
|
||||
|
||||
## Kernbegruendung
|
||||
|
||||
**DSGVO-Verstoss durch IP-Uebermittlung:** Das Gericht stellte fest, dass die automatische Uebermittlung dynamischer IP-Adressen an Google beim Laden von Google Fonts das Recht auf informationelle Selbstbestimmung (Art. 823 BGB) und Art. 6 Abs. 1 DSGVO verletzt.
|
||||
|
||||
**IP-Adressen = personenbezogene Daten:** Dynamische IP-Adressen sind personenbezogene Daten, weil der Website-Betreiber ueber den abstrakten rechtlichen Weg (Behoerden, Provider) die Identifikation der Person erreichen kann.
|
||||
|
||||
**Kein berechtigtes Interesse:** Das berechtigte Interesse der Beklagten scheitert, weil "Google Fonts auch genutzt werden kann, ohne dass beim Aufruf der Webseite eine Verbindung zu Google-Servern hergestellt wird und die IP-Adresse der Webseitenbesucher uebertragen wird." (lokales Hosting moeglich)
|
||||
|
||||
## Compliance-Anforderung
|
||||
|
||||
Website-Betreiber muessen Google Fonts lokal hosten oder Alternativen verwenden, die keine automatische IP-Uebermittlung an externe Server ohne explizite Einwilligung verursachen.
|
||||
|
||||
## Scanner-Pruefpunkte
|
||||
- HTML pruefen auf: fonts.googleapis.com, fonts.gstatic.com
|
||||
- CSS pruefen auf: @import url('https://fonts.googleapis.com/...')
|
||||
- JS pruefen auf: WebFont.load mit google-Provider
|
||||
- Wenn gefunden: FAIL — externer Google Fonts Einbindung ohne Consent
|
||||
@@ -3,9 +3,11 @@ from fastapi import APIRouter
|
||||
from api.collections import router as collections_router
|
||||
from api.documents import router as documents_router
|
||||
from api.search import router as search_router
|
||||
from api.tenant_documents import router as tenant_documents_router
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
router.include_router(collections_router, tags=["Collections"])
|
||||
router.include_router(documents_router, tags=["Documents"])
|
||||
router.include_router(tenant_documents_router, tags=["Tenant Documents"])
|
||||
router.include_router(search_router, tags=["Search"])
|
||||
|
||||
@@ -0,0 +1,289 @@
|
||||
"""
|
||||
Tenant-isolated document upload, listing, and deletion.
|
||||
|
||||
Each tenant gets their own Qdrant collection (bp_docs_tenant_{short_id}).
|
||||
Documents are stored in MinIO under tenant-specific paths.
|
||||
No data crosses tenant boundaries.
|
||||
|
||||
Endpoints:
|
||||
POST /api/v1/tenant/documents - Upload + process PDF
|
||||
GET /api/v1/tenant/documents - List tenant's documents
|
||||
DELETE /api/v1/tenant/documents/{doc_id} - Delete document + vectors
|
||||
GET /api/v1/tenant/documents/{doc_id}/status - Processing status
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import uuid
|
||||
from typing import Optional
|
||||
|
||||
from fastapi import APIRouter, File, Form, HTTPException, Header, Request, UploadFile
|
||||
from pydantic import BaseModel
|
||||
|
||||
from api.auth import optional_jwt_auth
|
||||
from embedding_client import embedding_client
|
||||
from html_utils import decode_html_bytes, looks_like_html, strip_html
|
||||
from minio_client_wrapper import minio_wrapper
|
||||
from qdrant_client_wrapper import qdrant_wrapper
|
||||
|
||||
logger = logging.getLogger("rag-service.api.tenant-documents")
|
||||
|
||||
router = APIRouter(prefix="/api/v1/tenant/documents")
|
||||
|
||||
VECTOR_DIM = 1024 # bge-m3 dimension
|
||||
MAX_FILE_SIZE = 50 * 1024 * 1024 # 50 MB
|
||||
ALLOWED_TYPES = {"application/pdf", "text/html", "text/plain"}
|
||||
PDF_MAGIC = b"%PDF"
|
||||
|
||||
|
||||
def _collection_name(tenant_id: str) -> str:
|
||||
"""Derive tenant-specific Qdrant collection name."""
|
||||
short = tenant_id.replace("-", "")[:12]
|
||||
return f"bp_docs_tenant_{short}"
|
||||
|
||||
|
||||
def _storage_path(tenant_id: str, document_id: str, filename: str) -> str:
|
||||
"""Derive tenant-isolated storage path."""
|
||||
short = tenant_id.replace("-", "")[:12]
|
||||
return f"tenant_docs/{short}/{document_id}/{filename}"
|
||||
|
||||
|
||||
def _extract_tenant_id(
|
||||
request: Request,
|
||||
x_tenant_id: Optional[str] = Header(None),
|
||||
) -> str:
|
||||
"""Extract tenant ID from header. Required for all tenant endpoints."""
|
||||
tid = x_tenant_id or request.headers.get("x-tenant-id", "")
|
||||
if not tid:
|
||||
raise HTTPException(status_code=400, detail="X-Tenant-ID header required")
|
||||
return tid
|
||||
|
||||
|
||||
# ── Response models ────────────────────────────────────────────────
|
||||
|
||||
class DocumentResponse(BaseModel):
|
||||
id: str
|
||||
filename: str
|
||||
file_size: int
|
||||
status: str
|
||||
chunk_count: int
|
||||
collection: str
|
||||
created_at: Optional[str] = None
|
||||
|
||||
|
||||
class DocumentListResponse(BaseModel):
|
||||
documents: list[DocumentResponse]
|
||||
total: int
|
||||
|
||||
|
||||
# ── Endpoints ──────────────────────────────────────────────────────
|
||||
|
||||
@router.post("", response_model=DocumentResponse)
|
||||
async def upload_tenant_document(
|
||||
request: Request,
|
||||
file: UploadFile = File(...),
|
||||
x_tenant_id: Optional[str] = Header(None),
|
||||
chunk_size: int = Form(default=512),
|
||||
chunk_overlap: int = Form(default=50),
|
||||
metadata_json: Optional[str] = Form(default=None),
|
||||
):
|
||||
"""Upload a document, process it, and index in tenant-specific collection."""
|
||||
optional_jwt_auth(request)
|
||||
tenant_id = _extract_tenant_id(request, x_tenant_id)
|
||||
|
||||
# Read + validate
|
||||
file_bytes = await file.read()
|
||||
if len(file_bytes) == 0:
|
||||
raise HTTPException(status_code=400, detail="Empty file")
|
||||
if len(file_bytes) > MAX_FILE_SIZE:
|
||||
raise HTTPException(status_code=413, detail=f"File too large (max {MAX_FILE_SIZE // 1024 // 1024} MB)")
|
||||
|
||||
filename = file.filename or "document.pdf"
|
||||
content_type = file.content_type or "application/octet-stream"
|
||||
|
||||
# PDF magic bytes check
|
||||
if filename.lower().endswith(".pdf") and not file_bytes[:4].startswith(PDF_MAGIC):
|
||||
raise HTTPException(status_code=400, detail="File claims to be PDF but magic bytes don't match")
|
||||
|
||||
document_id = str(uuid.uuid4())
|
||||
collection = _collection_name(tenant_id)
|
||||
object_name = _storage_path(tenant_id, document_id, filename)
|
||||
|
||||
# Ensure collection exists
|
||||
await qdrant_wrapper.create_collection(collection, VECTOR_DIM)
|
||||
|
||||
# Store in MinIO
|
||||
try:
|
||||
await minio_wrapper.upload_document(
|
||||
object_name=object_name,
|
||||
data=file_bytes,
|
||||
content_type=content_type,
|
||||
metadata={"document_id": document_id, "tenant_id": tenant_id},
|
||||
)
|
||||
except Exception as exc:
|
||||
logger.error("MinIO upload failed for tenant %s: %s", tenant_id, exc)
|
||||
raise HTTPException(status_code=500, detail="Storage failed")
|
||||
|
||||
# Extract text
|
||||
try:
|
||||
text = await _extract_text(file_bytes, filename, content_type)
|
||||
except Exception as exc:
|
||||
logger.error("Text extraction failed: %s", exc)
|
||||
raise HTTPException(status_code=500, detail=f"Text extraction failed: {exc}")
|
||||
|
||||
if not text or not text.strip():
|
||||
raise HTTPException(status_code=400, detail="No text could be extracted")
|
||||
|
||||
# Chunk
|
||||
chunk_result = await embedding_client.chunk_text(
|
||||
text=text, strategy="recursive",
|
||||
chunk_size=chunk_size, overlap=chunk_overlap,
|
||||
)
|
||||
chunks = chunk_result.chunks
|
||||
chunks_meta = chunk_result.chunks_with_metadata
|
||||
|
||||
if not chunks:
|
||||
raise HTTPException(status_code=400, detail="Chunking produced zero chunks")
|
||||
|
||||
# Embed
|
||||
embeddings = await embedding_client.generate_embeddings(chunks)
|
||||
|
||||
# Parse extra metadata
|
||||
extra_metadata = {}
|
||||
if metadata_json:
|
||||
try:
|
||||
extra_metadata = json.loads(metadata_json)
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
# Build payloads with tenant isolation
|
||||
_STRUCT_FIELDS = ("section", "section_title", "paragraph", "paragraph_num", "page")
|
||||
payloads = []
|
||||
for i, chunk in enumerate(chunks):
|
||||
payload = {
|
||||
"document_id": document_id,
|
||||
"tenant_id": tenant_id,
|
||||
"filename": filename,
|
||||
"chunk_index": i,
|
||||
"chunk_text": chunk,
|
||||
**extra_metadata,
|
||||
}
|
||||
if i < len(chunks_meta):
|
||||
for field in _STRUCT_FIELDS:
|
||||
value = chunks_meta[i].get(field)
|
||||
if value is not None and value != "":
|
||||
payload[field] = value
|
||||
payloads.append(payload)
|
||||
|
||||
# Index in tenant collection
|
||||
indexed = await qdrant_wrapper.index_documents(
|
||||
collection=collection, vectors=embeddings, payloads=payloads,
|
||||
)
|
||||
|
||||
logger.info(
|
||||
"Tenant %s: uploaded %s (%d chunks, %d vectors) to %s",
|
||||
tenant_id[:8], filename, len(chunks), indexed, collection,
|
||||
)
|
||||
|
||||
return DocumentResponse(
|
||||
id=document_id, filename=filename,
|
||||
file_size=len(file_bytes), status="indexed",
|
||||
chunk_count=len(chunks), collection=collection,
|
||||
)
|
||||
|
||||
|
||||
@router.get("", response_model=DocumentListResponse)
|
||||
async def list_tenant_documents(
|
||||
request: Request,
|
||||
x_tenant_id: Optional[str] = Header(None),
|
||||
):
|
||||
"""List all documents for this tenant."""
|
||||
optional_jwt_auth(request)
|
||||
tenant_id = _extract_tenant_id(request, x_tenant_id)
|
||||
|
||||
collection = _collection_name(tenant_id)
|
||||
|
||||
try:
|
||||
# Get unique document_ids from Qdrant
|
||||
docs = await qdrant_wrapper.get_unique_documents(collection)
|
||||
except Exception:
|
||||
# Collection doesn't exist yet → no documents
|
||||
docs = []
|
||||
|
||||
return DocumentListResponse(documents=docs, total=len(docs))
|
||||
|
||||
|
||||
@router.delete("/{doc_id}")
|
||||
async def delete_tenant_document(
|
||||
doc_id: str,
|
||||
request: Request,
|
||||
x_tenant_id: Optional[str] = Header(None),
|
||||
):
|
||||
"""Delete a document and all its vectors from tenant collection."""
|
||||
optional_jwt_auth(request)
|
||||
tenant_id = _extract_tenant_id(request, x_tenant_id)
|
||||
|
||||
collection = _collection_name(tenant_id)
|
||||
errors = []
|
||||
|
||||
# Delete vectors from Qdrant
|
||||
try:
|
||||
await qdrant_wrapper.delete_by_filter(
|
||||
collection=collection,
|
||||
filter_conditions={"document_id": doc_id},
|
||||
)
|
||||
except Exception as exc:
|
||||
errors.append(f"Qdrant: {exc}")
|
||||
|
||||
# Delete file from MinIO
|
||||
try:
|
||||
prefix = f"tenant_docs/{tenant_id.replace('-', '')[:12]}/{doc_id}/"
|
||||
await minio_wrapper.delete_by_prefix(prefix)
|
||||
except Exception as exc:
|
||||
errors.append(f"MinIO: {exc}")
|
||||
|
||||
if errors:
|
||||
logger.warning("Partial delete for %s/%s: %s", tenant_id[:8], doc_id[:8], errors)
|
||||
return {"deleted": True, "warnings": errors}
|
||||
|
||||
logger.info("Tenant %s: deleted document %s", tenant_id[:8], doc_id[:8])
|
||||
return {"deleted": True, "document_id": doc_id}
|
||||
|
||||
|
||||
@router.get("/{doc_id}/status")
|
||||
async def document_status(
|
||||
doc_id: str,
|
||||
request: Request,
|
||||
x_tenant_id: Optional[str] = Header(None),
|
||||
):
|
||||
"""Get processing status for a document."""
|
||||
optional_jwt_auth(request)
|
||||
tenant_id = _extract_tenant_id(request, x_tenant_id)
|
||||
|
||||
collection = _collection_name(tenant_id)
|
||||
try:
|
||||
count = await qdrant_wrapper.count_by_filter(
|
||||
collection=collection,
|
||||
filter_conditions={"document_id": doc_id},
|
||||
)
|
||||
status = "indexed" if count > 0 else "not_found"
|
||||
except Exception:
|
||||
count = 0
|
||||
status = "not_found"
|
||||
|
||||
return {"document_id": doc_id, "status": status, "chunk_count": count}
|
||||
|
||||
|
||||
# ── Helpers ────────────────────────────────────────────────────────
|
||||
|
||||
async def _extract_text(file_bytes: bytes, filename: str, content_type: str) -> str:
|
||||
"""Extract text from PDF, HTML, or plain text."""
|
||||
if content_type == "application/pdf" or filename.lower().endswith(".pdf"):
|
||||
return await embedding_client.extract_pdf(file_bytes)
|
||||
if filename.lower().endswith((".html", ".htm")):
|
||||
text = decode_html_bytes(file_bytes)
|
||||
return strip_html(text)
|
||||
text = file_bytes.decode("utf-8", errors="replace")
|
||||
if looks_like_html(text):
|
||||
return strip_html(text)
|
||||
return text
|
||||
@@ -122,6 +122,16 @@ class MinioClientWrapper:
|
||||
logger.error("Failed to delete '%s': %s", object_name, exc)
|
||||
raise
|
||||
|
||||
async def delete_by_prefix(self, prefix: str) -> int:
|
||||
"""Remove all objects under a prefix."""
|
||||
objects = self.client.list_objects(settings.MINIO_BUCKET, prefix=prefix, recursive=True)
|
||||
count = 0
|
||||
for obj in objects:
|
||||
self.client.remove_object(settings.MINIO_BUCKET, obj.object_name)
|
||||
count += 1
|
||||
logger.info("Deleted %d objects with prefix '%s'", count, prefix)
|
||||
return count
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Presigned URL
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
@@ -235,6 +235,74 @@ class QdrantClientWrapper:
|
||||
logger.info("Deleted points from '%s' with filter %s", collection, filter_conditions)
|
||||
return True
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Tenant document helpers
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
async def get_unique_documents(self, collection: str) -> list[dict]:
|
||||
"""Get unique documents from a collection by scrolling and grouping."""
|
||||
try:
|
||||
self.client.get_collection(collection)
|
||||
except Exception:
|
||||
return []
|
||||
|
||||
docs: dict[str, dict] = {}
|
||||
offset = None
|
||||
while True:
|
||||
result = self.client.scroll(
|
||||
collection_name=collection,
|
||||
scroll_filter=None,
|
||||
limit=100,
|
||||
offset=offset,
|
||||
with_payload=True,
|
||||
with_vectors=False,
|
||||
)
|
||||
points, next_offset = result
|
||||
for pt in points:
|
||||
payload = pt.payload or {}
|
||||
doc_id = payload.get("document_id", "")
|
||||
if doc_id and doc_id not in docs:
|
||||
docs[doc_id] = {
|
||||
"id": doc_id,
|
||||
"filename": payload.get("filename", ""),
|
||||
"file_size": payload.get("file_size", 0),
|
||||
"status": "indexed",
|
||||
"chunk_count": 0,
|
||||
"collection": collection,
|
||||
}
|
||||
if doc_id:
|
||||
docs[doc_id]["chunk_count"] += 1
|
||||
|
||||
if next_offset is None:
|
||||
break
|
||||
offset = next_offset
|
||||
|
||||
return list(docs.values())
|
||||
|
||||
async def count_by_filter(
|
||||
self, collection: str, filter_conditions: dict[str, Any]
|
||||
) -> int:
|
||||
"""Count points matching filter."""
|
||||
try:
|
||||
self.client.get_collection(collection)
|
||||
except Exception:
|
||||
return 0
|
||||
|
||||
must_conditions = []
|
||||
for key, value in filter_conditions.items():
|
||||
must_conditions.append(
|
||||
qmodels.FieldCondition(
|
||||
key=key, match=qmodels.MatchValue(value=value)
|
||||
)
|
||||
)
|
||||
|
||||
result = self.client.count(
|
||||
collection_name=collection,
|
||||
count_filter=qmodels.Filter(must=must_conditions),
|
||||
exact=True,
|
||||
)
|
||||
return result.count
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Info
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
Reference in New Issue
Block a user