3 Commits

Author SHA1 Message Date
Benjamin Admin 0bb9726ddd Merge branch 'main' of ssh://gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-core
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 48s
CI / test-python-voice (push) Successful in 43s
CI / test-bqas (push) Successful in 36s
2026-05-10 15:09:51 +02:00
Benjamin Admin 8510af46eb feat(pipeline): MC Quality Overhaul — 74.5% → 92.8% accuracy, 5.3K → 13.6K MCs
Phase 0: Quality Audit script (Claude Sonnet, 1750 samples)
Phase 1: Object ontology expanded 31 → 74 tokens with descriptions + boundaries
Phase 2: 174K controls re-classified via Haiku (10 batches, $50)
  - Generic tokens removed (documentation, procedure, process)
  - L2 sub-topics added (108K + 64K controls)
  - Bad subtopics fixed (stakeholder_*, escalation fragments)
Phase 3: Re-clustering K=18704 (37K objects → 16.7K groups)
Phase 4: Direct MC generation from canonical tokens (gpre2_direct_mc.py)
Phase 5: Regulation-source split (gpre3, dry-run tested)

New features:
- Tenant-isolated document upload API (rag-service)
- BAuA crawler (Playwright, 131 PDFs downloaded)
- OSHA Technical Manual crawler (23 chapters)
- CE obligation extractor (6141 obligations from Qdrant)

RAG ingestion:
- 126 BAuA PDFs (TRBS/TRGS/ASR): 27,664 chunks
- OSHA Technical Manual: 7,241 chunks
- OSHA 1910 Subpart O (full): 745 chunks
- EuGH C-588/21 P: 216 chunks
- EU 2018/1725: 842 chunks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 15:08:15 +02:00
Benjamin Admin 81db904b3e feat(legal-sources): add OSHA machinery safety standards + international norms mapping
OSHA 29 CFR 1910 Subpart O (1910.211-1910.219) — complete machine
guarding requirements. US federal law, public domain.

International norms mapping table: China GB/T, Korea KS, India BIS
equivalents to ISO/EN standards. Unfortunately all countries protect
ISO copyright even for identical national adoptions (IDT).

Only OSHA provides truly free machinery safety content.
EU Excel harmonised standards list included for reference.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-09 10:50:43 +02:00
31 changed files with 6024 additions and 6 deletions
@@ -0,0 +1,158 @@
# Controls nutzen — Anleitung für andere Sessions
**Stand:** 2026-05-07, wird laufend aktualisiert
**Repo:** breakpilot-core (~/Projekte/breakpilot-core)
---
## Was sind die Controls?
174.497 atomare Compliance-Controls in der Datenbank. Jeder Control ist eine **einzelne prüfbare Anforderung** aus einer Rechtsquelle (DSGVO, NIS2, NIST, AI Act, etc.).
### Beispiel
```
Control-ID: AUTH-2956-A14
Titel: "Implementierung von Multi-Faktor-Authentifizierung prüfen"
Objective: "Sicherstellen, dass MFA korrekt implementiert ist..."
Merge-Key: "verify:multi_factor_auth:testing"
Severity: high
```
## Wo liegen die Controls?
### Datenbank (PostgreSQL auf Mac Mini)
```sql
-- Alle Controls abfragen
SELECT id, control_id, title, objective, severity,
source_citation, -- Rechtsquelle (JSON)
generation_metadata->>'merge_group_hint' AS merge_key
FROM compliance.canonical_controls
WHERE release_state NOT IN ('deprecated', 'rejected');
```
**Verbindung:**
```bash
# Vom MacBook:
ssh macmini "/usr/local/bin/docker exec bp-core-postgres psql -U breakpilot -d breakpilot_db"
# Oder via Control-Pipeline Container:
ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline curl -sf http://127.0.0.1:8098/..."
```
### API (Port 8098, nur via Docker exec erreichbar)
```bash
# Master Controls auflisten
ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline \
curl -sf 'http://127.0.0.1:8098/v1/master-controls?limit=50&sort=total_controls'"
# Master Control Detail mit allen Membern
ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline \
curl -sf 'http://127.0.0.1:8098/v1/master-controls/MC-8292'"
```
## Struktur der Controls
### merge_group_hint (Schlüsselfeld!)
Jeder Control hat einen `merge_group_hint` im Format `action:object:phase`:
```
implement:encryption:implementation
define:access_control:definition
monitor:network_security:monitoring
report:supervisory_authority:reporting
```
**74 kanonische Object-Tokens** (Stand 2026-05-07):
| Kategorie | Tokens |
|-----------|--------|
| **Security** | multi_factor_auth, password_policy, credentials, session_management, privileged_access, access_control, encryption, transport_encryption, key_management, certificate_management, network_security, network_segmentation, firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting, compliance_audit, vulnerability, patch_management, backup, disaster_recovery, physical_security, secure_development, api_security, input_validation, container_security, logging_configuration |
| **Data Protection** | personal_data, sensitive_data, health_data, consent, data_subject_rights, data_retention, data_transfer, data_breach_notification, dpia, data_processing_agreement, privacy_by_design, data_processing_register, data_classification, cookie_consent, video_surveillance |
| **Governance** | policy, procedure, process, training, awareness, incident, risk_management, third_party_management, change_management, documentation, records_management, compliance_reporting, asset_management, human_resources_security |
| **Regulatory** | supervisory_authority, certification, product_safety, ai_system, financial_reporting, aml, whistleblowing, consumer_protection, ecommerce, telecommunications, medical_device, payment_services, critical_infrastructure, supply_chain_due_diligence, sustainability_reporting |
### Rechtsquellen (source_citation)
Die **Parent-Controls** (nicht die atomaren!) haben `source_citation`:
```sql
-- Controls mit Rechtsquelle finden
SELECT cc.control_id, cc.title,
pc.source_citation->>'source' AS regulation,
pc.source_citation->>'article' AS article
FROM compliance.canonical_controls cc
JOIN compliance.canonical_controls pc ON pc.id = cc.parent_control_uuid
WHERE pc.source_citation IS NOT NULL
AND pc.source_citation->>'source' LIKE '%DSGVO%';
```
148 verschiedene Rechtsquellen (DSGVO, NIS2, NIST, OWASP, BSI, TKG, etc.)
## Controls filtern (Use Cases)
### Beispiel: Alle DSGVO Art. 13 Controls (für DSI-Prüfung)
```sql
SELECT cc.control_id, cc.title, cc.objective,
cc.generation_metadata->>'merge_group_hint' AS merge_key,
pc.source_citation->>'article' AS article
FROM compliance.canonical_controls cc
JOIN compliance.canonical_controls pc ON pc.id = cc.parent_control_uuid
WHERE pc.source_citation->>'source' = 'DSGVO (EU) 2016/679'
AND pc.source_citation->>'article' LIKE '%13%'
AND cc.release_state NOT IN ('deprecated', 'rejected')
ORDER BY cc.control_id;
```
### Beispiel: Alle Encryption-Controls
```sql
SELECT control_id, title, objective
FROM compliance.canonical_controls
WHERE generation_metadata->>'merge_group_hint' LIKE '%:encryption:%'
AND release_state NOT IN ('deprecated', 'rejected');
```
### Beispiel: Controls nach Object-Token filtern
```sql
-- Alle Controls zu einem bestimmten Thema
SELECT control_id, title,
generation_metadata->>'merge_group_hint' AS merge_key
FROM compliance.canonical_controls
WHERE generation_metadata->>'merge_group_hint' LIKE '%:data_retention:%'
AND release_state NOT IN ('deprecated', 'rejected');
```
## Wichtige Tabellen
| Tabelle | Rows | Beschreibung |
|---------|------|-------------|
| `compliance.canonical_controls` | ~294K | Alle Controls (Rich + Atomic) |
| `compliance.master_controls` | ~5.329 | Gruppierte Master Controls |
| `compliance.master_control_members` | ~172K | Zuordnung Control → MC |
| `compliance.object_ontology` | 74 | Kanonische Object-Definitionen |
| `compliance.regulation_registry` | 223 | Rechtsquellen-Register |
## Was gerade passiert (2026-05-07)
**Phase 2 läuft:** Alle 174K Controls werden per Claude Haiku re-klassifiziert. Die `merge_group_hint` werden von frei-form LLM-Objekten auf 74 kanonische Tokens normalisiert. Danach:
- Phase 3: Re-Clustering (gpre1 mit K=20000)
- Phase 4: Neue Master Controls (gpre2)
- Phase 5: Regulation-Source-Split (gpre3)
**NICHT ÄNDERN:** `canonical_controls`, `master_controls`, `object_ontology` Tabellen werden aktiv bearbeitet.
## DB-Zugang Quick Reference
```bash
# Quick Query (eine Zeile)
ssh macmini "/usr/local/bin/docker exec bp-core-postgres psql -U breakpilot -d breakpilot_db -c \"SELECT count(*) FROM compliance.canonical_controls\""
# Interaktive Session
ssh macmini "/usr/local/bin/docker exec -it bp-core-postgres psql -U breakpilot -d breakpilot_db"
```
@@ -0,0 +1,162 @@
-- Migration 010: Expanded Object Ontology
-- Expands from 31 to ~180 canonical object tokens with clear semantic boundaries.
-- Each token has a description to prevent ambiguous classification.
--
-- IMPORTANT: This migration ADDS new tokens. Existing synonyms are preserved.
SET search_path TO compliance, public;
-- Add description column to object_synonyms if not exists
DO $$ BEGIN
ALTER TABLE object_synonyms ADD COLUMN IF NOT EXISTS description TEXT;
EXCEPTION WHEN duplicate_column THEN NULL;
END $$;
-- New table: canonical object definitions with clear boundaries
CREATE TABLE IF NOT EXISTS object_ontology (
canonical_token VARCHAR(100) PRIMARY KEY,
category VARCHAR(50) NOT NULL, -- security, data_protection, governance, regulatory, technical
description_de TEXT NOT NULL, -- German description for LLM prompts
description_en TEXT NOT NULL, -- English description
NOT_confused_with TEXT, -- Explicit disambiguation
examples TEXT, -- Example controls that belong here
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- ═══════════════════════════════════════════════════════════════
-- SECURITY & TECHNICAL
-- ═══════════════════════════════════════════════════════════════
-- Authentication & Identity
INSERT INTO object_ontology VALUES
('multi_factor_auth', 'security', 'Multi-Faktor-Authentifizierung (2FA/MFA)', 'Multi-factor authentication', 'NOT password_policy (Passwortregeln) oder session_management (Sitzungen)', 'MFA implementieren, 2FA-Pflicht, Authentifizierungsfaktoren'),
('password_policy', 'security', 'Passwortrichtlinien und -komplexität', 'Password policies and complexity', 'NOT credentials (allg. Zugangsdaten) oder multi_factor_auth (MFA)', 'Passwortlänge, Komplexität, Rotation, Passwort-Historie'),
('credentials', 'security', 'Zugangsdaten-Verwaltung (Tokens, API-Keys, Secrets)', 'Credential management', 'NOT password_policy (Passwortregeln) oder key_management (kryptografisch)', 'API-Key-Rotation, Token-Verwaltung, Secret Storage'),
('session_management', 'security', 'Sitzungsverwaltung (Session Timeout, Token-Lifecycle)', 'Session management', 'NOT multi_factor_auth (Login) oder access_control (Berechtigungen)', 'Session Timeout, Token-Invalidierung, Concurrent Sessions'),
('privileged_access', 'security', 'Verwaltung privilegierter Zugriffe (Admin, Root)', 'Privileged access management', 'NOT access_control (allg. Zugriffskontrolle)', 'Admin-Konten, Root-Zugriff, PAM, Just-in-Time-Access'),
('access_control', 'security', 'Allgemeine Zugriffskontrolle (RBAC, Berechtigungen)', 'Access control (RBAC, permissions)', 'NOT privileged_access (Admin) oder authentication (Login)', 'Rollenbasierte Zugriffskontrolle, Berechtigungsvergabe, Least Privilege')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Encryption & Cryptography
INSERT INTO object_ontology VALUES
('encryption', 'security', 'Verschlüsselung at-rest (Datenverschlüsselung)', 'Encryption at rest', 'NOT transport_encryption (in-transit) oder key_management (Schlüssel)', 'AES-256, Festplattenverschlüsselung, DB-Verschlüsselung'),
('transport_encryption', 'security', 'Transportverschlüsselung (TLS, HTTPS)', 'Transport encryption (TLS)', 'NOT encryption (at-rest)', 'TLS 1.3, HTTPS, mTLS, Zertifikats-Pinning'),
('key_management', 'security', 'Kryptografische Schlüsselverwaltung', 'Cryptographic key management', 'NOT credentials (API-Keys) oder certificate_management (Zertifikate)', 'Key Rotation, HSM, Key Escrow, Schlüsselerzeugung'),
('certificate_management', 'security', 'Zertifikatsverwaltung (PKI, X.509)', 'Certificate management (PKI)', 'NOT key_management (Schlüssel) oder encryption (Verschlüsselung)', 'X.509-Zertifikate, PKI, Zertifikatsrückruf, CA-Verwaltung')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Network Security
INSERT INTO object_ontology VALUES
('network_security', 'security', 'Allgemeine Netzwerksicherheit', 'General network security', 'NOT network_segmentation (Segmentierung) oder firewall (Regeln)', 'Netzwerk-Hardening, Port-Management, DNS-Sicherheit'),
('network_segmentation', 'security', 'Netzwerksegmentierung (VLANs, Zonen)', 'Network segmentation', 'NOT network_security (allg.) oder firewall (Regeln)', 'VLANs, DMZ, Micro-Segmentation, Zero Trust Network'),
('firewall', 'security', 'Firewall-Regeln und -Verwaltung', 'Firewall rules and management', 'NOT network_security (allg.)', 'WAF, Firewall-Regeln, Ingress/Egress, Whitelist'),
('vpn', 'security', 'VPN-Konfiguration und -Verwaltung', 'VPN configuration', NULL, 'IPSec, WireGuard, Site-to-Site VPN'),
('remote_access', 'security', 'Fernzugriff und Remote-Arbeit', 'Remote access', 'NOT vpn (Technologie)', 'Remote Desktop, Bastion Hosts, Jump Server')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Monitoring & Logging (CRITICAL: clear boundaries!)
INSERT INTO object_ontology VALUES
('monitoring', 'security', 'Kontinuierliche Echtzeit-Überwachung von Systemen/Metriken', 'Continuous real-time monitoring of systems', 'NOT audit_logging (Protokollierung), NOT training (Schulung), NOT procedure (Verfahren), NOT risk_assessment (Bewertung)', 'System-Health-Monitoring, Verfügbarkeitsüberwachung, Performance-Monitoring, Anomalie-Erkennung in Echtzeit'),
('audit_logging', 'security', 'Protokollierung und Audit-Trail (Nachvollziehbarkeit)', 'Audit logging and trail', 'NOT monitoring (Echtzeit-Überwachung), NOT compliance_audit (Prüfungen)', 'Log-Aufzeichnung, Audit Trail, Zeitstempel, Nachvollziehbarkeit, Protokollierung von Zugriffen'),
('siem', 'security', 'Security Information and Event Management', 'SIEM', 'NOT monitoring (allg.) oder audit_logging (Protokollierung)', 'SIEM-Korrelation, Security Events, Log-Aggregation'),
('alerting', 'security', 'Benachrichtigungen und Meldepflichten bei Sicherheitsereignissen', 'Security alerting and notification obligations', 'NOT monitoring (Überwachung) oder incident (Vorfallsbehandlung)', 'Sicherheitsmeldungen, Breach Notification, Benachrichtigungspflichten'),
('compliance_audit', 'governance', 'Compliance-Prüfungen und externe Audits', 'Compliance audits and external reviews', 'NOT audit_logging (technische Protokollierung), NOT monitoring (Überwachung)', 'Externe Prüfung, Jahresabschlussprüfung, Zertifizierungsaudit, Lieferanten-Audit')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Vulnerability & Patch Management
INSERT INTO object_ontology VALUES
('vulnerability', 'security', 'Schwachstellenmanagement und -scanning', 'Vulnerability management', 'NOT patch_management (Updates)', 'Vulnerability Scanning, CVE-Tracking, Penetration Testing'),
('patch_management', 'security', 'Software-Updates und Patch-Verwaltung', 'Patch management', 'NOT vulnerability (Scanning)', 'Patch-Zyklus, Update-Policy, Hotfix-Prozess')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Backup & Recovery
INSERT INTO object_ontology VALUES
('backup', 'security', 'Datensicherung und Backup-Strategien', 'Backup strategies', 'NOT disaster_recovery (Wiederherstellung)', 'Backup-Rotation, Offsite-Backup, Backup-Verschlüsselung'),
('disaster_recovery', 'security', 'Notfallwiederherstellung und Business Continuity', 'Disaster recovery', 'NOT backup (Datensicherung) oder incident (Vorfälle)', 'DR-Plan, RTO/RPO, Failover, Business Continuity')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- ═══════════════════════════════════════════════════════════════
-- DATA PROTECTION (CRITICAL: clear boundaries!)
-- ═══════════════════════════════════════════════════════════════
INSERT INTO object_ontology VALUES
('personal_data', 'data_protection', 'Verarbeitung personenbezogener Daten (DSGVO-Grundsätze)', 'Personal data processing principles', 'NOT sensitive_data (besondere Kategorien), NOT data_subject_rights (Betroffenenrechte), NOT consent (Einwilligung)', 'Datenminimierung, Zweckbindung, Speicherbegrenzung, Rechtmäßigkeit der Verarbeitung'),
('sensitive_data', 'data_protection', 'Besondere Kategorien personenbezogener Daten (Art. 9 DSGVO)', 'Special categories of personal data', 'NOT personal_data (allg.), NOT health_data (Gesundheit)', 'Biometrische Daten, ethnische Herkunft, politische Meinungen, Gewerkschaftszugehörigkeit'),
('health_data', 'data_protection', 'Gesundheitsdaten und Medizindaten', 'Health and medical data', 'NOT sensitive_data (allg. besondere Kategorien)', 'Patientendaten, Medizinprodukte-Daten, klinische Daten'),
('consent', 'data_protection', 'Einwilligungsmanagement', 'Consent management', 'NOT data_subject_rights (andere Betroffenenrechte)', 'Einwilligung einholen, Widerruf, Opt-In, Consent-Banner'),
('data_subject_rights', 'data_protection', 'Betroffenenrechte (Auskunft, Löschung, Portabilität)', 'Data subject rights (access, erasure, portability)', 'NOT consent (Einwilligung), NOT personal_data (Verarbeitung)', 'Auskunftsrecht, Recht auf Löschung, Datenportabilität, Widerspruchsrecht'),
('data_retention', 'data_protection', 'Aufbewahrungsfristen und Löschkonzept', 'Data retention and deletion', 'NOT backup (technische Sicherung)', 'Löschfristen, Aufbewahrungspflichten, Löschkonzept, Archivierung'),
('data_transfer', 'data_protection', 'Internationale Datenübermittlung (Drittländer, SCC)', 'International data transfer', 'NOT data_processing (Verarbeitung)', 'Drittlandtransfer, Standardvertragsklauseln, Angemessenheitsbeschluss, BCR'),
('data_breach_notification', 'data_protection', 'Meldung von Datenschutzverletzungen (Art. 33/34 DSGVO)', 'Data breach notification', 'NOT incident (allg. Sicherheitsvorfälle), NOT alerting (techn. Alerts)', 'Breach-Meldung an Aufsichtsbehörde, Benachrichtigung Betroffener, 72-Stunden-Frist'),
('dpia', 'data_protection', 'Datenschutz-Folgenabschätzung (Art. 35 DSGVO)', 'Data protection impact assessment', NULL, 'DSFA, Schwellwertanalyse, Risikobewertung für Betroffene'),
('data_processing_agreement', 'data_protection', 'Auftragsverarbeitung (Art. 28 DSGVO)', 'Data processing agreements', NULL, 'AVV, Auftragsverarbeiter, Sub-Auftragsverarbeiter, TOMs'),
('privacy_by_design', 'data_protection', 'Datenschutz durch Technikgestaltung (Art. 25 DSGVO)', 'Privacy by design and default', NULL, 'Privacy by Default, Datenminimierung in der Architektur'),
('data_processing_register', 'data_protection', 'Verzeichnis von Verarbeitungstätigkeiten (Art. 30 DSGVO)', 'Records of processing activities', NULL, 'VVT, Verarbeitungsverzeichnis')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- ═══════════════════════════════════════════════════════════════
-- GOVERNANCE & ORGANIZATION
-- ═══════════════════════════════════════════════════════════════
INSERT INTO object_ontology VALUES
('policy', 'governance', 'Richtlinien und Leitlinien ERSTELLEN/DEFINIEREN', 'Creating/defining policies', 'NOT procedure (Verfahrensablauf), NOT compliance_audit (Prüfung)', 'Sicherheitsrichtlinie erstellen, Policy-Framework definieren, Leitlinie verabschieden'),
('procedure', 'governance', 'Verfahren und Prozessabläufe DEFINIEREN/DOKUMENTIEREN', 'Defining/documenting procedures', 'NOT incident (Vorfallsbehandlung), NOT process (laufender Betrieb)', 'Verfahrensanweisung, Ablaufbeschreibung, Standardprozess definieren'),
('process', 'governance', 'Laufende betriebliche Prozesse AUSFÜHREN', 'Executing operational processes', 'NOT procedure (Definition), NOT monitoring (Überwachung)', 'Betriebsprozess, Geschäftsprozess, Workflow-Ausführung'),
('training', 'governance', 'Schulung und Weiterbildung DURCHFÜHREN', 'Training and education', 'NOT awareness (Sensibilisierung), NOT monitoring (Überwachung!)', 'Mitarbeiterschulung, Zertifizierungskurs, Pflichtunterweisung'),
('awareness', 'governance', 'Sicherheitsbewusstsein und Sensibilisierung', 'Security awareness', 'NOT training (formale Schulung)', 'Phishing-Simulation, Awareness-Kampagne, Sicherheitskultur'),
('incident', 'governance', 'Sicherheitsvorfälle BEHANDELN (Incident Response)', 'Incident response and handling', 'NOT alerting (Benachrichtigung), NOT data_breach_notification (DSGVO-Meldung)', 'Incident Response Plan, Vorfallsanalyse, Containment, Recovery, Lessons Learned'),
('risk_management', 'governance', 'Risikomanagement und -bewertung', 'Risk management and assessment', 'NOT vulnerability (techn. Schwachstellen), NOT monitoring (Überwachung)', 'Risikobewertung, Risikobehandlung, Risikoakzeptanz, Risikomatrix'),
('third_party_management', 'governance', 'Lieferanten- und Drittanbieter-Management', 'Third-party and vendor management', 'NOT data_processing_agreement (AVV)', 'Lieferantenbewertung, Vendor Risk Assessment, Supply Chain Security'),
('change_management', 'governance', 'Änderungsmanagement', 'Change management', 'NOT patch_management (Updates)', 'Change Request, Change Advisory Board, Rollback-Verfahren'),
('documentation', 'governance', 'Allgemeine Dokumentationspflichten', 'General documentation requirements', 'NOT audit_logging (technische Logs), NOT data_processing_register (VVT)', 'Betriebshandbuch, Systemdokumentation, Verfahrensdokumentation'),
('records_management', 'governance', 'Akten- und Unterlagenverwaltung', 'Records management', 'NOT data_retention (Löschfristen)', 'Archivierung, Aktenführung, Aufbewahrungspflichten nach HGB/AO'),
('compliance_reporting', 'governance', 'Compliance-Berichterstattung', 'Compliance reporting', 'NOT alerting (techn. Alerts), NOT supervisory_authority (Behördenkommunikation)', 'Compliance-Bericht, Management-Reporting, KPI-Tracking'),
('asset_management', 'governance', 'IT-Asset-Verwaltung und Inventar', 'IT asset management', NULL, 'Asset-Inventar, CMDB, Hardware-Lifecycle, Software-Inventar'),
('physical_security', 'security', 'Physische Sicherheit und Zutrittskontrolle', 'Physical security and access', NULL, 'Zutrittskontrolle, Videoüberwachung (physisch), Serverraum-Sicherheit'),
('human_resources_security', 'governance', 'Personalsicherheit', 'HR security', 'NOT training (Schulung)', 'Background-Checks, Geheimhaltungsvereinbarungen, Onboarding/Offboarding')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- ═══════════════════════════════════════════════════════════════
-- REGULATORY SPECIFIC
-- ═══════════════════════════════════════════════════════════════
INSERT INTO object_ontology VALUES
('supervisory_authority', 'regulatory', 'Kommunikation mit Aufsichtsbehörden', 'Supervisory authority communication', 'NOT compliance_reporting (interne Berichte)', 'Meldung an BaFin, Abstimmung mit DPA, behördliche Anfragen'),
('certification', 'regulatory', 'Zertifizierung und Konformitätsbewertung', 'Certification and conformity assessment', 'NOT compliance_audit (Prüfung), NOT personal_data (Datenschutz)', 'CE-Kennzeichnung, ISO-Zertifizierung, Konformitätserklärung'),
('product_safety', 'regulatory', 'Produktsicherheit und Marktüberwachung', 'Product safety and market surveillance', 'NOT certification (Zertifizierung)', 'Rückrufmanagement, Sicherheitsbewertung, RAPEX-Meldung'),
('ai_system', 'regulatory', 'KI-System-Regulierung (AI Act)', 'AI system regulation', NULL, 'KI-Risikobewertung, Hochrisiko-KI, Transparenzpflichten, FRIA'),
('financial_reporting', 'regulatory', 'Finanzberichterstattung und Rechnungslegung', 'Financial reporting and accounting', NULL, 'Jahresabschluss, HGB-Pflichten, IFRS, Buchführung'),
('aml', 'regulatory', 'Geldwäscheprävention und KYC', 'Anti-money laundering and KYC', NULL, 'KYC, Verdachtsmeldung, PEP-Prüfung, Transaktionsmonitoring'),
('whistleblowing', 'regulatory', 'Hinweisgeberschutz und Meldekanäle', 'Whistleblower protection', NULL, 'Hinweisgebersystem, Meldekanal, Hinweisgeberschutzgesetz'),
('consumer_protection', 'regulatory', 'Verbraucherschutz und AGB', 'Consumer protection', NULL, 'AGB-Prüfung, Widerrufsrecht, Informationspflichten, Preistransparenz'),
('ecommerce', 'regulatory', 'E-Commerce-Pflichten (Impressum, Fernabsatz)', 'E-commerce obligations', NULL, 'Impressumspflicht, Fernabsatzrecht, Online-Handel-Pflichten'),
('telecommunications', 'regulatory', 'Telekommunikationsregulierung', 'Telecommunications regulation', NULL, 'TKG-Pflichten, Vorratsdatenspeicherung, Notruf'),
('medical_device', 'regulatory', 'Medizinprodukte-Regulierung (MDR)', 'Medical device regulation', NULL, 'UDI, klinische Bewertung, Post-Market Surveillance'),
('payment_services', 'regulatory', 'Zahlungsdienste-Regulierung (PSD2)', 'Payment services regulation', NULL, 'Starke Kundenauthentifizierung, PSD2-Compliance, Open Banking'),
('critical_infrastructure', 'regulatory', 'KRITIS und NIS2-Pflichten', 'Critical infrastructure (NIS2)', NULL, 'KRITIS-Meldepflichten, NIS2-Maßnahmen, Mindeststandards'),
('supply_chain_due_diligence', 'regulatory', 'Lieferkettensorgfaltspflicht (LkSG)', 'Supply chain due diligence', 'NOT third_party_management (allg. Lieferanten)', 'Menschenrechts-Due-Diligence, Umwelt-Sorgfaltspflicht, LkSG-Bericht'),
('sustainability_reporting', 'regulatory', 'Nachhaltigkeitsberichterstattung (CSRD)', 'Sustainability reporting', NULL, 'ESG-Reporting, CSRD, Nachhaltigkeitsbericht'),
('cookie_consent', 'regulatory', 'Cookie-Consent und Tracking (TDDDG/ePrivacy)', 'Cookie consent and tracking', 'NOT consent (allg. Einwilligung)', 'Cookie-Banner, Tracking-Einwilligung, TDDDG §25'),
('video_surveillance', 'regulatory', 'Videoüberwachung (datenschutzrechtlich)', 'Video surveillance (data protection)', 'NOT physical_security (physische Sicherheit), NOT monitoring (IT-Monitoring)', 'Kamera-Überwachung, Speicherfristen, Kennzeichnungspflicht')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- ═══════════════════════════════════════════════════════════════
-- APPLICATION SECURITY
-- ═══════════════════════════════════════════════════════════════
INSERT INTO object_ontology VALUES
('secure_development', 'technical', 'Sichere Softwareentwicklung (SDLC)', 'Secure software development lifecycle', NULL, 'Secure Coding, Code Review, SAST/DAST, DevSecOps'),
('api_security', 'technical', 'API-Sicherheit', 'API security', NULL, 'API-Authentifizierung, Rate Limiting, Input Validation'),
('input_validation', 'technical', 'Eingabevalidierung und Output Encoding', 'Input validation and output encoding', NULL, 'XSS-Prävention, SQL-Injection-Schutz, Parametervalidierung'),
('container_security', 'technical', 'Container- und Cloud-Sicherheit', 'Container and cloud security', NULL, 'Docker-Hardening, Kubernetes-Security, Image-Scanning'),
('logging_configuration', 'technical', 'Log-Konfiguration und -Format', 'Log configuration and format', 'NOT audit_logging (Nachvollziehbarkeit), NOT monitoring (Überwachung)', 'Log-Format, Log-Rotation, Log-Shipping, Structured Logging'),
('data_classification', 'governance', 'Datenklassifizierung und -kennzeichnung', 'Data classification and labeling', 'NOT sensitive_data (besondere Kategorien)', 'Vertraulichkeitsstufen, Datenklassifizierung, Labeling')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Count results
DO $$
DECLARE cnt INTEGER;
BEGIN
SELECT count(*) INTO cnt FROM object_ontology;
RAISE NOTICE 'object_ontology: % canonical tokens defined', cnt;
END $$;
@@ -0,0 +1,214 @@
#!/usr/bin/env python3
"""
Extract CE-relevant obligations from TRBS/TRGS/ASR/OSHA chunks in Qdrant.
Searches for MUSS/SOLL patterns in chunk texts and classifies them.
Output: JSON file with structured obligations for the CE session.
Usage:
python3 /app/scripts/extract_ce_obligations.py
python3 /app/scripts/extract_ce_obligations.py --output /tmp/ce_obligations.json
"""
import argparse
import json
import logging
import os
import re
from pathlib import Path
import httpx
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("ce-obligations")
QDRANT_URL = os.getenv("QDRANT_URL", "http://qdrant:6333")
COLLECTION = "bp_compliance_ce"
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://host.docker.internal:11434")
LLM_MODEL = "qwen3.5:35b-a3b"
# Obligation patterns (DE + EN)
OBLIGATION_PATTERNS = re.compile(
r"(muss|müssen|hat\s+[\w\s]*zu\s|ist\s+[\w\s]*sicherzustellen|"
r"ist\s+verpflichtet|sind\s+verpflichtet|darf\s+nicht|"
r"shall|must|required\s+to|is\s+required|shall\s+not)",
re.IGNORECASE,
)
# CE relevance keywords
CE_KEYWORDS = re.compile(
r"(maschine|schutzeinrichtung|gefährdung|quetsch|scher|stoß|"
r"schneid|fang|einzug|absturz|druck|explosion|brand|"
r"elektrisch|spannung|erdung|schutzleiter|not-halt|"
r"betriebsanleitung|kennzeichnung|prüfung|prüfpflicht|"
r"instandhaltung|wartung|sicherheitsabstand|"
r"schutzmaßnahme|persönliche schutzausrüstung|psa|"
r"machine|guard|hazard|crush|shear|cut|entangle|"
r"lockout|tagout|electrical|grounding|emergency stop|"
r"safety distance|protective device|ppe|inspection)",
re.IGNORECASE,
)
HAZARD_CATEGORIES = {
"quetsch|crush|squeeze": "mechanical_crushing",
"schneid|cut": "mechanical_cutting",
"fang|einzug|entangle|draw": "mechanical_entanglement",
"absturz|fall": "fall_hazard",
"explosion|ex-bereich|atex": "explosion_hazard",
"brand|fire|feuer": "fire_hazard",
"elektrisch|electrical|spannung|voltage": "electrical_hazard",
"lärm|noise|schall": "noise_hazard",
"gefahrstoff|hazardous substance|chemical": "chemical_hazard",
"ergonomie|ergonomic|heben|lift": "ergonomic_hazard",
"temperatur|heat|hitze|kälte|cold": "thermal_hazard",
"strahlung|radiation|laser": "radiation_hazard",
"not-halt|emergency stop|e-stop": "emergency_stop",
"lockout|tagout|loto": "lockout_tagout",
"kennzeichnung|label|marking|sign": "safety_marking",
"prüfung|inspection|test": "inspection_requirement",
"instandhaltung|maintenance|wartung": "maintenance",
"schutzeinrichtung|guard|protective device": "protective_device",
"betriebsanleitung|instruction|manual": "operating_instructions",
"druck|pressure|behälter|vessel|kessel|boiler": "pressure_hazard",
}
# Source-based overrides: TRGS docs about chemicals/storage
# should never be classified as mechanical hazards
_CHEMICAL_SOURCES = re.compile(
r"trgs\s*(5[0-9]{2}|7[0-9]{2}|9[0-9]{2}|4[0-9]{2}|6[0-9]{2})",
re.IGNORECASE,
)
def _classify_hazard(text: str, source: str) -> str:
"""Classify hazard with source-aware overrides."""
# TRGS sources → chemical/pressure/explosion, never mechanical
if _CHEMICAL_SOURCES.search(source):
if re.search(r"explosion|ex-bereich|atex|zündfähig", text, re.IGNORECASE):
return "explosion_hazard"
if re.search(r"druck|pressure|behälter|vessel", text, re.IGNORECASE):
return "pressure_hazard"
if re.search(r"brand|fire|feuer", text, re.IGNORECASE):
return "fire_hazard"
return "chemical_hazard"
# Standard pattern matching (order matters — specific first)
for pattern, category in HAZARD_CATEGORIES.items():
if re.search(pattern, text, re.IGNORECASE):
return category
return "general"
def scroll_chunks(source_filter: str = None) -> list[dict]:
"""Scroll through Qdrant to get all relevant chunks."""
chunks = []
offset = None
batch = 100
while True:
scroll_body = {
"limit": batch,
"with_payload": True,
"with_vector": False,
}
if offset is not None:
scroll_body["offset"] = offset
resp = httpx.post(
f"{QDRANT_URL}/collections/{COLLECTION}/points/scroll",
json=scroll_body,
timeout=30.0,
)
data = resp.json()
points = data.get("result", {}).get("points", [])
next_offset = data.get("result", {}).get("next_page_offset")
for pt in points:
payload = pt.get("payload", {})
source = payload.get("source", payload.get("filename", ""))
text = payload.get("chunk_text", "")
# Filter for TRBS/TRGS/ASR/OSHA
source_lower = source.lower()
is_relevant = any(k in source_lower for k in
["trbs", "trgs", "asr", "osha"])
if not is_relevant:
continue
# Check for obligation patterns
if not OBLIGATION_PATTERNS.search(text):
continue
# Check CE relevance
if not CE_KEYWORDS.search(text):
continue
# Classify hazard category (source-aware)
hazard = _classify_hazard(text, source)
# Determine obligation type
if re.search(r"muss|müssen|shall|must|required", text, re.IGNORECASE):
obl_type = "MUSS"
elif re.search(r"soll|sollte|should", text, re.IGNORECASE):
obl_type = "SOLL"
else:
obl_type = "MUSS"
chunks.append({
"source": source,
"section": payload.get("section", ""),
"paragraph": payload.get("paragraph", ""),
"obligation_text": text.strip()[:500],
"hazard_category": hazard,
"obligation_type": obl_type,
"ce_relevance": "high" if hazard != "general" else "medium",
"filename": payload.get("filename", ""),
})
if next_offset is None or not points:
break
offset = next_offset
if len(chunks) % 500 == 0:
logger.info(" Scanned... %d obligations found so far", len(chunks))
return chunks
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--output", default="/tmp/ce_obligations.json")
args = parser.parse_args()
logger.info("Scanning %s for CE obligations...", COLLECTION)
obligations = scroll_chunks()
logger.info("Found %d CE-relevant obligations", len(obligations))
# Stats
by_source = {}
by_hazard = {}
for o in obligations:
src = o["source"][:30]
by_source[src] = by_source.get(src, 0) + 1
by_hazard[o["hazard_category"]] = by_hazard.get(o["hazard_category"], 0) + 1
logger.info("\nBy source:")
for src, cnt in sorted(by_source.items(), key=lambda x: -x[1])[:20]:
logger.info(" %4d %s", cnt, src)
logger.info("\nBy hazard category:")
for cat, cnt in sorted(by_hazard.items(), key=lambda x: -x[1]):
logger.info(" %4d %s", cnt, cat)
# Save
Path(args.output).write_text(
json.dumps(obligations, indent=2, ensure_ascii=False)
)
logger.info("\nSaved to %s", args.output)
if __name__ == "__main__":
main()
@@ -0,0 +1,289 @@
#!/usr/bin/env python3
"""
Add L2 sub-topics to broad tokens. Instead of just "incident",
produces "incident:response", "incident:detection", etc.
Only processes tokens with >500 controls AND <90% audit accuracy.
Usage:
python3 /app/scripts/gpre0_add_subtopics.py --dry-run
python3 /app/scripts/gpre0_add_subtopics.py
"""
import argparse
import json
import logging
import os
import time
from collections import defaultdict
from pathlib import Path
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("gpre0-subtopics")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
CHECKPOINT_DIR = Path("/tmp/gpre0_subtopic_checkpoints")
# Tokens that are too broad — need L2 sub-topics
BROAD_TOKENS = {
# Round 1 (already done)
"risk_management", "policy", "audit_logging", "incident",
"access_control", "compliance_audit", "asset_management",
"key_management", "third_party_management", "monitoring",
"financial_reporting", "data_classification", "change_management",
"alerting", "multi_factor_auth", "api_security",
"certificate_management", "human_resources_security",
"training", "data_processing_agreement", "data_processing_register",
"consumer_protection", "input_validation", "vulnerability",
"dpia", "data_breach_notification", "backup",
"supply_chain_due_diligence", "awareness",
"privacy_by_design", "credentials", "logging_configuration",
# Round 2 (remaining large tokens)
"supervisory_authority", "certification", "secure_development",
"product_safety", "personal_data", "data_subject_rights", "consent",
"ai_system", "encryption", "data_retention", "disaster_recovery",
"data_transfer", "aml", "transport_encryption", "network_security",
"physical_security", "medical_device", "patch_management",
"cookie_consent", "video_surveillance", "network_segmentation",
"telecommunications", "privileged_access", "session_management",
"password_policy", "governance", "whistleblowing", "payment_services",
"health_data", "sensitive_data", "ecommerce", "sustainability_reporting",
"critical_infrastructure", "regulatory",
}
SYSTEM_PROMPT = """Du bist ein Compliance-Spezialist. Jeder Control hat bereits ein Hauptthema (L1 Token).
Deine Aufgabe: Bestimme ein SPEZIFISCHES Sub-Thema (L2) innerhalb des Hauptthemas.
Das L2 Sub-Thema soll den KONKRETEN Aspekt beschreiben. Verwende kurze, klare englische Bezeichnungen.
Beispiele:
- L1=incident, Titel="Incident Response Plan erstellen" → L2="response_plan"
- L1=incident, Titel="Sicherheitsvorfälle erkennen" → L2="detection"
- L1=incident, Titel="Recovery nach Vorfall dokumentieren" → L2="recovery"
- L1=incident, Titel="Forensische Analyse durchführen" → L2="forensics"
- L1=risk_management, Titel="Risikobewertung durchführen" → L2="assessment"
- L1=risk_management, Titel="Risikominderungsmaßnahmen umsetzen" → L2="treatment"
- L1=risk_management, Titel="Restrisiko akzeptieren" → L2="acceptance"
- L1=access_control, Titel="Rollenbasierte Zugriffskontrolle" → L2="rbac"
- L1=access_control, Titel="Zugriffsrechte regelmäßig prüfen" → L2="access_review"
- L1=access_control, Titel="Identitätsmanagement implementieren" → L2="identity_management"
- L1=monitoring, Titel="Systemverfügbarkeit überwachen" → L2="availability"
- L1=monitoring, Titel="Sicherheitsereignisse überwachen" → L2="security_events"
- L1=policy, Titel="Datenschutzrichtlinie erstellen" → L2="data_protection"
- L1=policy, Titel="Acceptable Use Policy definieren" → L2="acceptable_use"
- L1=policy, Titel="Passwortrichtlinie festlegen" → L2="password"
- L1=financial_reporting, Titel="Jahresabschluss erstellen" → L2="annual_accounts"
- L1=financial_reporting, Titel="Steuererklärung einreichen" → L2="tax"
- L1=alerting, Titel="Datenpanne an Behörde melden" → L2="breach_notification"
- L1=alerting, Titel="Sicherheitswarnung eskalieren" → L2="escalation"
REGELN:
- L2 soll 1-3 Wörter sein, snake_case
- L2 soll SPEZIFISCH sein (nicht das L1 wiederholen)
- Verwende konsistente L2-Bezeichnungen für ähnliche Controls
Antworte NUR als JSON-Array: [{"id":"...","l2":"subtopic"}, ...]"""
def call_claude(controls_batch: list[dict]) -> tuple[list[dict], dict]:
"""Send batch to Claude for L2 sub-topic assignment."""
items = []
for c in controls_batch:
items.append(
f'- id="{c["control_id"]}" '
f'L1="{c["current_object"]}" '
f't="{c["title"]}" '
f'o="{c["objective"][:80]}"'
)
prompt = "Bestimme L2 Sub-Topics:\n" + "\n".join(items)
headers = {
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
payload = {
"model": ANTHROPIC_MODEL,
"max_tokens": 1500,
"temperature": 0.0,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": prompt}],
}
try:
resp = httpx.post(
ANTHROPIC_URL, headers=headers, json=payload, timeout=45.0
)
resp.raise_for_status()
data = resp.json()
content = data.get("content", [{}])[0].get("text", "")
usage = data.get("usage", {})
start = content.find("[")
end = content.rfind("]") + 1
if start >= 0 and end > start:
return json.loads(content[start:end]), usage
return [], usage
except httpx.TimeoutException:
logger.error("TIMEOUT — skipping")
return [], {}
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
logger.warning("Rate limited — waiting 60s")
time.sleep(60)
else:
logger.error("API error %d", e.response.status_code)
return [], {}
except Exception as e:
logger.error("Failed: %s", e)
return [], {}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--batch-size", type=int, default=20)
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Build LIKE patterns for broad tokens
like_clauses = " OR ".join(
f"cc.generation_metadata->>'merge_group_hint' LIKE '%:{tok}:%'"
for tok in BROAD_TOKENS
)
with engine.connect() as c:
rows = c.execute(text(f"""
SELECT cc.id, cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective,
cc.generation_metadata->>'merge_group_hint' as hint
FROM canonical_controls cc
WHERE cc.generation_metadata->>'merge_group_hint' IS NOT NULL
AND cc.release_state NOT IN ('deprecated', 'rejected')
AND ({like_clauses})
""")).fetchall()
controls = []
for uuid, cid, title, objective, hint in rows:
parts = hint.split(":", 2) if hint else []
obj = parts[1] if len(parts) > 1 else ""
if obj in BROAD_TOKENS:
controls.append({
"uuid": str(uuid), "control_id": cid,
"title": title or "", "objective": objective or "",
"current_hint": hint, "current_object": obj,
})
logger.info("Found %d controls in broad tokens to add L2 sub-topics", len(controls))
# Process
total_tagged = 0
total_skipped = 0
total_input_tokens = 0
total_output_tokens = 0
corrections = []
l2_stats: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
for i in range(0, len(controls), args.batch_size):
batch = controls[i:i + args.batch_size]
results, usage = call_claude(batch)
total_input_tokens += usage.get("input_tokens", 0)
total_output_tokens += usage.get("output_tokens", 0)
if not results:
total_skipped += len(batch)
continue
result_map = {r.get("id", ""): r for r in results}
for ctrl in batch:
r = result_map.get(ctrl["control_id"], {})
l2 = r.get("l2", "")
if not l2:
total_skipped += 1
continue
total_tagged += 1
old_hint = ctrl["current_hint"]
parts = old_hint.split(":", 2)
action = parts[0] if parts else "implement"
l1 = parts[1] if len(parts) > 1 else "unknown"
phase = parts[2] if len(parts) > 2 else "implementation"
# New format: action:L1_L2:phase
new_obj = f"{l1}_{l2}"
new_hint = f"{action}:{new_obj}:{phase}"
corrections.append({
"uuid": ctrl["uuid"],
"old_hint": old_hint,
"new_hint": new_hint,
})
l2_stats[l1][l2] += 1
processed = min(i + args.batch_size, len(controls))
if processed % 5000 < args.batch_size or processed >= len(controls):
logger.info(
"Progress: %d/%d (tagged=%d skip=%d)",
processed, len(controls), total_tagged, total_skipped,
)
time.sleep(0.3)
# Report
cost_in = total_input_tokens / 1_000_000 * 0.80
cost_out = total_output_tokens / 1_000_000 * 4.00
logger.info("\n" + "=" * 60)
logger.info("SUBTOPIC REPORT")
logger.info("=" * 60)
logger.info("Total: %d | Tagged: %d | Skipped: %d", len(controls), total_tagged, total_skipped)
logger.info("Cost: $%.2f (Haiku)", cost_in + cost_out)
# Show L2 distribution per L1
for l1, subs in sorted(l2_stats.items()):
top_subs = sorted(subs.items(), key=lambda x: -x[1])[:10]
logger.info("\n%s (%d unique L2):", l1, len(subs))
for l2, cnt in top_subs:
logger.info(" %4d %s_%s", cnt, l1, l2)
# Save corrections
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
corr_file = CHECKPOINT_DIR / "corrections_subtopics.json"
corr_file.write_text(json.dumps(corrections))
logger.info("\nSaved %d corrections to %s", len(corrections), corr_file)
if args.dry_run:
logger.info("DRY RUN — not updating DB")
return
if corrections:
logger.info("Applying %d corrections...", len(corrections))
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for corr in corrections:
c.execute(text("""
UPDATE canonical_controls
SET generation_metadata = jsonb_set(
generation_metadata,
'{merge_group_hint}',
to_jsonb(CAST(:new_hint AS text))
)
WHERE id = CAST(:uuid AS uuid)
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
logger.info("Done. %d hints updated.", len(corrections))
if __name__ == "__main__":
main()
@@ -0,0 +1,52 @@
#!/usr/bin/env python3
"""Apply saved corrections from JSON file to DB (crash recovery)."""
import argparse
import json
import logging
import os
from pathlib import Path
from sqlalchemy import create_engine, text
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("apply-corrections")
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("file", help="Path to corrections JSON file")
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
corrections = json.loads(Path(args.file).read_text())
logger.info("Loaded %d corrections from %s", len(corrections), args.file)
if args.dry_run:
for c in corrections[:10]:
logger.info(" %s: %s%s", c["uuid"][:8], c["old_hint"], c["new_hint"])
logger.info("DRY RUN — not applying")
return
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
applied = 0
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for corr in corrections:
c.execute(text("""
UPDATE canonical_controls
SET generation_metadata = jsonb_set(
generation_metadata,
'{merge_group_hint}',
to_jsonb(CAST(:new_hint AS text))
)
WHERE id = CAST(:uuid AS uuid)
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
applied += 1
logger.info("Applied %d corrections.", applied)
if __name__ == "__main__":
main()
@@ -0,0 +1,153 @@
#!/usr/bin/env python3
"""Fix bad L2 subtopics: stakeholder_*, escalation fragments, *_approval*, *_documentation."""
import json
import logging
import os
import time
from pathlib import Path
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("fix-subtopics")
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
SYSTEM_PROMPT = """Du klassifizierst Controls mit einem L1_L2 Token. Das L2 soll den KONKRETEN fachlichen Aspekt beschreiben.
VERBOTENE L2-Wörter (zu generisch):
- stakeholder (zu vage — WER sind die Stakeholder? WAS wird getan?)
- documentation (ist eine Handlung, kein Thema)
- approval (ist eine Handlung)
- communication (zu vage)
Stattdessen SPEZIFISCH:
- "stakeholder_notification" bei Behördenmeldung → "authority_reporting"
- "stakeholder_consultation" bei DSFA → "impact_consultation"
- "stakeholder_engagement" bei Training → "participant_selection"
- "escalation_procedure""severity_classification" oder "response_plan"
- "access_documentation""access_policy" oder "permission_matrix"
- "approval_process""authorization_workflow" oder "sign_off"
L2 = 1-3 Wörter, snake_case, FACHLICH SPEZIFISCH.
Antworte NUR als JSON-Array: [{"id":"...","token":"L1_L2"}, ...]"""
def main():
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
with engine.connect() as c:
rows = c.execute(text("""
SELECT cc.id, cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective,
cc.generation_metadata->>'merge_group_hint' as hint
FROM canonical_controls cc
WHERE cc.release_state NOT IN ('deprecated', 'rejected')
AND cc.generation_metadata->>'merge_group_hint' IS NOT NULL
AND (
cc.generation_metadata->>'merge_group_hint' LIKE '%stakeholder%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%_escalation_%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%_approval_%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%response_time%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%machine_re%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%management_app%'
)
""")).fetchall()
controls = []
for uuid, cid, title, objective, hint in rows:
parts = hint.split(":", 2) if hint else []
controls.append({
"uuid": str(uuid), "control_id": cid,
"title": title or "", "objective": objective or "",
"current_hint": hint,
"current_object": parts[1] if len(parts) > 1 else "",
})
logger.info("Found %d controls with bad subtopics to fix", len(controls))
headers = {
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
corrections = []
total_fixed = 0
batch_size = 20
for i in range(0, len(controls), batch_size):
batch = controls[i:i + batch_size]
items = [
f'- id="{c["control_id"]}" cur="{c["current_object"]}" t="{c["title"]}" o="{c["objective"][:80]}"'
for c in batch
]
try:
resp = httpx.post(ANTHROPIC_URL, headers=headers, json={
"model": "claude-haiku-4-5-20251001",
"max_tokens": 1500, "temperature": 0.0,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": "Fix:\n" + "\n".join(items)}],
}, timeout=45.0)
resp.raise_for_status()
content = resp.json().get("content", [{}])[0].get("text", "")
start = content.find("[")
end = content.rfind("]") + 1
results = json.loads(content[start:end]) if start >= 0 else []
except Exception as e:
logger.error("Batch %d failed: %s", i, e)
continue
result_map = {r.get("id", ""): r for r in results}
for ctrl in batch:
r = result_map.get(ctrl["control_id"], {})
new_token = r.get("token", "")
if not new_token or new_token == ctrl["current_object"]:
continue
if "stakeholder" in new_token or "approval" in new_token:
continue # Still bad
parts = ctrl["current_hint"].split(":", 2)
action = parts[0] if parts else "implement"
phase = parts[2] if len(parts) > 2 else "implementation"
corrections.append({
"uuid": ctrl["uuid"],
"old_hint": ctrl["current_hint"],
"new_hint": f"{action}:{new_token}:{phase}",
})
total_fixed += 1
if (i + batch_size) % 200 < batch_size:
logger.info("Progress: %d/%d (fixed=%d)", min(i + batch_size, len(controls)), len(controls), total_fixed)
time.sleep(0.3)
logger.info("Fixed: %d of %d controls", total_fixed, len(controls))
# Save + apply
Path("/tmp/corrections_bad_subtopics.json").write_text(json.dumps(corrections))
if corrections:
logger.info("Applying %d corrections...", len(corrections))
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for corr in corrections:
c.execute(text("""
UPDATE canonical_controls
SET generation_metadata = jsonb_set(
generation_metadata,
'{merge_group_hint}',
to_jsonb(CAST(:new_hint AS text))
)
WHERE id = CAST(:uuid AS uuid)
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
logger.info("Done.")
if __name__ == "__main__":
main()
@@ -0,0 +1,284 @@
#!/usr/bin/env python3
"""
Fix generic tokens: Re-classify controls that were assigned to
action-based tokens (documentation, procedure, process, etc.)
instead of topic-based tokens.
Runs sequentially in 5 batches. NO retry on timeout.
Usage:
python3 /app/scripts/gpre0_fix_generic_tokens.py --dry-run
python3 /app/scripts/gpre0_fix_generic_tokens.py
"""
import argparse
import json
import logging
import os
import time
from collections import defaultdict
from pathlib import Path
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("gpre0-fix-generic")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
CHECKPOINT_DIR = Path("/tmp/gpre0_fix_checkpoints")
# Tokens that are ACTION-based, not TOPIC-based → must be re-classified
FORBIDDEN_TOKENS = {
"documentation", "procedure", "process",
"compliance_reporting", "records_management",
}
SYSTEM_PROMPT = """Du bist ein Compliance-Klassifizierer. Ordne jeden Control dem THEMA zu, nicht der Handlung.
KRITISCH: Die Tokens "documentation", "procedure", "process", "compliance_reporting",
"records_management" sind VERBOTEN. Klassifiziere nach dem INHALTLICHEN THEMA.
Beispiele:
- "Risikobewertung dokumentieren" → risk_management (NICHT documentation)
- "Incident-Verfahren definieren" → incident (NICHT procedure)
- "Verschlüsselungsprozess implementieren" → encryption (NICHT process)
- "Audit-Ergebnisse berichten" → compliance_audit (NICHT compliance_reporting)
- "Datenschutz-Unterlagen verwalten" → personal_data (NICHT records_management)
- "Löschkonzept dokumentieren" → data_retention (NICHT documentation)
- "Zertifizierungsverfahren definieren" → certification (NICHT procedure)
- "Schulungsprozess durchführen" → training (NICHT process)
ERLAUBTE TOKENS:
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
privileged_access, access_control, encryption, transport_encryption,
key_management, certificate_management, network_security, network_segmentation,
firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting,
compliance_audit, vulnerability, patch_management, backup, disaster_recovery,
physical_security, secure_development, api_security, input_validation,
container_security, logging_configuration
DATA_PROTECTION: personal_data, sensitive_data, health_data, consent,
data_subject_rights, data_retention, data_transfer, data_breach_notification,
dpia, data_processing_agreement, privacy_by_design, data_processing_register,
data_classification, cookie_consent, video_surveillance
GOVERNANCE: policy, training, awareness, incident, risk_management,
third_party_management, change_management, asset_management,
human_resources_security
REGULATORY: supervisory_authority, certification, product_safety, ai_system,
financial_reporting, aml, whistleblowing, consumer_protection, ecommerce,
telecommunications, medical_device, payment_services, critical_infrastructure,
supply_chain_due_diligence, sustainability_reporting
Antworte NUR als JSON-Array: [{"id":"...","token":"...","conf":0.9}, ...]"""
def call_claude(controls_batch: list[dict]) -> tuple[list[dict], dict]:
"""Send batch to Claude. NO retry on timeout."""
items = []
for c in controls_batch:
items.append(
f'- id="{c["control_id"]}" '
f'cur="{c["current_object"]}" '
f't="{c["title"]}" '
f'o="{c["objective"][:100]}"'
)
prompt = "Klassifiziere nach THEMA (nicht Handlung):\n" + "\n".join(items)
headers = {
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
payload = {
"model": ANTHROPIC_MODEL,
"max_tokens": 1500,
"temperature": 0.0,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": prompt}],
}
try:
resp = httpx.post(
ANTHROPIC_URL, headers=headers, json=payload, timeout=45.0
)
resp.raise_for_status()
data = resp.json()
content = data.get("content", [{}])[0].get("text", "")
usage = data.get("usage", {})
start = content.find("[")
end = content.rfind("]") + 1
if start >= 0 and end > start:
return json.loads(content[start:end]), usage
return [], usage
except httpx.TimeoutException:
logger.error("TIMEOUT — skipping batch")
return [], {}
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
logger.warning("Rate limited — waiting 60s")
time.sleep(60)
else:
logger.error("API error %d", e.response.status_code)
return [], {}
except Exception as e:
logger.error("Failed: %s", e)
return [], {}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--batch-size", type=int, default=20)
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Load only controls with forbidden tokens
forbidden_pattern = "|".join(
f":{tok}:" for tok in FORBIDDEN_TOKENS
)
with engine.connect() as c:
rows = c.execute(text("""
SELECT cc.id, cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective,
cc.generation_metadata->>'merge_group_hint' as hint
FROM canonical_controls cc
WHERE cc.generation_metadata->>'merge_group_hint' IS NOT NULL
AND cc.release_state NOT IN ('deprecated', 'rejected')
AND (
cc.generation_metadata->>'merge_group_hint' LIKE '%:documentation:%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:procedure:%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:process:%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:compliance_reporting:%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:records_management:%'
)
""")).fetchall()
controls = []
for uuid, cid, title, objective, hint in rows:
parts = hint.split(":", 2) if hint else []
controls.append({
"uuid": str(uuid), "control_id": cid,
"title": title or "", "objective": objective or "",
"current_hint": hint,
"current_object": parts[1] if len(parts) > 1 else hint,
})
logger.info("Found %d controls with forbidden tokens to re-classify", len(controls))
# Process
total_fixed = 0
total_kept = 0
total_skipped = 0
total_input_tokens = 0
total_output_tokens = 0
corrections = []
change_stats: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
for i in range(0, len(controls), args.batch_size):
batch = controls[i:i + args.batch_size]
results, usage = call_claude(batch)
total_input_tokens += usage.get("input_tokens", 0)
total_output_tokens += usage.get("output_tokens", 0)
if not results:
total_skipped += len(batch)
continue
result_map = {r.get("id", ""): r for r in results}
for ctrl in batch:
r = result_map.get(ctrl["control_id"], {})
new_token = r.get("token", "")
if not new_token or new_token in FORBIDDEN_TOKENS:
total_kept += 1
continue
old_obj = ctrl["current_object"]
if new_token != old_obj:
total_fixed += 1
parts = ctrl["current_hint"].split(":", 2)
action = parts[0] if parts else "implement"
phase = parts[2] if len(parts) > 2 else "implementation"
corrections.append({
"uuid": ctrl["uuid"],
"old_hint": ctrl["current_hint"],
"new_hint": f"{action}:{new_token}:{phase}",
})
change_stats[old_obj][new_token] += 1
else:
total_kept += 1
processed = min(i + args.batch_size, len(controls))
if processed % 2000 < args.batch_size or processed >= len(controls):
logger.info(
"Progress: %d/%d (fixed=%d kept=%d skip=%d)",
processed, len(controls), total_fixed, total_kept, total_skipped,
)
time.sleep(0.3)
# Report
cost_in = total_input_tokens / 1_000_000 * 0.80
cost_out = total_output_tokens / 1_000_000 * 4.00
logger.info("\n" + "=" * 60)
logger.info("GENERIC TOKEN FIX REPORT")
logger.info("=" * 60)
logger.info("Total: %d controls", len(controls))
logger.info("Fixed: %d", total_fixed)
logger.info("Kept: %d (LLM also chose forbidden → kept as-is)", total_kept)
logger.info("Skipped: %d", total_skipped)
logger.info("Cost: $%.2f (Haiku)", cost_in + cost_out)
logger.info("\nTop changes:")
flat = []
for old, news in change_stats.items():
for new, cnt in news.items():
flat.append((cnt, old, new))
for cnt, old, new in sorted(flat, reverse=True)[:30]:
logger.info(" %4d × %s%s", cnt, old, new)
# Save corrections
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
corr_file = CHECKPOINT_DIR / "corrections_generic_fix.json"
corr_file.write_text(json.dumps(corrections))
logger.info("Saved %d corrections to %s", len(corrections), corr_file)
if args.dry_run:
logger.info("DRY RUN — not updating DB")
return
if corrections:
logger.info("Applying %d corrections...", len(corrections))
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for corr in corrections:
c.execute(text("""
UPDATE canonical_controls
SET generation_metadata = jsonb_set(
generation_metadata,
'{merge_group_hint}',
to_jsonb(CAST(:new_hint AS text))
)
WHERE id = CAST(:uuid AS uuid)
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
logger.info("Done. %d hints corrected.", len(corrections))
if __name__ == "__main__":
main()
+37
View File
@@ -0,0 +1,37 @@
#!/bin/bash
# Run all 10 batches sequentially. Safe: if one fails, the rest don't run.
# Each batch saves corrections to JSON before applying to DB.
#
# Usage: bash /app/scripts/gpre0_run_all.sh
# bash /app/scripts/gpre0_run_all.sh 5 # start from batch 5
set -e
START=${1:-1}
TOTAL=10
echo "=== Starting from batch $START of $TOTAL ==="
for i in $(seq $START $TOTAL); do
echo ""
echo "================================================================"
echo " BATCH $i/$TOTAL$(date)"
echo "================================================================"
PYTHONPATH=/app python3 /app/scripts/gpre0_validate_hints.py \
--batch-id $i \
--total-batches $TOTAL \
--batch-size 20
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
echo "BATCH $i FAILED with exit code $EXIT_CODE"
echo "Resume with: bash /app/scripts/gpre0_run_all.sh $i"
exit $EXIT_CODE
fi
echo "BATCH $i DONE — $(date)"
done
echo ""
echo "ALL $TOTAL BATCHES COMPLETE!"
@@ -0,0 +1,351 @@
#!/usr/bin/env python3
"""
Phase 2: Validate and correct merge_group_hints using Claude Haiku.
Re-classifies each control's object token against the expanded ontology
(74 canonical tokens). Corrects wrong hints in the DB.
SAFETY: Split into 4 batches. NEVER retries on timeout (double-billing!).
Writes checkpoint after each API call for safe resume.
Usage:
python3 /app/scripts/gpre0_validate_hints.py --batch-id 1 --dry-run
python3 /app/scripts/gpre0_validate_hints.py --batch-id 1
python3 /app/scripts/gpre0_validate_hints.py --batch-id 2
python3 /app/scripts/gpre0_validate_hints.py --batch-id 3
python3 /app/scripts/gpre0_validate_hints.py --batch-id 4
"""
import argparse
import json
import logging
import os
import time
from collections import defaultdict
from pathlib import Path
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("gpre0-validate")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
CHECKPOINT_DIR = Path("/tmp/gpre0_checkpoints")
SYSTEM_PROMPT = """Du bist ein Compliance-Klassifizierer. Ordne jeden Control GENAU EINEM Token zu.
REGEL: Waehle IMMER den naechstbesten Token aus der Liste. OTHER nur wenn ABSOLUT
kein Token auch nur entfernt passt (<1% der Faelle). Im Zweifel: den breitesten
passenden Token waehlen (z.B. "policy" fuer Governance-Dokumente, "procedure" fuer
Ablauf-Definitionen, "risk_management" fuer Bewertungen).
TOKENS:
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
privileged_access, access_control, encryption, transport_encryption,
key_management, certificate_management, network_security, network_segmentation,
firewall, vpn, remote_access, monitoring (NUR Echtzeit-Systemueberwachung),
audit_logging (Protokollierung/Audit Trail), siem, alerting (Meldepflichten),
compliance_audit (externe Pruefungen), vulnerability, patch_management,
backup, disaster_recovery, physical_security, secure_development,
api_security, input_validation, container_security, logging_configuration
DATA_PROTECTION: personal_data (DSGVO-Verarbeitung), sensitive_data (Art.9),
health_data, consent, data_subject_rights, data_retention, data_transfer,
data_breach_notification, dpia, data_processing_agreement, privacy_by_design,
data_processing_register, data_classification, cookie_consent, video_surveillance
GOVERNANCE: policy (Richtlinie definieren), procedure (Verfahren definieren),
process (Betriebsprozess ausfuehren), training (Schulung), awareness,
incident (Vorfallsbehandlung), risk_management, third_party_management,
change_management, documentation, records_management, compliance_reporting,
asset_management, human_resources_security
REGULATORY: supervisory_authority, certification (Zertifizierung/Konformitaet),
product_safety, ai_system, financial_reporting, aml, whistleblowing,
consumer_protection, ecommerce, telecommunications, medical_device,
payment_services, critical_infrastructure, supply_chain_due_diligence,
sustainability_reporting
ABGRENZUNGEN:
- monitoring = NUR Echtzeit-Systemueberwachung, NICHT Audit/Schulung/Bewertung
- audit_logging = Protokollierung, NICHT externe Pruefung (→ compliance_audit)
- procedure = Verfahren DEFINIEREN, NICHT Vorfaelle behandeln (→ incident)
- personal_data = DSGVO-Verarbeitung, NICHT Zertifizierung (→ certification)
- alerting = Meldepflichten, NICHT Vorfallsbehandlung (→ incident)
Antworte NUR als JSON-Array: [{"id":"...","token":"...","conf":0.9}, ...]
KEIN weiterer Text. Nur das Array."""
def call_claude(controls_batch: list[dict]) -> tuple[list[dict], dict]:
"""Send batch to Claude. NO RETRY on timeout (double-billing risk!)."""
items = []
for c in controls_batch:
items.append(
f'- id="{c["control_id"]}" '
f'cur="{c["current_object"]}" '
f't="{c["title"]}" '
f'o="{c["objective"][:100]}"'
)
prompt = "Klassifiziere:\n" + "\n".join(items)
headers = {
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
payload = {
"model": ANTHROPIC_MODEL,
"max_tokens": 1500,
"temperature": 0.0,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": prompt}],
}
try:
resp = httpx.post(
ANTHROPIC_URL, headers=headers, json=payload, timeout=45.0
)
resp.raise_for_status()
data = resp.json()
content = data.get("content", [{}])[0].get("text", "")
usage = data.get("usage", {})
start = content.find("[")
end = content.rfind("]") + 1
if start >= 0 and end > start:
return json.loads(content[start:end]), usage
logger.warning("No JSON array in response")
return [], usage
except httpx.TimeoutException:
# CRITICAL: Do NOT retry! Log and skip.
logger.error("TIMEOUT — skipping batch (NOT retrying to avoid double-billing)")
return [], {}
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
logger.warning("Rate limited — waiting 60s then skipping")
time.sleep(60)
else:
logger.error("API error %d — skipping batch", e.response.status_code)
return [], {}
except Exception as e:
logger.error("Request failed — skipping: %s", e)
return [], {}
def load_checkpoint(batch_id: int) -> int:
"""Load last processed index for this batch."""
cp_file = CHECKPOINT_DIR / f"batch_{batch_id}.json"
if cp_file.exists():
data = json.loads(cp_file.read_text())
return data.get("last_index", 0)
return 0
def save_checkpoint(batch_id: int, last_index: int, stats: dict):
"""Save progress checkpoint."""
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
cp_file = CHECKPOINT_DIR / f"batch_{batch_id}.json"
cp_file.write_text(json.dumps({
"batch_id": batch_id,
"last_index": last_index,
**stats,
}))
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--batch-id", type=int, required=True)
parser.add_argument("--total-batches", type=int, default=10)
parser.add_argument("--batch-size", type=int, default=20)
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--resume", action="store_true",
help="Resume from checkpoint")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Load ALL control IDs ordered deterministically, then select quarter
with engine.connect() as c:
all_ids = c.execute(text("""
SELECT cc.id
FROM canonical_controls cc
WHERE cc.generation_metadata->>'merge_group_hint' IS NOT NULL
AND cc.generation_metadata->>'merge_group_hint' != ''
AND cc.release_state NOT IN ('deprecated', 'rejected')
ORDER BY cc.id
""")).fetchall()
total = len(all_ids)
chunk = total // args.total_batches
start_idx = (args.batch_id - 1) * chunk
end_idx = total if args.batch_id == args.total_batches else args.batch_id * chunk
batch_ids = [str(r[0]) for r in all_ids[start_idx:end_idx]]
logger.info("Batch %d/%d: controls %d-%d (%d controls of %d total)",
args.batch_id, args.total_batches, start_idx, end_idx, len(batch_ids), total)
# Load full data for this batch
id_list = ",".join(f"'{uid}'" for uid in batch_ids)
with engine.connect() as c:
rows = c.execute(text(f"""
SELECT cc.id, cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective,
cc.generation_metadata->>'merge_group_hint' as hint
FROM canonical_controls cc
WHERE cc.id IN ({id_list})
ORDER BY cc.id
""")).fetchall()
controls = []
for uuid, cid, title, objective, hint in rows:
parts = hint.split(":", 2) if hint else []
controls.append({
"uuid": str(uuid), "control_id": cid,
"title": title or "", "objective": objective or "",
"current_hint": hint, "current_object": parts[1] if len(parts) > 1 else hint,
})
# Resume from checkpoint?
start_from = 0
if args.resume:
start_from = load_checkpoint(args.batch_id)
if start_from > 0:
logger.info("Resuming from index %d", start_from)
# Process
total_same = 0
total_changed = 0
total_other = 0
total_skipped = 0
total_input_tokens = 0
total_output_tokens = 0
corrections: list[dict] = []
change_stats: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
for i in range(start_from, len(controls), args.batch_size):
batch = controls[i:i + args.batch_size]
results, usage = call_claude(batch)
total_input_tokens += usage.get("input_tokens", 0)
total_output_tokens += usage.get("output_tokens", 0)
if not results:
total_skipped += len(batch)
save_checkpoint(args.batch_id, i + args.batch_size, {
"same": total_same, "changed": total_changed,
"other": total_other, "skipped": total_skipped,
})
continue
result_map = {r.get("id", ""): r for r in results}
for ctrl in batch:
r = result_map.get(ctrl["control_id"], {})
new_token = r.get("token", "")
if not new_token:
total_skipped += 1
continue
old_obj = ctrl["current_object"]
if new_token == "OTHER":
total_other += 1
elif new_token == old_obj:
total_same += 1
else:
total_changed += 1
parts = ctrl["current_hint"].split(":", 2)
action = parts[0] if parts else "implement"
phase = parts[2] if len(parts) > 2 else "implementation"
corrections.append({
"uuid": ctrl["uuid"],
"old_hint": ctrl["current_hint"],
"new_hint": f"{action}:{new_token}:{phase}",
})
change_stats[old_obj][new_token] += 1
# Checkpoint every batch
save_checkpoint(args.batch_id, i + args.batch_size, {
"same": total_same, "changed": total_changed,
"other": total_other, "skipped": total_skipped,
})
processed = min(i + args.batch_size, len(controls))
if processed % 1000 < args.batch_size or processed >= len(controls):
logger.info(
"Batch %d: %d/%d (same=%d changed=%d other=%d skip=%d)",
args.batch_id, processed, len(controls),
total_same, total_changed, total_other, total_skipped,
)
time.sleep(0.3)
# Report
cost_in = total_input_tokens / 1_000_000 * 0.80 # Haiku
cost_out = total_output_tokens / 1_000_000 * 4.00 # Haiku
total_cost = cost_in + cost_out
total_proc = total_same + total_changed + total_other
logger.info("\n" + "=" * 60)
logger.info("BATCH %d REPORT", args.batch_id)
logger.info("=" * 60)
logger.info("Processed: %d | Skipped: %d", total_proc, total_skipped)
logger.info("Same: %d (%.1f%%)", total_same, total_same / max(total_proc, 1) * 100)
logger.info("Changed: %d (%.1f%%)", total_changed, total_changed / max(total_proc, 1) * 100)
logger.info("OTHER: %d (%.1f%%)", total_other, total_other / max(total_proc, 1) * 100)
logger.info("Cost: $%.2f (Haiku)", total_cost)
logger.info("Cost/ctrl: $%.5f", total_cost / max(total_proc, 1))
# Top changes
flat = []
for old, news in change_stats.items():
for new, cnt in news.items():
flat.append((cnt, old, new))
logger.info("\nTop Changes:")
for cnt, old, new in sorted(flat, reverse=True)[:20]:
logger.info(" %4d × %s%s", cnt, old, new)
# Always save corrections to file (recovery safety)
corr_file = CHECKPOINT_DIR / f"corrections_batch_{args.batch_id}.json"
if corrections:
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
corr_file.write_text(json.dumps(corrections))
logger.info("Saved %d corrections to %s", len(corrections), corr_file)
if args.dry_run:
logger.info("\nDRY RUN — not updating DB")
return
# Apply corrections in single transaction
if corrections:
logger.info("\nApplying %d corrections...", len(corrections))
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for corr in corrections:
c.execute(text("""
UPDATE canonical_controls
SET generation_metadata = jsonb_set(
generation_metadata,
'{merge_group_hint}',
to_jsonb(CAST(:new_hint AS text))
)
WHERE id = CAST(:uuid AS uuid)
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
logger.info("Done. %d hints corrected.", len(corrections))
else:
logger.info("No corrections needed.")
if __name__ == "__main__":
main()
+214
View File
@@ -0,0 +1,214 @@
#!/usr/bin/env python3
"""
G-pre2 v2: Build Master Controls directly from canonical tokens.
No K-Means needed — Phase 2 already normalized merge_group_hints
to 74 canonical tokens. Each token = one object group.
Groups controls by (canonical_token, phase) and creates MCs
for tokens with >=2 distinct phases.
Usage:
python3 /app/scripts/gpre2_direct_mc.py --dry-run
python3 /app/scripts/gpre2_direct_mc.py --min-phases 2
"""
import argparse
import json
import logging
import os
from collections import defaultdict
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("gpre2-direct")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
PHASE_ORDER = {
"scope": 0, "definition": 1, "governance": 1,
"design": 2, "implementation": 3, "configuration": 3,
"operation": 4, "training": 4, "monitoring": 5,
"testing": 6, "review": 7, "assessment": 8, "remediation": 8,
"validation": 9, "reporting": 10, "evidence": 11,
}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--min-phases", type=int, default=2)
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Step 1: Load all controls with merge_group_hint
logger.info("Loading controls...")
with engine.connect() as c:
rows = c.execute(text("""
SELECT id, control_id,
generation_metadata->>'merge_group_hint' AS hint
FROM canonical_controls
WHERE generation_metadata->>'merge_group_hint' IS NOT NULL
AND generation_metadata->>'merge_group_hint' != ''
AND release_state NOT IN ('deprecated', 'rejected')
""")).fetchall()
logger.info("Loaded %d controls", len(rows))
# Step 2: Group by (object_token, phase)
token_phases: dict[str, dict[str, list]] = defaultdict(
lambda: defaultdict(list)
)
for uuid, control_id, hint in rows:
parts = hint.split(":", 2)
if len(parts) < 2:
continue
action = parts[0]
obj = parts[1]
phase = parts[2] if len(parts) > 2 else "implementation"
token_phases[obj][phase].append((str(uuid), control_id, action))
logger.info("Found %d unique object tokens", len(token_phases))
# Step 3: Create Master Controls
master_controls = []
master_members = []
for token, phases in token_phases.items():
if len(phases) < args.min_phases:
continue
sorted_phases = sorted(
phases.keys(), key=lambda p: PHASE_ORDER.get(p, 99)
)
phase_counts = {p: len(ctrls) for p, ctrls in phases.items()}
total = sum(phase_counts.values())
master_controls.append({
"canonical_name": token,
"phases_covered": json.dumps(sorted_phases),
"phase_control_count": json.dumps(phase_counts),
"total_controls": total,
})
for phase, controls in phases.items():
for ctrl_uuid, ctrl_id, action in controls:
master_members.append({
"canonical_name": token,
"control_uuid": ctrl_uuid,
"phase": phase,
"action": action,
})
logger.info(
"Created %d Master Controls with %d members (min %d phases)",
len(master_controls), len(master_members), args.min_phases,
)
# Stats
if master_controls:
counts = [mc["total_controls"] for mc in master_controls]
phases_per = [
len(json.loads(mc["phases_covered"])) for mc in master_controls
]
logger.info(" Avg controls/MC: %.1f", sum(counts) / len(counts))
logger.info(" Max controls/MC: %d", max(counts))
logger.info(" Avg phases/MC: %.1f", sum(phases_per) / len(phases_per))
logger.info(" Max phases/MC: %d", max(phases_per))
# Size distribution
logger.info("\n Size distribution:")
logger.info(" ≤10: %d", sum(1 for c in counts if c <= 10))
logger.info(" 11-50: %d", sum(1 for c in counts if 11 <= c <= 50))
logger.info(" 51-200: %d", sum(1 for c in counts if 51 <= c <= 200))
logger.info(" 201-500: %d", sum(1 for c in counts if 201 <= c <= 500))
logger.info(" 501-2K: %d", sum(1 for c in counts if 501 <= c <= 2000))
logger.info(" >2K: %d", sum(1 for c in counts if c > 2000))
# Top 15
top = sorted(master_controls, key=lambda x: -x["total_controls"])[:15]
logger.info("\n Top 15 Master Controls:")
for mc in top:
logger.info(
" %6d %s (%d phases)",
mc["total_controls"],
mc["canonical_name"],
len(json.loads(mc["phases_covered"])),
)
if args.dry_run:
logger.info("\nDRY RUN — not writing to DB")
return
# Step 4: Write to DB
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
c.execute(text("DELETE FROM master_control_members"))
c.execute(text("DELETE FROM master_controls"))
# Get next object_group_id
max_gid = c.execute(
text("SELECT COALESCE(MAX(group_id), 0) FROM object_groups")
).scalar()
next_gid = max_gid + 1
mc_uuids = {}
for mc in master_controls:
gid = next_gid
next_gid += 1
mc_id = f"MC-{gid}"
c.execute(text("""
INSERT INTO master_controls
(master_control_id, object_group_id, canonical_name,
phases_covered, phase_control_count, total_controls)
VALUES (:mcid, :gid, :name,
CAST(:phases AS jsonb),
CAST(:pcounts AS jsonb), :total)
"""), {
"mcid": mc_id, "gid": gid,
"name": mc["canonical_name"],
"phases": mc["phases_covered"],
"pcounts": mc["phase_control_count"],
"total": mc["total_controls"],
})
mc_uuid = c.execute(text(
"SELECT id FROM master_controls WHERE master_control_id = :mcid"
), {"mcid": mc_id}).scalar()
mc_uuids[mc["canonical_name"]] = str(mc_uuid)
# Insert members
mem_count = 0
for mem in master_members:
mc_uuid = mc_uuids.get(mem["canonical_name"])
if not mc_uuid:
continue
c.execute(text("""
INSERT INTO master_control_members
(master_control_uuid, control_uuid, phase, action)
VALUES (CAST(:mc AS uuid), CAST(:ctrl AS uuid),
:phase, :action)
"""), {
"mc": mc_uuid,
"ctrl": mem["control_uuid"],
"phase": mem["phase"],
"action": mem["action"],
})
mem_count += 1
logger.info("Wrote %d MCs + %d members to DB", len(master_controls), mem_count)
if __name__ == "__main__":
main()
@@ -0,0 +1,298 @@
#!/usr/bin/env python3
"""
G-pre3: Split large Master Controls by regulation source.
For each MC with >200 controls:
1. Load member controls with parent's source_citation->>'source'
2. Group by regulation source
3. Sources with >= MIN_SOURCE_SIZE → new sub-MC
4. Small sources → merge into "mixed" bucket
5. UNKNOWN (no source_citation) → sub-cluster by embedding if >MAX_MC
6. Delete original large MC, create new sub-MCs
Usage:
python3 /app/scripts/gpre3_regulation_split.py --dry-run
python3 /app/scripts/gpre3_regulation_split.py --min-source 15 --max-mc 100
"""
import argparse
import json
import logging
import os
import re
from collections import defaultdict
from sqlalchemy import create_engine, text
from services.embedding_utils import subcluster_controls
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("gpre3")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
# ── Source key normalization ────────────────────────────────────────
# fmt: off
_SOURCE_SHORT: dict[str, str] = {
"DSGVO (EU) 2016/679": "dsgvo", "NIS2-Richtlinie (EU) 2022/2555": "nis2",
"KI-Verordnung (EU) 2024/1689": "ai_act", "Cyber Resilience Act (CRA)": "cra",
"Digital Services Act (DSA)": "dsa", "Digital Markets Act (DMA)": "dma",
"Digital Operational Resilience Act": "dora", "Data Governance Act (DGA)": "dga",
"Data Act": "data_act", "Maschinenverordnung (EU) 2023/1230": "machinery_reg",
"Medizinprodukteverordnung (EU) 2017/745 (MDR)": "mdr",
"European Health Data Space": "ehds", "European Accessibility Act": "eaa",
"EU Cybersecurity Act": "eu_csa", "EU Blue Guide 2022": "eu_blue_guide",
"EU-US Data Privacy Framework": "eu_us_dpf", "Markets in Crypto-Assets (MiCA)": "mica",
"Standardvertragsklauseln (SCC)": "scc", "ePrivacy-Richtlinie": "eprivacy",
"Batterieverordnung (EU) 2023/1542": "battery_reg",
"Bundesdatenschutzgesetz (BDSG)": "bdsg",
"BSI-Gesetz (BSIG 2025, NIS2-Umsetzung)": "bsig",
"BSI-Kritisverordnung (BSI-KritisV)": "bsi_kritisv",
"Geldwaeschegesetz (GwG)": "gwg", "Hinweisgeberschutzgesetz (HinSchG)": "hinschg",
"Lieferkettensorgfaltspflichtengesetz (LkSG)": "lksg",
"KRITIS-Dachgesetz (KRITISDachG)": "kritisdachg",
"NIST SP 800-53 Rev. 5": "nist_800_53", "NIST Cybersecurity Framework 2.0": "nist_csf",
"NIST Privacy Framework 1.0": "nist_privacy",
"NIST SP 800-207 (Zero Trust)": "nist_zero_trust",
"NIST SP 800-218 (SSDF)": "nist_ssdf", "NIST SP 800-63-3": "nist_800_63",
"NIST AI Risk Management Framework": "nist_ai_rmf",
"NISTIR 8259A IoT Security": "nist_iot",
"OWASP Top 10 (2021)": "owasp_top10", "OWASP API Security Top 10 (2023)": "owasp_api",
"OWASP ASVS 4.0": "owasp_asvs", "OWASP SAMM 2.0": "owasp_samm",
"OWASP MASVS 2.0": "owasp_masvs", "OWASP Mobile Top 10": "owasp_mobile",
"ENISA": "enisa", "TDDDG": "tdddg", "TKG": "tkg", "TMG": "tmg",
"BGB": "bgb", "UWG": "uwg", "UrhG": "urhg",
"BAIT (BaFin 2024)": "bait", "VAIT (BaFin 2022)": "vait",
"AML-Verordnung": "aml_reg", "Zahlungsdiensterichtlinie 2": "psd2",
"Telekommunikationsgesetz Oesterreich": "at_tkg",
"Österreichisches Datenschutzgesetz (DSG)": "at_dsg",
"Allgemeines Gleichbehandlungsgesetz (AGG)": "agg",
"Aktiengesetz (AktG)": "aktg", "Handelsgesetzbuch (HGB)": "hgb",
"GmbH-Gesetz (GmbHG)": "gmbhg", "Insolvenzordnung (InsO)": "inso",
"Gewerbeordnung (GewO)": "gewo", "Abgabenordnung (AO)": "ao",
}
# fmt: on
def source_to_key(source: str) -> str:
"""Normalize regulation source name to a short slug key."""
if source in _SOURCE_SHORT:
return _SOURCE_SHORT[source]
s = source.lower()
s = re.sub(r"\(.*?\)", "", s)
s = re.sub(r"[^a-z0-9äöüß]+", "_", s)
s = re.sub(r"_+", "_", s).strip("_")
return s[:40] if s else "unknown"
# ── Main ───────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--min-source", type=int, default=15,
help="Min controls per source for own sub-MC")
parser.add_argument("--max-mc", type=int, default=100,
help="Max controls per sub-MC before sub-clustering")
parser.add_argument("--threshold", type=int, default=200,
help="Only split MCs with more than N controls")
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Step 1: Find large master controls
with engine.connect() as c:
large_mcs = c.execute(text("""
SELECT mc.id, mc.master_control_id, mc.object_group_id,
mc.canonical_name, mc.total_controls
FROM master_controls mc
WHERE mc.total_controls > :threshold
ORDER BY mc.total_controls DESC
"""), {"threshold": args.threshold}).fetchall()
logger.info("Found %d MCs with >%d controls", len(large_mcs), args.threshold)
if not large_mcs:
return
# Step 2: Build split plans
all_splits = []
for mc_uuid, mc_id, og_id, canonical, total in large_mcs:
plan = _build_split_plan(engine, mc_uuid, mc_id, og_id, canonical, total, args)
all_splits.append(plan)
total_new = sum(len(sp["sub_groups"]) for sp in all_splits)
total_covered = sum(
sum(len(sg["controls"]) for sg in sp["sub_groups"]) for sp in all_splits
)
logger.info("SUMMARY: %d large MCs → %d sub-MCs (%d controls)", len(all_splits), total_new, total_covered)
if args.dry_run:
logger.info("DRY RUN — not writing to DB")
return
_write_splits(engine, all_splits)
def _build_split_plan(engine, mc_uuid, mc_id, og_id, canonical, total, args) -> dict:
"""Build a regulation-source split plan for one large MC."""
logger.info("\n━━━ %s: %s (%d controls) ━━━", mc_id, canonical, total)
with engine.connect() as c:
members = c.execute(text("""
SELECT mcm.control_uuid, mcm.phase, mcm.action,
cc.control_id, cc.title,
COALESCE(pc.source_citation->>'source', 'UNKNOWN') AS src
FROM master_control_members mcm
JOIN canonical_controls cc ON cc.id = mcm.control_uuid
LEFT JOIN canonical_controls pc ON pc.id = cc.parent_control_uuid
WHERE mcm.master_control_uuid = CAST(:mc_uuid AS uuid)
"""), {"mc_uuid": str(mc_uuid)}).fetchall()
by_source: dict[str, list[dict]] = defaultdict(list)
for ctrl_uuid, phase, action, cid, title, src in members:
by_source[src].append({
"control_uuid": str(ctrl_uuid), "phase": phase,
"action": action, "control_id": cid, "title": title,
})
sorted_sources = sorted(by_source.items(), key=lambda x: -len(x[1]))
for src, ctrls in sorted_sources[:8]:
logger.info(" %4d %s", len(ctrls), src)
if len(sorted_sources) > 8:
logger.info(" ... +%d more sources", len(sorted_sources) - 8)
plan = {"mc_uuid": str(mc_uuid), "mc_id": mc_id, "og_id": og_id,
"canonical": canonical, "total": total, "sub_groups": []}
own_mc_sources = []
mixed_controls = []
for src, ctrls in sorted_sources:
if src == "UNKNOWN":
continue
if len(ctrls) >= args.min_source:
own_mc_sources.append((src, ctrls))
else:
mixed_controls.extend(ctrls)
unknown_controls = by_source.get("UNKNOWN", [])
# (a) Named regulation sub-MCs
for src, ctrls in own_mc_sources:
key = source_to_key(src)
name = f"{canonical}_{key}"
_add_subgroups(plan, name, src, ctrls, args.max_mc)
# (b) Mixed small-source bucket
if mixed_controls:
_add_subgroups(plan, f"{canonical}_mixed", "mixed", mixed_controls, args.max_mc)
# (c) UNKNOWN bucket
if unknown_controls:
_add_subgroups(plan, f"{canonical}_general", "general", unknown_controls, args.max_mc)
logger.info("%d sub-groups:", len(plan["sub_groups"]))
for sg in sorted(plan["sub_groups"], key=lambda x: -len(x["controls"])):
logger.info(" %4d %s", len(sg["controls"]), sg["name"])
return plan
def _add_subgroups(plan: dict, name: str, source: str,
controls: list[dict], max_mc: int):
"""Add controls as one or more sub-groups to the plan."""
if len(controls) <= max_mc:
plan["sub_groups"].append({"name": name, "source": source, "controls": controls})
else:
clusters = subcluster_controls(controls, max_mc)
for i, cluster in enumerate(clusters):
sub_name = f"{name}_{i+1}" if len(clusters) > 1 else name
plan["sub_groups"].append({"name": sub_name, "source": source, "controls": cluster})
def _write_splits(engine, splits: list[dict]):
"""Apply split plan: delete old MCs, create new object_groups + MCs."""
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
max_gid = c.execute(
text("SELECT COALESCE(MAX(group_id), 0) FROM object_groups")
).scalar()
next_gid = max_gid + 1
total_mc = 0
total_mem = 0
for sp in splits:
c.execute(text(
"DELETE FROM master_control_members "
"WHERE master_control_uuid = CAST(:u AS uuid)"
), {"u": sp["mc_uuid"]})
c.execute(text(
"DELETE FROM master_controls WHERE id = CAST(:u AS uuid)"
), {"u": sp["mc_uuid"]})
logger.info("Deleted %s (%s)", sp["mc_id"], sp["canonical"])
for sg in sp["sub_groups"]:
if not sg["controls"]:
continue
gid = next_gid
next_gid += 1
members_list = list({ctrl["control_id"] for ctrl in sg["controls"]})
c.execute(text("""
INSERT INTO object_groups
(group_id, canonical_name, member_count, members, top_controls_count)
VALUES (:gid, :name, :cnt, CAST(:members AS jsonb), 0)
"""), {"gid": gid, "name": sg["name"], "cnt": len(members_list),
"members": json.dumps(members_list)})
by_phase: dict[str, list[dict]] = defaultdict(list)
for ctrl in sg["controls"]:
by_phase[ctrl["phase"]].append(ctrl)
sorted_phases = sorted(by_phase.keys())
phase_counts = {p: len(v) for p, v in by_phase.items()}
mc_id = f"MC-{gid}"
c.execute(text("""
INSERT INTO master_controls
(master_control_id, object_group_id, canonical_name,
phases_covered, phase_control_count, total_controls)
VALUES (:mcid, :gid, :name,
CAST(:phases AS jsonb), CAST(:pcounts AS jsonb), :total)
"""), {"mcid": mc_id, "gid": gid, "name": sg["name"],
"phases": json.dumps(sorted_phases),
"pcounts": json.dumps(phase_counts),
"total": sum(phase_counts.values())})
mc_uuid = c.execute(text(
"SELECT id FROM master_controls WHERE master_control_id = :mcid"
), {"mcid": mc_id}).scalar()
for ctrl in sg["controls"]:
c.execute(text("""
INSERT INTO master_control_members
(master_control_uuid, control_uuid, phase, action)
VALUES (CAST(:mc AS uuid), CAST(:ctrl AS uuid), :phase, :action)
"""), {"mc": str(mc_uuid), "ctrl": ctrl["control_uuid"],
"phase": ctrl["phase"], "action": ctrl["action"]})
total_mem += 1
total_mc += 1
logger.info("Created %d new MCs with %d members", total_mc, total_mem)
with engine.connect() as c:
stats = c.execute(text("""
SELECT count(*), count(CASE WHEN total_controls > 200 THEN 1 END),
AVG(total_controls)::int
FROM compliance.master_controls
""")).fetchone()
logger.info("Final: %d MCs, %d still >200, avg %d controls/MC", stats[0], stats[1], stats[2])
if __name__ == "__main__":
main()
@@ -0,0 +1,310 @@
#!/usr/bin/env python3
"""
Phase 0: Quality Audit for Master Control Assignments.
Uses Claude Sonnet to validate whether controls are correctly assigned
to their Master Controls. Samples controls from large and small MCs.
Usage:
python3 /app/scripts/gpre_quality_audit.py
python3 /app/scripts/gpre_quality_audit.py --large-sample 50 --small-sample 10
python3 /app/scripts/gpre_quality_audit.py --mc MC-8292 # single MC
"""
import argparse
import json
import logging
import os
import random
import time
from collections import defaultdict
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("quality-audit")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_MODEL = os.getenv("AUDIT_MODEL", "claude-sonnet-4-20250514")
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
SYSTEM_PROMPT = """Du bist ein Compliance-Experte der prüft ob Controls korrekt zu Master Controls zugeordnet sind.
Für jeden Control beantworte:
1. MATCH: Gehört dieser Control thematisch zum Master Control Topic?
2. CONFIDENCE: Wie sicher bist du? (0.0-1.0)
3. REASON: Kurze Begründung (max 1 Satz)
4. SUGGESTED_TOPIC: Falls MATCH=false, welches Topic wäre korrekt?
Wichtige Unterscheidungen:
- "monitoring" = kontinuierliche Überwachung, Alerting, Log-Analyse
- "training" = Schulung, Awareness, Lernmaterialien
- "personal_data" = personenbezogene Daten, DSGVO-Betroffenenrechte
- "procedure" = Verfahren, Prozesse (aber NICHT wenn es spezifisch um Incidents geht)
- "incident" = Sicherheitsvorfälle, Breach Notification, Recovery
- "policy" = Richtlinien, Regelwerke, Governance-Dokumente
- "encryption" = Verschlüsselung, Kryptografie, Key Management
- "audit_logging" = Protokollierung, Audit Trail, Nachvollziehbarkeit
Antworte NUR als JSON-Array, ein Objekt pro Control."""
def call_claude(controls_batch: list[dict], mc_topic: str) -> list[dict]:
"""Send a batch of controls to Claude for validation."""
items = []
for c in controls_batch:
items.append(
f"- Control '{c['control_id']}': "
f"Titel=\"{c['title']}\", "
f"Objective=\"{c['objective'][:150]}...\", "
f"Phase={c['phase']}, Action={c['action']}"
)
prompt = (
f"Master Control Topic: \"{mc_topic}\"\n\n"
f"Prüfe diese {len(controls_batch)} Controls:\n\n"
+ "\n".join(items)
+ "\n\nAntwort als JSON-Array mit Feldern: "
"control_id, match (bool), confidence (float), reason (str), "
"suggested_topic (str, nur wenn match=false)."
)
headers = {
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
payload = {
"model": ANTHROPIC_MODEL,
"max_tokens": 2048,
"temperature": 0.1,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": prompt}],
}
for attempt in range(3):
try:
resp = httpx.post(
ANTHROPIC_URL,
headers=headers,
json=payload,
timeout=60.0,
)
resp.raise_for_status()
data = resp.json()
content = data.get("content", [{}])[0].get("text", "")
usage = data.get("usage", {})
# Parse JSON from response
start = content.find("[")
end = content.rfind("]") + 1
if start >= 0 and end > start:
results = json.loads(content[start:end])
return results, usage
logger.warning("No JSON array in response: %s", content[:200])
return [], usage
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
wait = 30 * (attempt + 1)
logger.warning("Rate limited, waiting %ds...", wait)
time.sleep(wait)
else:
logger.error("API error: %s", e)
return [], {}
except Exception as e:
logger.error("Request failed (attempt %d): %s", attempt + 1, e)
if attempt < 2:
time.sleep(5)
return [], {}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--large-sample", type=int, default=50,
help="Controls to sample per large MC")
parser.add_argument("--small-sample", type=int, default=10,
help="Controls to sample per small MC")
parser.add_argument("--small-mc-count", type=int, default=50,
help="Number of small MCs to audit")
parser.add_argument("--mc", type=str, default=None,
help="Audit a single MC by ID (e.g., MC-8292)")
parser.add_argument("--batch-size", type=int, default=10,
help="Controls per API call")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Load MCs to audit
with engine.connect() as c:
if args.mc:
mcs = c.execute(text("""
SELECT id, master_control_id, canonical_name, total_controls
FROM master_controls WHERE master_control_id = :mc
"""), {"mc": args.mc}).fetchall()
else:
# Large MCs (>200) + random small MCs
large = c.execute(text("""
SELECT id, master_control_id, canonical_name, total_controls
FROM master_controls WHERE total_controls > 200
ORDER BY total_controls DESC
""")).fetchall()
small = c.execute(text("""
SELECT id, master_control_id, canonical_name, total_controls
FROM master_controls WHERE total_controls BETWEEN 10 AND 200
ORDER BY RANDOM() LIMIT :cnt
"""), {"cnt": args.small_mc_count}).fetchall()
mcs = list(large) + list(small)
logger.info("Auditing %d Master Controls", len(mcs))
# Results tracking
total_checked = 0
total_match = 0
total_mismatch = 0
total_input_tokens = 0
total_output_tokens = 0
mc_results: dict[str, dict] = {}
all_mismatches: list[dict] = []
for mc_uuid, mc_id, canonical, total in mcs:
is_large = total > 200
sample_size = args.large_sample if is_large else args.small_sample
# Sample controls
with engine.connect() as c:
controls = c.execute(text("""
SELECT mcm.control_uuid, mcm.phase, mcm.action,
cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective
FROM master_control_members mcm
JOIN canonical_controls cc ON cc.id = mcm.control_uuid
WHERE mcm.master_control_uuid = CAST(:mc AS uuid)
ORDER BY RANDOM()
LIMIT :n
"""), {"mc": str(mc_uuid), "n": sample_size}).fetchall()
if not controls:
continue
control_dicts = [
{"control_uuid": str(r[0]), "phase": r[1], "action": r[2],
"control_id": r[3], "title": r[4] or "", "objective": r[5] or ""}
for r in controls
]
logger.info("\n%s: %s (%d total, sampling %d)",
mc_id, canonical, total, len(control_dicts))
mc_match = 0
mc_mismatch = 0
# Process in batches
for i in range(0, len(control_dicts), args.batch_size):
batch = control_dicts[i:i + args.batch_size]
results, usage = call_claude(batch, canonical)
total_input_tokens += usage.get("input_tokens", 0)
total_output_tokens += usage.get("output_tokens", 0)
for r in results:
if r.get("match", True):
mc_match += 1
total_match += 1
else:
mc_mismatch += 1
total_mismatch += 1
mismatch = {
"mc_id": mc_id,
"mc_topic": canonical,
"control_id": r.get("control_id", "?"),
"confidence": r.get("confidence", 0),
"reason": r.get("reason", ""),
"suggested_topic": r.get("suggested_topic", ""),
}
all_mismatches.append(mismatch)
total_checked += len(results)
# Rate limit
time.sleep(1)
accuracy = mc_match / (mc_match + mc_mismatch) if (mc_match + mc_mismatch) > 0 else 1.0
mc_results[mc_id] = {
"canonical": canonical, "total": total,
"checked": mc_match + mc_mismatch,
"match": mc_match, "mismatch": mc_mismatch,
"accuracy": accuracy,
}
logger.info("%d/%d correct (%.1f%%)",
mc_match, mc_match + mc_mismatch, accuracy * 100)
# Final report
_print_report(mc_results, all_mismatches, total_checked, total_match,
total_mismatch, total_input_tokens, total_output_tokens)
def _print_report(mc_results, mismatches, checked, match, mismatch,
input_tok, output_tok):
"""Print the quality audit report."""
logger.info("\n" + "=" * 70)
logger.info("QUALITY AUDIT REPORT")
logger.info("=" * 70)
logger.info("Total controls checked: %d", checked)
logger.info("Correct assignments: %d (%.1f%%)",
match, match / max(checked, 1) * 100)
logger.info("Wrong assignments: %d (%.1f%%)",
mismatch, mismatch / max(checked, 1) * 100)
# Cost estimate
cost_input = input_tok / 1_000_000 * 3.0 # Sonnet input: $3/MTok
cost_output = output_tok / 1_000_000 * 15.0 # Sonnet output: $15/MTok
logger.info("\nAPI Usage: %d input + %d output tokens",
input_tok, output_tok)
logger.info("Estimated cost: $%.2f", cost_input + cost_output)
# Per-MC breakdown (worst first)
logger.info("\n--- Per-MC Accuracy (worst first) ---")
sorted_mcs = sorted(mc_results.values(), key=lambda x: x["accuracy"])
for mc in sorted_mcs:
flag = "" if mc["accuracy"] < 0.9 else "⚠️" if mc["accuracy"] < 0.95 else ""
logger.info(" %s %s (%s): %d/%d = %.1f%% [total: %d]",
flag, mc["canonical"][:30].ljust(30),
"large" if mc["total"] > 200 else "small",
mc["match"], mc["checked"],
mc["accuracy"] * 100, mc["total"])
# Top mismatches
if mismatches:
logger.info("\n--- Mismatches (all %d) ---", len(mismatches))
for m in sorted(mismatches, key=lambda x: -x.get("confidence", 0)):
logger.info(" %s in %s (%s) → should be '%s': %s",
m["control_id"], m["mc_id"], m["mc_topic"],
m["suggested_topic"], m["reason"])
# Size-class breakdown
large_mcs = [m for m in mc_results.values() if m["total"] > 200]
small_mcs = [m for m in mc_results.values() if m["total"] <= 200]
if large_mcs:
lg_acc = sum(m["match"] for m in large_mcs) / max(sum(m["checked"] for m in large_mcs), 1)
logger.info("\nLarge MCs (>200): %.1f%% accuracy (%d MCs)",
lg_acc * 100, len(large_mcs))
if small_mcs:
sm_acc = sum(m["match"] for m in small_mcs) / max(sum(m["checked"] for m in small_mcs), 1)
logger.info("Small MCs (≤200): %.1f%% accuracy (%d MCs)",
sm_acc * 100, len(small_mcs))
if __name__ == "__main__":
main()
+119 -6
View File
@@ -460,12 +460,50 @@ WICHTIGE REGELN:
7. MERGE-KEY: Erzeuge im JSON-Output ein zusaetzliches Feld "merge_key" mit
dem Format: "action_type:normalized_object:control_phase"
WICHTIG: Waehle normalized_object NUR aus dieser Liste kanonischer Tokens:
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
privileged_access, access_control, encryption, transport_encryption,
key_management, certificate_management, network_security, network_segmentation,
firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting,
compliance_audit, vulnerability, patch_management, backup, disaster_recovery,
physical_security, secure_development, api_security, input_validation,
container_security, logging_configuration
DATA_PROTECTION: personal_data, sensitive_data, health_data, consent,
data_subject_rights, data_retention, data_transfer, data_breach_notification,
dpia, data_processing_agreement, privacy_by_design, data_processing_register,
data_classification, cookie_consent, video_surveillance
GOVERNANCE: policy, procedure, process, training, awareness, incident,
risk_management, third_party_management, change_management, documentation,
records_management, compliance_reporting, asset_management,
human_resources_security
REGULATORY: supervisory_authority, certification, product_safety, ai_system,
financial_reporting, aml, whistleblowing, consumer_protection, ecommerce,
telecommunications, medical_device, payment_services, critical_infrastructure,
supply_chain_due_diligence, sustainability_reporting
Wenn KEIN Token passt: "OTHER:kurzbeschreibung" (z.B. "OTHER:battery_recycling")
ABGRENZUNGEN (haeufige Fehler vermeiden!):
- monitoring = NUR kontinuierliche Echtzeit-Ueberwachung von Systemen
- audit_logging = Protokollierung, Audit Trail, Nachvollziehbarkeit
- compliance_audit = externe Pruefungen, Zertifizierungsaudits
- training = Schulungen DURCHFUEHREN (nicht "ueberwachen")
- procedure = Verfahren DEFINIEREN (nicht Incident-Behandlung)
- incident = Sicherheitsvorfaelle BEHANDELN
- alerting = Meldepflichten und Benachrichtigungen
- personal_data = DSGVO-Verarbeitungsgrundsaetze (nicht Zertifizierung!)
- certification = Zertifizierung/Konformitaet (nicht Datenschutz)
Beispiele:
- "implement:api_rate_limiting:implementation"
- "define:access_control_policy:definition"
- "monitor:third_party_vulnerabilities:monitoring"
- "test:authentication_mechanism:testing"
- "implement:multi_factor_auth:implementation"
- "define:access_control:definition"
- "monitor:network_security:monitoring"
- "test:vulnerability:testing"
- "report:supervisory_authority:reporting"
- "implement:audit_logging:implementation" (NICHT monitoring!)
- "define:incident:definition" (Incident-Verfahren, NICHT procedure!)
- "train:training:operation" (Schulung, NICHT monitoring!)
8. APPLICABILITY + SCANNER: Bestimme fuer jedes Control:
- applicability: Unter welchen Bedingungen gilt dieses Control?
@@ -2472,6 +2510,81 @@ def _ensure_list(val) -> list:
return []
# Canonical object tokens from object_ontology (loaded once)
_CANONICAL_OBJECTS: set[str] | None = None
def _load_canonical_objects() -> set[str]:
"""Load canonical tokens from DB, fallback to hardcoded set."""
global _CANONICAL_OBJECTS
if _CANONICAL_OBJECTS is not None:
return _CANONICAL_OBJECTS
try:
from db.session import get_engine
from sqlalchemy import text
engine = get_engine()
with engine.connect() as c:
rows = c.execute(text(
"SELECT canonical_token FROM compliance.object_ontology"
)).fetchall()
_CANONICAL_OBJECTS = {r[0] for r in rows}
except Exception:
_CANONICAL_OBJECTS = set()
if not _CANONICAL_OBJECTS:
_CANONICAL_OBJECTS = {
"multi_factor_auth", "password_policy", "credentials",
"session_management", "privileged_access", "access_control",
"encryption", "transport_encryption", "key_management",
"certificate_management", "network_security",
"network_segmentation", "firewall", "vpn", "remote_access",
"monitoring", "audit_logging", "siem", "alerting",
"compliance_audit", "vulnerability", "patch_management",
"backup", "disaster_recovery", "personal_data",
"sensitive_data", "consent", "data_subject_rights",
"data_retention", "data_transfer", "data_breach_notification",
"dpia", "data_processing_agreement", "privacy_by_design",
"policy", "procedure", "process", "training", "awareness",
"incident", "risk_management", "third_party_management",
"change_management", "documentation", "supervisory_authority",
"certification", "product_safety", "ai_system", "aml",
"critical_infrastructure", "medical_device",
}
return _CANONICAL_OBJECTS
def _validate_merge_key(merge_key: str) -> str:
"""Validate merge_key object against canonical ontology.
Returns the merge_key (possibly corrected). Logs warnings for
unknown objects so they can be tracked.
"""
parts = merge_key.split(":", 2)
if len(parts) < 2:
return merge_key
action, obj = parts[0], parts[1]
phase = parts[2] if len(parts) > 2 else "implementation"
# Accept OTHER: prefix (LLM signaling unknown object)
if obj.startswith("OTHER:"):
return merge_key
# Check against canonical ontology
canonical = _load_canonical_objects()
if obj in canonical:
return merge_key
# Try normalize_object() as fallback
from services.control_dedup import normalize_object
normed = normalize_object(obj)
if normed in canonical:
return f"{action}:{normed}:{phase}"
# Unknown object — log and keep as-is (will be clustered by embedding)
logger.debug("merge_key unknown object: %s (normed: %s)", obj, normed)
return merge_key
# ---------------------------------------------------------------------------
# Decomposition Pass
# ---------------------------------------------------------------------------
@@ -3025,10 +3138,10 @@ class DecompositionPass:
evidence_type=parsed.get("evidence_type", ""),
provides_context=_ensure_list(parsed.get("provides_context", [])),
)
# Store merge_key from LLM output in metadata
# Store merge_key from LLM output in metadata — with validation
llm_merge_key = parsed.get("merge_key", "")
if llm_merge_key:
atomic.merge_group_hint = llm_merge_key
atomic.merge_group_hint = _validate_merge_key(llm_merge_key)
atomic.parent_control_uuid = obl["parent_uuid"]
atomic.obligation_candidate_id = obl["candidate_id"]
@@ -0,0 +1,84 @@
"""Shared embedding + sub-clustering utilities for the control pipeline."""
import logging
import os
from collections import defaultdict
import httpx
import numpy as np
from sklearn.cluster import MiniBatchKMeans
logger = logging.getLogger(__name__)
EMBEDDING_URL = os.getenv(
"EMBEDDING_SERVICE_URL", "http://embedding-service:8087"
)
def embed_texts(texts: list[str]) -> np.ndarray | None:
"""Embed texts via the embedding-service in batches of 64."""
try:
result = np.zeros((len(texts), 1024), dtype=np.float32)
batch_size = 64
for i in range(0, len(texts), batch_size):
batch = texts[i : i + batch_size]
for attempt in range(3):
try:
with httpx.Client(
timeout=httpx.Timeout(60.0, connect=10.0)
) as client:
resp = client.post(
f"{EMBEDDING_URL}/embed", json={"texts": batch}
)
resp.raise_for_status()
embs = resp.json().get("embeddings", [])
end = min(i + len(embs), len(texts))
result[i:end] = np.array(embs, dtype=np.float32)
break
except Exception as e:
if attempt == 2:
logger.error("Embed batch %d failed: %s", i, e)
import time
time.sleep(2)
return result
except Exception as e:
logger.error("Embedding failed: %s", e)
return None
def subcluster_controls(
controls: list[dict], target_size: int = 50
) -> list[list[dict]]:
"""Sub-cluster controls by embedding similarity.
Returns a list of clusters. Falls back to naive chunking
if embedding fails.
"""
if len(controls) <= target_size:
return [controls]
texts = [c.get("title", "") or c.get("control_id", "") for c in controls]
embeddings = embed_texts(texts)
if embeddings is None:
return [
controls[i : i + target_size]
for i in range(0, len(controls), target_size)
]
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
norms[norms == 0] = 1
normalized = embeddings / norms
k = max(2, min(len(controls) // target_size, 30))
kmeans = MiniBatchKMeans(
n_clusters=k,
batch_size=min(100, len(controls)),
max_iter=50,
random_state=42,
)
labels = kmeans.fit_predict(normalized)
clusters: dict[int, list[dict]] = defaultdict(list)
for i, ctrl in enumerate(controls):
clusters[int(labels[i])].append(ctrl)
return list(clusters.values())
@@ -0,0 +1,97 @@
# Internationale Normen-Mappings: ISO/EN ↔ Nationale Aequivalente
## Ziel
Frei zugaengliche nationale Normen laden die inhaltlich aequivalent zu kostenpflichtigen
DIN/EN/ISO-Normen sind. Eigene Uebersetzung + Zuordnung = rechtlich sicher (Rule 3).
## Status: IDT = Identical, MOD = Modified, NEQ = Not Equivalent
---
## China (GB/T) — Frei auf openstd.samr.gov.cn
| ISO/EN Norm | GB/T Aequivalent | Status | Thema |
|---|---|---|---|
| ISO 12100:2010 | GB/T 15706-2012 | IDT | Risikobeurteilung Grundnorm |
| ISO 13849-1:2023 | GB/T 16855.1-2018 | IDT | Sicherheitssteuerungen PL |
| ISO 13849-2:2012 | GB/T 16855.2-2015 | IDT | Validierung Steuerungen |
| IEC 62061:2021 | GB/T 16855.3 | IDT | SIL Steuerungssysteme |
| IEC 60204-1:2016 | GB/T 5226.1-2019 | IDT | Elektrische Ausruestung |
| ISO 13855:2010 | GB/T 19876-2012 | IDT | Sicherheitsabstaende |
| ISO 13850:2015 | GB/T 16754-2022 | IDT | Not-Halt |
| ISO 14119:2013 | GB/T 18831 | IDT | Verriegelungseinrichtungen |
| ISO 14120:2015 | GB/T 8196-2018 | IDT | Trennende Schutzeinrichtungen |
| ISO 13857:2019 | GB/T 23821-2022 | IDT | Sicherheitsabstaende Gliedmassen |
| ISO 10218-1:2011 | GB 11291.1-2011 | IDT | Industrieroboter Sicherheit |
Quelle: https://openstd.samr.gov.cn (SAMR/SAC, frei zugaenglich)
---
## USA (OSHA/ANSI) — Frei auf osha.gov
| ISO/EN Norm | US Aequivalent | Frei? | Thema |
|---|---|---|---|
| ISO 12100 | ANSI/ISO 12100 (identisch) | ❌ ANSI kostenpflichtig |
| Maschinenrichtlinie | OSHA 29 CFR 1910 Subpart O | ✅ Frei | Machine Guarding |
| EN 60204-1 | NFPA 79 | ❌ Kostenpflichtig |
| Allgemein | OSHA Technical Manual | ✅ Frei | Umfassende Anleitungen |
Frei nutzbar: OSHA Standards (29 CFR) + Technical Manual
Quelle: https://www.osha.gov/otm
---
## Korea (KS) — Teilweise frei auf standard.go.kr
| ISO/EN Norm | KS Aequivalent | Status | Thema |
|---|---|---|---|
| ISO 12100:2010 | KS B ISO 12100:2014 | IDT | Risikobeurteilung |
| ISO 13849-1 | KS B ISO 13849-1 | IDT | Sicherheitssteuerungen |
| IEC 60204-1 | KS C IEC 60204-1 | IDT | Elektrische Ausruestung |
Quelle: https://standard.go.kr (Korean Agency for Technology and Standards, KATS)
---
## Indien (BIS) — Teilweise frei auf bis.gov.in
| ISO/EN Norm | IS Aequivalent | Status | Thema |
|---|---|---|---|
| ISO 12100:2010 | IS/ISO 12100:2010 | IDT | Risikobeurteilung |
| IEC 60204-1 | IS/IEC 60204-1 | IDT | Elektrische Ausruestung |
Quelle: https://www.services.bis.gov.in (Bureau of Indian Standards)
---
## Download-Status (Stand 2026-05-09)
| Quelle | Sprache | Volltext frei? | Status |
|---|---|---|---|
| China GB/T (openstd.samr.gov.cn) | Chinesisch | ❌ "Copyright protection" fuer ISO-basierte | Nur Metadaten frei |
| USA OSHA 29 CFR 1910 (osha.gov) | Englisch | ✅ Public Domain | ✅ 1910.212 geladen |
| USA OSHA Technical Manual | Englisch | ✅ Public Domain | Teilweise geladen |
| Korea KS (standard.go.kr) | Koreanisch | ❌ Kostenpflichtig | Nur Metadaten |
| Indien BIS (bis.gov.in) | Englisch | ❌ Kostenpflichtig | Nur Metadaten |
**Ernuechterndes Ergebnis:** Auch China, Korea und Indien schuetzen das ISO-Copyright
fuer ihre identischen nationalen Uebernahmen (IDT). Der Volltext ist NIRGENDS frei
zugaenglich — nur die USA (OSHA) haben eigene, unabhaengige Regulierungstexte.
**Was trotzdem nutzbar ist:**
1. OSHA 29 CFR 1910 Subpart O — eigene US-Anforderungen, frei, englisch
2. OSHA Technical Manual — detaillierte Anleitungen, frei
3. Metadaten aller Laender — Normnummern, Titel, Mappings (fuer Referenz-Tabelle)
4. Chinesische GB-Normen die NICHT auf ISO basieren (rein chinesische Standards)
---
## Rechtliche Bewertung
- Nationale Aequivalente sind als "IDT" (identical) markiert = gleicher Inhalt
- Wir laden die NATIONALEN Versionen (nicht die ISO-Version)
- Eigene Uebersetzung ins Deutsche = eigenes Werk (transformative use)
- Mapping-Tabelle zeigt transparent die Herkunft
- Wir sagen "aequivalent zu ISO 12100", nicht "identisch mit ISO 12100"
- Kein ISO-Normtext wird reproduziert — nur eigene Formulierungen
@@ -0,0 +1,69 @@
# OSHA 29 CFR 1910 Subpart O — Machinery and Machine Guarding
# Quelle: https://www.osha.gov/laws-regs/regulations/standardnumber/1910/1910SubpartO
# Lizenz: US Federal Law — Public Domain
# Geladen: 2026-05-09
## 1910.211 — Definitions
Definitionen fuer Woodworking, Abrasive Wheels, Rubber/Plastics Mills, Power Presses, Forging, Power Transmission.
## 1910.212 — General Requirements for All Machines
- (a)(1) Guarding: barrier guards, two-hand tripping, electronic safety devices
- (a)(2) Guards affixed to machine, must not create hazards
- (a)(3) Point of operation guarding: guillotine cutters, shears, power presses, milling, saws, jointers, portable tools, forming rolls
- (a)(4) Revolving equipment: interlocked enclosure
- (a)(5) Fan blades below 7 feet: guards with max 1/2 inch openings
- (b) Fixed machinery anchoring
## 1910.213 — Woodworking Machinery
- Machine construction: no excessive vibration, secure bearings
- Controls: accessible power cutoff, locking belt shifters, anti-restart
- Hand-fed ripsaws: complete hoods, spreaders, non-kickback devices
- Crosscut saws: hood requirements
- Radial saws: upper/lower blade guarding, forward travel stops
- Bandsaws: full wheel encasement (0.037" min wire mesh)
- Jointers: automatic guards, max 2.5" throat, knife projection limits
- Shapers: cage or adjustable guards
- Sanding machines: feed roll guards, enclosed drums
## 1910.214 — Cooperage Machinery
[Reserved]
## 1910.215 — Abrasive Wheel Machinery
- Safety guards required (except internal work, mounted wheels ≤2")
- Angular exposure: bench/floor max 90°, cylindrical max 180°, surface/cutting max 150°
- Flanges: min 1/3 wheel diameter
- Speed limits: ≤8000 SFPM cast iron OK, 8000-16000 cast/structural steel
- Ring test before mounting
- Work rests: max 1/8" opening
## 1910.216 — Mills and Calenders (Rubber/Plastics)
- Top rolls min 50" above operator level
- Safety trip controls: pressure-sensitive body bars, triprods, tripwire
- Stopping limits: mills ≤1.5% peripheral speed, calenders ≤1.75%
- Manual reset required (no automatic)
## 1910.217 — Mechanical Power Presses
- Brakes: self-engaging
- Clutches: single-stroke with compression springs
- Two-hand controls: concurrent use, antirepeat
- Point of operation: Table O-10 max permissible openings
- Guard types: die enclosure, fixed barrier, interlocked, adjustable, presence sensing, pull-out, two-hand
- PSDI mode: light curtain, annual certification, min safety distance
- Injury reporting to OSHA within 30 days
## 1910.218 — Forging Machines
- Periodic inspection with documented certification
- Ram blocking during die changes (Table O-11)
- Tongs sufficient length to prevent kickback contact
- Scale guards at hammer/press backs
- Safety cylinder heads, quick-closing emergency valves
- Power lockout requirements
## 1910.219 — Mechanical Power-Transmission Apparatus
- Belts: guard if ≤7 feet from floor, 15" above belt minimum
- Overhead belts: full enclosure if >1800 ft/min or >8" wide
- Pulleys: guard if ≤7 feet, no cracked/broken pulleys
- Shafts: stationary casing ≤7 feet, projecting ends smooth with caps
- Gears: complete enclosure or 7-foot guard extending 6" above mesh
- Sprockets/chains: enclosure unless >7 feet
- Inspection: max 60-day intervals
+79
View File
@@ -0,0 +1,79 @@
#!/usr/bin/env python3
"""Crawl OSHA Technical Manual — all chapters as HTML."""
import json
import logging
import time
from pathlib import Path
from playwright.sync_api import sync_playwright
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("osha-crawl")
OUTPUT_DIR = Path(__file__).parent / "otm_chapters"
BASE = "https://www.osha.gov"
def main():
OUTPUT_DIR.mkdir(exist_ok=True)
registry = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
# Step 1: Get all chapter URLs
page.goto(f"{BASE}/otm", timeout=30000)
time.sleep(5)
links = page.query_selector_all('a[href*="/otm/"]')
chapters = []
seen = set()
for l in links:
href = l.get_attribute("href") or ""
text = (l.inner_text() or "").strip()
if href and "chapter" in href and href not in seen and text:
seen.add(href)
chapters.append({"url": href, "title": text})
logger.info("Found %d chapters", len(chapters))
# Step 2: Download each chapter
for i, ch in enumerate(chapters):
url = ch["url"] if ch["url"].startswith("http") else BASE + ch["url"]
slug = ch["url"].replace("/otm/", "").replace("/", "_")
outfile = OUTPUT_DIR / f"{slug}.html"
logger.info("[%d/%d] %s", i + 1, len(chapters), ch["title"][:60])
if outfile.exists():
logger.info(" Already exists, skipping")
ch["local_path"] = str(outfile)
registry.append(ch)
continue
try:
page.goto(url, timeout=30000)
time.sleep(3)
content = page.content()
outfile.write_text(content)
ch["local_path"] = str(outfile)
logger.info(" Saved: %s (%.1f KB)", outfile.name, len(content) / 1024)
except Exception as e:
logger.error(" Failed: %s", e)
ch["local_path"] = None
registry.append(ch)
time.sleep(1)
browser.close()
reg_file = Path(__file__).parent / "otm_registry.json"
reg_file.write_text(json.dumps(registry, indent=2, ensure_ascii=False))
ok = sum(1 for r in registry if r.get("local_path"))
logger.info("Done: %d/%d chapters saved", ok, len(registry))
if __name__ == "__main__":
main()
@@ -0,0 +1,177 @@
# TRBS + TRGS + ASR — Download-URLs
**Stand:** 2026-05-09
**Quelle:** BAuA (Bundesanstalt für Arbeitsschutz und Arbeitsmedizin)
**Lizenz:** Gemeinfrei (§5 UrhG — amtliche Bekanntmachungen)
## Anleitung
BAuA hat Bot-Schutz. Die PDFs müssen **manuell im Browser** heruntergeladen werden.
Jede URL führt zur BAuA-Detailseite → dort den PDF-Download-Link klicken.
Alle heruntergeladenen PDFs in dieses Verzeichnis legen:
```
legal-sources/trbs-trgs-asr/
```
Dateinamen-Konvention: `trbs_1111.pdf`, `trgs_400.pdf`, `asr_a1_3.pdf`
---
## TRBS — Technische Regeln für Betriebssicherheit (~35 Dokumente)
### 1000er Reihe (Allgemein)
1. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1001.html — TRBS 1001: Struktur und Anwendung
2. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1111.html — TRBS 1111: Gefährdungsbeurteilung
3. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1112.html — TRBS 1112: Instandhaltung
4. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1112-Teil-1.html — TRBS 1112 Teil 1: Explosionsgefährdungen bei Instandhaltung
5. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1115.html — TRBS 1115: Sicherheitsrelevante MSR-Einrichtungen
6. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1115-Teil-1.html — TRBS 1115 Teil 1: Cybersicherheit für MSR
7. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1116.html — TRBS 1116: Qualifikation und Unterweisung
8. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1121.html — TRBS 1121: Änderungen an Aufzugsanlagen
9. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1122.html — TRBS 1122: Änderungen an Anlagen (§1 Abs.2 Nr.4)
10. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1123.html — TRBS 1123: Änderungen an Anlagen (§1 Abs.2 Nr.3)
11. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1151.html — TRBS 1151: Mensch-Arbeitsmittel-Schnittstelle, Ergonomie
12. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1201.html — TRBS 1201: Prüfungen von Arbeitsmitteln
13. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1201-Teil-1.html — TRBS 1201 Teil 1: Prüfung in Ex-Bereichen
14. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1201-Teil-2.html — TRBS 1201 Teil 2: Prüfung bei Dampf/Druck
15. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1201-Teil-4.html — TRBS 1201 Teil 4: Prüfung von Aufzugsanlagen
16. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1201-Teil-5.html — TRBS 1201 Teil 5: Prüfung Lager-/Tankstellen
17. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-1203.html — TRBS 1203: Befähigte Personen
### 2000er Reihe (Gefährdungsbezogen)
18. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2111.html — TRBS 2111: Mechanische Gefährdungen
19. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2111-Teil-1.html — TRBS 2111 Teil 1: Kontrolliert bewegte Teile
20. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2111-Teil-2.html — TRBS 2111 Teil 2: Unkontrolliert bewegte Teile
21. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2111-Teil-3.html — TRBS 2111 Teil 3: Gefährliche Oberflächen
22. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2111-Teil-4.html — TRBS 2111 Teil 4: Mobile Arbeitsmittel
23. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2121.html — TRBS 2121: Absturzgefährdung
24. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2141.html — TRBS 2141: Dampf und Druck
25. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2141-Teil-1.html — TRBS 2141 Teil 1: Versagen drucktragender Wandung
26. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2152.html — TRBS 2152: Explosionsfähige Atmosphäre
27. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2152-Teil-1.html — TRBS 2152 Teil 1: Beurteilung Explosionsgefährdung
28. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2152-Teil-2.html — TRBS 2152 Teil 2: Vermeidung Ex-Atmosphäre
29. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2152-Teil-3.html — TRBS 2152 Teil 3: Vermeidung Entzündung
30. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2152-Teil-4.html — TRBS 2152 Teil 4: Konstruktiver Explosionsschutz
31. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2181.html — TRBS 2181: Eingeschlossensein in Personenaufnahmemitteln
32. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-2210.html — TRBS 2210: Wechselwirkungen
### 3000er Reihe (Spezifisch)
33. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-3121.html — TRBS 3121: Betrieb von Aufzugsanlagen
34. https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS-3151.html — TRBS 3151: Brand-/Explosionsschutz Tankstellen
---
## TRGS — Technische Regeln für Gefahrstoffe (~50 Dokumente)
### 200er Reihe (Einstufung/Kennzeichnung)
35. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-200.html — TRGS 200: Einstufung und Kennzeichnung
36. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-201.html — TRGS 201: Einstufung und Kennzeichnung bei Tätigkeiten
37. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-220.html — TRGS 220: Sicherheitsdatenblatt
### 400er Reihe (Gefährdungsbeurteilung)
38. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-400.html — TRGS 400: Gefährdungsbeurteilung Gefahrstoffe
39. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-401.html — TRGS 401: Hautgefährdung
40. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-402.html — TRGS 402: Inhalative Exposition
41. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-406.html — TRGS 406: Sensibilisierende Stoffe
42. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-407.html — TRGS 407: Tätigkeiten mit Gasen
43. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-410.html — TRGS 410: Expositionsverzeichnis krebserzeugende Stoffe
44. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-420.html — TRGS 420: Verfahrens- und stoffspezifische Kriterien
45. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-430.html — TRGS 430: Isocyanate
46. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-460.html — TRGS 460: Stand der Technik
### 500er Reihe (Schutzmaßnahmen)
47. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-500.html — TRGS 500: Schutzmaßnahmen
48. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-504.html — TRGS 504: Tätigkeiten mit Blei
49. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-505.html — TRGS 505: Oberflächenbehandlung in Räumen
50. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-507.html — TRGS 507: Oberflächenbehandlung in Räumen und Behältern
51. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-509.html — TRGS 509: Lagern von flüssigen/festen Gefahrstoffen in ortsfesten Behältern
52. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-510.html — TRGS 510: Lagerung von Gefahrstoffen in ortsbeweglichen Behältern
53. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-512.html — TRGS 512: Begasungen
54. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-513.html — TRGS 513: Tätigkeiten an Sterilisatoren mit ETO
55. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-519.html — TRGS 519: Asbest
56. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-520.html — TRGS 520: Errichtung und Betrieb von Sammelstellen
57. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-521.html — TRGS 521: Abbruch/Sanierung alte Mineralwolle
58. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-522.html — TRGS 522: Raumdesinfektion mit Formaldehyd
59. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-523.html — TRGS 523: Schädlingsbekämpfung mit sehr giftigen/giftigen Stoffen
60. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-524.html — TRGS 524: Schutzmaßnahmen bei kontaminierten Bereichen
61. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-525.html — TRGS 525: Gefahrstoffe in Einrichtungen der medizinischen Versorgung
62. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-526.html — TRGS 526: Laboratorien
63. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-527.html — TRGS 527: Tätigkeiten mit Nanomaterialien
64. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-528.html — TRGS 528: Schweißtechnische Arbeiten
65. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-529.html — TRGS 529: Tätigkeiten bei Biogasanlagen
66. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-530.html — TRGS 530: Friseurhandwerk
67. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-551.html — TRGS 551: Teer und andere PAK-haltige Stoffe
68. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-552.html — TRGS 552: N-Nitrosamine
69. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-553.html — TRGS 553: Holzstaub
70. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-554.html — TRGS 554: Abgase von Dieselmotoren
71. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-555.html — TRGS 555: Betriebsanweisung und Information
72. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-557.html — TRGS 557: Dioxine
73. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-558.html — TRGS 558: Quarzfeinstaub
74. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-559.html — TRGS 559: Mineralischer Staub
75. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-561.html — TRGS 561: Krebserzeugende Metalle
### 600er Reihe (Substitution)
76. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-600.html — TRGS 600: Substitution
77. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-610.html — TRGS 610: Ersatzstoffe und Ersatzverfahren für chrysotilhaltigen Asbest
78. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-617.html — TRGS 617: Ersatzstoffe für Kühlschmierstoffe
79. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-619.html — TRGS 619: Substitution für chromat-haltige Beschichtungsstoffe
### 700er Reihe (Brand-/Explosionsschutz)
80. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-720.html — TRGS 720: Gefährliche explosionsfähige Gemische
81. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-721.html — TRGS 721: Beurteilung Explosionsgefährdung
82. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-722.html — TRGS 722: Vermeidung explosionsfähiger Gemische
83. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-723.html — TRGS 723: Gefährliche explosionsfähige Gemische Vermeidung Entzündung
84. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-724.html — TRGS 724: Gefährliche explosionsfähige Gemische Konstruktiver Schutz
85. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-725.html — TRGS 725: Gefährliche explosionsfähige Gemische MSR-Einrichtungen
86. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-726.html — TRGS 726: Sauerstoffgrenzkonzentration
87. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-727.html — TRGS 727: Vermeidung von Zündgefahren (elektrostatisch)
88. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-741.html — TRGS 741: Organische Peroxide
89. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-745.html — TRGS 745: Ortsbewegliche Druckgasbehälter
90. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-746.html — TRGS 746: Ortsfeste Druckanlagen für Gase
91. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-751.html — TRGS 751: Vermeidung von Brand-/Explosionsgefahren Tankstellen
92. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-800.html — TRGS 800: Brandschutzmaßnahmen
### 900er Reihe (Grenzwerte)
93. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-900.html — TRGS 900: Arbeitsplatzgrenzwerte
94. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-903.html — TRGS 903: Biologische Grenzwerte
95. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-905.html — TRGS 905: Verzeichnis krebserzeugender Stoffe
96. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-906.html — TRGS 906: Verzeichnis krebserzeugender Verfahren
97. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-907.html — TRGS 907: Verzeichnis sensibilisierender Stoffe
98. https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS-910.html — TRGS 910: Risikobezogenes Maßnahmenkonzept krebserzeugende Stoffe
---
## ASR — Arbeitsstättenregeln (~21 Dokumente)
99. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-V3.html — ASR V3: Gefährdungsbeurteilung
100. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-V3a-2.html — ASR V3a.2: Barrierefreie Gestaltung
101. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A1-2.html — ASR A1.2: Raumabmessungen und Bewegungsflächen
102. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A1-3.html — ASR A1.3: Sicherheits-/Gesundheitsschutzkennzeichnung
103. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A1-5-1-2.html — ASR A1.5/1,2: Fußböden
104. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A1-6.html — ASR A1.6: Fenster, Oberlichter
105. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A1-7.html — ASR A1.7: Türen und Tore
106. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A1-8.html — ASR A1.8: Verkehrswege
107. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A2-1.html — ASR A2.1: Schutz vor Absturz
108. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A2-2.html — ASR A2.2: Maßnahmen gegen Brände
109. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A2-3.html — ASR A2.3: Fluchtwege und Notausgänge
110. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A3-4.html — ASR A3.4: Beleuchtung und Sichtverbindung
111. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A3-4-3.html — ASR A3.4/3: Sicherheitsbeleuchtung
112. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A3-5.html — ASR A3.5: Raumtemperatur
113. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A3-6.html — ASR A3.6: Lüftung
114. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A3-7.html — ASR A3.7: Lärm
115. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A4-1.html — ASR A4.1: Sanitärräume
116. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A4-2.html — ASR A4.2: Pausen-/Bereitschaftsräume
117. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A4-3.html — ASR A4.3: Erste-Hilfe-Räume
118. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A4-4.html — ASR A4.4: Unterkünfte
119. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A5-2.html — ASR A5.2: Baustellen
120. https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR-A6.html — ASR A6: Bildschirmarbeit
---
**Gesamt: 120 Dokumente** (34 TRBS + 64 TRGS + 22 ASR)
**Hinweis:** Einige URLs könnten leicht abweichen (Bindestriche vs. Punkte). Im Browser die BAuA-Übersichtsseite nutzen und von dort die PDFs einzeln herunterladen:
- https://www.baua.de/DE/Angebote/Regelwerk/TRBS/TRBS.html
- https://www.baua.de/DE/Angebote/Regelwerk/TRGS/TRGS.html
- https://www.baua.de/DE/Angebote/Regelwerk/ASR/ASR.html
+256
View File
@@ -0,0 +1,256 @@
#!/usr/bin/env python3
"""
BAuA Regulatory Crawler TRBS, TRGS, ASR
Crawls the BAuA website using Playwright (headless browser),
extracts PDF links, downloads all documents.
Usage:
python3 crawl_baua.py # download all
python3 crawl_baua.py --category trbs # only TRBS
python3 crawl_baua.py --dry-run # list PDFs without downloading
"""
import argparse
import hashlib
import json
import logging
import re
import time
from pathlib import Path
from urllib.parse import urljoin
from playwright.sync_api import sync_playwright
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("baua-crawler")
BASE_URL = "https://www.baua.de"
OUTPUT_DIR = Path(__file__).parent / "pdfs"
REGISTRY_FILE = Path(__file__).parent / "source_registry.json"
CATEGORIES = {
"trbs": {
"url": f"{BASE_URL}/DE/Angebote/Regelwerk/TRBS/TRBS.html",
"name": "Technische Regeln für Betriebssicherheit",
"source_type": "technical_rule",
"legal_basis": "BetrSichV",
},
"trgs": {
"url": f"{BASE_URL}/DE/Angebote/Regelwerk/TRGS/TRGS.html",
"name": "Technische Regeln für Gefahrstoffe",
"source_type": "technical_rule",
"legal_basis": "GefStoffV",
},
"asr": {
"url": f"{BASE_URL}/DE/Angebote/Regelwerk/ASR/ASR.html",
"name": "Arbeitsstättenregeln",
"source_type": "technical_rule",
"legal_basis": "ArbStättV",
},
}
def crawl_index(page, category: str, config: dict) -> list[dict]:
"""Crawl index page and extract detail page links."""
logger.info("Crawling %s index: %s", category.upper(), config["url"])
page.goto(config["url"], wait_until="networkidle", timeout=30000)
time.sleep(3) # Wait for BunnyShield
# Extract all links to detail pages
links = page.query_selector_all("a[href]")
detail_urls = []
seen = set()
for link in links:
href = link.get_attribute("href") or ""
text = (link.inner_text() or "").strip()
# Match pattern: /DE/Angebote/Regelwerk/TRBS/TRBS-1111 (no .html!)
# ASR uses ASR-A1-3 (not ASR-ASR-A1-3)
base_pattern = f"/DE/Angebote/Regelwerk/{category.upper()}/"
is_detail = (base_pattern in href
and "#" not in href and "?" not in href
and href != base_pattern.rstrip("/")
and href.split("/")[-1] != category.upper())
if is_detail and href not in seen:
full_url = urljoin(BASE_URL, href)
seen.add(href)
# Extract regulation number from URL
filename = href.split("/")[-1]
detail_urls.append({
"detail_url": full_url,
"title": text[:200] if text else filename,
"filename": filename,
"category": category,
})
logger.info("Found %d detail pages for %s", len(detail_urls), category.upper())
return detail_urls
def extract_pdf_url(page, detail: dict) -> dict:
"""Visit detail page and extract PDF download link."""
try:
page.goto(detail["detail_url"], wait_until="networkidle", timeout=30000)
time.sleep(2)
# Strategy 1: Direct PDF link
pdf_links = page.query_selector_all('a[href$=".pdf"]')
for link in pdf_links:
href = link.get_attribute("href") or ""
if href:
detail["pdf_url"] = urljoin(BASE_URL, href)
return detail
# Strategy 2: Download button with data attribute
download_btns = page.query_selector_all("[data-download-url]")
for btn in download_btns:
url = btn.get_attribute("data-download-url") or ""
if url and ".pdf" in url:
detail["pdf_url"] = urljoin(BASE_URL, url)
return detail
# Strategy 3: Links containing "pdf" or "download"
all_links = page.query_selector_all("a[href]")
for link in all_links:
href = link.get_attribute("href") or ""
text = (link.inner_text() or "").lower()
if (".pdf" in href or "download" in text) and href:
detail["pdf_url"] = urljoin(BASE_URL, href)
return detail
# Strategy 4: Check for blob/dynamic download
download_links = page.query_selector_all(
'a[href*="blob"], a[href*="download"], a[href*="__blob"]'
)
for link in download_links:
href = link.get_attribute("href") or ""
if href:
detail["pdf_url"] = urljoin(BASE_URL, href)
return detail
logger.warning("No PDF found for %s", detail["filename"])
detail["pdf_url"] = None
return detail
except Exception as e:
logger.error("Error on %s: %s", detail["detail_url"], e)
detail["pdf_url"] = None
return detail
def download_pdf(page, detail: dict, output_dir: Path) -> dict:
"""Download PDF and compute hash."""
if not detail.get("pdf_url"):
return detail
cat = detail["category"]
safe_name = re.sub(r"[^a-zA-Z0-9_\-]", "_", detail["filename"]).lower()
pdf_path = output_dir / cat / f"{safe_name}.pdf"
pdf_path.parent.mkdir(parents=True, exist_ok=True)
if pdf_path.exists():
logger.info(" Already exists: %s", pdf_path.name)
detail["local_path"] = str(pdf_path)
detail["sha256"] = hashlib.sha256(pdf_path.read_bytes()).hexdigest()
return detail
try:
with page.expect_download(timeout=60000) as download_info:
page.goto(detail["pdf_url"], timeout=30000)
download = download_info.value
download.save_as(str(pdf_path))
except Exception:
# Fallback: direct download via response
try:
response = page.request.get(detail["pdf_url"])
if response.ok:
pdf_path.write_bytes(response.body())
else:
logger.error(" Download failed: %s (HTTP %d)",
detail["filename"], response.status)
return detail
except Exception as e:
logger.error(" Download failed: %s%s", detail["filename"], e)
return detail
size = pdf_path.stat().st_size
detail["local_path"] = str(pdf_path)
detail["sha256"] = hashlib.sha256(pdf_path.read_bytes()).hexdigest()
detail["size_bytes"] = size
logger.info(" Downloaded: %s (%.1f KB)", pdf_path.name, size / 1024)
return detail
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--category", choices=["trbs", "trgs", "asr"],
help="Only crawl one category")
parser.add_argument("--dry-run", action="store_true",
help="List PDFs without downloading")
parser.add_argument("--headless", action="store_true", default=True)
parser.add_argument("--no-headless", action="store_true")
args = parser.parse_args()
headless = not args.no_headless
categories = [args.category] if args.category else list(CATEGORIES.keys())
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
registry = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=headless)
context = browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
page = context.new_page()
for cat in categories:
config = CATEGORIES[cat]
logger.info("\n=== %s ===", cat.upper())
# Step 1: Crawl index
details = crawl_index(page, cat, config)
# Step 2: Extract PDF URLs
for i, detail in enumerate(details):
logger.info("[%d/%d] %s", i + 1, len(details), detail["filename"])
extract_pdf_url(page, detail)
time.sleep(1) # Be polite
# Step 3: Download PDFs
if not args.dry_run:
for detail in details:
download_pdf(page, detail, OUTPUT_DIR)
time.sleep(0.5)
# Add metadata
for detail in details:
detail["source_type"] = config["source_type"]
detail["legal_basis"] = config["legal_basis"]
detail["license_rule"] = 1 # §5 UrhG, gemeinfrei
detail["jurisdiction"] = "DE"
registry.extend(details)
browser.close()
# Save registry
REGISTRY_FILE.write_text(json.dumps(registry, indent=2, ensure_ascii=False))
logger.info("\nRegistry saved: %s (%d entries)", REGISTRY_FILE, len(registry))
# Summary
total = len(registry)
with_pdf = sum(1 for r in registry if r.get("pdf_url"))
downloaded = sum(1 for r in registry if r.get("local_path"))
logger.info("Total: %d | PDF found: %d | Downloaded: %d", total, with_pdf, downloaded)
if __name__ == "__main__":
main()
@@ -0,0 +1,119 @@
#!/usr/bin/env python3
"""
Ingest downloaded TRBS/TRGS/ASR PDFs into Qdrant via RAG Service.
Reads the source_registry.json and uploads each PDF to the RAG service.
Usage:
python3 ingest_to_qdrant.py # ingest all
python3 ingest_to_qdrant.py --category trbs # only TRBS
python3 ingest_to_qdrant.py --dry-run # list without uploading
"""
import argparse
import json
import logging
import time
from pathlib import Path
import httpx
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("ingest-trbs")
REGISTRY_FILE = Path(__file__).parent / "source_registry.json"
RAG_URL = "https://macmini:8097/api/v1/documents/upload"
COLLECTION = "bp_compliance_ce" # Same collection as other CE documents
def ingest_pdf(entry: dict) -> dict:
"""Upload a single PDF to the RAG service."""
local_path = entry.get("local_path", "")
if not local_path or not Path(local_path).exists():
return {"status": "skipped", "reason": "no local file"}
pdf_path = Path(local_path)
category = entry.get("category", "unknown")
filename = entry.get("filename", pdf_path.name)
title = entry.get("title", filename)
metadata = {
"source": title,
"regulation_id": f"{category}_{filename}".lower().replace("-", "_"),
"jurisdiction": "DE",
"source_type": "technical_rule",
"license_rule": 1,
"category": category,
"legal_basis": entry.get("legal_basis", ""),
}
try:
with open(pdf_path, "rb") as f:
files = {"file": (pdf_path.name, f, "application/pdf")}
data = {
"collection": COLLECTION,
"data_type": "legal",
"use_case": "compliance",
"year": "2026",
"chunk_size": "512",
"chunk_overlap": "50",
"metadata_json": json.dumps(metadata),
}
resp = httpx.post(RAG_URL, files=files, data=data, timeout=300.0, verify=False)
resp.raise_for_status()
result = resp.json()
return {
"status": "ok",
"document_id": result.get("document_id", ""),
"chunks": result.get("chunks_count", 0),
}
except Exception as e:
return {"status": "error", "reason": str(e)}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--category", choices=["trbs", "trgs", "asr"])
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
registry = json.loads(REGISTRY_FILE.read_text())
if args.category:
registry = [e for e in registry if e.get("category") == args.category]
logger.info("Ingesting %d documents into Qdrant (%s)", len(registry), COLLECTION)
total_ok = 0
total_chunks = 0
total_err = 0
for i, entry in enumerate(registry):
if not entry.get("local_path"):
continue
if args.dry_run:
logger.info("[%d/%d] %s%s (dry-run)",
i + 1, len(registry), entry["filename"], entry.get("title", "")[:60])
continue
logger.info("[%d/%d] %s", i + 1, len(registry), entry["filename"])
result = ingest_pdf(entry)
if result["status"] == "ok":
total_ok += 1
total_chunks += result["chunks"]
logger.info("%d chunks indexed", result["chunks"])
else:
total_err += 1
logger.error("%s: %s", result["status"], result.get("reason", ""))
time.sleep(1) # Be gentle
logger.info("\nDone: %d OK (%d chunks), %d errors, %d total",
total_ok, total_chunks, total_err, len(registry))
if __name__ == "__main__":
main()
File diff suppressed because it is too large Load Diff
+114
View File
@@ -0,0 +1,114 @@
# Urteile zum Download — Priorisiert nach Scannbarkeit
## Prioritaet 1: Website-scannbar (11 Urteile)
### 1. LG Muenchen I — Google Fonts
- Az: 3 O 17493/20 (20.01.2022)
- URL: https://www.gesetze-bayern.de/Content/Document/Y-300-Z-GRURRS-B-2022-N-612
- Scanner: fonts.googleapis.com, fonts.gstatic.com im HTML
### 2. DSB Oesterreich — Google Analytics
- Az: D155.027 (22.12.2021)
- URL: https://noyb.eu/de/oesterreichische-dsb-eu-us-datenuebermittlung-google-analytics-illegal
- Originalbescheid: https://noyb.eu/sites/default/files/2022-01/E-Bescheid%20%20redacted.pdf
- Scanner: google-analytics.com, gtag/js, analytics.js
### 3. CNIL — Cookie-Banner 150 Mio
- Sanktionsbescheid gegen Google (31.12.2021)
- URL: https://www.cnil.fr/en/cookies-cnil-fines-google-150-million-euros
- Scanner: Cookie-Banner DOM (Ablehnen vs Akzeptieren Button-Paritaet)
### 4. BGH — Planet49 / Opt-In
- Az: I ZR 7/16 (28.05.2020, nach EuGH C-673/17)
- URL: https://juris.bundesgerichtshof.de/cgi-bin/rechtsprechung/document.py?Gericht=bgh&Art=en&nr=107124
- Scanner: Cookies vor Consent, vorausgefuellte Checkboxen
### 5. EuGH — Schrems II
- Az: C-311/18 (16.07.2020)
- URL: https://curia.europa.eu/juris/liste.jsf?num=C-311/18
- Scanner: HTTP-Requests an US-Server (IP-Geolocation)
### 6. OLG Koeln — Dark Patterns Cookie-Banner
- Az: 6 U 58/21 (19.11.2021)
- Scanner: Button-Groesse, Farbe, Hierarchie im Consent-Banner
### 7. EuGH — Button-Loesung (Amazon)
- Az: C-649/17 (07.04.2022)
- Scanner: Bestell-Button Text ("zahlungspflichtig bestellen"?)
### 8. BGH — Impressum Social Media
- Az: I ZR 169/22 (09.09.2021)
- Scanner: Vollstaendiges Impressum innerhalb 2 Klicks
### 9. BGH — Grundpreis PAngV
- Az: I ZR 46/20 (20.01.2022)
- Scanner: Grundpreis neben Endpreis bei mengenbasierten Produkten
### 10. LG Berlin — Datenschutzerklaerung Vollstaendigkeit
- Az: 16 O 341/15
- Scanner: Art. 13/14 DSGVO Pflichtangaben in DSE
### 11. DSK — Telemedien Orientierungshilfe
- Bereits im RAG als: dsk_oh_telemedien (589 Chunks)
- KEIN Download noetig ✅
## Prioritaet 2: Dokument/Prozess-Checks (8 Urteile)
### 12. EuGH — SCHUFA Scoring / Art. 22
- Az: C-634/21 (07.12.2023)
- URL: https://curia.europa.eu/juris/liste.jsf?num=C-634/21
### 13. BAG — Zeiterfassung
- Az: 1 ABR 22/21 (13.09.2022)
- Bereits im RAG als: bag_1_abr_22_21 (237 Chunks)
- KEIN Download noetig ✅
### 14. EuGH — Schadensersatz bei Datenleck (Befuerchtung reicht)
- Az: C-340/21 (14.12.2023)
- URL: https://curia.europa.eu/juris/liste.jsf?num=C-340/21
### 15. EuGH — Meta / Berechtigtes Interesse
- Az: C-252/21 (04.07.2023)
- URL: https://curia.europa.eu/juris/liste.jsf?num=C-252/21
### 16. LAG Hamm — Microsoft 365 Mitbestimmung
- Az: 11 Sa 1108/22 (20.06.2023)
### 17. OLG Muenchen — Widerrufsbelehrung
- Az: 29 U 2698/19
### 18. BVerfG — Recht auf Vergessenwerden
- Az: 1 BvR 1547/19 (06.11.2019)
### 19. 1&1 Bussgeld (BfDI)
- 9,55 Mio EUR (09.12.2019)
- Unzureichende Authentifizierung im Kundenservice
### 20. BFSG/EAA
- Bereits im RAG als: bfsg (219 Chunks)
- KEIN Download noetig ✅
## Bereits im RAG vorhanden (kein Download):
- dsk_oh_telemedien (589 Chunks) ✅
- bag_1_abr_22_21 — Zeiterfassung (237 Chunks) ✅
- bfsg (219 Chunks) ✅
- 13 weitere BAG-Urteile ✅
## Download-Status:
- [ ] 1. Google Fonts
- [ ] 2. Google Analytics (DSB AT)
- [ ] 3. CNIL Cookie-Banner
- [ ] 4. BGH Planet49
- [ ] 5. EuGH Schrems II
- [ ] 6. OLG Koeln Dark Patterns
- [ ] 7. EuGH Button-Loesung
- [ ] 8. BGH Impressum
- [ ] 9. BGH Grundpreis
- [ ] 10. LG Berlin DSE
- [ ] 12. EuGH SCHUFA
- [ ] 14. EuGH Schadensersatz Datenleck
- [ ] 15. EuGH Meta
- [ ] 16. LAG Hamm M365
- [ ] 17. OLG Muenchen Widerruf
- [ ] 18. BVerfG Vergessenwerden
- [ ] 19. 1&1 Bussgeld
@@ -0,0 +1,28 @@
URTEIL DES GERICHTSHOFS (Dritte Kammer)
4. Mai 2023
Rechtssache C-300/21 — UI gegen Oesterreichische Post AG
TENOR:
1. Art. 82 Abs. 1 der Verordnung (EU) 2016/679 (DSGVO) ist dahin auszulegen, dass der blosse Verstoss gegen die Bestimmungen dieser Verordnung nicht ausreicht, um einen Schadenersatzanspruch zu begruenden.
2. Art. 82 Abs. 1 der DSGVO ist dahin auszulegen, dass er einer nationalen Regelung oder Praxis entgegensteht, die den Ersatz eines immateriellen Schadens davon abhaengig macht, dass der der betroffenen Person entstandene Schaden einen bestimmten Grad an Erheblichkeit erreicht hat.
3. Art. 82 der DSGVO ist dahin auszulegen, dass die nationalen Gerichte bei der Festsetzung der Hoehe des Schadenersatzes die innerstaatlichen Vorschriften anzuwenden haben, sofern die unionsrechtlichen Grundsaetze der Aequivalenz und der Effektivitaet beachtet werden.
KERNAUSSAGEN:
- DSGVO-Verstoss allein begruendet KEINEN Schadenersatzanspruch — es braucht einen konkreten Schaden
- Aber: KEINE Erheblichkeitsschwelle fuer immateriellen Schaden (jeder nachweisbare Schaden genuegt)
- 3 kumulative Voraussetzungen fuer Art. 82: Verstoss + Schaden + Kausalzusammenhang
- "Schaden" ist weit auszulegen (146. Erwaegungsgrund DSGVO)
- Kein Strafschadensersatz — nur Ausgleichsfunktion (vollstaendiger und wirksamer Ersatz)
- Nationale Gerichte wenden nationales Recht fuer die Hoehe an (Verfahrensautonomie)
- Grundsaetze der Aequivalenz und Effektivitaet muessen beachtet werden
- Unangenehme Gefuehle koennen immateriellen Schaden darstellen (keine Bagatellgrenze)
RELEVANTE NORMEN:
- Art. 82 DSGVO (Haftung und Recht auf Schadenersatz)
- Art. 83 DSGVO (Geldbussen — ergaenzt Schadenersatz, aber eigenstaendig)
- Art. 84 DSGVO (Sanktionen)
- Erwaegungsgrund 146 DSGVO (weite Auslegung des Schadensbegriffs)
- Erwaegungsgruende 75, 85 DSGVO (moegliche Schaeden)
@@ -0,0 +1,44 @@
URTEIL DES GERICHTSHOFS (Grosse Kammer)
16. Juli 2020
Rechtssache C-311/18 — Data Protection Commissioner gegen Facebook Ireland Ltd, Maximillian Schrems
TENOR:
1. Art. 2 Abs. 1 und 2 der Verordnung (EU) 2016/679 (DSGVO) ist dahin auszulegen, dass eine zu gewerblichen Zwecken erfolgende Uebermittlung personenbezogener Daten durch einen in einem Mitgliedstaat ansaessigen Wirtschaftsteilnehmer an einen anderen, in einem Drittland ansaessigen Wirtschaftsteilnehmer in den Anwendungsbereich dieser Verordnung faellt, ungeachtet dessen, ob die Daten bei ihrer Uebermittlung oder im Anschluss daran von den Behoerden des betreffenden Drittlands fuer Zwecke der oeffentlichen Sicherheit, der Landesverteidigung und der Sicherheit des Staates verarbeitet werden koennen.
2. Art. 46 Abs. 1 und Art. 46 Abs. 2 Buchst. c der DSGVO sind dahin auszulegen, dass die nach diesen Vorschriften erforderlichen geeigneten Garantien, durchsetzbaren Rechte und wirksamen Rechtsbehelfe gewaehrleisten muessen, dass die Rechte der Personen, deren personenbezogene Daten auf der Grundlage von Standarddatenschutzklauseln in ein Drittland uebermittelt werden, ein Schutzniveau geniessen, das dem in der EU durch die DSGVO im Licht der Charta garantierten Niveau der Sache nach gleichwertig ist.
3. Art. 58 Abs. 2 Buchst. f und j der DSGVO ist dahin auszulegen, dass die zustaendige Aufsichtsbehoerde verpflichtet ist, eine auf Standarddatenschutzklauseln gestuetzte Uebermittlung personenbezogener Daten in ein Drittland auszusetzen oder zu verbieten, wenn die Klauseln in diesem Drittland nicht eingehalten werden oder nicht eingehalten werden koennen und der nach dem Unionsrecht erforderliche Schutz nicht mit anderen Mitteln gewaehrleistet werden kann.
4. Die Pruefung des Beschlusses 2010/87/EU (Standardvertragsklauseln) anhand der Art. 7, 8 und 47 der Charta hat nichts ergeben, was seine Gueltigkeit beruehren koennte.
5. Der Durchfuehrungsbeschluss (EU) 2016/1250 (EU-US-Datenschutzschild / Privacy Shield) ist UNGUELTIG.
KERNAUSSAGEN:
- Privacy Shield (EU-US-Datenschutzschild) ist ungueltig
- US-Ueberwachungsprogramme (PRISM, UPSTREAM via Section 702 FISA + E.O. 12333) verstoessen gegen EU-Grundrechte
- Weder Section 702 FISA noch E.O. 12333 genuegen dem Verhaeltnismaessigkeitsgrundsatz
- PPD-28 verleiht betroffenen EU-Buergern keine durchsetzbaren Rechte
- Die Ombudsperson des Datenschutzschilds ist KEIN unabhaengiges Gericht i.S.v. Art. 47 Charta
- Standardvertragsklauseln (SCCs) bleiben gueltig, ABER:
- Der Verantwortliche muss VOR der Uebermittlung pruefen ob das Drittland angemessenen Schutz bietet
- Ggf. muessen zusaetzliche Massnahmen ergriffen werden
- Wenn kein angemessener Schutz moeglich: Uebermittlung aussetzen/verbieten
- Aufsichtsbehoerden sind VERPFLICHTET Uebermittlungen zu verbieten wenn Schutz nicht gewaehrleistet
- DSGVO gilt auch wenn Drittland-Behoerden Daten fuer nationale Sicherheit nutzen koennten
RELEVANTE NORMEN:
- Art. 44-49 DSGVO (Uebermittlungen in Drittlaender)
- Art. 45 DSGVO (Angemessenheitsbeschluss)
- Art. 46 DSGVO (Geeignete Garantien / Standardvertragsklauseln)
- Art. 58 Abs. 2 DSGVO (Befugnisse der Aufsichtsbehoerden)
- Art. 7, 8, 47 EU-Grundrechtecharta
- Art. 52 Abs. 1 EU-Grundrechtecharta (Verhaeltnismaessigkeit)
- Section 702 FISA (US-Auslandsaufklaerung)
- Executive Order 12333 (US-Nachrichtendienste)
- PPD-28 (Presidential Policy Directive)
AUSWIRKUNGEN:
- Jede Datenuebermittlung in die USA muss einzeln geprueft werden (Transfer Impact Assessment)
- Zusaetzliche technische Massnahmen (z.B. Verschluesselung) erforderlich
- Nachfolger: EU-US Data Privacy Framework (2023)
@@ -0,0 +1,30 @@
URTEIL DES GERICHTSHOFS (Erste Kammer)
7. Dezember 2023
Vorlage zur Vorabentscheidung Schutz natuerlicher Personen bei der Verarbeitung personenbezogener Daten Verordnung (EU) 2016/679 Art. 22 Automatisierte Entscheidung im Einzelfall Wirtschaftsauskunfteien Automatisierte Erstellung eines Wahrscheinlichkeitswerts in Bezug auf die Faehigkeit einer Person zur Erfuellung kuenftiger Zahlungsverpflichtungen (Scoring) Verwendung dieses Wahrscheinlichkeitswerts durch Dritte
In der Rechtssache C-634/21 — OQ gegen Land Hessen, Beteiligte: SCHUFA Holding AG
TENOR:
Art. 22 Abs. 1 der Verordnung (EU) 2016/679 (DSGVO) ist dahin auszulegen, dass eine automatisierte Entscheidung im Einzelfall im Sinne dieser Bestimmung vorliegt, wenn ein auf personenbezogene Daten zu einer Person gestuetzter Wahrscheinlichkeitswert in Bezug auf deren Faehigkeit zur Erfuellung kuenftiger Zahlungsverpflichtungen durch eine Wirtschaftsauskunftei automatisiert erstellt wird, sofern von diesem Wahrscheinlichkeitswert massgeblich abhaengt, ob ein Dritter, dem dieser Wahrscheinlichkeitswert uebermittelt wird, ein Vertragsverhaeltnis mit dieser Person begruendet, durchfuehrt oder beendet.
KERNAUSSAGEN:
- SCHUFA-Scoring ist eine automatisierte Entscheidung im Einzelfall gemaess Art. 22 DSGVO
- Der Score-Wert selbst ist bereits die "Entscheidung" (nicht erst die Handlung des Dritten)
- Art. 22 DSGVO stellt ein grundsaetzliches VERBOT automatisierter Entscheidungen auf
- Ausnahmen nur nach Art. 22 Abs. 2 DSGVO (Vertrag, Rechtsvorschrift, Einwilligung)
- Betroffene haben Recht auf Auskunft ueber die involvierte Logik (Art. 15 Abs. 1 Buchst. h)
- Nationale Regelungen (wie § 31 BDSG) muessen Art. 5, 6 und 22 DSGVO genuegen
- Enge Auslegung wuerde zu Rechtsschutzluecke fuehren (3-Akteure-Problem)
- Angemessene Massnahmen: Recht auf menschliches Eingreifen, Darlegung des Standpunkts, Anfechtung
RELEVANTE NORMEN:
- Art. 22 DSGVO (Automatisierte Entscheidungen im Einzelfall)
- Art. 4 Nr. 4 DSGVO (Definition Profiling)
- Art. 15 Abs. 1 Buchst. h DSGVO (Auskunftsrecht bei automatisierter Entscheidung)
- Art. 13 Abs. 2 Buchst. f DSGVO (Informationspflicht)
- Art. 5 DSGVO (Grundsaetze der Verarbeitung)
- Art. 6 DSGVO (Rechtmaessigkeit)
- § 31 BDSG (Scoring — Vereinbarkeit mit EU-Recht zweifelhaft)
@@ -0,0 +1,19 @@
URTEIL DES GERICHTSHOFS (Große Kammer)
1. Oktober 2019
Vorlage zur Vorabentscheidung Richtlinie 95/46/EG Richtlinie 2002/58/EG Verordnung (EU) 2016/679 Verarbeitung personenbezogener Daten und Schutz der Privatsphäre in der elektronischen Kommunikation Cookies Begriff der Einwilligung der betroffenen Person Einwilligungserklaerung mittels eines mit einem voreingestellten Haekchen versehenen Ankreuzkaestchens
In der Rechtssache C-673/17
Bundesverband der Verbraucherzentralen und Verbraucherverbände Verbraucherzentrale Bundesverband e. V. gegen Planet49 GmbH
TENOR:
1. Art. 2 Buchst. f und Art. 5 Abs. 3 der Richtlinie 2002/58/EG in Verbindung mit Art. 2 Buchst. h der Richtlinie 95/46/EG bzw. mit Art. 4 Nr. 11 und Art. 6 Abs. 1 Buchst. a der Verordnung 2016/679 sind dahin auszulegen, dass keine wirksame Einwilligung im Sinne dieser Bestimmungen vorliegt, wenn die Speicherung von Informationen oder der Zugriff auf Informationen, die bereits im Endgeraet des Nutzers einer Website gespeichert sind, mittels Cookies durch ein voreingestelltes Ankreuzkaestchen erlaubt wird, das der Nutzer zur Verweigerung seiner Einwilligung abwaehlen muss.
2. Art. 2 Buchst. f und Art. 5 Abs. 3 der Richtlinie 2002/58 sind nicht unterschiedlich auszulegen, je nachdem, ob es sich bei den im Endgeraet des Nutzers einer Website gespeicherten oder abgerufenen Informationen um personenbezogene Daten handelt oder nicht.
3. Art. 5 Abs. 3 der Richtlinie 2002/58 ist dahin auszulegen, dass Angaben zur Funktionsdauer der Cookies und dazu, ob Dritte Zugriff auf die Cookies erhalten koennen, zu den Informationen zaehlen, die der Diensteanbieter dem Nutzer einer Website zu geben hat.
Verkuendet in oeffentlicher Sitzung in Luxemburg am 1. Oktober 2019.
@@ -0,0 +1,29 @@
# LG München I — Google Fonts Urteil
# Az: 3 O 17493/20 (20.01.2022)
# Quelle: gesetze-bayern.de
## Tenor (Entscheidung)
1. Die Beklagte wird verurteilt, es zu unterlassen, die dynamische IP-Adresse des Klägers an Google weiterzugeben, wenn der Kläger die Website der Beklagten aufruft, ohne dass der Kläger in die Weitergabe eingewilligt hat. Androhung: Ordnungsgeld bis 250.000 EUR oder Ordnungshaft bis 6 Monate.
2. Die Beklagte wird verurteilt, dem Kläger Auskunft zu erteilen, welche personenbezogenen Daten über ihn verarbeitet werden.
3. Die Beklagte wird verurteilt, 100 EUR Schmerzensgeld nebst Zinsen zu zahlen.
## Kernbegruendung
**DSGVO-Verstoss durch IP-Uebermittlung:** Das Gericht stellte fest, dass die automatische Uebermittlung dynamischer IP-Adressen an Google beim Laden von Google Fonts das Recht auf informationelle Selbstbestimmung (Art. 823 BGB) und Art. 6 Abs. 1 DSGVO verletzt.
**IP-Adressen = personenbezogene Daten:** Dynamische IP-Adressen sind personenbezogene Daten, weil der Website-Betreiber ueber den abstrakten rechtlichen Weg (Behoerden, Provider) die Identifikation der Person erreichen kann.
**Kein berechtigtes Interesse:** Das berechtigte Interesse der Beklagten scheitert, weil "Google Fonts auch genutzt werden kann, ohne dass beim Aufruf der Webseite eine Verbindung zu Google-Servern hergestellt wird und die IP-Adresse der Webseitenbesucher uebertragen wird." (lokales Hosting moeglich)
## Compliance-Anforderung
Website-Betreiber muessen Google Fonts lokal hosten oder Alternativen verwenden, die keine automatische IP-Uebermittlung an externe Server ohne explizite Einwilligung verursachen.
## Scanner-Pruefpunkte
- HTML pruefen auf: fonts.googleapis.com, fonts.gstatic.com
- CSS pruefen auf: @import url('https://fonts.googleapis.com/...')
- JS pruefen auf: WebFont.load mit google-Provider
- Wenn gefunden: FAIL — externer Google Fonts Einbindung ohne Consent
+2
View File
@@ -3,9 +3,11 @@ from fastapi import APIRouter
from api.collections import router as collections_router
from api.documents import router as documents_router
from api.search import router as search_router
from api.tenant_documents import router as tenant_documents_router
router = APIRouter()
router.include_router(collections_router, tags=["Collections"])
router.include_router(documents_router, tags=["Documents"])
router.include_router(tenant_documents_router, tags=["Tenant Documents"])
router.include_router(search_router, tags=["Search"])
+289
View File
@@ -0,0 +1,289 @@
"""
Tenant-isolated document upload, listing, and deletion.
Each tenant gets their own Qdrant collection (bp_docs_tenant_{short_id}).
Documents are stored in MinIO under tenant-specific paths.
No data crosses tenant boundaries.
Endpoints:
POST /api/v1/tenant/documents - Upload + process PDF
GET /api/v1/tenant/documents - List tenant's documents
DELETE /api/v1/tenant/documents/{doc_id} - Delete document + vectors
GET /api/v1/tenant/documents/{doc_id}/status - Processing status
"""
import json
import logging
import uuid
from typing import Optional
from fastapi import APIRouter, File, Form, HTTPException, Header, Request, UploadFile
from pydantic import BaseModel
from api.auth import optional_jwt_auth
from embedding_client import embedding_client
from html_utils import decode_html_bytes, looks_like_html, strip_html
from minio_client_wrapper import minio_wrapper
from qdrant_client_wrapper import qdrant_wrapper
logger = logging.getLogger("rag-service.api.tenant-documents")
router = APIRouter(prefix="/api/v1/tenant/documents")
VECTOR_DIM = 1024 # bge-m3 dimension
MAX_FILE_SIZE = 50 * 1024 * 1024 # 50 MB
ALLOWED_TYPES = {"application/pdf", "text/html", "text/plain"}
PDF_MAGIC = b"%PDF"
def _collection_name(tenant_id: str) -> str:
"""Derive tenant-specific Qdrant collection name."""
short = tenant_id.replace("-", "")[:12]
return f"bp_docs_tenant_{short}"
def _storage_path(tenant_id: str, document_id: str, filename: str) -> str:
"""Derive tenant-isolated storage path."""
short = tenant_id.replace("-", "")[:12]
return f"tenant_docs/{short}/{document_id}/{filename}"
def _extract_tenant_id(
request: Request,
x_tenant_id: Optional[str] = Header(None),
) -> str:
"""Extract tenant ID from header. Required for all tenant endpoints."""
tid = x_tenant_id or request.headers.get("x-tenant-id", "")
if not tid:
raise HTTPException(status_code=400, detail="X-Tenant-ID header required")
return tid
# ── Response models ────────────────────────────────────────────────
class DocumentResponse(BaseModel):
id: str
filename: str
file_size: int
status: str
chunk_count: int
collection: str
created_at: Optional[str] = None
class DocumentListResponse(BaseModel):
documents: list[DocumentResponse]
total: int
# ── Endpoints ──────────────────────────────────────────────────────
@router.post("", response_model=DocumentResponse)
async def upload_tenant_document(
request: Request,
file: UploadFile = File(...),
x_tenant_id: Optional[str] = Header(None),
chunk_size: int = Form(default=512),
chunk_overlap: int = Form(default=50),
metadata_json: Optional[str] = Form(default=None),
):
"""Upload a document, process it, and index in tenant-specific collection."""
optional_jwt_auth(request)
tenant_id = _extract_tenant_id(request, x_tenant_id)
# Read + validate
file_bytes = await file.read()
if len(file_bytes) == 0:
raise HTTPException(status_code=400, detail="Empty file")
if len(file_bytes) > MAX_FILE_SIZE:
raise HTTPException(status_code=413, detail=f"File too large (max {MAX_FILE_SIZE // 1024 // 1024} MB)")
filename = file.filename or "document.pdf"
content_type = file.content_type or "application/octet-stream"
# PDF magic bytes check
if filename.lower().endswith(".pdf") and not file_bytes[:4].startswith(PDF_MAGIC):
raise HTTPException(status_code=400, detail="File claims to be PDF but magic bytes don't match")
document_id = str(uuid.uuid4())
collection = _collection_name(tenant_id)
object_name = _storage_path(tenant_id, document_id, filename)
# Ensure collection exists
await qdrant_wrapper.create_collection(collection, VECTOR_DIM)
# Store in MinIO
try:
await minio_wrapper.upload_document(
object_name=object_name,
data=file_bytes,
content_type=content_type,
metadata={"document_id": document_id, "tenant_id": tenant_id},
)
except Exception as exc:
logger.error("MinIO upload failed for tenant %s: %s", tenant_id, exc)
raise HTTPException(status_code=500, detail="Storage failed")
# Extract text
try:
text = await _extract_text(file_bytes, filename, content_type)
except Exception as exc:
logger.error("Text extraction failed: %s", exc)
raise HTTPException(status_code=500, detail=f"Text extraction failed: {exc}")
if not text or not text.strip():
raise HTTPException(status_code=400, detail="No text could be extracted")
# Chunk
chunk_result = await embedding_client.chunk_text(
text=text, strategy="recursive",
chunk_size=chunk_size, overlap=chunk_overlap,
)
chunks = chunk_result.chunks
chunks_meta = chunk_result.chunks_with_metadata
if not chunks:
raise HTTPException(status_code=400, detail="Chunking produced zero chunks")
# Embed
embeddings = await embedding_client.generate_embeddings(chunks)
# Parse extra metadata
extra_metadata = {}
if metadata_json:
try:
extra_metadata = json.loads(metadata_json)
except json.JSONDecodeError:
pass
# Build payloads with tenant isolation
_STRUCT_FIELDS = ("section", "section_title", "paragraph", "paragraph_num", "page")
payloads = []
for i, chunk in enumerate(chunks):
payload = {
"document_id": document_id,
"tenant_id": tenant_id,
"filename": filename,
"chunk_index": i,
"chunk_text": chunk,
**extra_metadata,
}
if i < len(chunks_meta):
for field in _STRUCT_FIELDS:
value = chunks_meta[i].get(field)
if value is not None and value != "":
payload[field] = value
payloads.append(payload)
# Index in tenant collection
indexed = await qdrant_wrapper.index_documents(
collection=collection, vectors=embeddings, payloads=payloads,
)
logger.info(
"Tenant %s: uploaded %s (%d chunks, %d vectors) to %s",
tenant_id[:8], filename, len(chunks), indexed, collection,
)
return DocumentResponse(
id=document_id, filename=filename,
file_size=len(file_bytes), status="indexed",
chunk_count=len(chunks), collection=collection,
)
@router.get("", response_model=DocumentListResponse)
async def list_tenant_documents(
request: Request,
x_tenant_id: Optional[str] = Header(None),
):
"""List all documents for this tenant."""
optional_jwt_auth(request)
tenant_id = _extract_tenant_id(request, x_tenant_id)
collection = _collection_name(tenant_id)
try:
# Get unique document_ids from Qdrant
docs = await qdrant_wrapper.get_unique_documents(collection)
except Exception:
# Collection doesn't exist yet → no documents
docs = []
return DocumentListResponse(documents=docs, total=len(docs))
@router.delete("/{doc_id}")
async def delete_tenant_document(
doc_id: str,
request: Request,
x_tenant_id: Optional[str] = Header(None),
):
"""Delete a document and all its vectors from tenant collection."""
optional_jwt_auth(request)
tenant_id = _extract_tenant_id(request, x_tenant_id)
collection = _collection_name(tenant_id)
errors = []
# Delete vectors from Qdrant
try:
await qdrant_wrapper.delete_by_filter(
collection=collection,
filter_conditions={"document_id": doc_id},
)
except Exception as exc:
errors.append(f"Qdrant: {exc}")
# Delete file from MinIO
try:
prefix = f"tenant_docs/{tenant_id.replace('-', '')[:12]}/{doc_id}/"
await minio_wrapper.delete_by_prefix(prefix)
except Exception as exc:
errors.append(f"MinIO: {exc}")
if errors:
logger.warning("Partial delete for %s/%s: %s", tenant_id[:8], doc_id[:8], errors)
return {"deleted": True, "warnings": errors}
logger.info("Tenant %s: deleted document %s", tenant_id[:8], doc_id[:8])
return {"deleted": True, "document_id": doc_id}
@router.get("/{doc_id}/status")
async def document_status(
doc_id: str,
request: Request,
x_tenant_id: Optional[str] = Header(None),
):
"""Get processing status for a document."""
optional_jwt_auth(request)
tenant_id = _extract_tenant_id(request, x_tenant_id)
collection = _collection_name(tenant_id)
try:
count = await qdrant_wrapper.count_by_filter(
collection=collection,
filter_conditions={"document_id": doc_id},
)
status = "indexed" if count > 0 else "not_found"
except Exception:
count = 0
status = "not_found"
return {"document_id": doc_id, "status": status, "chunk_count": count}
# ── Helpers ────────────────────────────────────────────────────────
async def _extract_text(file_bytes: bytes, filename: str, content_type: str) -> str:
"""Extract text from PDF, HTML, or plain text."""
if content_type == "application/pdf" or filename.lower().endswith(".pdf"):
return await embedding_client.extract_pdf(file_bytes)
if filename.lower().endswith((".html", ".htm")):
text = decode_html_bytes(file_bytes)
return strip_html(text)
text = file_bytes.decode("utf-8", errors="replace")
if looks_like_html(text):
return strip_html(text)
return text
+10
View File
@@ -122,6 +122,16 @@ class MinioClientWrapper:
logger.error("Failed to delete '%s': %s", object_name, exc)
raise
async def delete_by_prefix(self, prefix: str) -> int:
"""Remove all objects under a prefix."""
objects = self.client.list_objects(settings.MINIO_BUCKET, prefix=prefix, recursive=True)
count = 0
for obj in objects:
self.client.remove_object(settings.MINIO_BUCKET, obj.object_name)
count += 1
logger.info("Deleted %d objects with prefix '%s'", count, prefix)
return count
# ------------------------------------------------------------------
# Presigned URL
# ------------------------------------------------------------------
+68
View File
@@ -235,6 +235,74 @@ class QdrantClientWrapper:
logger.info("Deleted points from '%s' with filter %s", collection, filter_conditions)
return True
# ------------------------------------------------------------------
# Tenant document helpers
# ------------------------------------------------------------------
async def get_unique_documents(self, collection: str) -> list[dict]:
"""Get unique documents from a collection by scrolling and grouping."""
try:
self.client.get_collection(collection)
except Exception:
return []
docs: dict[str, dict] = {}
offset = None
while True:
result = self.client.scroll(
collection_name=collection,
scroll_filter=None,
limit=100,
offset=offset,
with_payload=True,
with_vectors=False,
)
points, next_offset = result
for pt in points:
payload = pt.payload or {}
doc_id = payload.get("document_id", "")
if doc_id and doc_id not in docs:
docs[doc_id] = {
"id": doc_id,
"filename": payload.get("filename", ""),
"file_size": payload.get("file_size", 0),
"status": "indexed",
"chunk_count": 0,
"collection": collection,
}
if doc_id:
docs[doc_id]["chunk_count"] += 1
if next_offset is None:
break
offset = next_offset
return list(docs.values())
async def count_by_filter(
self, collection: str, filter_conditions: dict[str, Any]
) -> int:
"""Count points matching filter."""
try:
self.client.get_collection(collection)
except Exception:
return 0
must_conditions = []
for key, value in filter_conditions.items():
must_conditions.append(
qmodels.FieldCondition(
key=key, match=qmodels.MatchValue(value=value)
)
)
result = self.client.count(
collection_name=collection,
count_filter=qmodels.Filter(must=must_conditions),
exact=True,
)
return result.count
# ------------------------------------------------------------------
# Info
# ------------------------------------------------------------------