Compare commits
137 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 870cdc871e | |||
| f398088fbb | |||
| 0c5f1fd7a4 | |||
| de542633e2 | |||
| ff4a743558 | |||
| dac2a9f685 | |||
| adb7c6802c | |||
| dbfe7347b1 | |||
| a7850a0296 | |||
| ec3b0e26fd | |||
| 19d1a56df4 | |||
| 3934bdf814 | |||
| dbd44ecc20 | |||
| 93687a32fe | |||
| 2d9fec3a6d | |||
| a6f4ca88a4 | |||
| 297eff949e | |||
| 01e2e0fc4b | |||
| b4043b20b2 | |||
| ad61fd3779 | |||
| d1b55cd65b | |||
| cb46372e52 | |||
| f1814fe8ec | |||
| 12a9fe1810 | |||
| 8b5b9905a7 | |||
| cd23ebc3ba | |||
| f30ac73b79 | |||
| bb85ee2e27 | |||
| 0d5ebcd27a | |||
| 7d721a6787 | |||
| 9a1ad87acd | |||
| 911697bab4 | |||
| 9783657da3 | |||
| 47d7beeb52 | |||
| 63b195c0aa | |||
| 77993d0ea0 | |||
| 9382d2a7a4 | |||
| b727f14011 | |||
| 084beed348 | |||
| 5510689710 | |||
| 49e594bf38 | |||
| 583e54fabc | |||
| 7f4b7da098 | |||
| f3e54180f0 | |||
| ae937a35d7 | |||
| edac3aca6c | |||
| fc4d5d8c56 | |||
| f5d4e3bd95 | |||
| 9e3604fe31 | |||
| 0c09b960b9 | |||
| cf18b1074a | |||
| 2e8cbfff3f | |||
| f6489e7748 | |||
| 519cc274bb | |||
| 79810f4eb8 | |||
| 5f193c8a72 | |||
| d13f4511cb | |||
| 937eca6b77 | |||
| 0c1561d6cc | |||
| 0bb9726ddd | |||
| 8510af46eb | |||
| 81db904b3e | |||
| 572052285c | |||
| 1ef22e6f95 | |||
| d291af0e33 | |||
| 76aad8b1d1 | |||
| 54f0919b73 | |||
| ec7eee8e3d | |||
| b0d273d3ab | |||
| 17b9006b88 | |||
| e013702a02 | |||
| f022b489e2 | |||
| 0092c4fe47 | |||
| d5bcd0bd5b | |||
| c398e74d5e | |||
| e82f99b8cb | |||
| 66a70ab31c | |||
| ad24835940 | |||
| e683701a44 | |||
| 0bad74a3bd | |||
| 22257a7ed8 | |||
| a20de0b52b | |||
| 775d8b52f3 | |||
| f0a84e79ab | |||
| 64f45be63a | |||
| 404963db77 | |||
| 0acbf25956 | |||
| 2bd9b015eb | |||
| be126a7a39 | |||
| 30a9165497 | |||
| f2184be02f | |||
| 06014d57b3 | |||
| 6c022d1a79 | |||
| e869cabc81 | |||
| 652e3a65a3 | |||
| aab8eeb335 | |||
| 9437e029d0 | |||
| 4fd2bfefcd | |||
| fac9280716 | |||
| 118be3540d | |||
| a9671a572b | |||
| 2f4a3f2ea2 | |||
| 0b0eed27b0 | |||
| 97a7f6f264 | |||
| ff21bc258a | |||
| 3009f3d13a | |||
| 5a6e588641 | |||
| 41183ff93d | |||
| 75dda9ac92 | |||
| a459636bc4 | |||
| ddad58f607 | |||
| f130c45ca8 | |||
| 93099b2770 | |||
| 370143b643 | |||
| 07039cc408 | |||
| af83e41494 | |||
| 9888b1b5d7 | |||
| da21339e76 | |||
| 6ab10415d8 | |||
| 1bf1411c66 | |||
| 5946aa47d5 | |||
| d9c16fb914 | |||
| 6f58fdbaa5 | |||
| b8ff4e9290 | |||
| f2104768a0 | |||
| 2f861cd6d7 | |||
| 23b233bda3 | |||
| adfff6cfe4 | |||
| 269464943e | |||
| e8df15c0f8 | |||
| 7c5592b50e | |||
| e8f018f2c6 | |||
| b151951448 | |||
| 2e2e81b3e1 | |||
| b873c0e4ae | |||
| 9dc16674e2 | |||
| e6e2688b56 |
@@ -25,6 +25,7 @@ voice-service/bqas/** | owner=pipeline | reason=RAG Quality Assessment, produkti
|
||||
# Seed/Helper Scripts (keine Service-Logik)
|
||||
scripts/seed-demo-and-screenshot.py | owner=infra | reason=Einmaliges Seed-Script, kein Service-Code | review=permanent
|
||||
pitch-deck/scripts/import-finanzplan.py | owner=pitch-deck | reason=583 LOC, einmaliges Excel-Import-Script (9 Sheet-Importer), hardcodierte Row/Col-Mappings fuer eine Finanzplan-.xlsm-Datei, keine wiederverwendbare Logik | review=2027-01
|
||||
pitch-deck/scripts/export-finanzplan-excel.ts | owner=pitch-deck | reason=1254 LOC, Excel-Export-Script — analog zu import-finanzplan.py: 9 Sheets, ~80% Cell-Formatting/Styling-Boilerplate, keine wiederverwendbare Logik | review=2027-01
|
||||
|
||||
# PDF Templates (reine statische HTML/CSS Strings, keine Logik)
|
||||
backend-core/services/pdf_templates.py | owner=all | reason=519 LOC, rein statische Jinja2-HTML-Templates + CSS, keine Logik | review=2026-07
|
||||
@@ -33,3 +34,6 @@ backend-core/services/pdf_templates.py | owner=all | reason=519 LOC, rein statis
|
||||
pitch-deck/lib/presenter/presenter-faq.ts | owner=pitch-deck | reason=973 LOC, pure static FAQ array (questions/answers/keywords), no logic | review=2027-01
|
||||
pitch-deck/lib/presenter/presenter-script.ts | owner=pitch-deck | reason=608 LOC, pure static presenter script data + 3 trivial lookup functions | review=2027-01
|
||||
pitch-deck/lib/i18n.ts | owner=pitch-deck | reason=620 LOC, pure DE/EN translation dictionaries + 3 small format helpers | review=2027-01
|
||||
|
||||
# Marketing Website — adapted from pitch-deck USP slide (complex SVG animation, inline styles, no logic to split)
|
||||
marketing-website/components/sections/PlatformBridgeSection.tsx | owner=marketing | reason=816 LOC, adapted 1:1 from pitch-deck USPSlide with SVG animations, CSS keyframes, inline styles — splitting would break animation coherence | review=2027-01
|
||||
|
||||
@@ -1,297 +0,0 @@
|
||||
# Night Scheduler - Entwicklerdokumentation
|
||||
|
||||
**Status:** Produktiv
|
||||
**Letzte Aktualisierung:** 2026-02-09
|
||||
**URL:** https://macmini:3002/infrastructure/night-mode
|
||||
**API:** http://macmini:8096
|
||||
|
||||
---
|
||||
|
||||
## Uebersicht
|
||||
|
||||
Der Night Scheduler ermoeglicht die automatische Nachtabschaltung der Docker-Services:
|
||||
- Zeitgesteuerte Abschaltung (Standard: 22:00)
|
||||
- Zeitgesteuerter Start (Standard: 06:00)
|
||||
- Manuelle Sofortaktionen (Start/Stop)
|
||||
- Dashboard-UI zur Konfiguration
|
||||
|
||||
---
|
||||
|
||||
## Architektur
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Admin Dashboard (Port 3002) │
|
||||
│ /infrastructure/night-mode │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ API Proxy: /api/admin/night-mode │
|
||||
│ - GET: Status abrufen │
|
||||
│ - POST: Konfiguration speichern │
|
||||
│ - POST /execute: Sofortaktion (start/stop) │
|
||||
│ - GET /services: Service-Liste │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ night-scheduler (Port 8096) │
|
||||
│ - Python/FastAPI Container │
|
||||
│ - Prueft jede Minute ob Aktion faellig │
|
||||
│ - Fuehrt docker compose start/stop aus │
|
||||
│ - Speichert Config in /config/night-mode.json │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Dateien
|
||||
|
||||
| Pfad | Beschreibung |
|
||||
|------|--------------|
|
||||
| `night-scheduler/scheduler.py` | Python Scheduler mit FastAPI |
|
||||
| `night-scheduler/Dockerfile` | Container mit Docker CLI |
|
||||
| `night-scheduler/requirements.txt` | Dependencies |
|
||||
| `night-scheduler/config/night-mode.json` | Konfigurationsdatei |
|
||||
| `night-scheduler/tests/test_scheduler.py` | Unit Tests |
|
||||
| `admin-v2/app/api/admin/night-mode/route.ts` | API Proxy |
|
||||
| `admin-v2/app/api/admin/night-mode/execute/route.ts` | Execute Endpoint |
|
||||
| `admin-v2/app/api/admin/night-mode/services/route.ts` | Services Endpoint |
|
||||
| `admin-v2/app/(admin)/infrastructure/night-mode/page.tsx` | UI Seite |
|
||||
|
||||
---
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### GET /api/night-mode
|
||||
Status und Konfiguration abrufen.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"config": {
|
||||
"enabled": true,
|
||||
"shutdown_time": "22:00",
|
||||
"startup_time": "06:00",
|
||||
"last_action": "startup",
|
||||
"last_action_time": "2026-02-09T06:00:00",
|
||||
"excluded_services": ["night-scheduler", "nginx"]
|
||||
},
|
||||
"current_time": "14:30:00",
|
||||
"next_action": "shutdown",
|
||||
"next_action_time": "22:00",
|
||||
"time_until_next_action": "7h 30min",
|
||||
"services_status": {
|
||||
"backend": "running",
|
||||
"postgres": "running"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### POST /api/night-mode
|
||||
Konfiguration aktualisieren.
|
||||
|
||||
**Request:**
|
||||
```json
|
||||
{
|
||||
"enabled": true,
|
||||
"shutdown_time": "23:00",
|
||||
"startup_time": "07:00",
|
||||
"excluded_services": ["night-scheduler", "nginx", "vault"]
|
||||
}
|
||||
```
|
||||
|
||||
### POST /api/night-mode/execute
|
||||
Sofortige Aktion ausfuehren.
|
||||
|
||||
**Request:**
|
||||
```json
|
||||
{
|
||||
"action": "stop" // oder "start"
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"message": "Aktion 'stop' erfolgreich ausgefuehrt fuer 25 Services"
|
||||
}
|
||||
```
|
||||
|
||||
### GET /api/night-mode/services
|
||||
Liste aller Services abrufen.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"all_services": ["backend", "postgres", "valkey", ...],
|
||||
"excluded_services": ["night-scheduler", "nginx"],
|
||||
"status": {
|
||||
"backend": "running",
|
||||
"postgres": "running"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Konfiguration
|
||||
|
||||
### Config-Format (night-mode.json)
|
||||
|
||||
```json
|
||||
{
|
||||
"enabled": true,
|
||||
"shutdown_time": "22:00",
|
||||
"startup_time": "06:00",
|
||||
"last_action": "startup",
|
||||
"last_action_time": "2026-02-09T06:00:00",
|
||||
"excluded_services": ["night-scheduler", "nginx"]
|
||||
}
|
||||
```
|
||||
|
||||
### Umgebungsvariablen
|
||||
|
||||
| Variable | Default | Beschreibung |
|
||||
|----------|---------|--------------|
|
||||
| `COMPOSE_PROJECT_NAME` | `breakpilot-pwa` | Docker Compose Projektname |
|
||||
|
||||
---
|
||||
|
||||
## Ausgeschlossene Services
|
||||
|
||||
Diese Services werden NICHT gestoppt:
|
||||
|
||||
1. **night-scheduler** - Muss laufen, um Services zu starten
|
||||
2. **nginx** - Optional, fuer HTTPS-Zugriff
|
||||
|
||||
Weitere Services koennen ueber die Konfiguration ausgeschlossen werden.
|
||||
|
||||
---
|
||||
|
||||
## Docker Compose Integration
|
||||
|
||||
```yaml
|
||||
night-scheduler:
|
||||
build: ./night-scheduler
|
||||
container_name: breakpilot-pwa-night-scheduler
|
||||
volumes:
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
- ./night-scheduler/config:/config
|
||||
- ./docker-compose.yml:/app/docker-compose.yml:ro
|
||||
environment:
|
||||
- COMPOSE_PROJECT_NAME=breakpilot-pwa
|
||||
ports:
|
||||
- "8096:8096"
|
||||
networks:
|
||||
- breakpilot-pwa-network
|
||||
restart: unless-stopped
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tests ausfuehren
|
||||
|
||||
```bash
|
||||
# Im Container
|
||||
docker exec -it breakpilot-pwa-night-scheduler pytest -v
|
||||
|
||||
# Lokal (mit Dependencies)
|
||||
cd night-scheduler
|
||||
pip install -r requirements.txt
|
||||
pytest -v tests/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Deployment
|
||||
|
||||
```bash
|
||||
# 1. Dateien synchronisieren
|
||||
rsync -avz night-scheduler/ macmini:.../night-scheduler/
|
||||
|
||||
# 2. Container bauen
|
||||
ssh macmini "docker compose -f .../docker-compose.yml build --no-cache night-scheduler"
|
||||
|
||||
# 3. Container starten
|
||||
ssh macmini "docker compose -f .../docker-compose.yml up -d night-scheduler"
|
||||
|
||||
# 4. Testen
|
||||
curl http://macmini:8096/health
|
||||
curl http://macmini:8096/api/night-mode
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Problem: Services werden nicht gestoppt/gestartet
|
||||
|
||||
1. Pruefen ob Docker Socket gemountet ist:
|
||||
```bash
|
||||
docker exec breakpilot-pwa-night-scheduler ls -la /var/run/docker.sock
|
||||
```
|
||||
|
||||
2. Pruefen ob docker compose CLI verfuegbar ist:
|
||||
```bash
|
||||
docker exec breakpilot-pwa-night-scheduler docker compose version
|
||||
```
|
||||
|
||||
3. Logs pruefen:
|
||||
```bash
|
||||
docker logs breakpilot-pwa-night-scheduler
|
||||
```
|
||||
|
||||
### Problem: Konfiguration wird nicht gespeichert
|
||||
|
||||
1. Pruefen ob /config beschreibbar ist:
|
||||
```bash
|
||||
docker exec breakpilot-pwa-night-scheduler touch /config/test
|
||||
```
|
||||
|
||||
2. Volume-Mount pruefen in docker-compose.yml
|
||||
|
||||
### Problem: API nicht erreichbar
|
||||
|
||||
1. Container-Status pruefen:
|
||||
```bash
|
||||
docker ps | grep night-scheduler
|
||||
```
|
||||
|
||||
2. Health-Check pruefen:
|
||||
```bash
|
||||
curl http://localhost:8096/health
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sicherheitshinweise
|
||||
|
||||
- Der Container benoetigt Zugriff auf den Docker Socket
|
||||
- Nur interne Services koennen gestoppt/gestartet werden
|
||||
- Keine Authentifizierung (internes Netzwerk)
|
||||
- Keine sensitiven Daten in der Konfiguration
|
||||
|
||||
---
|
||||
|
||||
## Dependencies (SBOM)
|
||||
|
||||
| Package | Version | Lizenz |
|
||||
|---------|---------|--------|
|
||||
| FastAPI | 0.109.0 | MIT |
|
||||
| Uvicorn | 0.27.0 | BSD-3-Clause |
|
||||
| Pydantic | 2.5.3 | MIT |
|
||||
| pytest | 8.0.0 | MIT |
|
||||
| pytest-asyncio | 0.23.0 | Apache-2.0 |
|
||||
| httpx | 0.26.0 | BSD-3-Clause |
|
||||
|
||||
---
|
||||
|
||||
## Aenderungshistorie
|
||||
|
||||
| Datum | Aenderung |
|
||||
|-------|-----------|
|
||||
| 2026-02-09 | Initiale Implementierung |
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
#
|
||||
# Services:
|
||||
# Go: consent-service
|
||||
# Python: backend-core, voice-service (+ BQAS), embedding-service, night-scheduler
|
||||
# Python: backend-core, voice-service (+ BQAS), embedding-service
|
||||
# Node.js: admin-core
|
||||
|
||||
name: CI
|
||||
@@ -46,7 +46,7 @@ jobs:
|
||||
- name: Lint Python services
|
||||
run: |
|
||||
pip install --quiet ruff
|
||||
for svc in backend-core voice-service night-scheduler embedding-service; do
|
||||
for svc in backend-core voice-service embedding-service; do
|
||||
if [ -d "$svc" ]; then
|
||||
echo "=== Linting $svc ==="
|
||||
ruff check "$svc/" --output-format=github || true
|
||||
|
||||
@@ -0,0 +1,36 @@
|
||||
# Daily GDPR data cleanup for the pitch deck.
|
||||
# Calls /api/admin/cleanup which runs runDataCleanup():
|
||||
# - anonymizes investors inactive 30+ days
|
||||
# - anonymizes never-activated invites after 90 days
|
||||
# - deletes sessions + magic links older than 30 days
|
||||
# - anonymizes IPs in audit logs older than 30 days
|
||||
#
|
||||
# Requires Gitea Actions secret: PITCH_ADMIN_SECRET
|
||||
|
||||
name: Pitch deck — GDPR cleanup
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 2 * * *'
|
||||
|
||||
jobs:
|
||||
cleanup:
|
||||
runs-on: docker
|
||||
container:
|
||||
image: alpine:3.19
|
||||
steps:
|
||||
- name: Run data cleanup
|
||||
env:
|
||||
PITCH_ADMIN_SECRET: ${{ secrets.PITCH_ADMIN_SECRET }}
|
||||
run: |
|
||||
apk add --no-cache curl
|
||||
RESPONSE=$(curl -sSf -w "\n%{http_code}" -X POST \
|
||||
-H "Authorization: Bearer $PITCH_ADMIN_SECRET" \
|
||||
-H "Content-Type: application/json" \
|
||||
https://pitch.breakpilot.com/api/admin/cleanup) \
|
||||
|| { echo "Cleanup request failed"; exit 1; }
|
||||
HTTP_CODE=$(echo "$RESPONSE" | tail -n1)
|
||||
BODY=$(echo "$RESPONSE" | head -n-1)
|
||||
echo "Response: $BODY"
|
||||
[ "$HTTP_CODE" = "200" ] || { echo "Unexpected status $HTTP_CODE"; exit 1; }
|
||||
echo "GDPR cleanup completed successfully"
|
||||
@@ -41,6 +41,11 @@ backups/*.backup
|
||||
*.mp3
|
||||
*.wav
|
||||
|
||||
# Cloned external legal-source repos (gitignored; pulled fresh at ingest time)
|
||||
legal-sources/bsi-quaidal/
|
||||
legal-sources/bsi-quaidal-src/
|
||||
legal-sources/bsi-grundschutz-plus/
|
||||
|
||||
# Compiled binaries
|
||||
billing-service/billing-service
|
||||
consent-service/server
|
||||
@@ -62,3 +67,7 @@ consent-service/server
|
||||
# Coverage
|
||||
coverage/
|
||||
*.coverage
|
||||
controls_backup_*.dump
|
||||
|
||||
# Allow Finanzplan exports (generated by pitch-deck/scripts/export-finanzplan.sh)
|
||||
!pitch-deck/exports/*.xlsx
|
||||
|
||||
Generated
+2948
File diff suppressed because it is too large
Load Diff
@@ -10,7 +10,7 @@
|
||||
},
|
||||
"dependencies": {
|
||||
"lucide-react": "^0.468.0",
|
||||
"next": "^15.1.0",
|
||||
"next": "^15.5.16",
|
||||
"react": "^18.3.1",
|
||||
"react-dom": "^18.3.1",
|
||||
"reactflow": "^11.11.4",
|
||||
|
||||
@@ -0,0 +1,158 @@
|
||||
# Controls nutzen — Anleitung für andere Sessions
|
||||
|
||||
**Stand:** 2026-05-07, wird laufend aktualisiert
|
||||
**Repo:** breakpilot-core (~/Projekte/breakpilot-core)
|
||||
|
||||
---
|
||||
|
||||
## Was sind die Controls?
|
||||
|
||||
174.497 atomare Compliance-Controls in der Datenbank. Jeder Control ist eine **einzelne prüfbare Anforderung** aus einer Rechtsquelle (DSGVO, NIS2, NIST, AI Act, etc.).
|
||||
|
||||
### Beispiel
|
||||
|
||||
```
|
||||
Control-ID: AUTH-2956-A14
|
||||
Titel: "Implementierung von Multi-Faktor-Authentifizierung prüfen"
|
||||
Objective: "Sicherstellen, dass MFA korrekt implementiert ist..."
|
||||
Merge-Key: "verify:multi_factor_auth:testing"
|
||||
Severity: high
|
||||
```
|
||||
|
||||
## Wo liegen die Controls?
|
||||
|
||||
### Datenbank (PostgreSQL auf Mac Mini)
|
||||
|
||||
```sql
|
||||
-- Alle Controls abfragen
|
||||
SELECT id, control_id, title, objective, severity,
|
||||
source_citation, -- Rechtsquelle (JSON)
|
||||
generation_metadata->>'merge_group_hint' AS merge_key
|
||||
FROM compliance.canonical_controls
|
||||
WHERE release_state NOT IN ('deprecated', 'rejected');
|
||||
```
|
||||
|
||||
**Verbindung:**
|
||||
```bash
|
||||
# Vom MacBook:
|
||||
ssh macmini "/usr/local/bin/docker exec bp-core-postgres psql -U breakpilot -d breakpilot_db"
|
||||
|
||||
# Oder via Control-Pipeline Container:
|
||||
ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline curl -sf http://127.0.0.1:8098/..."
|
||||
```
|
||||
|
||||
### API (Port 8098, nur via Docker exec erreichbar)
|
||||
|
||||
```bash
|
||||
# Master Controls auflisten
|
||||
ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline \
|
||||
curl -sf 'http://127.0.0.1:8098/v1/master-controls?limit=50&sort=total_controls'"
|
||||
|
||||
# Master Control Detail mit allen Membern
|
||||
ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline \
|
||||
curl -sf 'http://127.0.0.1:8098/v1/master-controls/MC-8292'"
|
||||
```
|
||||
|
||||
## Struktur der Controls
|
||||
|
||||
### merge_group_hint (Schlüsselfeld!)
|
||||
|
||||
Jeder Control hat einen `merge_group_hint` im Format `action:object:phase`:
|
||||
|
||||
```
|
||||
implement:encryption:implementation
|
||||
define:access_control:definition
|
||||
monitor:network_security:monitoring
|
||||
report:supervisory_authority:reporting
|
||||
```
|
||||
|
||||
**74 kanonische Object-Tokens** (Stand 2026-05-07):
|
||||
|
||||
| Kategorie | Tokens |
|
||||
|-----------|--------|
|
||||
| **Security** | multi_factor_auth, password_policy, credentials, session_management, privileged_access, access_control, encryption, transport_encryption, key_management, certificate_management, network_security, network_segmentation, firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting, compliance_audit, vulnerability, patch_management, backup, disaster_recovery, physical_security, secure_development, api_security, input_validation, container_security, logging_configuration |
|
||||
| **Data Protection** | personal_data, sensitive_data, health_data, consent, data_subject_rights, data_retention, data_transfer, data_breach_notification, dpia, data_processing_agreement, privacy_by_design, data_processing_register, data_classification, cookie_consent, video_surveillance |
|
||||
| **Governance** | policy, procedure, process, training, awareness, incident, risk_management, third_party_management, change_management, documentation, records_management, compliance_reporting, asset_management, human_resources_security |
|
||||
| **Regulatory** | supervisory_authority, certification, product_safety, ai_system, financial_reporting, aml, whistleblowing, consumer_protection, ecommerce, telecommunications, medical_device, payment_services, critical_infrastructure, supply_chain_due_diligence, sustainability_reporting |
|
||||
|
||||
### Rechtsquellen (source_citation)
|
||||
|
||||
Die **Parent-Controls** (nicht die atomaren!) haben `source_citation`:
|
||||
|
||||
```sql
|
||||
-- Controls mit Rechtsquelle finden
|
||||
SELECT cc.control_id, cc.title,
|
||||
pc.source_citation->>'source' AS regulation,
|
||||
pc.source_citation->>'article' AS article
|
||||
FROM compliance.canonical_controls cc
|
||||
JOIN compliance.canonical_controls pc ON pc.id = cc.parent_control_uuid
|
||||
WHERE pc.source_citation IS NOT NULL
|
||||
AND pc.source_citation->>'source' LIKE '%DSGVO%';
|
||||
```
|
||||
|
||||
148 verschiedene Rechtsquellen (DSGVO, NIS2, NIST, OWASP, BSI, TKG, etc.)
|
||||
|
||||
## Controls filtern (Use Cases)
|
||||
|
||||
### Beispiel: Alle DSGVO Art. 13 Controls (für DSI-Prüfung)
|
||||
|
||||
```sql
|
||||
SELECT cc.control_id, cc.title, cc.objective,
|
||||
cc.generation_metadata->>'merge_group_hint' AS merge_key,
|
||||
pc.source_citation->>'article' AS article
|
||||
FROM compliance.canonical_controls cc
|
||||
JOIN compliance.canonical_controls pc ON pc.id = cc.parent_control_uuid
|
||||
WHERE pc.source_citation->>'source' = 'DSGVO (EU) 2016/679'
|
||||
AND pc.source_citation->>'article' LIKE '%13%'
|
||||
AND cc.release_state NOT IN ('deprecated', 'rejected')
|
||||
ORDER BY cc.control_id;
|
||||
```
|
||||
|
||||
### Beispiel: Alle Encryption-Controls
|
||||
|
||||
```sql
|
||||
SELECT control_id, title, objective
|
||||
FROM compliance.canonical_controls
|
||||
WHERE generation_metadata->>'merge_group_hint' LIKE '%:encryption:%'
|
||||
AND release_state NOT IN ('deprecated', 'rejected');
|
||||
```
|
||||
|
||||
### Beispiel: Controls nach Object-Token filtern
|
||||
|
||||
```sql
|
||||
-- Alle Controls zu einem bestimmten Thema
|
||||
SELECT control_id, title,
|
||||
generation_metadata->>'merge_group_hint' AS merge_key
|
||||
FROM compliance.canonical_controls
|
||||
WHERE generation_metadata->>'merge_group_hint' LIKE '%:data_retention:%'
|
||||
AND release_state NOT IN ('deprecated', 'rejected');
|
||||
```
|
||||
|
||||
## Wichtige Tabellen
|
||||
|
||||
| Tabelle | Rows | Beschreibung |
|
||||
|---------|------|-------------|
|
||||
| `compliance.canonical_controls` | ~294K | Alle Controls (Rich + Atomic) |
|
||||
| `compliance.master_controls` | ~5.329 | Gruppierte Master Controls |
|
||||
| `compliance.master_control_members` | ~172K | Zuordnung Control → MC |
|
||||
| `compliance.object_ontology` | 74 | Kanonische Object-Definitionen |
|
||||
| `compliance.regulation_registry` | 223 | Rechtsquellen-Register |
|
||||
|
||||
## Was gerade passiert (2026-05-07)
|
||||
|
||||
**Phase 2 läuft:** Alle 174K Controls werden per Claude Haiku re-klassifiziert. Die `merge_group_hint` werden von frei-form LLM-Objekten auf 74 kanonische Tokens normalisiert. Danach:
|
||||
- Phase 3: Re-Clustering (gpre1 mit K=20000)
|
||||
- Phase 4: Neue Master Controls (gpre2)
|
||||
- Phase 5: Regulation-Source-Split (gpre3)
|
||||
|
||||
**NICHT ÄNDERN:** `canonical_controls`, `master_controls`, `object_ontology` Tabellen werden aktiv bearbeitet.
|
||||
|
||||
## DB-Zugang Quick Reference
|
||||
|
||||
```bash
|
||||
# Quick Query (eine Zeile)
|
||||
ssh macmini "/usr/local/bin/docker exec bp-core-postgres psql -U breakpilot -d breakpilot_db -c \"SELECT count(*) FROM compliance.canonical_controls\""
|
||||
|
||||
# Interaktive Session
|
||||
ssh macmini "/usr/local/bin/docker exec -it bp-core-postgres psql -U breakpilot -d breakpilot_db"
|
||||
```
|
||||
@@ -0,0 +1,117 @@
|
||||
# Session-Handover: MC Quality + Gap-Analyse + RAG Ingestion
|
||||
|
||||
**Datum:** 2026-05-07 bis 2026-05-11 (5 Tage Marathon)
|
||||
**Repo:** breakpilot-core + breakpilot-compliance
|
||||
|
||||
---
|
||||
|
||||
## ERLEDIGT
|
||||
|
||||
### Master Control Quality Overhaul (Core)
|
||||
- **74.5% → 92.8% Accuracy** (13.588 MCs, 83.073 Members)
|
||||
- Phase 0: Quality Audit mit Claude Sonnet ($3)
|
||||
- Phase 1: Ontologie 31 → 74 Tokens + LLM-Prompt fix
|
||||
- Phase 2: 174K Controls re-klassifiziert via Haiku (10 Batches, ~$50)
|
||||
- Phase 2b: Generic Tokens gefixt (documentation/procedure → echte Themen, $7.54)
|
||||
- Phase 2c: L2 Sub-Topics (2 Runden, 172K Controls, ~$32)
|
||||
- Phase 2d: Bad Subtopics gefixt (stakeholder_*, $0.50)
|
||||
- Phase 3: Re-Clustering K=18704
|
||||
- Phase 4: gpre2 Direct MC (13.588 MCs)
|
||||
- Phase 6: Golden Dataset (20 Controls) + 8 Quality Tests (alle grün)
|
||||
- **Production Sync:** MCs + Members + Hints + doc_check_controls
|
||||
|
||||
### doc_check_controls (Core → Production)
|
||||
- **1.874 Controls** über 8 Dokumenttypen (DSE, Cookie, Impressum, AGB, Widerruf, DSFA, AVV, Löschkonzept)
|
||||
- Jeder mit check_question + pass_criteria + fail_criteria
|
||||
- Tabelle `compliance.doc_check_controls` lokal + Production
|
||||
|
||||
### RAG Ingestion (Core)
|
||||
- **126 BAuA PDFs** (TRBS/TRGS/ASR): 27.664 Chunks → `bp_compliance_ce`
|
||||
- **OSHA Technical Manual** (23 Kapitel): 7.241 Chunks → `bp_compliance_ce`
|
||||
- **OSHA 1910 Subpart O** (Volltext): 745 Chunks
|
||||
- **EuGH C-588/21 P**: 216 Chunks
|
||||
- **EU 2018/1725**: 842 Chunks → `bp_compliance`
|
||||
- **CE-Obligations extrahiert:** 6.141 Obligations → `/tmp/ce_obligations_v2.json`
|
||||
- Playwright-Crawler für BAuA + OSHA gebaut
|
||||
|
||||
### Gap-Analyse Engine (Compliance)
|
||||
- **12 Regulierungen** automatisch klassifiziert (CRA, AI Act, NIS2, DSGVO, MiCA, PSD2, AML, etc.)
|
||||
- **IST-Zustand Assessment:** CE-Kennzeichnung, angewandte Normen, bestehende Prozesse, IACE-Projekt-Link
|
||||
- **Norm→Control Mapping:** 20 Normen → MC-Topic Coverage
|
||||
- **Prioritäts-Engine:** Severity × Deadline × Dependency
|
||||
- **5 Branchentemplates:** IoT, Exchange, Cobot, SaaS, Medical
|
||||
- **Frontend:** 2-Step Wizard (Produkt + IST-Zustand) + Dashboard mit Ampel-Status
|
||||
- **API:** 8 Endpoints unter `/sdk/v1/gap/`
|
||||
- **Persistente Projekte:** Speichern + wieder öffnen
|
||||
- **Getestet:** SmartFactory Gateway → 5 Regulierungen, 500 Gaps
|
||||
|
||||
### Tenant Document Upload API (Core)
|
||||
- `POST/GET/DELETE /api/v1/tenant/documents`
|
||||
- Tenant-isolierte Qdrant-Collections
|
||||
- Code fertig, nicht deployed (RAG Service rebuild nötig)
|
||||
|
||||
### Master Controls Browser (Compliance)
|
||||
- **Neue Seite** `/sdk/master-controls` — reused Control Library UI
|
||||
- Sidebar-Eintrag zwischen Control Library und Provenance
|
||||
- 13.588 MCs mit allen Filtern, Paginierung, Klick-Detail
|
||||
- Verbindet sich mit Production-DB
|
||||
|
||||
---
|
||||
|
||||
## DB-Tabellen (neu/geändert)
|
||||
|
||||
| Tabelle | Repo | Rows (lokal) | Rows (Production) |
|
||||
|---------|------|-------------|-------------------|
|
||||
| compliance.master_controls | Core | 13.588 | 13.588 |
|
||||
| compliance.master_control_members | Core | 83.073 | 83.073 |
|
||||
| compliance.object_ontology | Core | 74 | 74 |
|
||||
| compliance.object_groups | Core | 16.683 | — |
|
||||
| compliance.doc_check_controls | Core | 1.874 | 1.874 |
|
||||
| compliance.gap_projects | Compliance | 1 | 0 |
|
||||
|
||||
---
|
||||
|
||||
## OFFEN / NÄCHSTE SESSION
|
||||
|
||||
1. **Orca Deploy-Fix** — Production deployed nicht automatisch (Webhook + docker pull Problem)
|
||||
2. **Gap-Analyse v2 IST-Zustand** — Frontend Step 2 deployed, Backend deployed, aber Orca blockiert
|
||||
3. **Tenant Document Upload** deployen (RAG Service rebuild)
|
||||
4. **Compliance-Repo auf gitea pushen** — aktuell "Everything up-to-date", Orca muss manuell redeployt werden
|
||||
5. **MC-Browser erweitern** — Detail-View mit Member-Controls verbessern
|
||||
|
||||
---
|
||||
|
||||
## BACKUPS (auf MacBook)
|
||||
|
||||
| Datei | Inhalt |
|
||||
|-------|--------|
|
||||
| `backup_pre_gpre3_20260510.dump` | Vor gpre3 Live-Run (171 MB) |
|
||||
| `backup_session_end_20260511.dump` | Session-Ende |
|
||||
| `production_backup_20260508.dump` | Production nach Phase 2 |
|
||||
| `gpre0_checkpoints_backup_20260508/` | 10 Corrections-JSONs |
|
||||
|
||||
---
|
||||
|
||||
## API-Kosten (Anthropic)
|
||||
|
||||
| Phase | Modell | Kosten |
|
||||
|-------|--------|--------|
|
||||
| Phase 0: Quality Audit | Sonnet | $2.92 |
|
||||
| Phase 0b: Quality Audit v2 | Sonnet | $5.93 |
|
||||
| Phase 2: 174K Re-Klassifizierung | Haiku | ~$50 |
|
||||
| Phase 2b: Generic Token Fix | Haiku | $7.54 |
|
||||
| Phase 2c: Subtopics R1 | Haiku | $20.22 |
|
||||
| Phase 2c: Subtopics R2 | Haiku | $12.03 |
|
||||
| Phase 2d: Bad Subtopics | Haiku | ~$0.50 |
|
||||
| 5K Test-Run | Sonnet | $5.32 |
|
||||
| doc_check_controls | Haiku | ~$5 |
|
||||
| **Gesamt** | | **~$110** |
|
||||
|
||||
---
|
||||
|
||||
## STRATEGISCHE ENTSCHEIDUNGEN (in Memory)
|
||||
|
||||
1. **3 Use Cases:** Gap-Analyse (Prio 1), Vendor Risk (Prio 2), Web3/Crypto als Vertikal (Prio 3)
|
||||
2. **Keine Norm-Reproduktion:** Obligation Extraction statt ISO-Texte (juristisch sicher)
|
||||
3. **Regulatory Ingestion Engine:** BAuA/OSHA Crawler als Vorlage für automatisierte Source-Feeds
|
||||
4. **CE-Compliance Crossover:** IACE × Master Controls für Trigger-basierte Compliance-Hinweise
|
||||
@@ -0,0 +1,335 @@
|
||||
# Instruktion: Teststrategie Block C
|
||||
|
||||
**Repo:** `/Users/benjaminadmin/Projekte/breakpilot-core/`
|
||||
**Verzeichnis:** `control-pipeline/tests/`
|
||||
**Erstellt:** 2026-05-01
|
||||
**Geschaetzter Aufwand:** 2-3 Tage
|
||||
|
||||
## Ausgangslage
|
||||
|
||||
- 221 bestehende Tests in 7 Dateien (NICHT aendern!)
|
||||
- 40 Golden Test Cases (golden_controls.yaml)
|
||||
- 24 Demo Cases (demo_cases.yaml)
|
||||
- Alle Tests sind pure Python, kein DB noetig
|
||||
- Pipeline v1 abgeschlossen: 151.675 unique Controls, 15.291 Dependencies
|
||||
|
||||
## Aufgabe 1: Real-World Benchmarks (C1)
|
||||
|
||||
### Was zu tun ist
|
||||
|
||||
10 echte deutsche E-Commerce Websites manuell pruefen und Ground Truth YAML erstellen.
|
||||
|
||||
### Verzeichnis
|
||||
|
||||
```
|
||||
control-pipeline/tests/benchmarks/
|
||||
├── amazon_de.yaml
|
||||
├── zalando_de.yaml
|
||||
├── otto_de.yaml
|
||||
├── lidl_de.yaml
|
||||
├── check24_de.yaml
|
||||
├── booking_de.yaml
|
||||
├── thomann_de.yaml
|
||||
├── aboutyou_de.yaml
|
||||
├── mytheresa_com.yaml
|
||||
└── kleiner_shop.yaml
|
||||
```
|
||||
|
||||
### Format pro Website
|
||||
|
||||
```yaml
|
||||
website: amazon.de
|
||||
url: https://www.amazon.de
|
||||
checked_at: "2026-05-XX"
|
||||
checked_by: "Name"
|
||||
|
||||
ground_truth:
|
||||
impressum:
|
||||
present: true/false
|
||||
complete: true/false # Name, Adresse, Email, HR-Nummer, USt-ID
|
||||
within_2_clicks: true/false
|
||||
missing_fields: [] # z.B. ["USt-ID", "Handelsregister"]
|
||||
|
||||
datenschutzerklaerung:
|
||||
present: true/false
|
||||
art13_complete: true/false
|
||||
missing_art13_fields: [] # z.B. ["Speicherdauer", "Empfaenger"]
|
||||
rechtsgrundlagen_korrekt: true/false
|
||||
wrong_legal_bases: [] # z.B. ["Analytics auf lit. f statt lit. a"]
|
||||
|
||||
cookie_banner:
|
||||
present: true/false
|
||||
reject_equally_easy: true/false # CNIL: Ablehnen = gleich prominent
|
||||
cookies_before_consent: true/false # Planet49: Cookies VOR Consent?
|
||||
dark_patterns: [] # z.B. ["Ablehnen-Button kleiner", "Ablehnen hinter Einstellungen"]
|
||||
|
||||
widerrufsbelehrung:
|
||||
present: true/false
|
||||
matches_legal_template: true/false # Gesetzliches Muster
|
||||
|
||||
agb:
|
||||
present: true/false
|
||||
checkout_button_text: "..." # z.B. "Jetzt kaufen" (korrekt) vs "Weiter" (falsch)
|
||||
|
||||
google_fonts_external: true/false
|
||||
google_analytics: true/false
|
||||
|
||||
third_party_services:
|
||||
- name: "Google Analytics"
|
||||
detected: true
|
||||
consent_required: true
|
||||
consent_obtained_before_load: false
|
||||
- name: "Facebook Pixel"
|
||||
detected: true
|
||||
consent_required: true
|
||||
consent_obtained_before_load: false
|
||||
|
||||
expected_findings:
|
||||
- "Cookie-Banner: Ablehnen nicht gleichwertig"
|
||||
- "Google Analytics ohne vorherige Einwilligung"
|
||||
- "DSE: Rechtsgrundlage fuer Analytics falsch"
|
||||
|
||||
expected_no_findings:
|
||||
- "Impressum fehlt" # Ist vorhanden, darf nicht geflagt werden
|
||||
```
|
||||
|
||||
### Test-Runner
|
||||
|
||||
```python
|
||||
# control-pipeline/tests/test_benchmarks.py
|
||||
"""
|
||||
Real-World Benchmark Tests — vergleicht Agent-Findings mit manueller Ground Truth.
|
||||
Erfordert: Compliance Agent muss laufen (https://macmini:3007/sdk/agent)
|
||||
"""
|
||||
|
||||
import yaml
|
||||
import pytest
|
||||
import os
|
||||
|
||||
BENCHMARK_DIR = os.path.join(os.path.dirname(__file__), "benchmarks")
|
||||
|
||||
def load_benchmarks():
|
||||
cases = []
|
||||
for f in sorted(os.listdir(BENCHMARK_DIR)):
|
||||
if f.endswith(".yaml"):
|
||||
with open(os.path.join(BENCHMARK_DIR, f)) as fh:
|
||||
cases.append(yaml.safe_load(fh))
|
||||
return cases
|
||||
|
||||
class TestBenchmarks:
|
||||
"""Precision/Recall gegen Ground Truth messen."""
|
||||
|
||||
@pytest.mark.parametrize("case", load_benchmarks(), ids=lambda c: c["website"])
|
||||
def test_benchmark(self, case):
|
||||
# TODO: Agent gegen Website laufen lassen
|
||||
# TODO: Findings mit expected_findings vergleichen
|
||||
# TODO: Precision + Recall berechnen
|
||||
pass
|
||||
```
|
||||
|
||||
### Wie die Ground Truth erstellt wird
|
||||
|
||||
1. Website im Browser oeffnen
|
||||
2. Impressum pruefen (alle Pflichtfelder nach § 5 DDG)
|
||||
3. Datenschutzerklaerung lesen (Art. 13 DSGVO Checkliste)
|
||||
4. Cookie-Banner testen (Ablehnen gleich einfach? Cookies vor Consent?)
|
||||
5. Widerrufsbelehrung gegen gesetzliches Muster pruefen
|
||||
6. Browser DevTools: Netzwerk-Tab → externe Requests vor Consent?
|
||||
7. Alles in YAML dokumentieren
|
||||
|
||||
**Ziel-Metriken:**
|
||||
- Precision > 80% (wenige False Positives)
|
||||
- Recall > 70% (findet die meisten echten Probleme)
|
||||
|
||||
---
|
||||
|
||||
## Aufgabe 2: Adversarial Tests (C2)
|
||||
|
||||
### Was zu tun ist
|
||||
|
||||
30 tricky Test Cases erstellen die den Agent/Controls herausfordern.
|
||||
|
||||
### Datei
|
||||
|
||||
`control-pipeline/tests/adversarial_cases.yaml`
|
||||
|
||||
### Kategorien
|
||||
|
||||
**A. Falsche Rechtsgrundlage (8 Cases):**
|
||||
- Analytics auf lit. f statt lit. a
|
||||
- Marketing-Emails auf lit. b statt lit. a
|
||||
- Mitarbeiter-Tracking auf lit. f statt Betriebsvereinbarung
|
||||
- Biometrische Daten auf lit. f statt Art. 9
|
||||
- Profiling auf lit. f statt Art. 22
|
||||
- Newsletter auf lit. b statt lit. a
|
||||
- Social Login auf lit. b statt lit. a
|
||||
- Kreditscoring auf lit. f statt lit. a + Art. 22
|
||||
|
||||
**B. Dark Patterns (6 Cases):**
|
||||
- Ablehnen-Button existiert aber 3px gross + grau
|
||||
- "Alle akzeptieren" prominent, "Einstellungen" statt "Ablehnen"
|
||||
- Cookie-Wall: Inhalt erst nach Zustimmung sichtbar
|
||||
- Vorausgefuellte Checkboxen (Planet49)
|
||||
- Confirm-Shaming: "Nein, ich moechte keine sichere Verbindung"
|
||||
- Ablehnen erfordert 3 Klicks, Akzeptieren nur 1
|
||||
|
||||
**C. Fast-vollstaendige Dokumente (6 Cases):**
|
||||
- Impressum komplett bis auf USt-ID
|
||||
- DSE ohne Speicherdauer
|
||||
- DSE ohne DSB-Kontakt
|
||||
- Widerrufsbelehrung mit falschem Fristbeginn
|
||||
- AGB ohne Gerichtsstand
|
||||
- Cookie-Policy ohne Auflistung aller Cookies
|
||||
|
||||
**D. Semantisch aehnlich aber verschieden (5 Cases):**
|
||||
- "Admin-MFA" vs "User-MFA" (verschiedene Scopes!)
|
||||
- "Daten loeschen nach Kuendigung" vs "Daten loeschen nach Aufbewahrungsfrist"
|
||||
- "Rate Limiting API" vs "Rate Limiting Login"
|
||||
- "Verschluesselung at rest" vs "Verschluesselung in transit"
|
||||
- "Incident Response Plan" vs "Business Continuity Plan"
|
||||
|
||||
**E. Semantisch verschieden aber gleich klingend (5 Cases):**
|
||||
- "Einwilligung" (DSGVO) vs "Einwilligung" (Werbung)
|
||||
- "Verarbeitung" (Daten) vs "Verarbeitung" (Lebensmittel)
|
||||
- "Risikobewertung" (DSGVO DSFA) vs "Risikobewertung" (Finanzrisiko)
|
||||
- "Audit" (Datenschutz) vs "Audit" (Finanzen)
|
||||
- "Zertifizierung" (ISO 27001) vs "Zertifizierung" (CE-Marking)
|
||||
|
||||
### Format
|
||||
|
||||
```yaml
|
||||
- id: ADV-LIT-001
|
||||
category: wrong_legal_basis
|
||||
input: "Wir verarbeiten Ihre Daten fuer Webanalyse auf Grundlage unseres berechtigten Interesses (Art. 6 Abs. 1 lit. f DSGVO)."
|
||||
context: "DSE-Abschnitt ueber Google Analytics"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: "wrong_legal_basis"
|
||||
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
|
||||
reason: "Analytics erfordert Einwilligung, nicht berechtigtes Interesse (EuGH C-673/17 Planet49)"
|
||||
difficulty: medium # easy / medium / hard
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Aufgabe 3: Regression-Harness (C3)
|
||||
|
||||
### Was zu tun ist
|
||||
|
||||
1. `conftest.py` mit shared Fixtures
|
||||
2. `test_regression.py` mit Snapshot-Tests
|
||||
3. CI/CD Quality Gate
|
||||
|
||||
### conftest.py
|
||||
|
||||
```python
|
||||
# control-pipeline/tests/conftest.py
|
||||
import os
|
||||
import pytest
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def db_session():
|
||||
"""DB session for integration tests — skip if no DATABASE_URL."""
|
||||
url = os.getenv("DATABASE_URL")
|
||||
if not url:
|
||||
pytest.skip("DATABASE_URL not set")
|
||||
from db.session import SessionLocal
|
||||
db = SessionLocal()
|
||||
yield db
|
||||
db.close()
|
||||
|
||||
@pytest.fixture
|
||||
def sample_controls(db_session):
|
||||
"""Load 100 random draft controls for regression testing."""
|
||||
from sqlalchemy import text
|
||||
rows = db_session.execute(text("""
|
||||
SELECT control_id, title, category, severity,
|
||||
generation_metadata->>'assertion' as assertion
|
||||
FROM compliance.canonical_controls
|
||||
WHERE release_state = 'draft' AND decomposition_method = 'pass0b'
|
||||
ORDER BY random() LIMIT 100
|
||||
""")).fetchall()
|
||||
return [dict(r._mapping) for r in rows]
|
||||
```
|
||||
|
||||
### test_regression.py
|
||||
|
||||
```python
|
||||
# control-pipeline/tests/test_regression.py
|
||||
"""
|
||||
Regression Tests — pruefen ob Pipeline-Updates bestehende Controls veraendern.
|
||||
Erfordert: DATABASE_URL Umgebungsvariable
|
||||
"""
|
||||
|
||||
class TestControlStability:
|
||||
def test_draft_count_stable(self, db_session):
|
||||
"""Draft count darf nicht um >5% abweichen."""
|
||||
from sqlalchemy import text
|
||||
count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
|
||||
)).scalar()
|
||||
assert count > 140000, f"Draft count too low: {count}"
|
||||
assert count < 200000, f"Draft count too high: {count}"
|
||||
|
||||
def test_no_null_assertions(self, db_session):
|
||||
"""Alle draft Controls muessen eine assertion haben."""
|
||||
from sqlalchemy import text
|
||||
null_count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
|
||||
"AND (generation_metadata->>'assertion' IS NULL OR generation_metadata->>'assertion' = '')"
|
||||
)).scalar()
|
||||
assert null_count < 1000, f"Too many controls without assertion: {null_count}"
|
||||
|
||||
def test_dependency_graph_valid(self, db_session):
|
||||
"""Keine Zyklen im Dependency-Graph."""
|
||||
from sqlalchemy import text
|
||||
cycle_count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.control_dependencies WHERE is_active = true"
|
||||
)).scalar()
|
||||
assert cycle_count > 10000, f"Too few dependencies: {cycle_count}"
|
||||
|
||||
class TestQualityGates:
|
||||
def test_duplicate_rate(self, db_session):
|
||||
pass # Implementieren: duplicate_rate < 5%
|
||||
|
||||
def test_evidence_leak_rate(self, db_session):
|
||||
pass # Implementieren: evidence_leak < 2%
|
||||
```
|
||||
|
||||
### CI/CD Quality Gate
|
||||
|
||||
```yaml
|
||||
# .gitea/workflows/quality-gate.yml
|
||||
name: Control Pipeline Quality Gate
|
||||
on:
|
||||
push:
|
||||
paths:
|
||||
- 'control-pipeline/**'
|
||||
|
||||
jobs:
|
||||
quality-gate:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Run Tests
|
||||
run: |
|
||||
cd control-pipeline
|
||||
pip install -r requirements.txt pytest pyyaml
|
||||
PYTHONPATH=. pytest tests/ -v --tb=short -x
|
||||
- name: Quality Metrics
|
||||
run: |
|
||||
# Nur wenn Container laeuft
|
||||
curl -sf http://127.0.0.1:8098/v1/canonical/generate/quality-metrics || echo "Pipeline not running, skip metrics"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## WICHTIG
|
||||
|
||||
- Bestehende 221 Tests NICHT aendern
|
||||
- NICHT deployen (Container nicht neustarten)
|
||||
- Alle neuen Tests muessen ohne DB laufen (ausser test_regression.py mit skip-Marker)
|
||||
- Ground Truth YAML manuell erstellen (kein LLM fuer die Referenzdaten!)
|
||||
- Bei Fragen: Memory lesen unter `/Users/benjaminadmin/.claude/projects/-Users-benjaminadmin-Projekte-breakpilot-core/memory/`
|
||||
@@ -0,0 +1,132 @@
|
||||
# Lessons Learned — MC `check_type` Klassifikation (KRITISCH fuer CRA + alle neuen Frameworks)
|
||||
|
||||
Datum: 2026-05-17
|
||||
Auslöser: Compliance-Check BMW lieferte 0/381 Cookie-MCs, 3/75 Impressum-MCs, 43/571 DSE-MCs — alle Doc-Typen unter 20%.
|
||||
|
||||
## TL;DR
|
||||
|
||||
**Die heutigen Master-Controls (MCs) vermischen drei strukturell unterschiedliche Klassen von Pruefungen in einer einzigen Tabelle (`compliance.doc_check_controls`). Nur eine der drei Klassen lässt sich gegen Dokument-Text matchen. Die anderen zwei werden faelschlich als "failed" gezaehlt, weil sie ueberhaupt nicht ueber Text-Matching gepruefbar sind.**
|
||||
|
||||
Bei der CRA-MC-Generierung (laeuft jetzt im Pass 0a mit Haiku) **MUSS** jeder MC ein **`check_type`-Feld** bekommen, bevor er in die Datenbank geht. Sonst wiederholt sich das Problem.
|
||||
|
||||
## Die drei Klassen
|
||||
|
||||
| `check_type` | Pruefungsfrage-Pattern | Beispiel | Wie pruefbar? |
|
||||
|---|---|---|---|
|
||||
| **`text`** | "Enthaelt das Dokument...", "Wird im X die Y benannt?", "Ist im Text aufgelistet..." | "Wird im Impressum die Aufsichtsbehoerde benannt?" | Regex / Embedding-Match gegen Doc-Text |
|
||||
| **`process`** | "Ist sichergestellt...", "Ist implementiert...", "Wird durchgefuehrt..." | "Ist sichergestellt, dass Cookies erst nach Einwilligung gespeichert werden?" | Evidence/TOM-Check — kein Doc-Text vorhanden |
|
||||
| **`review`** | "Sind ALLE / Werden ALLE / Stimmt X mit Y ueberein?" | "Sind alle Verarbeitungszwecke vollstaendig erfasst?" | Mensch (DSB) — Checkliste, nicht automatisch |
|
||||
|
||||
## Befund aus den BMW-Daten
|
||||
|
||||
| Doc-Type | TEXT (matchbar) | PROCESS | UNKLAR/REVIEW | Total | % TEXT |
|
||||
|---|---|---|---|---|---|
|
||||
| cookie | 30 | 49 | 302 | 381 | **8%** |
|
||||
| dse | 72 | 139 | 359 | 571 | **13%** |
|
||||
| impressum | 14 | 14 | 47 | 75 | **19%** |
|
||||
| agb | 24 | 20 | 69 | 113 | 21% |
|
||||
| widerruf | 29 | 26 | 96 | 153 | 19% |
|
||||
| loeschkonzept | 38 | 39 | 232 | 309 | 12% |
|
||||
|
||||
**Selbst mit perfektem Matching liegt die Obergrenze fuer doc_check bei 8-20%**, weil 80-92% der MCs nicht ueber Text-Matching pruefbar sind. Es sind keine "schlechten MCs" — sie sind in der falschen Schublade.
|
||||
|
||||
## Konsequenzen fuer CRA-Generation (Pass 0a)
|
||||
|
||||
### 1. Prompt-Aenderung (Hauptmassnahme)
|
||||
|
||||
Der Pass-0a-Prompt fuer Haiku/Sonnet MUSS pro generiertem Control ein `check_type`-Feld erzwingen. Vorschlag:
|
||||
|
||||
```jsonc
|
||||
{
|
||||
"control_id": "CRA-...-A01",
|
||||
"title": "...",
|
||||
"check_question": "...",
|
||||
"check_type": "text" | "process" | "review", // PFLICHT
|
||||
"rationale_for_check_type": "..."
|
||||
}
|
||||
```
|
||||
|
||||
Klassifikations-Regel im Prompt:
|
||||
|
||||
> Wenn deine `check_question` mit "Enthaelt", "Wird … genannt/aufgelistet/erwaehnt", "Steht im Text" beginnt -> `text`.
|
||||
> Wenn sie mit "Ist sichergestellt", "Ist implementiert", "Wird durchgefuehrt", "Existiert ein Prozess" beginnt -> `process`.
|
||||
> Wenn sie mit "Sind ALLE", "Werden ALLE", "Stimmt X mit Y ueberein" beginnt -> `review`.
|
||||
> Im Zweifel: lieber `review` als `text`.
|
||||
|
||||
### 2. Doc-Type-Zuordnung kritisch validieren
|
||||
|
||||
Bei den heutigen MCs sind viele falsch zugeordnet (z.B. "Bestellbestätigung implementieren" landet im `impressum`-doc_type, gehoert aber zu AGB/Widerruf). Fuer CRA:
|
||||
|
||||
- **`doc_type` darf nur Werte aus einer expliziten Liste annehmen** — pro Regulation festlegen.
|
||||
- Fuer CRA z.B.: `produkt_konformitaetserklaerung`, `risiko_management_dossier`, `sbom`, `cra_dse`, `meldepflichten_doku`.
|
||||
- Falsche Zuordnung im Prompt explizit verbieten: "Wenn der Control nicht eindeutig zu EINEM dieser Doc-Typen passt, setze `doc_type: 'unassigned'` und `check_type: 'review'`."
|
||||
|
||||
### 3. Zwei Tabellen statt einer
|
||||
|
||||
Heutige Architektur:
|
||||
- `compliance.doc_check_controls` <- alle 1874 MCs (mit allem vermischt)
|
||||
|
||||
Empfohlen fuer CRA + Refactor:
|
||||
- `compliance.text_check_controls` <- nur `check_type='text'`
|
||||
- `compliance.process_check_controls` <- nur `check_type='process'`, gepruefte via Evidence/TOM
|
||||
- `compliance.review_checklist_controls` <- nur `check_type='review'`, gepruefte via DSB-Workflow
|
||||
|
||||
Falls Schema-Aenderung nicht moeglich (CLAUDE.md: DB ist frozen), Sidecar-SQLite mit `mc_classification.db` oder neue Spalte als Add-only-Migration.
|
||||
|
||||
### 4. Dedupe-Phase respektieren
|
||||
|
||||
In Pass 0b (Dedup) muss `check_type` ein **Pflicht-Dedupe-Key** sein:
|
||||
- Zwei MCs mit gleicher Aussage aber unterschiedlichem `check_type` sind **nicht** Duplikate — sie pruefen verschiedene Dinge ("ist im Text genannt" vs "ist technisch implementiert").
|
||||
- Heute werden solche faelschlich gemerged → noch mehr Vermischung.
|
||||
|
||||
### 5. Matching-Engine danach umbauen
|
||||
|
||||
Das eigentliche doc-check-Match-System muss nur noch `check_type='text'`-MCs verarbeiten. Andere werden in ihre eigenen Module geroutet:
|
||||
|
||||
- `text` MCs -> `rag_document_checker` (Regex + spaeter Embedding)
|
||||
- `process` MCs -> neuer `evidence_check_runner` (Kunde lieferte Nachweise/TOM hoch)
|
||||
- `review` MCs -> neuer `review_checklist_ui` (DSB beantwortet manuell)
|
||||
|
||||
## Checkliste fuer CRA-Session
|
||||
|
||||
- [ ] Pass-0a-Prompt um `check_type`-Pflichtfeld erweitert (Wortlaut-Regel + Beispiele)
|
||||
- [ ] Pass-0a-Prompt zwingt `doc_type` aus expliziter Whitelist
|
||||
- [ ] Pass-0b-Dedup-Key enthaelt `check_type`
|
||||
- [ ] Output-Validator weist MCs ohne `check_type` zurueck
|
||||
- [ ] DB-Schema (oder Sidecar) hat `check_type`-Spalte mit Default `review` (sicherer Fallback)
|
||||
- [ ] Stichprobe von 50 generierten CRA-MCs vor Bulk-Run: TEXT-Anteil sollte 30-50% sein (mehr als bei den alten DSGVO-MCs, weil CRA stark dokument-fokussiert ist).
|
||||
|
||||
## Update 2026-05-17 — Parallel-CRA-Session-Findings
|
||||
|
||||
Die laufende CRA-Generation hat ein Feld `verification_method` (document/tool/hybrid/code_review/empty), das **NICHT identisch** mit `check_type` ist:
|
||||
|
||||
- `verification_method` fragt: **WAS schaust du dir an?** (Dokument, Tool-Output, Code)
|
||||
- `check_type` fragt: **KANN das per Text-Match geprueft werden?** (text/process/review)
|
||||
|
||||
Ein Control kann `verification_method=document` haben UND trotzdem `check_type=process` sein. Beispiel: "Wird die SBOM regelmaessig (mindestens monatlich) aktualisiert?" — Du schaust ins Dokument SBOM-Historie, prüfst aber einen Prozess. Text-Match findet das nie.
|
||||
|
||||
**Mapping-Heuristik (gut genug fuer 80% der Faelle, Rest LLM):**
|
||||
|
||||
| `verification_method` | Auto-Mapping `check_type` | LLM noetig? |
|
||||
|---|---|---|
|
||||
| `tool` | `process` | nein |
|
||||
| `code_review` | `process` | nein |
|
||||
| empty/null | `review` (sicherer Default) | nein |
|
||||
| `document` | erstmal `text`, Stichprobe pruefen | 10-20% sampling |
|
||||
| `hybrid` | LLM klassifizieren | ja, alle |
|
||||
|
||||
**Idealfall (fuer alle KUENFTIGEN Pass-0a-Generationen — auch CRA falls man nochmal generiert):** Beide Felder gleichzeitig generieren, nicht eins aus dem anderen ableiten.
|
||||
|
||||
## Backfill-Workflow fuer die laufende CRA-Generation
|
||||
|
||||
1. Aktueller Haiku-Job laeuft fertig (kein Restart, kein Verlust)
|
||||
2. Nach Job-Ende: Auto-Mapping fuer eindeutige Buckets (tool/code_review/empty)
|
||||
3. Sonnet-Klassifikation nur fuer `document`+`hybrid` Subset (~62 Calls fuer 1500 Controls, ~$0.05 statt $2)
|
||||
4. Wiederverwenden: `breakpilot-compliance/backend-compliance/scripts/classify_mc_check_type.py` — nur DB-Query anpassen (Source-Tabelle + WHERE-Filter)
|
||||
5. Validierung: TEXT-Anteil bei CRA sollte 40-60% sein (CRA ist dokument-zentrierter als DSGVO-Cookie)
|
||||
|
||||
## Quervewweise
|
||||
|
||||
- BMW-Run-Befund: `breakpilot-compliance` E-Mail vom 2026-05-17, check_id `08bcc9dd`
|
||||
- Bestehender Klassifikations-Skript fuer Retrofit der alten 1874: `backend-compliance/scripts/classify_mc_check_type.py`
|
||||
- Doc-Type-Audit-Query: dieselbe Datei, am Ende
|
||||
@@ -4,9 +4,21 @@ from api.control_generator_routes import router as generator_router
|
||||
from api.canonical_control_routes import router as canonical_router
|
||||
from api.document_compliance_routes import router as document_router
|
||||
from api.dependency_routes import router as dependency_router
|
||||
from api.master_control_routes import router as master_control_router
|
||||
from api.decision_trace_routes import router as decision_trace_router
|
||||
from api.decision_trace_routes import full_trace_router
|
||||
from api.compliance_commit_routes import router as compliance_commit_router
|
||||
from api.decision_event_routes import router as decision_event_router
|
||||
from api.deployment_check_routes import router as deployment_check_router
|
||||
|
||||
router = APIRouter()
|
||||
router.include_router(generator_router)
|
||||
router.include_router(canonical_router)
|
||||
router.include_router(document_router)
|
||||
router.include_router(dependency_router)
|
||||
router.include_router(master_control_router)
|
||||
router.include_router(decision_trace_router)
|
||||
router.include_router(full_trace_router)
|
||||
router.include_router(compliance_commit_router)
|
||||
router.include_router(decision_event_router)
|
||||
router.include_router(deployment_check_router)
|
||||
|
||||
@@ -0,0 +1,255 @@
|
||||
"""Compliance Commit Ledger API — G2.
|
||||
|
||||
Tracks code commits and their compliance impact. SDK reports each commit
|
||||
with affected controls, building an audit trail for code↔compliance mapping.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import uuid
|
||||
from typing import Optional
|
||||
|
||||
from fastapi import APIRouter, HTTPException, Query
|
||||
from pydantic import BaseModel
|
||||
from sqlalchemy import text
|
||||
|
||||
from db.session import SessionLocal
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
router = APIRouter(prefix="/v1/compliance-commits", tags=["compliance-commits"])
|
||||
|
||||
|
||||
class CreateCommitRequest(BaseModel):
|
||||
tenant_id: str
|
||||
project_id: Optional[str] = None
|
||||
commit_hash: str
|
||||
commit_message: Optional[str] = None
|
||||
commit_author: Optional[str] = None
|
||||
commit_date: Optional[str] = None
|
||||
branch: Optional[str] = None
|
||||
repo_url: Optional[str] = None
|
||||
affected_control_ids: list[str] = []
|
||||
affected_files: list[str] = []
|
||||
risk_level: str = "low"
|
||||
analysis_summary: Optional[str] = None
|
||||
analysis_metadata: dict = {}
|
||||
|
||||
|
||||
@router.post("")
|
||||
async def register_commit(req: CreateCommitRequest):
|
||||
"""Register a code commit with its compliance impact."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
cid = str(uuid.uuid4())
|
||||
db.execute(text("""
|
||||
INSERT INTO compliance_commits
|
||||
(id, tenant_id, project_id, commit_hash, commit_message,
|
||||
commit_author, commit_date, branch, repo_url,
|
||||
affected_control_ids, affected_files,
|
||||
risk_level, analysis_summary, analysis_metadata)
|
||||
VALUES
|
||||
(CAST(:id AS uuid), CAST(:tenant_id AS uuid), :project_id,
|
||||
:commit_hash, :commit_message, :commit_author,
|
||||
:commit_date, :branch, :repo_url,
|
||||
CAST(:control_ids AS jsonb), CAST(:files AS jsonb),
|
||||
:risk_level, :analysis_summary, CAST(:metadata AS jsonb))
|
||||
"""), {
|
||||
"id": cid,
|
||||
"tenant_id": req.tenant_id,
|
||||
"project_id": req.project_id,
|
||||
"commit_hash": req.commit_hash,
|
||||
"commit_message": req.commit_message,
|
||||
"commit_author": req.commit_author,
|
||||
"commit_date": req.commit_date,
|
||||
"branch": req.branch,
|
||||
"repo_url": req.repo_url,
|
||||
"control_ids": json.dumps(req.affected_control_ids),
|
||||
"files": json.dumps(req.affected_files),
|
||||
"risk_level": req.risk_level,
|
||||
"analysis_summary": req.analysis_summary,
|
||||
"metadata": json.dumps(req.analysis_metadata),
|
||||
})
|
||||
db.commit()
|
||||
return {
|
||||
"id": cid,
|
||||
"status": "registered",
|
||||
"affected_controls": len(req.affected_control_ids),
|
||||
"risk_level": req.risk_level,
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.get("")
|
||||
async def list_commits(
|
||||
tenant_id: Optional[str] = None,
|
||||
control_id: Optional[str] = None,
|
||||
risk_level: Optional[str] = None,
|
||||
branch: Optional[str] = None,
|
||||
since: Optional[str] = None,
|
||||
limit: int = Query(50, ge=1, le=500),
|
||||
offset: int = Query(0, ge=0),
|
||||
):
|
||||
"""List compliance commits with filters."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
clauses = []
|
||||
params: dict = {"limit": limit, "offset": offset}
|
||||
|
||||
if tenant_id:
|
||||
clauses.append("tenant_id = CAST(:tenant_id AS uuid)")
|
||||
params["tenant_id"] = tenant_id
|
||||
if control_id:
|
||||
clauses.append("affected_control_ids @> CAST(:cid_json AS jsonb)")
|
||||
params["cid_json"] = json.dumps([control_id])
|
||||
if risk_level:
|
||||
clauses.append("risk_level = :risk")
|
||||
params["risk"] = risk_level
|
||||
if branch:
|
||||
clauses.append("branch = :branch")
|
||||
params["branch"] = branch
|
||||
if since:
|
||||
clauses.append("commit_date >= CAST(:since AS timestamptz)")
|
||||
params["since"] = since
|
||||
|
||||
where = "WHERE " + " AND ".join(clauses) if clauses else ""
|
||||
|
||||
rows = db.execute(text(f"""
|
||||
SELECT id, commit_hash, commit_message, commit_author, commit_date,
|
||||
branch, affected_control_ids, affected_files, risk_level
|
||||
FROM compliance_commits
|
||||
{where}
|
||||
ORDER BY commit_date DESC NULLS LAST
|
||||
LIMIT :limit OFFSET :offset
|
||||
"""), params).fetchall()
|
||||
|
||||
total = db.execute(text(f"""
|
||||
SELECT count(*) FROM compliance_commits {where}
|
||||
"""), params).scalar()
|
||||
|
||||
return {
|
||||
"total": total,
|
||||
"commits": [
|
||||
{
|
||||
"id": str(r[0]),
|
||||
"commit_hash": r[1],
|
||||
"message": r[2],
|
||||
"author": r[3],
|
||||
"date": str(r[4]) if r[4] else None,
|
||||
"branch": r[5],
|
||||
"affected_control_ids": r[6],
|
||||
"affected_files": r[7],
|
||||
"risk_level": r[8],
|
||||
}
|
||||
for r in rows
|
||||
],
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.get("/stats")
|
||||
async def commit_stats(tenant_id: Optional[str] = None):
|
||||
"""Dashboard stats for compliance commits."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
tf = ""
|
||||
params: dict = {}
|
||||
if tenant_id:
|
||||
tf = "WHERE tenant_id = CAST(:tid AS uuid)"
|
||||
params["tid"] = tenant_id
|
||||
|
||||
risk = db.execute(text(f"""
|
||||
SELECT risk_level, count(*) FROM compliance_commits {tf}
|
||||
GROUP BY risk_level
|
||||
"""), params).fetchall()
|
||||
|
||||
recent = db.execute(text(f"""
|
||||
SELECT count(*) FROM compliance_commits
|
||||
{tf + ' AND' if tf else 'WHERE'} commit_date > NOW() - interval '7 days'
|
||||
"""), params).scalar()
|
||||
|
||||
total = sum(r[1] for r in risk)
|
||||
|
||||
return {
|
||||
"total_commits": total,
|
||||
"last_7_days": recent,
|
||||
"by_risk_level": {r[0]: r[1] for r in risk},
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.get("/by-control/{control_id}")
|
||||
async def commits_by_control(
|
||||
control_id: str,
|
||||
limit: int = Query(50, ge=1, le=200),
|
||||
):
|
||||
"""Get all commits that affect a specific control."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
rows = db.execute(text("""
|
||||
SELECT id, commit_hash, commit_message, commit_author, commit_date,
|
||||
branch, repo_url, affected_files, risk_level
|
||||
FROM compliance_commits
|
||||
WHERE affected_control_ids @> CAST(:cid_json AS jsonb)
|
||||
ORDER BY commit_date DESC NULLS LAST
|
||||
LIMIT :limit
|
||||
"""), {
|
||||
"cid_json": json.dumps([control_id]),
|
||||
"limit": limit,
|
||||
}).fetchall()
|
||||
|
||||
return {
|
||||
"control_id": control_id,
|
||||
"total_commits": len(rows),
|
||||
"commits": [
|
||||
{
|
||||
"id": str(r[0]),
|
||||
"commit_hash": r[1],
|
||||
"message": r[2],
|
||||
"author": r[3],
|
||||
"date": str(r[4]) if r[4] else None,
|
||||
"branch": r[5],
|
||||
"repo_url": r[6],
|
||||
"affected_files": r[7],
|
||||
"risk_level": r[8],
|
||||
}
|
||||
for r in rows
|
||||
],
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.get("/{commit_id}")
|
||||
async def get_commit(commit_id: str):
|
||||
"""Get details of a single compliance commit."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
row = db.execute(text("""
|
||||
SELECT * FROM compliance_commits WHERE id = CAST(:id AS uuid)
|
||||
"""), {"id": commit_id}).fetchone()
|
||||
|
||||
if not row:
|
||||
raise HTTPException(status_code=404, detail="Commit not found")
|
||||
|
||||
return {
|
||||
"id": str(row.id),
|
||||
"tenant_id": str(row.tenant_id),
|
||||
"project_id": str(row.project_id) if row.project_id else None,
|
||||
"commit_hash": row.commit_hash,
|
||||
"commit_message": row.commit_message,
|
||||
"commit_author": row.commit_author,
|
||||
"commit_date": str(row.commit_date) if row.commit_date else None,
|
||||
"branch": row.branch,
|
||||
"repo_url": row.repo_url,
|
||||
"affected_control_ids": row.affected_control_ids,
|
||||
"affected_files": row.affected_files,
|
||||
"risk_level": row.risk_level,
|
||||
"analysis_summary": row.analysis_summary,
|
||||
"analysis_metadata": row.analysis_metadata,
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
@@ -1553,6 +1553,7 @@ async def get_repair_backfill_status(backfill_id: str):
|
||||
class BatchDedupRequest(BaseModel):
|
||||
dry_run: bool = True
|
||||
hint_filter: Optional[str] = None # Only process groups matching this hint prefix
|
||||
since: Optional[str] = None # ISO datetime — scope to controls created at/after this
|
||||
|
||||
|
||||
_batch_dedup_status: dict = {}
|
||||
@@ -1567,7 +1568,15 @@ async def _run_batch_dedup(req: BatchDedupRequest, dedup_id: str):
|
||||
runner = BatchDedupRunner(db)
|
||||
_batch_dedup_status[dedup_id] = {"status": "running", "phase": "starting"}
|
||||
|
||||
stats = await runner.run(dry_run=req.dry_run, hint_filter=req.hint_filter)
|
||||
since_dt = None
|
||||
if req.since:
|
||||
from datetime import datetime
|
||||
since_dt = datetime.fromisoformat(req.since.replace("Z", "+00:00"))
|
||||
stats = await runner.run(
|
||||
dry_run=req.dry_run,
|
||||
hint_filter=req.hint_filter,
|
||||
since=since_dt,
|
||||
)
|
||||
|
||||
_batch_dedup_status[dedup_id] = {
|
||||
"status": "completed",
|
||||
@@ -2293,18 +2302,95 @@ async def get_batch_process_status(job_id: str):
|
||||
return status
|
||||
|
||||
|
||||
class RunPass0aRequest(BaseModel):
|
||||
limit: int = 0 # 0 = no limit
|
||||
batch_size: int = 5
|
||||
use_anthropic: bool = True
|
||||
category_filter: Optional[str] = None
|
||||
source_filter: Optional[str] = None
|
||||
|
||||
|
||||
_pass0a_status: dict = {}
|
||||
|
||||
|
||||
async def _run_pass0a_background(req: RunPass0aRequest, job_id: str):
|
||||
"""Run Pass 0a in background with own DB session."""
|
||||
from services.decomposition_pass import DecompositionPass
|
||||
db = SessionLocal()
|
||||
try:
|
||||
_pass0a_status[job_id] = {"status": "running"}
|
||||
dp = DecompositionPass(db)
|
||||
result = await dp.run_pass0a(
|
||||
limit=req.limit,
|
||||
batch_size=req.batch_size,
|
||||
use_anthropic=req.use_anthropic,
|
||||
category_filter=req.category_filter,
|
||||
source_filter=req.source_filter,
|
||||
)
|
||||
_pass0a_status[job_id] = {"status": "completed", **result}
|
||||
logger.info("Pass 0a job %s completed: %s", job_id, result)
|
||||
except Exception as e:
|
||||
logger.error("Pass 0a job %s failed: %s", job_id, e)
|
||||
_pass0a_status[job_id] = {"status": "failed", "error": str(e)}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.post("/generate/run-pass0a")
|
||||
async def run_pass0a(req: RunPass0aRequest):
|
||||
"""Run Pass 0a (Obligation Extraction) on undecomposed controls.
|
||||
|
||||
Extracts individual normative obligations from rich controls using LLM.
|
||||
Runs in background — poll status via GET /generate/pass0a-status/{job_id}.
|
||||
"""
|
||||
import uuid
|
||||
job_id = str(uuid.uuid4())[:8]
|
||||
_pass0a_status[job_id] = {"status": "starting"}
|
||||
asyncio.create_task(_run_pass0a_background(req, job_id))
|
||||
return {
|
||||
"status": "running",
|
||||
"job_id": job_id,
|
||||
"message": f"Pass 0a started. Poll /generate/pass0a-status/{job_id}",
|
||||
}
|
||||
|
||||
|
||||
@router.get("/generate/pass0a-status/{job_id}")
|
||||
async def get_pass0a_status(job_id: str):
|
||||
"""Get status of a Pass 0a job."""
|
||||
status = _pass0a_status.get(job_id)
|
||||
if not status:
|
||||
raise HTTPException(status_code=404, detail="Pass 0a job not found")
|
||||
return status
|
||||
|
||||
|
||||
class SubmitPass0bRequest(BaseModel):
|
||||
limit: int = 10
|
||||
batch_size: int = 5
|
||||
|
||||
|
||||
_last_submit_batch_id: str = ""
|
||||
_last_submit_time: float = 0
|
||||
|
||||
|
||||
@router.post("/generate/submit-pass0b")
|
||||
async def submit_pass0b(req: SubmitPass0bRequest):
|
||||
"""Submit Pass 0b batch to Anthropic Batch API.
|
||||
|
||||
Loads unprocessed obligations, applies pre-LLM filter, submits batch.
|
||||
Returns batch_id for status polling and later result processing.
|
||||
SAFETY: Refuses to submit if a batch was submitted in the last 10 minutes.
|
||||
This prevents duplicate batches from curl retries or timeouts.
|
||||
"""
|
||||
import time
|
||||
global _last_submit_batch_id, _last_submit_time
|
||||
|
||||
# Idempotency guard: refuse if last submit was <10 min ago
|
||||
elapsed = time.time() - _last_submit_time
|
||||
if elapsed < 600 and _last_submit_batch_id:
|
||||
return {
|
||||
"status": "blocked",
|
||||
"reason": f"Batch {_last_submit_batch_id} was submitted {int(elapsed)}s ago. Wait {int(600 - elapsed)}s or use force=true.",
|
||||
"last_batch_id": _last_submit_batch_id,
|
||||
}
|
||||
|
||||
from services.decomposition_pass import DecompositionPass
|
||||
db = SessionLocal()
|
||||
try:
|
||||
@@ -2313,6 +2399,12 @@ async def submit_pass0b(req: SubmitPass0bRequest):
|
||||
limit=req.limit,
|
||||
batch_size=req.batch_size,
|
||||
)
|
||||
# Record successful submit
|
||||
batch_id = result.get("batch_id", "")
|
||||
if batch_id:
|
||||
_last_submit_batch_id = batch_id
|
||||
_last_submit_time = time.time()
|
||||
logger.info("Submit guard: recorded batch %s", batch_id)
|
||||
return result
|
||||
except Exception as e:
|
||||
logger.error("Submit Pass 0b failed: %s", e)
|
||||
@@ -2693,3 +2785,199 @@ async def get_quality_metrics(
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# REVIEW CANDIDATE VERIFICATION (Block B — LLM decides DUPLIKAT/VERSCHIEDEN)
|
||||
# =============================================================================
|
||||
|
||||
_REVIEW_VERIFY_SYSTEM = """Du vergleichst Paare von Compliance Controls und entscheidest ob sie Duplikate sind.
|
||||
Antworte NUR mit einem JSON-Array. Fuer jedes Paar ein Objekt:
|
||||
{"pair_id": "...", "decision": "DUPLIKAT" oder "VERSCHIEDEN", "reason": "kurze Begruendung"}
|
||||
DUPLIKAT = gleiche Anforderung, nur anders formuliert.
|
||||
VERSCHIEDEN = unterschiedliche Anforderungen, auch wenn aehnliche Woerter vorkommen."""
|
||||
|
||||
|
||||
class ReviewVerifyRequest(BaseModel):
|
||||
limit: int = 0
|
||||
batch_size: int = 10
|
||||
dry_run: bool = True
|
||||
|
||||
|
||||
_review_verify_status: dict = {}
|
||||
|
||||
|
||||
async def _run_review_verify(req: ReviewVerifyRequest, job_id: str):
|
||||
from services.decomposition_pass import (
|
||||
create_anthropic_batch, fetch_batch_results, check_batch_status,
|
||||
)
|
||||
import asyncio as aio
|
||||
db = SessionLocal()
|
||||
try:
|
||||
_review_verify_status[job_id] = {"status": "loading"}
|
||||
|
||||
query = """
|
||||
SELECT r.id::text, r.candidate_control_id, r.candidate_title,
|
||||
r.matched_control_id, c2.title as matched_title,
|
||||
r.similarity_score
|
||||
FROM control_dedup_reviews r
|
||||
LEFT JOIN canonical_controls c2 ON c2.id = r.matched_control_uuid
|
||||
WHERE r.review_status = 'pending'
|
||||
ORDER BY r.similarity_score DESC
|
||||
"""
|
||||
if req.limit > 0:
|
||||
query += f" LIMIT {req.limit}"
|
||||
|
||||
rows = db.execute(text(query)).fetchall()
|
||||
total = len(rows)
|
||||
_review_verify_status[job_id] = {"status": "preparing", "total": total}
|
||||
|
||||
if total == 0:
|
||||
_review_verify_status[job_id] = {
|
||||
"status": "completed", "total": 0, "message": "No pending reviews",
|
||||
}
|
||||
return
|
||||
|
||||
if req.dry_run:
|
||||
_review_verify_status[job_id] = {
|
||||
"status": "dry_run", "total": total,
|
||||
"estimated_requests": (total + req.batch_size - 1) // req.batch_size,
|
||||
}
|
||||
return
|
||||
|
||||
# Build batch requests
|
||||
api_requests = []
|
||||
pair_map = {}
|
||||
for i in range(0, total, req.batch_size):
|
||||
batch = rows[i:i + req.batch_size]
|
||||
prompt = "Vergleiche diese Control-Paare:\n\n"
|
||||
batch_pairs = []
|
||||
for r in batch:
|
||||
pair_id = r[0][:8]
|
||||
prompt += (
|
||||
f"Paar {pair_id}:\n"
|
||||
f" A: {r[1]} — {r[2]}\n"
|
||||
f" B: {r[3]} — {r[4]}\n"
|
||||
f" Similarity: {r[5]:.3f}\n\n"
|
||||
)
|
||||
batch_pairs.append({"review_id": r[0], "candidate_id": r[1]})
|
||||
|
||||
batch_idx = i // req.batch_size
|
||||
custom_id = f"rv_b{batch_idx:05d}"
|
||||
pair_map[custom_id] = batch_pairs
|
||||
api_requests.append({
|
||||
"custom_id": custom_id,
|
||||
"params": {
|
||||
"model": "claude-haiku-4-5-20251001",
|
||||
"max_tokens": max(1024, len(batch) * 150),
|
||||
"system": [{
|
||||
"type": "text",
|
||||
"text": _REVIEW_VERIFY_SYSTEM,
|
||||
"cache_control": {"type": "ephemeral"},
|
||||
}],
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
},
|
||||
})
|
||||
|
||||
_review_verify_status[job_id] = {
|
||||
"status": "submitting", "total": total, "requests": len(api_requests),
|
||||
}
|
||||
batch_result = await create_anthropic_batch(api_requests)
|
||||
batch_id = batch_result.get("id", "")
|
||||
_review_verify_status[job_id] = {
|
||||
"status": "batch_submitted", "batch_id": batch_id,
|
||||
"total": total, "requests": len(api_requests),
|
||||
}
|
||||
|
||||
# Poll for completion
|
||||
for _ in range(720):
|
||||
await aio.sleep(10)
|
||||
status = await check_batch_status(batch_id)
|
||||
if status.get("processing_status") == "ended":
|
||||
break
|
||||
|
||||
# Process results
|
||||
results = await fetch_batch_results(batch_id)
|
||||
duplicates = 0
|
||||
different = 0
|
||||
errors = 0
|
||||
|
||||
for result in results:
|
||||
custom_id = result.get("custom_id", "")
|
||||
result_data = result.get("result", {})
|
||||
if result_data.get("type") != "succeeded":
|
||||
errors += 1
|
||||
continue
|
||||
|
||||
content = result_data.get("message", {}).get("content", [])
|
||||
text_content = content[0].get("text", "") if content else ""
|
||||
|
||||
try:
|
||||
import json as jmod
|
||||
import re
|
||||
json_matches = re.findall(r'\{[^}]+\}', text_content)
|
||||
pairs = pair_map.get(custom_id, [])
|
||||
|
||||
for j, match_str in enumerate(json_matches):
|
||||
try:
|
||||
parsed = jmod.loads(match_str)
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
decision = parsed.get("decision", "").upper()
|
||||
if j < len(pairs):
|
||||
review_id = pairs[j]["review_id"]
|
||||
if "DUPLIKAT" in decision:
|
||||
db.execute(text("""
|
||||
UPDATE control_dedup_reviews
|
||||
SET review_status = 'duplicate', review_notes = :notes
|
||||
WHERE id = CAST(:rid AS uuid)
|
||||
"""), {"rid": review_id, "notes": parsed.get("reason", "")})
|
||||
duplicates += 1
|
||||
else:
|
||||
db.execute(text("""
|
||||
UPDATE control_dedup_reviews
|
||||
SET review_status = 'different', review_notes = :notes
|
||||
WHERE id = CAST(:rid AS uuid)
|
||||
"""), {"rid": review_id, "notes": parsed.get("reason", "")})
|
||||
different += 1
|
||||
|
||||
db.commit()
|
||||
except Exception as e:
|
||||
logger.error("Review verify parse error: %s", e)
|
||||
errors += 1
|
||||
try:
|
||||
db.rollback()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
_review_verify_status[job_id] = {
|
||||
"status": "completed", "batch_id": batch_id, "total": total,
|
||||
"duplicates": duplicates, "different": different, "errors": errors,
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error("Review verify %s failed: %s", job_id, e)
|
||||
_review_verify_status[job_id] = {"status": "failed", "error": str(e)}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.post("/generate/review-verify")
|
||||
async def start_review_verify(req: ReviewVerifyRequest):
|
||||
"""LLM-verify review candidates (DUPLIKAT/VERSCHIEDEN) via Haiku Batch."""
|
||||
import uuid as uuid_mod
|
||||
job_id = str(uuid_mod.uuid4())[:8]
|
||||
_review_verify_status[job_id] = {"status": "starting"}
|
||||
asyncio.create_task(_run_review_verify(req, job_id))
|
||||
return {
|
||||
"status": "running", "job_id": job_id,
|
||||
"message": f"Poll /generate/review-verify-status/{job_id}",
|
||||
}
|
||||
|
||||
|
||||
@router.get("/generate/review-verify-status/{job_id}")
|
||||
async def get_review_verify_status(job_id: str):
|
||||
status = _review_verify_status.get(job_id)
|
||||
if not status:
|
||||
raise HTTPException(status_code=404, detail="Review verify job not found")
|
||||
return status
|
||||
|
||||
@@ -0,0 +1,224 @@
|
||||
"""Decision Events API — G3 Full Decision Memory.
|
||||
|
||||
Event-stream for each control's compliance lifecycle:
|
||||
assessment → decision → fix → verification → (failure → new cycle)
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import uuid
|
||||
from typing import Optional
|
||||
|
||||
from fastapi import APIRouter, HTTPException, Query
|
||||
from pydantic import BaseModel
|
||||
from sqlalchemy import text
|
||||
|
||||
from db.session import SessionLocal
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
router = APIRouter(prefix="/v1/decision-events", tags=["decision-events"])
|
||||
|
||||
|
||||
class CreateEventRequest(BaseModel):
|
||||
control_uuid: str
|
||||
decision_trace_id: Optional[str] = None
|
||||
tenant_id: Optional[str] = None
|
||||
event_type: str
|
||||
input_state: dict = {}
|
||||
output_state: dict = {}
|
||||
summary: Optional[str] = None
|
||||
actor: Optional[str] = None
|
||||
evidence_ids: list[str] = []
|
||||
metadata: dict = {}
|
||||
|
||||
|
||||
@router.post("")
|
||||
async def create_event(req: CreateEventRequest):
|
||||
"""Record a decision event in the compliance lifecycle."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
eid = str(uuid.uuid4())
|
||||
db.execute(text("""
|
||||
INSERT INTO decision_events
|
||||
(id, decision_trace_id, control_uuid, tenant_id,
|
||||
event_type, input_state, output_state,
|
||||
summary, actor, evidence_ids, metadata)
|
||||
VALUES
|
||||
(CAST(:id AS uuid),
|
||||
CASE WHEN :trace_id IS NOT NULL THEN CAST(:trace_id AS uuid) ELSE NULL END,
|
||||
CAST(:control_uuid AS uuid),
|
||||
CASE WHEN :tenant_id IS NOT NULL THEN CAST(:tenant_id AS uuid) ELSE NULL END,
|
||||
:event_type, CAST(:input AS jsonb), CAST(:output AS jsonb),
|
||||
:summary, :actor, CAST(:evidence AS jsonb), CAST(:meta AS jsonb))
|
||||
"""), {
|
||||
"id": eid,
|
||||
"trace_id": req.decision_trace_id,
|
||||
"control_uuid": req.control_uuid,
|
||||
"tenant_id": req.tenant_id,
|
||||
"event_type": req.event_type,
|
||||
"input": json.dumps(req.input_state),
|
||||
"output": json.dumps(req.output_state),
|
||||
"summary": req.summary,
|
||||
"actor": req.actor,
|
||||
"evidence": json.dumps(req.evidence_ids),
|
||||
"meta": json.dumps(req.metadata),
|
||||
})
|
||||
db.commit()
|
||||
return {"id": eid, "event_type": req.event_type, "status": "recorded"}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.get("")
|
||||
async def list_events(
|
||||
control_uuid: Optional[str] = None,
|
||||
tenant_id: Optional[str] = None,
|
||||
event_type: Optional[str] = None,
|
||||
limit: int = Query(100, ge=1, le=1000),
|
||||
offset: int = Query(0, ge=0),
|
||||
):
|
||||
"""List decision events with filters."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
clauses = []
|
||||
params: dict = {"limit": limit, "offset": offset}
|
||||
|
||||
if control_uuid:
|
||||
clauses.append("de.control_uuid = CAST(:cuuid AS uuid)")
|
||||
params["cuuid"] = control_uuid
|
||||
if tenant_id:
|
||||
clauses.append("de.tenant_id = CAST(:tid AS uuid)")
|
||||
params["tid"] = tenant_id
|
||||
if event_type:
|
||||
clauses.append("de.event_type = :etype")
|
||||
params["etype"] = event_type
|
||||
|
||||
where = "WHERE " + " AND ".join(clauses) if clauses else ""
|
||||
|
||||
rows = db.execute(text(f"""
|
||||
SELECT de.id, de.control_uuid, cc.control_id,
|
||||
de.event_type, de.summary, de.actor,
|
||||
de.input_state, de.output_state,
|
||||
de.evidence_ids, de.created_at
|
||||
FROM decision_events de
|
||||
LEFT JOIN canonical_controls cc ON cc.id = de.control_uuid
|
||||
{where}
|
||||
ORDER BY de.created_at DESC
|
||||
LIMIT :limit OFFSET :offset
|
||||
"""), params).fetchall()
|
||||
|
||||
return {
|
||||
"total": len(rows),
|
||||
"events": [
|
||||
{
|
||||
"id": str(r[0]),
|
||||
"control_uuid": str(r[1]),
|
||||
"control_id": r[2],
|
||||
"event_type": r[3],
|
||||
"summary": r[4],
|
||||
"actor": r[5],
|
||||
"input_state": r[6],
|
||||
"output_state": r[7],
|
||||
"evidence_ids": r[8],
|
||||
"created_at": str(r[9]),
|
||||
}
|
||||
for r in rows
|
||||
],
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.get("/stats")
|
||||
async def event_stats(tenant_id: Optional[str] = None):
|
||||
"""Lifecycle statistics: cycle times, failure rates."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
tf = ""
|
||||
params: dict = {}
|
||||
if tenant_id:
|
||||
tf = "WHERE tenant_id = CAST(:tid AS uuid)"
|
||||
params["tid"] = tenant_id
|
||||
|
||||
by_type = db.execute(text(f"""
|
||||
SELECT event_type, count(*) FROM decision_events {tf}
|
||||
GROUP BY event_type ORDER BY count(*) DESC
|
||||
"""), params).fetchall()
|
||||
|
||||
total = sum(r[1] for r in by_type)
|
||||
failures = next((r[1] for r in by_type if r[0] == "failure"), 0)
|
||||
verifications = next((r[1] for r in by_type if r[0] == "verification"), 0)
|
||||
|
||||
return {
|
||||
"total_events": total,
|
||||
"by_event_type": {r[0]: r[1] for r in by_type},
|
||||
"failure_rate": round(failures / total * 100, 1) if total > 0 else 0,
|
||||
"verification_rate": round(verifications / total * 100, 1) if total > 0 else 0,
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.get("/timeline/{control_id}")
|
||||
async def get_timeline(control_id: str):
|
||||
"""Full chronological timeline for a control's compliance lifecycle."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
# Resolve control_id to UUID
|
||||
ctrl = db.execute(text("""
|
||||
SELECT id, control_id, title FROM canonical_controls
|
||||
WHERE control_id = :cid
|
||||
"""), {"cid": control_id}).fetchone()
|
||||
|
||||
if not ctrl:
|
||||
raise HTTPException(status_code=404, detail="Control not found")
|
||||
|
||||
events = db.execute(text("""
|
||||
SELECT id, event_type, summary, actor,
|
||||
input_state, output_state, evidence_ids, created_at
|
||||
FROM decision_events
|
||||
WHERE control_uuid = CAST(:uuid AS uuid)
|
||||
ORDER BY created_at ASC
|
||||
"""), {"uuid": str(ctrl[0])}).fetchall()
|
||||
|
||||
# Determine current state from latest event
|
||||
current_state = "unknown"
|
||||
if events:
|
||||
last = events[-1]
|
||||
output = last[5] or {}
|
||||
current_state = output.get("status", last[1])
|
||||
|
||||
# Calculate avg fix time (assessment → fix_completed)
|
||||
fix_times = []
|
||||
assessment_at = None
|
||||
for e in events:
|
||||
if e[1] == "assessment":
|
||||
assessment_at = e[7]
|
||||
elif e[1] == "fix_completed" and assessment_at:
|
||||
delta = (e[7] - assessment_at).total_seconds() / 3600
|
||||
fix_times.append(delta)
|
||||
assessment_at = None
|
||||
|
||||
return {
|
||||
"control_id": ctrl[1],
|
||||
"control_title": ctrl[2],
|
||||
"current_state": current_state,
|
||||
"total_events": len(events),
|
||||
"time_to_fix_avg_hours": round(sum(fix_times) / len(fix_times), 1) if fix_times else None,
|
||||
"events": [
|
||||
{
|
||||
"id": str(e[0]),
|
||||
"type": e[1],
|
||||
"summary": e[2],
|
||||
"actor": e[3],
|
||||
"input_state": e[4],
|
||||
"output_state": e[5],
|
||||
"evidence_count": len(e[6]) if e[6] else 0,
|
||||
"at": str(e[7]),
|
||||
}
|
||||
for e in events
|
||||
],
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
@@ -0,0 +1,404 @@
|
||||
"""Decision Trace API — G1 Compliance Execution Layer.
|
||||
|
||||
Tracks compliance decisions per control: who decided, when, why,
|
||||
what evidence supports it, and what's the remediation plan.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import uuid
|
||||
from datetime import datetime
|
||||
from typing import Optional
|
||||
|
||||
from fastapi import APIRouter, HTTPException, Query
|
||||
from pydantic import BaseModel
|
||||
from sqlalchemy import text
|
||||
|
||||
from db.session import SessionLocal
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
router = APIRouter(prefix="/v1/decision-traces", tags=["decision-traces"])
|
||||
|
||||
|
||||
# ── Request/Response Models ──────────────────────────────────────────
|
||||
|
||||
|
||||
class CreateDecisionRequest(BaseModel):
|
||||
control_uuid: str
|
||||
regulation_id: Optional[str] = None
|
||||
obligation_id: Optional[str] = None
|
||||
status: str = "not_assessed"
|
||||
decision_reason: Optional[str] = None
|
||||
decided_by: Optional[str] = None
|
||||
fix_strategy: Optional[str] = None
|
||||
fix_owner: Optional[str] = None
|
||||
fix_target_date: Optional[str] = None
|
||||
evidence_ids: list[str] = []
|
||||
confidence: float = 0.0
|
||||
tenant_id: Optional[str] = None
|
||||
project_id: Optional[str] = None
|
||||
metadata: dict = {}
|
||||
|
||||
|
||||
class UpdateDecisionRequest(BaseModel):
|
||||
status: Optional[str] = None
|
||||
decision_reason: Optional[str] = None
|
||||
decided_by: Optional[str] = None
|
||||
fix_strategy: Optional[str] = None
|
||||
fix_owner: Optional[str] = None
|
||||
fix_target_date: Optional[str] = None
|
||||
fix_completed_date: Optional[str] = None
|
||||
evidence_ids: Optional[list[str]] = None
|
||||
confidence: Optional[float] = None
|
||||
metadata: Optional[dict] = None
|
||||
|
||||
|
||||
# ── Endpoints ────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@router.post("")
|
||||
async def create_decision(req: CreateDecisionRequest):
|
||||
"""Record a new compliance decision for a control."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
trace_id = str(uuid.uuid4())
|
||||
db.execute(text("""
|
||||
INSERT INTO decision_traces
|
||||
(id, control_uuid, regulation_id, obligation_id,
|
||||
status, decision_reason, decided_by, decided_at,
|
||||
fix_strategy, fix_owner, fix_target_date,
|
||||
evidence_ids, confidence, tenant_id, project_id, metadata)
|
||||
VALUES
|
||||
(CAST(:id AS uuid), CAST(:control_uuid AS uuid), :regulation_id, :obligation_id,
|
||||
:status, :decision_reason, :decided_by, NOW(),
|
||||
:fix_strategy, :fix_owner, :fix_target_date,
|
||||
CAST(:evidence_ids AS jsonb), :confidence,
|
||||
:tenant_id, :project_id, CAST(:metadata AS jsonb))
|
||||
"""), {
|
||||
"id": trace_id,
|
||||
"control_uuid": req.control_uuid,
|
||||
"regulation_id": req.regulation_id,
|
||||
"obligation_id": req.obligation_id,
|
||||
"status": req.status,
|
||||
"decision_reason": req.decision_reason,
|
||||
"decided_by": req.decided_by,
|
||||
"fix_strategy": req.fix_strategy,
|
||||
"fix_owner": req.fix_owner,
|
||||
"fix_target_date": req.fix_target_date,
|
||||
"evidence_ids": json.dumps(req.evidence_ids),
|
||||
"confidence": req.confidence,
|
||||
"tenant_id": req.tenant_id,
|
||||
"project_id": req.project_id,
|
||||
"metadata": json.dumps(req.metadata),
|
||||
})
|
||||
db.commit()
|
||||
return {"id": trace_id, "status": "created"}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.get("")
|
||||
async def list_decisions(
|
||||
control_uuid: Optional[str] = None,
|
||||
status: Optional[str] = None,
|
||||
tenant_id: Optional[str] = None,
|
||||
limit: int = Query(50, ge=1, le=500),
|
||||
offset: int = Query(0, ge=0),
|
||||
):
|
||||
"""List decision traces with optional filters."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
clauses = []
|
||||
params: dict = {"limit": limit, "offset": offset}
|
||||
|
||||
if control_uuid:
|
||||
clauses.append("dt.control_uuid = CAST(:control_uuid AS uuid)")
|
||||
params["control_uuid"] = control_uuid
|
||||
if status:
|
||||
clauses.append("dt.status = :status")
|
||||
params["status"] = status
|
||||
if tenant_id:
|
||||
clauses.append("dt.tenant_id = CAST(:tenant_id AS uuid)")
|
||||
params["tenant_id"] = tenant_id
|
||||
|
||||
where = "WHERE " + " AND ".join(clauses) if clauses else ""
|
||||
|
||||
rows = db.execute(text(f"""
|
||||
SELECT dt.id, dt.control_uuid, cc.control_id, cc.title,
|
||||
dt.status, dt.decision_reason, dt.decided_by, dt.decided_at,
|
||||
dt.fix_strategy, dt.fix_owner, dt.fix_target_date, dt.fix_completed_date,
|
||||
dt.evidence_ids, dt.confidence, dt.regulation_id
|
||||
FROM decision_traces dt
|
||||
LEFT JOIN canonical_controls cc ON cc.id = dt.control_uuid
|
||||
{where}
|
||||
ORDER BY dt.decided_at DESC NULLS LAST
|
||||
LIMIT :limit OFFSET :offset
|
||||
"""), params).fetchall()
|
||||
|
||||
total = db.execute(text(f"""
|
||||
SELECT count(*) FROM decision_traces dt {where}
|
||||
"""), params).scalar()
|
||||
|
||||
return {
|
||||
"total": total,
|
||||
"decisions": [
|
||||
{
|
||||
"id": str(r[0]),
|
||||
"control_uuid": str(r[1]),
|
||||
"control_id": r[2],
|
||||
"control_title": r[3],
|
||||
"status": r[4],
|
||||
"decision_reason": r[5],
|
||||
"decided_by": r[6],
|
||||
"decided_at": str(r[7]) if r[7] else None,
|
||||
"fix_strategy": r[8],
|
||||
"fix_owner": r[9],
|
||||
"fix_target_date": str(r[10]) if r[10] else None,
|
||||
"fix_completed_date": str(r[11]) if r[11] else None,
|
||||
"evidence_ids": r[12],
|
||||
"confidence": float(r[13]) if r[13] else 0,
|
||||
"regulation_id": r[14],
|
||||
}
|
||||
for r in rows
|
||||
],
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.get("/stats")
|
||||
async def decision_stats(tenant_id: Optional[str] = None):
|
||||
"""Dashboard statistics for compliance decisions."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
tenant_filter = ""
|
||||
params: dict = {}
|
||||
if tenant_id:
|
||||
tenant_filter = "WHERE tenant_id = CAST(:tenant_id AS uuid)"
|
||||
params["tenant_id"] = tenant_id
|
||||
|
||||
stats = db.execute(text(f"""
|
||||
SELECT status, count(*) FROM decision_traces
|
||||
{tenant_filter}
|
||||
GROUP BY status
|
||||
"""), params).fetchall()
|
||||
|
||||
total = sum(r[1] for r in stats)
|
||||
by_status = {r[0]: r[1] for r in stats}
|
||||
|
||||
return {
|
||||
"total_decisions": total,
|
||||
"by_status": by_status,
|
||||
"compliance_rate": round(
|
||||
by_status.get("compliant", 0) / total * 100, 1
|
||||
) if total > 0 else 0,
|
||||
"pending_remediation": by_status.get("under_remediation", 0),
|
||||
"not_assessed": by_status.get("not_assessed", 0),
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.get("/{trace_id}")
|
||||
async def get_decision(trace_id: str):
|
||||
"""Get a single decision trace."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
row = db.execute(text("""
|
||||
SELECT dt.*, cc.control_id, cc.title, cc.source_citation
|
||||
FROM decision_traces dt
|
||||
LEFT JOIN canonical_controls cc ON cc.id = dt.control_uuid
|
||||
WHERE dt.id = CAST(:id AS uuid)
|
||||
"""), {"id": trace_id}).fetchone()
|
||||
|
||||
if not row:
|
||||
raise HTTPException(status_code=404, detail="Decision trace not found")
|
||||
|
||||
return {
|
||||
"id": str(row.id),
|
||||
"control_uuid": str(row.control_uuid),
|
||||
"control_id": row.control_id,
|
||||
"control_title": row.title,
|
||||
"regulation_id": row.regulation_id,
|
||||
"obligation_id": row.obligation_id,
|
||||
"status": row.status,
|
||||
"decision_reason": row.decision_reason,
|
||||
"decided_by": row.decided_by,
|
||||
"decided_at": str(row.decided_at) if row.decided_at else None,
|
||||
"fix_strategy": row.fix_strategy,
|
||||
"fix_owner": row.fix_owner,
|
||||
"fix_target_date": str(row.fix_target_date) if row.fix_target_date else None,
|
||||
"fix_completed_date": str(row.fix_completed_date) if row.fix_completed_date else None,
|
||||
"evidence_ids": row.evidence_ids,
|
||||
"confidence": float(row.confidence) if row.confidence else 0,
|
||||
"source_citation": row.source_citation,
|
||||
"metadata": row.metadata,
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.put("/{trace_id}")
|
||||
async def update_decision(trace_id: str, req: UpdateDecisionRequest):
|
||||
"""Update a decision trace (status, fix progress, evidence)."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
updates = []
|
||||
params: dict = {"id": trace_id}
|
||||
|
||||
if req.status is not None:
|
||||
updates.append("status = :status")
|
||||
params["status"] = req.status
|
||||
if req.decision_reason is not None:
|
||||
updates.append("decision_reason = :reason")
|
||||
params["reason"] = req.decision_reason
|
||||
if req.decided_by is not None:
|
||||
updates.append("decided_by = :decided_by")
|
||||
params["decided_by"] = req.decided_by
|
||||
if req.fix_strategy is not None:
|
||||
updates.append("fix_strategy = :fix_strategy")
|
||||
params["fix_strategy"] = req.fix_strategy
|
||||
if req.fix_owner is not None:
|
||||
updates.append("fix_owner = :fix_owner")
|
||||
params["fix_owner"] = req.fix_owner
|
||||
if req.fix_target_date is not None:
|
||||
updates.append("fix_target_date = :fix_target")
|
||||
params["fix_target"] = req.fix_target_date
|
||||
if req.fix_completed_date is not None:
|
||||
updates.append("fix_completed_date = :fix_completed")
|
||||
params["fix_completed"] = req.fix_completed_date
|
||||
if req.evidence_ids is not None:
|
||||
updates.append("evidence_ids = CAST(:evidence AS jsonb)")
|
||||
params["evidence"] = json.dumps(req.evidence_ids)
|
||||
if req.confidence is not None:
|
||||
updates.append("confidence = :confidence")
|
||||
params["confidence"] = req.confidence
|
||||
|
||||
if not updates:
|
||||
raise HTTPException(status_code=400, detail="No fields to update")
|
||||
|
||||
result = db.execute(text(f"""
|
||||
UPDATE decision_traces SET {', '.join(updates)}
|
||||
WHERE id = CAST(:id AS uuid)
|
||||
"""), params)
|
||||
db.commit()
|
||||
|
||||
if result.rowcount == 0:
|
||||
raise HTTPException(status_code=404, detail="Decision trace not found")
|
||||
|
||||
return {"status": "updated", "id": trace_id}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
# ── Full Trace Endpoint ──────────────────────────────────────────────
|
||||
|
||||
|
||||
full_trace_router = APIRouter(prefix="/v1/controls", tags=["decision-traces"])
|
||||
|
||||
|
||||
@full_trace_router.get("/{control_id}/full-trace")
|
||||
async def get_full_trace(control_id: str):
|
||||
"""Get the complete Decision Trace chain for a control.
|
||||
|
||||
Returns: Regulation → Obligation → Control → Master Control → Decision → Evidence
|
||||
"""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
# 1. Control
|
||||
ctrl = db.execute(text("""
|
||||
SELECT id, control_id, title, objective, severity,
|
||||
source_citation, source_original_text,
|
||||
verification_method, category,
|
||||
generation_metadata->>'merge_group_hint' AS merge_hint
|
||||
FROM canonical_controls
|
||||
WHERE control_id = :cid
|
||||
"""), {"cid": control_id}).fetchone()
|
||||
|
||||
if not ctrl:
|
||||
raise HTTPException(status_code=404, detail="Control not found")
|
||||
|
||||
# 2. Regulation (from source_citation)
|
||||
citation = ctrl.source_citation or {}
|
||||
regulation = {
|
||||
"source": citation.get("source"),
|
||||
"article": citation.get("article"),
|
||||
"paragraph": citation.get("paragraph"),
|
||||
"source_type": citation.get("source_type"),
|
||||
"license": citation.get("license"),
|
||||
}
|
||||
|
||||
# 3. Obligation (from parent links)
|
||||
obligations = db.execute(text("""
|
||||
SELECT oc.candidate_id, oc.obligation_text, oc.action,
|
||||
oc.object, oc.normative_strength
|
||||
FROM obligation_candidates oc
|
||||
WHERE oc.parent_control_uuid = CAST(:uuid AS uuid)
|
||||
ORDER BY oc.candidate_id
|
||||
LIMIT 10
|
||||
"""), {"uuid": str(ctrl.id)}).fetchall()
|
||||
|
||||
# 4. Master Control (if member)
|
||||
master = db.execute(text("""
|
||||
SELECT mc.master_control_id, mc.canonical_name, mc.phases_covered
|
||||
FROM master_control_members mcm
|
||||
JOIN master_controls mc ON mc.id = mcm.master_control_uuid
|
||||
WHERE mcm.control_uuid = CAST(:uuid AS uuid)
|
||||
LIMIT 1
|
||||
"""), {"uuid": str(ctrl.id)}).fetchone()
|
||||
|
||||
# 5. Decision Traces
|
||||
decisions = db.execute(text("""
|
||||
SELECT id, status, decision_reason, decided_by, decided_at,
|
||||
fix_strategy, fix_owner, evidence_ids, confidence
|
||||
FROM decision_traces
|
||||
WHERE control_uuid = CAST(:uuid AS uuid)
|
||||
ORDER BY decided_at DESC NULLS LAST
|
||||
"""), {"uuid": str(ctrl.id)}).fetchall()
|
||||
|
||||
return {
|
||||
"control": {
|
||||
"id": ctrl.control_id,
|
||||
"uuid": str(ctrl.id),
|
||||
"title": ctrl.title,
|
||||
"objective": ctrl.objective,
|
||||
"severity": ctrl.severity,
|
||||
"category": ctrl.category,
|
||||
"verification_method": ctrl.verification_method,
|
||||
},
|
||||
"regulation": regulation,
|
||||
"original_text": ctrl.source_original_text[:500] if ctrl.source_original_text else None,
|
||||
"obligations": [
|
||||
{
|
||||
"id": o.candidate_id,
|
||||
"text": o.obligation_text,
|
||||
"action": o.action,
|
||||
"object": o.object,
|
||||
"strength": o.normative_strength,
|
||||
}
|
||||
for o in obligations
|
||||
],
|
||||
"master_control": {
|
||||
"id": master.master_control_id,
|
||||
"name": master.canonical_name,
|
||||
"phases": master.phases_covered,
|
||||
} if master else None,
|
||||
"decisions": [
|
||||
{
|
||||
"id": str(d.id),
|
||||
"status": d.status,
|
||||
"reason": d.decision_reason,
|
||||
"decided_by": d.decided_by,
|
||||
"decided_at": str(d.decided_at) if d.decided_at else None,
|
||||
"fix_strategy": d.fix_strategy,
|
||||
"fix_owner": d.fix_owner,
|
||||
"evidence_count": len(d.evidence_ids) if d.evidence_ids else 0,
|
||||
"confidence": float(d.confidence) if d.confidence else 0,
|
||||
}
|
||||
for d in decisions
|
||||
],
|
||||
"latest_status": decisions[0].status if decisions else "not_assessed",
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
@@ -0,0 +1,258 @@
|
||||
"""Pre-Deployment Enforcement API — G4.
|
||||
|
||||
CI/CD gate: checks if a deployment is safe by evaluating the compliance
|
||||
status of all affected controls. Blocks deploys with non-compliant controls.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import uuid
|
||||
from typing import Optional
|
||||
|
||||
from fastapi import APIRouter, HTTPException, Query
|
||||
from pydantic import BaseModel
|
||||
from sqlalchemy import text
|
||||
|
||||
from db.session import SessionLocal
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
router = APIRouter(prefix="/v1/deployment-checks", tags=["deployment-checks"])
|
||||
|
||||
SEVERITY_WEIGHT = {
|
||||
"critical": 4.0,
|
||||
"high": 3.0,
|
||||
"medium": 2.0,
|
||||
"low": 1.0,
|
||||
}
|
||||
|
||||
|
||||
class DeployCheckRequest(BaseModel):
|
||||
tenant_id: str
|
||||
commit_hash: str
|
||||
branch: Optional[str] = None
|
||||
environment: str = "production"
|
||||
affected_control_ids: list[str] = []
|
||||
metadata: dict = {}
|
||||
|
||||
|
||||
class OverrideRequest(BaseModel):
|
||||
override_by: str
|
||||
override_reason: str
|
||||
|
||||
|
||||
@router.post("")
|
||||
async def check_deployment(req: DeployCheckRequest):
|
||||
"""Check if a deployment is safe. Returns verdict: approved/blocked."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
check_id = str(uuid.uuid4())
|
||||
|
||||
blocking = []
|
||||
warnings = []
|
||||
risk_score = 0.0
|
||||
|
||||
if req.affected_control_ids:
|
||||
# Look up latest decision status for each affected control
|
||||
for ctrl_id in req.affected_control_ids:
|
||||
row = db.execute(text("""
|
||||
SELECT dt.status, dt.decision_reason, dt.fix_strategy,
|
||||
cc.control_id, cc.title, cc.severity
|
||||
FROM decision_traces dt
|
||||
JOIN canonical_controls cc ON cc.id = dt.control_uuid
|
||||
WHERE cc.control_id = :cid
|
||||
ORDER BY dt.decided_at DESC NULLS LAST
|
||||
LIMIT 1
|
||||
"""), {"cid": ctrl_id}).fetchone()
|
||||
|
||||
if not row:
|
||||
# No decision → treat as not_assessed (warning)
|
||||
warnings.append({
|
||||
"control_id": ctrl_id,
|
||||
"status": "not_assessed",
|
||||
"reason": "No compliance decision recorded",
|
||||
})
|
||||
continue
|
||||
|
||||
status = row[0]
|
||||
severity = row[5] or "medium"
|
||||
weight = SEVERITY_WEIGHT.get(severity, 2.0)
|
||||
|
||||
if status in ("not_compliant", "under_remediation"):
|
||||
blocking.append({
|
||||
"control_id": row[3],
|
||||
"title": row[4],
|
||||
"status": status,
|
||||
"reason": row[1],
|
||||
"fix_strategy": row[2],
|
||||
"severity": severity,
|
||||
})
|
||||
risk_score += weight
|
||||
elif status == "partially_compliant":
|
||||
warnings.append({
|
||||
"control_id": row[3],
|
||||
"title": row[4],
|
||||
"status": status,
|
||||
"reason": row[1],
|
||||
"severity": severity,
|
||||
})
|
||||
risk_score += weight * 0.5
|
||||
|
||||
# Also check for open failure events (G3)
|
||||
if req.affected_control_ids:
|
||||
placeholders = ",".join(["'%s'" % c for c in req.affected_control_ids])
|
||||
open_failures = db.execute(text(f"""
|
||||
SELECT cc.control_id, de.summary
|
||||
FROM decision_events de
|
||||
JOIN canonical_controls cc ON cc.id = de.control_uuid
|
||||
WHERE cc.control_id IN ({placeholders})
|
||||
AND de.event_type = 'failure'
|
||||
AND de.created_at > NOW() - interval '30 days'
|
||||
AND NOT EXISTS (
|
||||
SELECT 1 FROM decision_events de2
|
||||
WHERE de2.control_uuid = de.control_uuid
|
||||
AND de2.event_type = 'verification'
|
||||
AND de2.created_at > de.created_at
|
||||
)
|
||||
""")).fetchall()
|
||||
|
||||
for f in open_failures:
|
||||
if not any(b["control_id"] == f[0] for b in blocking):
|
||||
blocking.append({
|
||||
"control_id": f[0],
|
||||
"status": "open_failure",
|
||||
"reason": f[1] or "Unresolved failure event",
|
||||
"severity": "high",
|
||||
})
|
||||
risk_score += 3.0
|
||||
|
||||
verdict = "approved" if not blocking else "blocked"
|
||||
summary = (
|
||||
f"{len(blocking)} blocking, {len(warnings)} warnings. "
|
||||
+ ("Deploy approved." if verdict == "approved"
|
||||
else f"Fix {', '.join(b['control_id'] for b in blocking)} before deploying.")
|
||||
)
|
||||
|
||||
# Store check result
|
||||
db.execute(text("""
|
||||
INSERT INTO deployment_checks
|
||||
(id, tenant_id, commit_hash, branch, environment,
|
||||
verdict, affected_control_ids, blocking_controls,
|
||||
warning_controls, risk_score, summary, metadata)
|
||||
VALUES
|
||||
(CAST(:id AS uuid), CAST(:tid AS uuid), :hash, :branch, :env,
|
||||
:verdict, CAST(:affected AS jsonb), CAST(:blocking AS jsonb),
|
||||
CAST(:warnings AS jsonb), :risk, :summary, CAST(:meta AS jsonb))
|
||||
"""), {
|
||||
"id": check_id,
|
||||
"tid": req.tenant_id,
|
||||
"hash": req.commit_hash,
|
||||
"branch": req.branch,
|
||||
"env": req.environment,
|
||||
"verdict": verdict,
|
||||
"affected": json.dumps(req.affected_control_ids),
|
||||
"blocking": json.dumps(blocking),
|
||||
"warnings": json.dumps(warnings),
|
||||
"risk": risk_score,
|
||||
"summary": summary,
|
||||
"meta": json.dumps(req.metadata),
|
||||
})
|
||||
db.commit()
|
||||
|
||||
return {
|
||||
"id": check_id,
|
||||
"verdict": verdict,
|
||||
"risk_score": risk_score,
|
||||
"blocking_controls": blocking,
|
||||
"warning_controls": warnings,
|
||||
"summary": summary,
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.get("/stats")
|
||||
async def check_stats(tenant_id: Optional[str] = None):
|
||||
"""Deployment check statistics."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
tf = ""
|
||||
params: dict = {}
|
||||
if tenant_id:
|
||||
tf = "WHERE tenant_id = CAST(:tid AS uuid)"
|
||||
params["tid"] = tenant_id
|
||||
|
||||
by_verdict = db.execute(text(f"""
|
||||
SELECT verdict, count(*) FROM deployment_checks {tf}
|
||||
GROUP BY verdict
|
||||
"""), params).fetchall()
|
||||
|
||||
total = sum(r[1] for r in by_verdict)
|
||||
verdicts = {r[0]: r[1] for r in by_verdict}
|
||||
|
||||
return {
|
||||
"total_checks": total,
|
||||
"by_verdict": verdicts,
|
||||
"approval_rate": round(
|
||||
verdicts.get("approved", 0) / total * 100, 1
|
||||
) if total > 0 else 0,
|
||||
"override_count": verdicts.get("override", 0),
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.post("/{check_id}/override")
|
||||
async def override_check(check_id: str, req: OverrideRequest):
|
||||
"""Override a blocked deployment (with justification)."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
result = db.execute(text("""
|
||||
UPDATE deployment_checks
|
||||
SET verdict = 'override', override_by = :by, override_reason = :reason
|
||||
WHERE id = CAST(:id AS uuid) AND verdict = 'blocked'
|
||||
"""), {
|
||||
"id": check_id,
|
||||
"by": req.override_by,
|
||||
"reason": req.override_reason,
|
||||
})
|
||||
db.commit()
|
||||
|
||||
if result.rowcount == 0:
|
||||
raise HTTPException(status_code=404, detail="Check not found or not blocked")
|
||||
|
||||
return {"id": check_id, "verdict": "override", "override_by": req.override_by}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.get("/{check_id}")
|
||||
async def get_check(check_id: str):
|
||||
"""Get details of a deployment check."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
row = db.execute(text("""
|
||||
SELECT * FROM deployment_checks WHERE id = CAST(:id AS uuid)
|
||||
"""), {"id": check_id}).fetchone()
|
||||
|
||||
if not row:
|
||||
raise HTTPException(status_code=404, detail="Check not found")
|
||||
|
||||
return {
|
||||
"id": str(row.id),
|
||||
"tenant_id": str(row.tenant_id),
|
||||
"commit_hash": row.commit_hash,
|
||||
"branch": row.branch,
|
||||
"environment": row.environment,
|
||||
"verdict": row.verdict,
|
||||
"affected_control_ids": row.affected_control_ids,
|
||||
"blocking_controls": row.blocking_controls,
|
||||
"warning_controls": row.warning_controls,
|
||||
"risk_score": float(row.risk_score),
|
||||
"override_by": row.override_by,
|
||||
"override_reason": row.override_reason,
|
||||
"summary": row.summary,
|
||||
"created_at": str(row.created_at),
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
@@ -0,0 +1,178 @@
|
||||
"""Master Control API — G-pre3.
|
||||
|
||||
Provides read access to Master Controls (lifecycle-grouped atomic controls).
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
from typing import Optional
|
||||
|
||||
from fastapi import APIRouter, HTTPException, Query
|
||||
from sqlalchemy import text
|
||||
|
||||
from db.session import SessionLocal
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
router = APIRouter(prefix="/v1/master-controls", tags=["master-controls"])
|
||||
|
||||
|
||||
@router.get("")
|
||||
async def list_master_controls(
|
||||
limit: int = Query(50, ge=1, le=500),
|
||||
offset: int = Query(0, ge=0),
|
||||
search: Optional[str] = None,
|
||||
min_phases: Optional[int] = None,
|
||||
min_controls: Optional[int] = None,
|
||||
sort: str = Query("total_controls", regex="^(total_controls|phases|name|created_at)$"),
|
||||
):
|
||||
"""List Master Controls with optional filtering."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
where_clauses = []
|
||||
params: dict = {"limit": limit, "offset": offset}
|
||||
|
||||
if search:
|
||||
where_clauses.append("mc.canonical_name ILIKE :search")
|
||||
params["search"] = f"%{search}%"
|
||||
if min_phases:
|
||||
where_clauses.append("jsonb_array_length(mc.phases_covered) >= :min_phases")
|
||||
params["min_phases"] = min_phases
|
||||
if min_controls:
|
||||
where_clauses.append("mc.total_controls >= :min_controls")
|
||||
params["min_controls"] = min_controls
|
||||
|
||||
where = "WHERE " + " AND ".join(where_clauses) if where_clauses else ""
|
||||
|
||||
sort_map = {
|
||||
"total_controls": "mc.total_controls DESC",
|
||||
"phases": "jsonb_array_length(mc.phases_covered) DESC",
|
||||
"name": "mc.canonical_name ASC",
|
||||
"created_at": "mc.created_at DESC",
|
||||
}
|
||||
order = sort_map.get(sort, "mc.total_controls DESC")
|
||||
|
||||
rows = db.execute(text(f"""
|
||||
SELECT mc.id, mc.master_control_id, mc.object_group_id,
|
||||
mc.canonical_name, mc.phases_covered,
|
||||
mc.phase_control_count, mc.total_controls,
|
||||
mc.created_at
|
||||
FROM master_controls mc
|
||||
{where}
|
||||
ORDER BY {order}
|
||||
LIMIT :limit OFFSET :offset
|
||||
"""), params).fetchall()
|
||||
|
||||
total = db.execute(text(f"""
|
||||
SELECT count(*) FROM master_controls mc {where}
|
||||
"""), params).scalar()
|
||||
|
||||
return {
|
||||
"total": total,
|
||||
"limit": limit,
|
||||
"offset": offset,
|
||||
"master_controls": [
|
||||
{
|
||||
"id": str(r[0]),
|
||||
"master_control_id": r[1],
|
||||
"object_group_id": r[2],
|
||||
"canonical_name": r[3],
|
||||
"phases_covered": r[4],
|
||||
"phase_control_count": r[5],
|
||||
"total_controls": r[6],
|
||||
"created_at": str(r[7]),
|
||||
}
|
||||
for r in rows
|
||||
],
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.get("/stats")
|
||||
async def master_control_stats():
|
||||
"""Aggregate statistics about Master Controls."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
stats = db.execute(text("""
|
||||
SELECT
|
||||
count(*) AS total_master_controls,
|
||||
sum(total_controls) AS total_member_controls,
|
||||
avg(total_controls)::int AS avg_controls_per_mc,
|
||||
max(total_controls) AS max_controls,
|
||||
avg(jsonb_array_length(phases_covered))::numeric(3,1) AS avg_phases,
|
||||
max(jsonb_array_length(phases_covered)) AS max_phases
|
||||
FROM master_controls
|
||||
""")).fetchone()
|
||||
|
||||
phase_dist = db.execute(text("""
|
||||
SELECT phase, count(*) AS control_count
|
||||
FROM master_control_members
|
||||
GROUP BY phase
|
||||
ORDER BY control_count DESC
|
||||
""")).fetchall()
|
||||
|
||||
return {
|
||||
"total_master_controls": stats[0],
|
||||
"total_member_controls": stats[1],
|
||||
"avg_controls_per_mc": stats[2],
|
||||
"max_controls": stats[3],
|
||||
"avg_phases": float(stats[4]) if stats[4] else 0,
|
||||
"max_phases": stats[5],
|
||||
"phase_distribution": {r[0]: r[1] for r in phase_dist},
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.get("/{mc_id}")
|
||||
async def get_master_control(mc_id: str):
|
||||
"""Get a single Master Control with all phase-controls."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
mc = db.execute(text("""
|
||||
SELECT mc.id, mc.master_control_id, mc.object_group_id,
|
||||
mc.canonical_name, mc.phases_covered,
|
||||
mc.phase_control_count, mc.total_controls
|
||||
FROM master_controls mc
|
||||
WHERE mc.master_control_id = :mc_id
|
||||
"""), {"mc_id": mc_id}).fetchone()
|
||||
|
||||
if not mc:
|
||||
raise HTTPException(status_code=404, detail="Master Control not found")
|
||||
|
||||
members = db.execute(text("""
|
||||
SELECT mcm.phase, mcm.action,
|
||||
cc.control_id, cc.title, cc.severity,
|
||||
cc.source_citation->>'source' AS source
|
||||
FROM master_control_members mcm
|
||||
JOIN canonical_controls cc ON cc.id = mcm.control_uuid
|
||||
WHERE mcm.master_control_uuid = CAST(:mc_uuid AS uuid)
|
||||
ORDER BY mcm.phase, cc.control_id
|
||||
"""), {"mc_uuid": str(mc[0])}).fetchall()
|
||||
|
||||
# Group by phase
|
||||
phases = {}
|
||||
for phase, action, ctrl_id, title, severity, source in members:
|
||||
if phase not in phases:
|
||||
phases[phase] = []
|
||||
phases[phase].append({
|
||||
"control_id": ctrl_id,
|
||||
"title": title,
|
||||
"action": action,
|
||||
"severity": severity,
|
||||
"source": source,
|
||||
})
|
||||
|
||||
return {
|
||||
"id": str(mc[0]),
|
||||
"master_control_id": mc[1],
|
||||
"object_group_id": mc[2],
|
||||
"canonical_name": mc[3],
|
||||
"phases_covered": mc[4],
|
||||
"phase_control_count": mc[5],
|
||||
"total_controls": mc[6],
|
||||
"phases": phases,
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
@@ -0,0 +1,430 @@
|
||||
source: Derived from BSI QUAIDAL (Clean-Room)
|
||||
source_url: https://github.com/BSI-Bund/QUAIDAL
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
plagiarism_limit_4gram: 0.2
|
||||
generated_by_model: qwen3.5:35b-a3b
|
||||
controls:
|
||||
- id: AC-AI-DATA-QB-01-syntaktische-genauigkeit
|
||||
canonical_name: Syntaktische Genauigkeit
|
||||
description: Das KI-Trainingsset muss syntaktisch konsistent sein, wobei alle definierten
|
||||
Grammatik- und Strukturregeln strikt einzuhalten sind. Eine fehlerfreie Datenstruktur
|
||||
ist zwingend erforderlich, um eine korrekte Verarbeitung durch Parser oder Sprachmodelle
|
||||
zu gewährleisten. Die Validierung der formalen Korrektheit ist vor jedem Training
|
||||
durchzuführen, um Verarbeitungsfehler auszuschließen.
|
||||
kind: building_block
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-01
|
||||
- MA-02
|
||||
- MA-03
|
||||
- MA-04
|
||||
- MA-05
|
||||
- MA-27
|
||||
external_refs:
|
||||
- framework: BSI AIC4
|
||||
citation: null
|
||||
- framework: ISO/IEC 25012
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QB-01
|
||||
title_original_de: QB-01 Syntaktische Genauigkeit
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-01_Syntactic%20Accuracy.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: AC-AI-DATA-QB-02-semantische-genauigkeit
|
||||
canonical_name: Semantische Genauigkeit
|
||||
description: Die KI-Trainingsdaten müssen inhaltlich korrekt sein, sodass die zugewiesenen
|
||||
Werte dem tatsächlichen Sachverhalt entsprechen und nicht nur formal valide sind.
|
||||
Es ist sicherzustellen, dass semantische Zuordnungen keine logischen Fehler aufweisen,
|
||||
wie beispielsweise die Klassifizierung von Tieren als technische Geräte. Eine
|
||||
Prüfung muss verifizieren, dass die Bedeutung der Datenpunkte im Kontext der Anwendung
|
||||
eindeutig und fehlerfrei interpretiert werden kann.
|
||||
kind: building_block
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-05
|
||||
- MA-06
|
||||
- MA-07
|
||||
- MA-27
|
||||
external_refs:
|
||||
- framework: BSI AIC4
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QB-02
|
||||
title_original_de: QB-02 Semantische Genauigkeit
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-02_Semantic%20Accuracy.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: AC-AI-DATA-QB-03-vielfalt
|
||||
canonical_name: Vielfalt
|
||||
description: Das KI-Trainingsdatenset muss eine maximale Varianz in den relevanten
|
||||
Merkmalen aufweisen, um die Heterogenität der Eingabewerte zu gewährleisten. Es
|
||||
ist sicherzustellen, dass das Spektrum der enthaltenen Werte breit genug ist,
|
||||
um das Variationspotential der Zielgruppe vollständig abzudecken. Eine Prüfung
|
||||
der Datenverteilung ist vor dem Training durchzuführen, um eine unzureichende
|
||||
Diversität auszuschließen.
|
||||
kind: building_block
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-08
|
||||
- MA-09
|
||||
- MA-10
|
||||
- MA-12
|
||||
- MA-27
|
||||
- MA-28
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QB-03
|
||||
title_original_de: QB-03 Vielfalt
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-03_Diversity.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0204
|
||||
- id: AC-AI-DATA-QB-04-ausgewogenheit
|
||||
canonical_name: Ausgewogenheit
|
||||
description: Der Trainingsdatensatz ist so zu konzipieren, dass die Verteilung aller
|
||||
relevanten Klassen proportional zur Zielrealität erfolgt, um eine einseitige Dominanz
|
||||
einzelner Kategorien zu vermeiden. Es ist sicherzustellen, dass keine Gruppe systematisch
|
||||
unter- oder überrepräsentiert wird, um Verzerrungen im Modellverhalten auszuschließen.
|
||||
Die Datenqualität muss durch eine ausgewogene Varianz aller Merkmale gewährleistet
|
||||
werden, um Overfitting und Bias wirksam zu verhindern.
|
||||
kind: building_block
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-08
|
||||
- MA-09
|
||||
- MA-10
|
||||
- MA-12
|
||||
- MA-14
|
||||
- MA-27
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QB-04
|
||||
title_original_de: QB-04 Ausgewogenheit
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-04_Balance.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0182
|
||||
- id: AC-AI-DATA-QB-05-umfang
|
||||
canonical_name: Umfang
|
||||
description: Der Trainingsdatensatz muss eine quantitativ ausreichende Anzahl an
|
||||
Datenpunkten aufweisen, um statistisch signifikante Muster zu erfassen und das
|
||||
Risiko von Overfitting zu minimieren. Die Größe der Datenbasis ist so zu dimensionieren,
|
||||
dass sie eine belastbare Analyse der zugrundeliegenden Verteilungen ermöglicht
|
||||
und die Generalisierungsfähigkeit des Modells stabilisiert. Eine Prüfung ist durchzuführen,
|
||||
um sicherzustellen, dass der reine quantitative Umfang die notwendige Basis für
|
||||
eine robuste Modellbildung bildet.
|
||||
kind: building_block
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-11
|
||||
- MA-12
|
||||
- MA-15
|
||||
- MA-27
|
||||
external_refs:
|
||||
- framework: BSI AIC4
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QB-05
|
||||
title_original_de: QB-05 Umfang
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-05_Size.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0161
|
||||
- id: AC-AI-DATA-QB-06-verzerrung
|
||||
canonical_name: Verzerrung
|
||||
description: Das KI-System muss vor dem produktiven Einsatz auf systematische Verzerrungen
|
||||
in den Trainingsdaten und den daraus resultierenden Vorhersagen untersucht werden.
|
||||
Es ist sicherzustellen, dass latente Ungleichbehandlungen quantitativ erfasst
|
||||
und dokumentiert werden, um eine transparente Bewertung der Fairness zu ermöglichen.
|
||||
Die Prüfung umfasst die Identifikation von Abweichungen, die auf unausgewogene
|
||||
Datenverteilungen zurückzuführen sind, bevor das Modell für reale Anwendungen
|
||||
freigegeben wird.
|
||||
kind: building_block
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-01
|
||||
- MA-02
|
||||
- MA-03
|
||||
- MA-04
|
||||
- MA-06
|
||||
- MA-07
|
||||
- MA-08
|
||||
- MA-09
|
||||
- MA-10
|
||||
- MA-11
|
||||
- MA-12
|
||||
- MA-13
|
||||
- MA-14
|
||||
- MA-15
|
||||
- MA-16
|
||||
- MA-17
|
||||
- MA-18
|
||||
- MA-20
|
||||
- MA-23
|
||||
- MA-24
|
||||
- MA-27
|
||||
- MA-28
|
||||
- QB-15
|
||||
- QM-11
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QB-06
|
||||
title_original_de: QB-06 Verzerrung
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-06_Bias-Detektion.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: AC-AI-DATA-QB-07-gesamtheit
|
||||
canonical_name: Gesamtheit
|
||||
description: Das Trainingsdatenset muss sämtliche für das spezifische Anwendungsszenario
|
||||
definierten Attribute und Entitätsinstanzen vollständig enthalten, um die Anforderung
|
||||
der Gesamtheit zu erfüllen. Diese Vollständigkeit ist auf der Ebene des gesamten
|
||||
Datensatzes, einzelner Spalten oder einzelner Datenpunkte nachweisbar zu prüfen.
|
||||
Die Bewertung der Datenqualität erfolgt stets kontextbezogen unter Berücksichtigung
|
||||
der jeweiligen Nutzungszwecke.
|
||||
kind: building_block
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-12
|
||||
- MA-13
|
||||
- MA-27
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QB-07
|
||||
title_original_de: QB-07 Gesamtheit
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-07_Totality.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: AC-AI-DATA-QB-08-konsistenzsicherung
|
||||
canonical_name: Konsistenzsicherung
|
||||
description: Die Konsistenz der KI-Trainingsdaten ist durch standardisierte Datentypen
|
||||
und formatierte Attribute über den gesamten Lebenszyklus sicherzustellen. Automatisierte
|
||||
Prüfmechanismen müssen Abweichungen in den Datenwerten sowie zeitlichen Verläufen
|
||||
frühzeitig identifizieren, um nachvollziehbare Transformations- oder Imputationsmaßnahmen
|
||||
einzuleiten. Eine einheitliche Datenstruktur ist zwingend erforderlich, um die
|
||||
Integrität der Trainingsbasis für valide Modellentscheidungen zu gewährleisten.
|
||||
kind: building_block
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-01
|
||||
- MA-02
|
||||
- MA-03
|
||||
external_refs:
|
||||
- framework: ISO/IEC 25012
|
||||
citation: null
|
||||
- framework: BSI AIC4
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QB-08
|
||||
title_original_de: QB-08 Konsistenzsicherung
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-08_ConsistencyAssurance.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: AC-AI-DATA-QB-09-quellenmanagement
|
||||
canonical_name: Quellenmanagement
|
||||
description: Die Organisation muss einen durchgängigen Mechanismus implementieren,
|
||||
der die Herkunft und den Verarbeitungsweg jeder Trainingsdaten-Einheit lückenlos
|
||||
dokumentiert. Es ist sicherzustellen, dass jeder Datenpunkt mit seinem Ursprung
|
||||
sowie allen nachfolgenden Transformationsschritten verknüpft bleibt, um die Integrität
|
||||
der KI-Datenbasis zu gewährleisten. Zusätzlich sind alle Zugriffe und Modifikationen
|
||||
in einem unveränderlichen Protokoll chronologisch festzuhalten, um einen vollständigen
|
||||
Audit-Trail für Compliance-Prüfungen zu schaffen.
|
||||
kind: building_block
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-18
|
||||
- MA-19
|
||||
- MA-20
|
||||
- MA-22
|
||||
external_refs:
|
||||
- framework: BSI AIC4
|
||||
citation: null
|
||||
- framework: AI Act
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QB-09
|
||||
title_original_de: QB-09 Quellenmanagement
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-09_Sourcemanagement.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0167
|
||||
- id: AC-AI-DATA-QB-10-datenpruefung
|
||||
canonical_name: _Datenprüfung
|
||||
description: Vor der Initialisierung des Trainingsprozesses ist eine systematische
|
||||
Validierung der Eingangsdaten auf Vollständigkeit, Konsistenz und Integrität durchzuführen.
|
||||
Dabei sind Unregelmäßigkeiten wie fehlende Werte, formatinkonsistenzen oder statistische
|
||||
Ausreißer zu identifizieren und zu bereinigen. Das System muss sicherstellen,
|
||||
dass keine verzerrten oder fehlerhaften Datensätze das Modelltraining beeinträchtigen
|
||||
und die Datenqualität den definierten Qualitätsstandards entspricht.
|
||||
kind: building_block
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-05
|
||||
- MA-20
|
||||
- MA-26
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QB-10
|
||||
title_original_de: QB-10_Datenprüfung
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-10_DataChecks.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0204
|
||||
- id: AC-AI-DATA-QB-11-prozesse
|
||||
canonical_name: Prozesse
|
||||
description: Es ist sicherzustellen, dass jeder Schritt der Datenvorbereitung und
|
||||
-verarbeitung für KI-Trainingszwecke lückenlos protokolliert wird, um die vollständige
|
||||
Nachvollziehbarkeit der Datenherkunft und aller Transformationen zu gewährleisten.
|
||||
Diese Dokumentation muss so strukturiert sein, dass sie eine valide Reproduzierbarkeit
|
||||
der Modelle sowie eine fundierte Qualitätssicherung der zugrundeliegenden Datensätze
|
||||
ermöglicht. Durch die Erfassung aller Änderungsereignisse wird die Integrität
|
||||
der Trainingsdaten über den gesamten Lebenszyklus hinweg verifiziert.
|
||||
kind: building_block
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-18
|
||||
- MA-21
|
||||
external_refs:
|
||||
- framework: BSI Grundschutz
|
||||
citation: null
|
||||
- framework: ISO/IEC 23894
|
||||
citation: null
|
||||
- framework: ISO/IEC 42001
|
||||
citation: null
|
||||
- framework: AI Act
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QB-11
|
||||
title_original_de: QB-11 Prozesse
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-11_Processes.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: AC-AI-DATA-QB-12-merkmalsentwicklung
|
||||
canonical_name: Merkmalsentwicklung
|
||||
description: Die Erstellung und Auswahl von Eingangsmerkmalen für KI-Modelle ist
|
||||
so zu gestalten, dass sie signifikante Korrelationen zur Zielgröße aufweisen und
|
||||
redundante Informationen eliminieren. Es ist sicherzustellen, dass die transformierten
|
||||
Daten generalisierbar sind und eine hohe Informationsdichte für neue, unbekannte
|
||||
Datensätze bieten. Eine Validierung muss nachweisen, dass die abgeleiteten Merkmale
|
||||
die Interpretierbarkeit des Modells unterstützen und keine unnötige Komplexität
|
||||
verursachen.
|
||||
kind: building_block
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-01
|
||||
- MA-02
|
||||
- MA-03
|
||||
- MA-06
|
||||
- MA-12
|
||||
- MA-14
|
||||
- MA-17
|
||||
- MA-23
|
||||
- MA-24
|
||||
- MA-27
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QB-12
|
||||
title_original_de: QB-12 Merkmalsentwicklung
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-12_FeatureEngineering.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: AC-AI-DATA-QB-13-datenvorbereitung
|
||||
canonical_name: Datenvorbereitung
|
||||
description: Vor der Initialisierung des Trainingsprozesses sind alle Rohdaten durch
|
||||
definierte Transformationen in eine qualitätsgeprüfte und für das Modell verarbeitbare
|
||||
Struktur zu überführen. Es ist sicherzustellen, dass jede angewandte Datenaufbereitung
|
||||
die Integrität der Trainingsmenge gewährleistet und keine nicht validierten Artefakte
|
||||
in das Lernsystem einfließen. Die Durchführbarkeit dieser Schritte ist vor dem
|
||||
Start der Modellkonvergenz durch systematische Prüfverfahren nachzuweisen.
|
||||
kind: building_block
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-02
|
||||
- MA-03
|
||||
- MA-04
|
||||
- MA-13
|
||||
- MA-14
|
||||
- MA-16
|
||||
- MA-17
|
||||
- MA-23
|
||||
- MA-24
|
||||
- MA-25
|
||||
- MA-27
|
||||
- MA-29
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QB-13
|
||||
title_original_de: QB-13 Datenvorbereitung
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-13_DataPreparation.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: AC-AI-DATA-QB-14-expertanalysis
|
||||
canonical_name: _Expertanalysis
|
||||
description: Die Qualität der KI-Trainingsdaten ist durch eine unabhängige, manuelle
|
||||
Begutachtung durch qualifiziertes Fachpersonal zu validieren. Dabei sind mehrere
|
||||
Prüfer eigenständig einzusetzen, um subjektive Verzerrungen und Gruppenkonformitätseffekte
|
||||
bei der Bewertung auszuschließen. Die Ergebnisse dieser fachlichen Analyse müssen
|
||||
anonymisiert zusammengeführt werden, um eine objektive Beurteilung der Datensatzqualität
|
||||
zu gewährleisten.
|
||||
kind: building_block
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-06
|
||||
- MA-10
|
||||
- MA-14
|
||||
- MA-15
|
||||
- MA-21
|
||||
- MA-22
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QB-14
|
||||
title_original_de: QB-14_Expertanalysis
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-14_Expertanalysis.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: AC-AI-DATA-QB-15-bias-mitigation
|
||||
canonical_name: Bias-Mitigation
|
||||
description: Das System muss technische Mechanismen implementieren, um systematische
|
||||
Verzerrungen in den Trainingsdaten oder während des Lernprozesses zu identifizieren
|
||||
und zu kompensieren. Diese Maßnahmen sind unabhängig vom Entwicklungsstadium anzuwenden,
|
||||
wobei Datenanpassungen vor dem Training, Regularisierungsverfahren während des
|
||||
Lernens oder Korrekturen der Ausgabeergebnisse nach dem Training möglich sind.
|
||||
Eine Prüfung der Fairness-Kriterien ist vor der Freigabe des Modells durchzuführen,
|
||||
um sicherzustellen, dass keine diskriminierenden Muster in den Ergebnissen verbleiben.
|
||||
kind: building_block
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-30
|
||||
- QM-57
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QB-15
|
||||
title_original_de: QB-15 Bias-Mitigation
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-15_Bias-Mitigation.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
@@ -0,0 +1,280 @@
|
||||
source: Derived from BSI QUAIDAL (Clean-Room)
|
||||
source_url: https://github.com/BSI-Bund/QUAIDAL
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
plagiarism_limit_4gram: 0.2
|
||||
generated_by_model: qwen3.5:35b-a3b
|
||||
controls:
|
||||
- id: MC-AI-DATA-QKB-01-repraesentativitaet
|
||||
canonical_name: Repräsentativität
|
||||
description: Der Trainingsdatensatz muss die statistische Verteilung der Zielpopulation
|
||||
exakt abbilden, um systematische Verzerrungen im Modell zu vermeiden. Es ist sicherzustellen,
|
||||
dass alle relevanten Merkmalsausprägungen in ausreichender Häufigkeit und ohne
|
||||
Über- oder Unterrepräsentation vorliegen. Die Datenmenge ist so zu dimensionieren,
|
||||
dass eine robuste Generalisierungsfähigkeit für alle Subgruppen der Gesamtpopulation
|
||||
gewährleistet wird. Eine Prüfung auf Stichprobenqualität ist vor dem Training
|
||||
durchzuführen.
|
||||
kind: criterion
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QB-03
|
||||
- QB-04
|
||||
- QB-05
|
||||
- QB-06
|
||||
- QB-15
|
||||
external_refs:
|
||||
- framework: AI Act
|
||||
citation: Artikel 10
|
||||
- framework: ISO/IEC 25012
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QKB-01
|
||||
title_original_de: QKB-01 Repräsentativität
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-01_Representativity.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MC-AI-DATA-QKB-02-vollstaendigkeit
|
||||
canonical_name: Vollständigkeit
|
||||
description: Der Datensatz muss sämtliche für das spezifische KI-Modell erwarteten
|
||||
Attribute und Merkmalsausprägungen lückenlos beinhalten. Es ist sicherzustellen,
|
||||
dass keine Entitätsinstanzen fehlen und alle definierten Merkmale mit Werten belegt
|
||||
sind. Eine Prüfung auf fehlende Werte oder unvollständige Attributmengen ist vor
|
||||
dem Training zwingend durchzuführen, um Verzerrungen zu vermeiden.
|
||||
kind: criterion
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QB-07
|
||||
- QB-09
|
||||
external_refs:
|
||||
- framework: AI Act
|
||||
citation: Artikel 10
|
||||
- framework: BSI AIC4
|
||||
citation: null
|
||||
- framework: ISO/IEC 25012
|
||||
citation: null
|
||||
- framework: ISO/IEC 25024
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QKB-02
|
||||
title_original_de: QKB-02 Vollständigkeit
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-02_Completeness.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MC-AI-DATA-QKB-03-genauigkeit
|
||||
canonical_name: Genauigkeit
|
||||
description: Die Integrität der KI-Trainingsdaten erfordert, dass jeder einzelne
|
||||
Datenelementwert eine definierte numerische oder symbolische Übereinstimmung mit
|
||||
dem referenzierten Sollwert aufweist. Es ist sicherzustellen, dass Abweichungen
|
||||
innerhalb festgelegter Toleranzgrenzen bezüglich Rundung, Formatierung und Messauflösung
|
||||
bleiben. Die Einhaltung dieser Spezifikation ist durch automatisierte Prüfverfahren
|
||||
vor jedem Trainingslauf zu verifizieren.
|
||||
kind: criterion
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QB-01
|
||||
- QB-02
|
||||
external_refs:
|
||||
- framework: ISO/IEC 25012
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QKB-03
|
||||
title_original_de: QKB-03 Genauigkeit
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-03_Accuracy.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MC-AI-DATA-QKB-04-konsistenz
|
||||
canonical_name: Konsistenz
|
||||
description: Das System muss sicherstellen, dass alle Eingabedaten für das KI-Training
|
||||
logisch kohärent und frei von internen Widersprüchen sind. Einheitliche Kodierungen
|
||||
für Kategorien sowie konsistente Formatierungen sind zwingend erforderlich, um
|
||||
eine fehlerfreie Generalisierung durch das Modell zu ermöglichen. Jede Abweichung
|
||||
von den definierten Datenstandards ist durch automatische Prüfmechanismen zu identifizieren
|
||||
und zu unterbinden.
|
||||
kind: criterion
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QB-02
|
||||
- QB-07
|
||||
- QB-08
|
||||
- QB-10
|
||||
- QB-11
|
||||
- QB-12
|
||||
external_refs:
|
||||
- framework: ISO/IEC 25012
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QKB-04
|
||||
title_original_de: QKB-04 Konsistenz
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-04_Consistency.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MC-AI-DATA-QKB-05-korrektheit
|
||||
canonical_name: Korrektheit
|
||||
description: Das KI-Modell muss ausschließlich auf Datensätzen trainiert werden,
|
||||
die inhaltlich frei von Fehlern sind und den tatsächlichen Gegebenheiten oder
|
||||
definierten Referenzstandards exakt entsprechen. Es ist sicherzustellen, dass
|
||||
jede annotierte Information den als wahr geltenden Zustand im Anwendungskontext
|
||||
fehlerfrei abbildet. Die Validierung der Trainingsdaten ist vor Beginn des Lernprozesses
|
||||
durchzuführen, um sicherzustellen, dass keine inkorrekten Werte die Modellleistung
|
||||
beeinträchtigen.
|
||||
kind: criterion
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QB-09
|
||||
- QB-10
|
||||
- QB-12
|
||||
- QB-14
|
||||
external_refs:
|
||||
- framework: ISO/IEC 25012
|
||||
citation: null
|
||||
- framework: BSI AIC4
|
||||
citation: null
|
||||
- framework: AI Act
|
||||
citation: Artikel 10
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QKB-05
|
||||
title_original_de: QKB-05 Korrektheit
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-05_Correctness.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MC-AI-DATA-QKB-06-einheitlichkeit
|
||||
canonical_name: Einheitlichkeit
|
||||
description: Die Konsistenz der KI-Trainingsdaten ist durch die strikte Einhaltung
|
||||
definierter Syntaxregeln und Datenstrukturen sicherzustellen. Jedes Datenelement
|
||||
muss vor der Verarbeitung gemäß festgelegten Standards formatiert werden, um strukturelle
|
||||
Abweichungen auszuschließen. Eine Prüfung der formalen Einheitlichkeit ist unabhängig
|
||||
von der inhaltlichen Richtigkeit der Werte durchzuführen.
|
||||
kind: criterion
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QB-02
|
||||
- QB-08
|
||||
- QB-10
|
||||
- QB-12
|
||||
- QB-14
|
||||
external_refs:
|
||||
- framework: ISO/IEC 25012
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QKB-06
|
||||
title_original_de: QKB-06 Einheitlichkeit
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-06_Uniformity.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MC-AI-DATA-QKB-07-gueltigkeit
|
||||
canonical_name: Gültigkeit
|
||||
description: Das System muss sicherstellen, dass die für das KI-Training verwendeten
|
||||
Daten inhaltlich exakt das intendierte Zielkonstrukt abbilden und nicht nur oberflächliche
|
||||
Korrelationen erfassen. Es ist zu prüfen, ob die erfassten Merkmale den theoretischen
|
||||
Anforderungen an den Messgegenstand entsprechen, um eine valide Grundlage für
|
||||
Ableitungen zu gewährleisten. Eine Abweichung zwischen dem gemessenen Inhalt und
|
||||
dem definierten Zielkonzept ist als Fehlerzustand zu klassifizieren und muss ausgeschlossen
|
||||
werden.
|
||||
kind: criterion
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QB-02
|
||||
- QB-05
|
||||
- QB-09
|
||||
- QB-10
|
||||
- QB-14
|
||||
external_refs:
|
||||
- framework: ISO/IEC 25012
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QKB-07
|
||||
title_original_de: QKB-07 Gültigkeit
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-07_Validity.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MC-AI-DATA-QKB-08-eindeutigkeit
|
||||
canonical_name: Eindeutigkeit
|
||||
description: Jeder Datensatz im Trainingskorpus muss eine eindeutige Identität besitzen,
|
||||
um die Entstehung redundanter Instanzen auszuschließen. Es ist sicherzustellen,
|
||||
dass keine doppelten oder mehrdeutigen Einträge vorliegen, da diese die Modellgeneralisierung
|
||||
beeinträchtigen und zu Overfitting führen können. Die Validierung muss nachweisen,
|
||||
dass jede Dateneinheit eindeutig identifizierbar ist und logisch von anderen unterscheidbar
|
||||
bleibt.
|
||||
kind: criterion
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QB-05
|
||||
- QB-10
|
||||
- QB-13
|
||||
external_refs:
|
||||
- framework: ISO/IEC 25012
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QKB-08
|
||||
title_original_de: QKB-08 Eindeutigkeit
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-08_Uniqueness.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MC-AI-DATA-QKB-09-sichere-quellen
|
||||
canonical_name: Sichere Quellen
|
||||
description: Für KI-Trainingsdaten muss eine lückenlose Provenienz-Dokumentation
|
||||
etabliert werden, die jeden Verarbeitungsschritt von der Erfassung bis zur finalen
|
||||
Nutzung nachvollziehbar macht. Es ist sicherzustellen, dass alle Transformationen
|
||||
und Herkunftsinformationen vollständig erfasst sind, um die Datenintegrität und
|
||||
-qualität kontinuierlich verifizieren zu können. Die Nachprüfbarkeit dieser Metadaten
|
||||
ist zwingend erforderlich, um potenzielle Qualitätsmängel oder Manipulationen
|
||||
in den Trainingsbeständen frühzeitig zu identifizieren.
|
||||
kind: criterion
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QB-09
|
||||
- QB-11
|
||||
external_refs:
|
||||
- framework: ISO/IEC 25012
|
||||
citation: null
|
||||
- framework: BSI AIC4
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QKB-09
|
||||
title_original_de: QKB-09 Sichere Quellen
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-09_SecureSource.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MC-AI-DATA-QKB-10-daten-mit-personenbezug
|
||||
canonical_name: Daten mit Personenbezug
|
||||
description: Das System muss vor der Nutzung von Trainingsdaten eine automatisierte
|
||||
Prüfung durchführen, um personenbezogene Informationen zu identifizieren. Ist
|
||||
derartige Datenbestandteil der Eingabedaten, ist deren vollständige und nachweisbare
|
||||
Entfernung sicherzustellen, bevor ein Modelltraining initiiert wird. Die Integrität
|
||||
der verbleibenden Datensätze ist durch technische Maßnahmen gegen unbeabsichtigte
|
||||
Wiederverwendung zu gewährleisten.
|
||||
kind: criterion
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QB-09
|
||||
- QB-10
|
||||
- QB-11
|
||||
- QB-14
|
||||
external_refs:
|
||||
- framework: EU GDPR
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: QKB-10
|
||||
title_original_de: QKB-10 Daten mit Personenbezug
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-10_PersonalDataCheck.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,753 @@
|
||||
source: Derived from BSI QUAIDAL (Clean-Room)
|
||||
source_url: https://github.com/BSI-Bund/QUAIDAL
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
plagiarism_limit_4gram: 0.2
|
||||
generated_by_model: qwen3.5:35b-a3b
|
||||
controls:
|
||||
- id: MIT-AI-DATA-MA-01-datentyp-validierung
|
||||
canonical_name: Datentyp Validierung
|
||||
description: Es ist sicherzustellen, dass alle Eingabedaten und Trainingsdatensätze
|
||||
vor der Verarbeitung auf Konformität mit den definierten Schemata und Datentypen
|
||||
des Modells geprüft werden. Abweichungen von den erwarteten Formaten sind automatisch
|
||||
zu identifizieren und müssen entweder bereinigt oder ausgeschlossen werden, um
|
||||
Inferenzfehler zu verhindern. Diese Validierung ist als automatisierter Schritt
|
||||
in den Datenpipelines zu implementieren, um die Integrität der KI-Systeme zu gewährleisten.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-32
|
||||
- QM-34
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-01
|
||||
title_original_de: MA-01 Datentyp Validierung
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-01_Datatype%20Validation.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-02-format-pruefung
|
||||
canonical_name: Format Prüfung
|
||||
description: Die Eingabedaten für KI-Trainingszwecke sind vor der Verarbeitung auf
|
||||
strukturelle Korrektheit zu validieren, wobei Datentypen wie Zeitstempel oder
|
||||
Textfelder exakt den definierten Schemata entsprechen müssen. Durch die erzwingung
|
||||
einer einheitlichen Formatierung wird verhindert, dass regionale Abweichungen
|
||||
oder inkonsistente Darstellungen zu Fehlinterpretationen im Modell führen. Die
|
||||
Konformität ist automatisiert zu prüfen, um sicherzustellen, dass keine nicht
|
||||
konformen Datensätze in den Lernprozess eingehen.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-32
|
||||
- QM-34
|
||||
- QM-43
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-02
|
||||
title_original_de: MA-02 Format Prüfung
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-02_Format%20Check.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-03-bereichspruefung
|
||||
canonical_name: Bereichsprüfung
|
||||
description: Das System muss vor dem KI-Training eine automatische Validierung aller
|
||||
Eingangsmerkmale durchführen, um Werte außerhalb definierter physikalischer oder
|
||||
logischer Grenzen zu identifizieren. Dabei sind insbesondere inkonsistente Datentypen,
|
||||
fehlerhafte Maßeinheiten und statistisch unplausible Ausreißer zu detektieren
|
||||
und zu isolieren. Die Integrität des Trainingsdatensatzes ist erst dann gewährleistet,
|
||||
wenn alle nicht konformen Einträge ausgeschlossen oder korrigiert wurden, bevor
|
||||
der Lernprozess initiiert wird.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-51
|
||||
- QM-52
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-03
|
||||
title_original_de: MA-03 Bereichsprüfung
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-03_Range%20Check.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-04-over-undersampling
|
||||
canonical_name: Over-Undersampling
|
||||
description: Das Daten-Set für das KI-Training ist auf ein ausgewogenes Klassenverhältnis
|
||||
zu prüfen, wobei eine künstliche Aufstockung seltener Kategorien durch synthetische
|
||||
Generierung oder Duplizierung zulässig ist. Alternativ ist eine Reduktion der
|
||||
Datenpunkte der Mehrheitsklasse nach definierten Kriterien durchzuführen, um eine
|
||||
Verzerrung des Modells zu vermeiden. Die angewandte Methode zur Erreichung dieses
|
||||
Gleichgewichts ist dokumentiert und muss reproduzierbar sein.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-34
|
||||
- QM-38
|
||||
- QM-57
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-04
|
||||
title_original_de: MA-04 Over-Undersampling
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-04_Over-Undersampling.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-05-automatisierte-aufgaben
|
||||
canonical_name: Automatisierte Aufgaben
|
||||
description: Wiederkehrende Prozesse der Datenvorverarbeitung und Qualitätsprüfung
|
||||
im KI-Lebenszyklus sind durch automatisierte Mechanismen zu implementieren. Die
|
||||
Ausführung dieser Aufgaben muss so konfiguriert sein, dass eine konsistente Ergebnisqualität
|
||||
über alle Durchläufe hinweg sichergestellt wird. Es ist zu prüfen, dass die eingesetzten
|
||||
Automatisierungswerkzeuge spezifische Validierungsregeln für Trainingsdaten zuverlässig
|
||||
anwenden.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-02
|
||||
- MA-03
|
||||
- QM-10
|
||||
- QM-34
|
||||
- QM-64
|
||||
external_refs:
|
||||
- framework: AI Act
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-05
|
||||
title_original_de: MA-05 Automatisierte Aufgaben
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-05_Automated%20Tasks.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-06-experten-auswertung
|
||||
canonical_name: Experten Auswertung
|
||||
description: Für die Validierung von KI-Trainingsdaten ist eine manuelle Prüfung
|
||||
durch qualifizierte Fachexperten zwingend erforderlich. Diese Experten müssen
|
||||
die inhaltliche Gültigkeit, Relevanz und Korrektheit der Datensätze auf Basis
|
||||
domänenspezifischen Wissens systematisch evaluieren. Das Ergebnis dieser Begutachtung
|
||||
dient dazu, methodische Fehler oder qualitative Mängel frühzeitig zu identifizieren
|
||||
und konkrete Maßnahmen zur Datenbereinigung abzuleiten.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-16
|
||||
- QM-30
|
||||
- QM-43
|
||||
- QM-45
|
||||
- QM-59
|
||||
- QM-70
|
||||
external_refs:
|
||||
- framework: ISO/IEC 25012
|
||||
citation: null
|
||||
- framework: ISO/IEC 25024
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-06
|
||||
title_original_de: MA-06 Experten Auswertung
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-06_Expert%20Evaluation.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0204
|
||||
- id: MIT-AI-DATA-MA-07-massenbeteiligung
|
||||
canonical_name: Massenbeteiligung
|
||||
description: Das System muss Mechanismen implementieren, um die Qualität von Trainingsdaten
|
||||
durch dezentrale Validierung durch eine heterogene Gruppe externer Prüfer sicherzustellen.
|
||||
Es ist zwingend erforderlich, dass die Ergebnisse dieser kollektiven Überprüfung
|
||||
mit internen Qualitätsstandards abgeglichen werden, um systematische Fehler in
|
||||
den annotierten Datensätzen zu identifizieren. Die Integrität der KI-Modelle ist
|
||||
nur gewährleistet, wenn diese skalierbare Prüfprozedur für kritische Datenmengen
|
||||
routinemäßig angewendet wird.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-06
|
||||
- QM-03
|
||||
- QM-16
|
||||
- QM-43
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-07
|
||||
title_original_de: MA-07 Massenbeteiligung
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-07_Crowdsourcing.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-08-verteilungsanalyse
|
||||
canonical_name: Verteilungsanalyse
|
||||
description: Es ist sicherzustellen, dass die Verteilung der Trainingsdaten über
|
||||
alle relevanten Klassen und Merkmalsbereiche systematisch auf statistische Verzerrungen
|
||||
und Anomalien geprüft wird. Diese Analyse muss nachweisen, dass das Modell auf
|
||||
einer repräsentativen und ausgewogenen Datenbasis trainiert wurde, um die Generalisierungsfähigkeit
|
||||
der Vorhersagen zu gewährleisten. Die Ergebnisse der Verteilungsprüfung sind vor
|
||||
Beginn des Trainings zu dokumentieren und bei signifikanten Abweichungen sind
|
||||
Korrekturmaßnahmen einzuleiten.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-06
|
||||
- QM-10
|
||||
- QM-11
|
||||
- QM-51
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-08
|
||||
title_original_de: MA-08 Verteilungsanalyse
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-08_DistributionAnalysis.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0339
|
||||
- id: MIT-AI-DATA-MA-09-vergleichgrundgesamtheit
|
||||
canonical_name: VergleichGrundgesamtheit
|
||||
description: Das System muss eine repräsentative Referenzstichprobe aus der Zielverteilung
|
||||
bereitstellen, um die Validität von KI-Trainingsdaten zu verifizieren. Es ist
|
||||
sicherzustellen, dass diese Referenzdaten als Goldstandard dienen, um Abweichungen
|
||||
zwischen dem Trainingsset und der tatsächlichen Grundgesamtheit zu quantifizieren.
|
||||
Die Übereinstimmung ist durch einen automatisierten Abgleich mit den vorab definierten
|
||||
Verteilungsparametern zu prüfen.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-9
|
||||
- QM-51
|
||||
- QM-52
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-09
|
||||
title_original_de: MA-09 VergleichGrundgesamtheit
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-09_CompareGroundtruth.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-10-gewichtung-der-daten
|
||||
canonical_name: Gewichtung der Daten
|
||||
description: Für KI-Trainingsdatensätze ist eine manuelle Gewichtung der einzelnen
|
||||
Merkmale zwingend erforderlich, um systematische Verzerrungen zu minimieren. Diese
|
||||
Maßnahme dient der Sicherstellung einer ausgewogenen Datenrepräsentation und verbessert
|
||||
die Generalisierungsfähigkeit des Modells auf spezifische Anwendungsfälle. Die
|
||||
Zuordnung der Gewichtungsfaktoren ist vor dem Training durchzuführen und muss
|
||||
dokumentiert werden, um die Nachvollziehbarkeit der Datenqualität zu gewährleisten.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-10
|
||||
- QM-18
|
||||
- QM-28
|
||||
- QM-29
|
||||
- QM-37
|
||||
- QM-38
|
||||
- QM-39
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-10
|
||||
title_original_de: MA-10 Gewichtung der Daten
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-10_ManualWeights.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-11-stichprobengroesse
|
||||
canonical_name: Stichprobengröße
|
||||
description: Die Menge der für das Training verwendeten Daten ist so zu dimensionieren,
|
||||
dass statistisch signifikante Ergebnisse bei definiertem Konfidenzniveau und akzeptabler
|
||||
Fehlervarianz gewährleistet sind. Die Datengröße muss iterativ angepasst werden,
|
||||
wobei sowohl die Gesamtgröße der zugrundeliegenden Population als auch die spezifische
|
||||
Art der Datenerweiterung systematisch zu berücksichtigen sind. Eine Validierung
|
||||
der Datenqualität ist zwingend erforderlich, um Verzerrungen durch unterschiedliche
|
||||
Skalierungsmethoden auszuschließen.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-08
|
||||
- QM-09
|
||||
- QM-39
|
||||
- QM-41
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-11
|
||||
title_original_de: MA-11 Stichprobengröße
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-11_Trainingsdataset%20Size.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-12-abdeckung-relevanter-merkmale
|
||||
canonical_name: Abdeckung relevanter Merkmale
|
||||
description: Das Trainingsdatenset muss vollständig alle für die spezifische Problemstellung
|
||||
essenziellen Eingangsvariablen enthalten, um eine lückenlose Merkmalsabdeckung
|
||||
zu gewährleisten. Es ist sicherzustellen, dass keine kritischen Einflussgrößen
|
||||
fehlen, da sonst das Modell keine verlässlichen Korrelationen erlernen kann. Die
|
||||
Vollständigkeit des Merkmalsraums ist vor Beginn des Trainingsprozesses durch
|
||||
eine formale Prüfung zu verifizieren.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-06
|
||||
- MA-14
|
||||
- QM-10
|
||||
- QM-11
|
||||
- QM-13
|
||||
- QM-25
|
||||
- QM-26
|
||||
- QM-27
|
||||
- QM-28
|
||||
- QM-29
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-12
|
||||
title_original_de: MA-12 Abdeckung relevanter Merkmale
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-12_RelevantFeatureCoverage.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-13-vollstaendige-information-in-datensaetze
|
||||
canonical_name: Vollständige Information in Datensätzen
|
||||
description: Für die Validierung von KI-Trainingsdaten ist sicherzustellen, dass
|
||||
alle für die Analyse erforderlichen Attribute vollständig vorliegen und keine
|
||||
unbeabsichtigten Lücken existieren. Bei festgestellten Datenfehlern ist zwingend
|
||||
die Ursache zu ermitteln, um das passende Imputationsverfahren basierend auf dem
|
||||
spezifischen Fehlerschema auszuwählen. Eine unzureichende Datenbasis darf nicht
|
||||
zur Modellierung genutzt werden, solange die Integrität der relevanten Information
|
||||
nicht durch geeignete Maßnahmen wiederhergestellt wurde.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-12
|
||||
- QM-40
|
||||
- QM-53
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-13
|
||||
title_original_de: MA-13 Vollständige Information in Datensätzen
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-13_CompleteInformation.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-14-eda-explorative-daten-analyse
|
||||
canonical_name: EDA-Explorative Daten Analyse
|
||||
description: Vor Beginn des Modelltrainings ist eine explorative Datenanalyse durchzuführen,
|
||||
um Datenverteilungen, Korrelationen sowie Ausreißer und strukturelle Anomalien
|
||||
ohne vorab definierte Hypothesen zu identifizieren. Die gewonnenen Erkenntnisse
|
||||
sind systematisch zu dokumentieren, um die Qualität der Trainingsdaten zu validieren
|
||||
und fundierte Entscheidungen über notwendige Bereinigungs- oder Erweiterungsschritte
|
||||
abzuleiten. Auf Basis dieser Analyse ist der Datensatz so anzupassen, dass er
|
||||
die für die Zielfunktion erforderliche Repräsentativität und Integrität gewährleistet.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-10
|
||||
- QM-12
|
||||
- QM-24
|
||||
- QM-25
|
||||
- QM-26
|
||||
- QM-27
|
||||
- QM-28
|
||||
- QM-29
|
||||
- QM-36
|
||||
- QM-42
|
||||
- QM-54
|
||||
- QM-57
|
||||
- QM-61
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-14
|
||||
title_original_de: MA-14 EDA-Explorative Daten Analyse
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-14_EDA-ExplorativeDataAnalysis.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-15-empirische-evidenz
|
||||
canonical_name: Empirische Evidenz
|
||||
description: Es ist sicherzustellen, dass die Wirksamkeit von Schutzmaßnahmen gegen
|
||||
KI-gestützte Angriffe durch den systematischen Vergleich mit historischen Einsatzszenarien
|
||||
empirisch validiert wird. Dabei sind Leistungsdaten aus vergleichbaren Anwendungsfällen
|
||||
heranzuziehen, um die Angemessenheit der eingesetzten Trainingsdatensätze und
|
||||
Methoden für den spezifischen Kontext nachzuweisen. Die Analyse muss belegen,
|
||||
dass die gewählten Maßnahmen die identifizierten Risiken in der Praxis effektiv
|
||||
reduzieren und die Datenqualität den aktuellen Bedrohungsmodellen entspricht.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-16
|
||||
- QM-30
|
||||
- QM-61
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-15
|
||||
title_original_de: MA-15 Empirische Evidenz
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-15_EmpiricEvidence.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-16-daten-imputation
|
||||
canonical_name: Daten Imputation
|
||||
description: Für KI-Trainingsdatensätze ist eine systematische Analyse der Ursachen
|
||||
für fehlende Werte zwingend erforderlich, bevor eine Rekonstruktion erfolgt. Das
|
||||
gewählte Verfahren zur Datenergänzung muss sich strikt an den identifizierten
|
||||
Entstehungsgründen orientieren, um die statistische Integrität des Modells zu
|
||||
wahren. Eine unkritische Imputation ohne Ursachenanalyse ist unzulässig, da sie
|
||||
das Lernverhalten des Algorithmus verfälschen kann.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-13
|
||||
- QM-10
|
||||
- QM-22
|
||||
- QM-44
|
||||
- QM-53
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-16
|
||||
title_original_de: MA-16 Daten Imputation
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-16_DataImputation.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-17-metadatenverwaltung
|
||||
canonical_name: Metadatenverwaltung
|
||||
description: Für den KI-Trainingsprozess ist eine vollständige Dokumentation der
|
||||
Datenherkunft, der Qualitätsmetriken sowie der rechtlichen Klassifizierung jeder
|
||||
einzelnen Trainingsinstanz sicherzustellen. Diese strukturellen Begleitinformationen
|
||||
müssen maschinenlesbar vorliegen, um eine automatisierte Validierung der Datenintegrität
|
||||
und eine nachvollziehbare Auditierung des Datensatzes zu ermöglichen. Die Erfassung
|
||||
dieser Attribute ist zwingend erforderlich, um die Eignung der Daten für den spezifischen
|
||||
Trainingszweck zu gewährleisten und regulatorische Vorgaben einzuhalten.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-59
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-17
|
||||
title_original_de: MA-17 Metadatenverwaltung
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-17_MetadataManagement.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-18-provenienztracking
|
||||
canonical_name: ProvenienzTracking
|
||||
description: Die Herkunft und der Verarbeitungsweg von KI-Trainingsdaten sind lückenlos
|
||||
zu dokumentieren, um deren Integrität und Nachvollziehbarkeit sicherzustellen.
|
||||
Für jeden Datensatz ist eine eindeutige Identifikation des Ursprungs sowie aller
|
||||
Transformationsschritte im Lebenszyklus zu führen. Diese Metadaten müssen so strukturiert
|
||||
sein, dass eine Rückverfolgung zur ursprünglichen Quelle jederzeit möglich ist,
|
||||
ohne dass Datenverluste oder Manipulationen unentdeckt bleiben.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-59
|
||||
- QM-60
|
||||
- QM-61
|
||||
- QM-65
|
||||
- QM-67
|
||||
- QM-70
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-18
|
||||
title_original_de: MA-18 ProvenienzTracking
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-18_ProvenienzTracking.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-19-audit-trails
|
||||
canonical_name: Audit Trails
|
||||
description: Für die Nachvollziehbarkeit von KI-Trainingsprozessen ist ein lückenloses
|
||||
Protokollierungssystem zu implementieren, das alle Datenmanipulationen und Modellupdates
|
||||
zeitgestempelt erfasst. Jeder Zugriff auf Trainingsdatensätze sowie jede Änderung
|
||||
der Modellparameter muss mit eindeutigen Benutzeridentitäten verknüpft werden.
|
||||
Die gespeicherten Logs müssen so strukturiert sein, dass sie eine vollständige
|
||||
Rekonstruktion des Datenflusses und eine Rückführung auf frühere Datenqualitätszustände
|
||||
ermöglichen.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- MA-22
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-19
|
||||
title_original_de: MA-19 Audit Trails
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-19_AuditTrails.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-20-prozess-dokumentation
|
||||
canonical_name: Prozess Dokumentation
|
||||
description: Für die Sicherstellung der Datenqualität im KI-Trainingsprozess ist
|
||||
eine vollständige Dokumentation aller Phasen der Datenerstellung und -aufbereitung
|
||||
zwingend erforderlich. Diese Spezifikation muss verbindlich festlegen, welche
|
||||
Aktivitäten auszuführen sind, wer hierfür verantwortlich zeichnet, welche Ressourcen
|
||||
notwendig sind und welche qualitativen Ergebnisse zu erzielen sind. Insbesondere
|
||||
ist die Nachverfolgbarkeit der Datenherkunft innerhalb des Dokumentationsprozesses
|
||||
lückenlos zu gewährleisten, um die Integrität der Trainingsdaten zu validieren.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-15
|
||||
- QM-31
|
||||
- QM-62
|
||||
- QM-65
|
||||
external_refs:
|
||||
- framework: ISO/IEC 42001
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-20
|
||||
title_original_de: MA-20 Prozess Dokumentation
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-20_ProcessDocumentation.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-21-compliance
|
||||
canonical_name: Compliance
|
||||
description: Der Einsatz von KI-Modellen erfordert eine zwingende Prüfung der Trainingsdatensätze
|
||||
auf rechtliche Konformität und ethische Integrität, bevor diese zur Modellgenerierung
|
||||
verwendet werden. Es ist sicherzustellen, dass alle verarbeiteten Informationen
|
||||
die Vorgaben der DSGVO sowie branchenspezifische Regularien vollständig erfüllen
|
||||
und keine unrechtmäßig beschafften oder personenbezogenen Daten ohne explizite
|
||||
Einwilligung enthalten. Die Validierung dieser Datenqualität muss vor jedem Trainingslauf
|
||||
durch einen automatisierten oder manuellen Compliance-Check nachgewiesen werden.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-12
|
||||
- QM-15
|
||||
external_refs:
|
||||
- framework: EU GDPR
|
||||
citation: null
|
||||
- framework: AI Act
|
||||
citation: null
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-21
|
||||
title_original_de: MA-21 Compliance
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-21_Compliance.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-22-vertrauenswuerdigkeit
|
||||
canonical_name: Vertrauenswürdigkeit
|
||||
description: Die Integrität und Zuverlässigkeit der für das KI-Training verwendeten
|
||||
Datensätze ist im jeweiligen Anwendungskontext nachweislich zu verifizieren. Es
|
||||
ist sicherzustellen, dass potenzielle Manipulationen oder unbeabsichtigte Korruptionen
|
||||
des Datenflusses durch technische Prüfmechanismen ausgeschlossen werden. Bei der
|
||||
Anwendung von Korrekturverfahren zur Datenbereinigung muss die ursprüngliche Glaubwürdigkeit
|
||||
der Informationen gewahrt bleiben und darf nicht durch die Maßnahme beeinträchtigt
|
||||
werden.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-15
|
||||
- QM-43
|
||||
- QM-65
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-22
|
||||
title_original_de: MA-22 Vertrauenswürdigkeit
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-22_Credibility.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-23-merkmalsskalierung
|
||||
canonical_name: Merkmalsskalierung
|
||||
description: Für KI-Trainingsdatensätze ist eine Normalisierung der Merkmalswerte
|
||||
auf einen einheitlichen Wertebereich zwingend erforderlich, um Dominanzeffekte
|
||||
durch unterschiedliche Größenordnungen zu vermeiden. Diese Maßnahme stellt sicher,
|
||||
dass Algorithmen, die auf Distanzberechnungen oder Gradientenverfahren basieren,
|
||||
nicht durch skalenbedingte Verzerrungen beeinträchtigt werden. Die Wirksamkeit
|
||||
der Skalierung ist vor dem Training systematisch zu prüfen, um die Vorhersagegenauigkeit
|
||||
des Modells zu garantieren.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-10
|
||||
- QM-56
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-23
|
||||
title_original_de: MA-23 Merkmalsskalierung
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-23_FeatureScaling.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-24-merkmalserstellung
|
||||
canonical_name: Merkmalserstellung
|
||||
description: Es ist sicherzustellen, dass bei der Erstellung neuer Eingangsmerkmale
|
||||
für KI-Modelle ausschließlich validierte Transformationsverfahren angewendet werden,
|
||||
um die Datenqualität zu gewährleisten. Die Generierung neuer Features muss auf
|
||||
nachvollziehbaren Algorithmen basieren, die eine signifikante Verbesserung der
|
||||
Modellleistung gegenüber den Rohdaten nachweisen. Jede angewandte Methode zur
|
||||
Datenanreicherung oder -bereinigung ist vor dem Training auf ihre Eignung zur
|
||||
Mustererkennung und Vorhersagegenauigkeit zu prüfen.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-11
|
||||
- QM-25
|
||||
- QM-26
|
||||
- QM-27
|
||||
- QM-28
|
||||
- QM-51
|
||||
- QM-71
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-24
|
||||
title_original_de: MA-24 Merkmalserstellung
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-24_FeatureCreation.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-25-differential-privacy
|
||||
canonical_name: Differential Privacy
|
||||
description: Das System muss bei der Verarbeitung von KI-Trainingsdaten differenzielle
|
||||
Privatsphäre implementieren, indem statistisch signifikante, zufällige Störgrößen
|
||||
zu den Ergebnissen hinzugefügt werden. Es ist sicherzustellen, dass die An- oder
|
||||
Abwesenheit einzelner Datensätze im Trainingsset das Ausgabeergebnis nur marginal
|
||||
beeinflusst. Durch diese Maßnahme ist zu prüfen, ob keine Rückschlüsse auf spezifische
|
||||
Personen aus den generierten Analysen gezogen werden können, während die allgemeine
|
||||
Datenqualität für das Modelltraining erhalten bleibt.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-58
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-25
|
||||
title_original_de: MA-25 Differential Privacy
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-25_Differential%20Privacy.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0625
|
||||
- id: MIT-AI-DATA-MA-26-federated-learning
|
||||
canonical_name: Federated Learning
|
||||
description: Für KI-Systeme, die auf verteilten Datenquellen basieren, ist ein Federated-Learning-Ansatz
|
||||
zwingend vorzusehen, um die Rohdaten dezentral zu belassen. Die lokalen Modelle
|
||||
müssen ausschließlich aggregierte Parameter an eine zentrale Instanz übermitteln,
|
||||
während die ursprünglichen Trainingsdaten niemals die lokale Umgebung verlassen.
|
||||
Eine Prüfung ist sicherzustellen, dass durch diese Architektur keine sensiblen
|
||||
Informationen während des Lernprozesses zentralisiert oder übertragen werden.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-63
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-26
|
||||
title_original_de: MA-26 Federated Learning
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-26_Federated%20Learning%20Approach.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-27-statistische-grundlagenthemen
|
||||
canonical_name: Statistische Grundlagenthemen
|
||||
description: Für die Sicherstellung der Datenqualität im KI-Lebenszyklus sind statistische
|
||||
Basisverfahren systematisch zu implementieren und kontinuierlich zu validieren.
|
||||
Es ist sicherzustellen, dass alle relevanten Metriken zur Verteilungsanalyse und
|
||||
Datenintegrität konsistent in die Berechnungspipelines integriert werden. Diese
|
||||
fundamentalen Analysen müssen unabhängig von spezifischen Bausteinen als übergeordnete
|
||||
Prüfkriterien für die Modellgüte dienen.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-01
|
||||
- QM-02
|
||||
- QM-03
|
||||
- QM-04
|
||||
- QM-06
|
||||
- QM-07
|
||||
- QM-09
|
||||
- QM-23
|
||||
- QM-51
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-27
|
||||
title_original_de: MA-27 Statistische Grundlagenthemen
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-27_StatisticalBasis.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0213
|
||||
- id: MIT-AI-DATA-MA-28-diversitaetsindizes
|
||||
canonical_name: Diversitätsindizes
|
||||
description: Das System muss quantitative Metriken zur Erfassung der Heterogenität
|
||||
von KI-Trainingsdaten implementieren, um die Verteilung verschiedener Kategorien
|
||||
zu messen. Es ist sicherzustellen, dass diese Kennzahlen sowohl die Anzahl vorhandener
|
||||
Klassen als auch deren Gleichverteilung abbilden. Die Validierung der Datenqualität
|
||||
erfolgt durch die Berechnung von Diversitätsindizes, die statistische Unsicherheit
|
||||
oder Kollisionswahrscheinlichkeiten quantifizieren.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-68
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-28
|
||||
title_original_de: MA-28 Diversitätsindizes
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-28_Diversity-Indices.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-29-data-splitting
|
||||
canonical_name: Data-Splitting
|
||||
description: Die Aufteilung von KI-Trainingsdaten in disjunkte Teilmengen ist zwingend
|
||||
erforderlich, um eine unvoreingenommene Validierung der Modellgüte zu gewährleisten.
|
||||
Dabei müssen mindestens drei voneinander getrennte Bereiche für das Training,
|
||||
die Hyperparameter-Optimierung sowie die abschließende Leistungsbewertung definiert
|
||||
werden. Eine zufällige oder stratifizierte Trennung ist sicherzustellen, um Datenlecks
|
||||
zwischen den Phasen auszuschließen und die Generalisierungsfähigkeit des Systems
|
||||
nachweisbar zu prüfen.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-69
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-29
|
||||
title_original_de: MA-29 Data-Splitting
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-29_Data%20Splitting.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
- id: MIT-AI-DATA-MA-30-fairness
|
||||
canonical_name: Fairness
|
||||
description: Das System muss sicherstellen, dass KI-Trainingsdaten keine systematischen
|
||||
Verzerrungen bezüglich sensibler demografischer Merkmale aufweisen, um diskriminierende
|
||||
Vorhersagen zu vermeiden. Bei unzureichender Repräsentation von Teilgruppen sind
|
||||
präventive Aufbereitungsverfahren oder algorithmische Transformationsmethoden
|
||||
zur Bias-Korrektur zwingend anzuwenden. Die Wirksamkeit dieser Maßnahmen ist vor
|
||||
der Modellbereitstellung durch quantitative Prüfverfahren auf Gleichbehandlungsgrundsätze
|
||||
zu validieren.
|
||||
kind: measure
|
||||
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
|
||||
related_quaidal_ids:
|
||||
- QM-57
|
||||
external_refs: []
|
||||
source:
|
||||
framework: BSI QUAIDAL
|
||||
section: MA-30
|
||||
title_original_de: MA-30 Fairness
|
||||
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-30_Fairness.md
|
||||
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
|
||||
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
|
||||
plagiarism_score_at_generation: 0.0
|
||||
File diff suppressed because it is too large
Load Diff
@@ -165,21 +165,29 @@ def classify_source_regulation(source_regulation: str) -> str:
|
||||
"""
|
||||
Klassifiziert eine source_regulation als law, guideline oder framework.
|
||||
|
||||
Verwendet exaktes Matching gegen die Map. Bei unbekannten Quellen
|
||||
wird anhand von Schluesselwoertern geraten, Fallback ist 'framework'
|
||||
(konservativstes Ergebnis).
|
||||
Delegates to DB-backed RegulationRegistry (with 5min cache).
|
||||
Falls back to SOURCE_REGULATION_CLASSIFICATION dict + heuristic
|
||||
if DB is unavailable.
|
||||
"""
|
||||
if not source_regulation:
|
||||
return SOURCE_TYPE_FRAMEWORK
|
||||
|
||||
# Exaktes Match
|
||||
# Try DB-backed registry first
|
||||
try:
|
||||
from services.regulation_registry import classify_source_regulation as _db_classify
|
||||
result = _db_classify(source_regulation)
|
||||
if result:
|
||||
return result
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Fallback: local dict
|
||||
if source_regulation in SOURCE_REGULATION_CLASSIFICATION:
|
||||
return SOURCE_REGULATION_CLASSIFICATION[source_regulation]
|
||||
|
||||
# Heuristik fuer unbekannte Quellen
|
||||
lower = source_regulation.lower()
|
||||
|
||||
# Gesetze erkennen
|
||||
law_indicators = [
|
||||
"verordnung", "richtlinie", "gesetz", "directive", "regulation",
|
||||
"(eu)", "(eg)", "act", "ley", "loi", "törvény", "código",
|
||||
@@ -187,19 +195,16 @@ def classify_source_regulation(source_regulation: str) -> str:
|
||||
if any(ind in lower for ind in law_indicators):
|
||||
return SOURCE_TYPE_LAW
|
||||
|
||||
# Leitlinien erkennen
|
||||
guideline_indicators = [
|
||||
"edpb", "leitlinie", "guideline", "wp2", "bsi", "empfehlung",
|
||||
]
|
||||
if any(ind in lower for ind in guideline_indicators):
|
||||
return SOURCE_TYPE_GUIDELINE
|
||||
|
||||
# Frameworks erkennen
|
||||
framework_indicators = [
|
||||
"enisa", "nist", "owasp", "oecd", "cisa", "framework", "iso",
|
||||
]
|
||||
if any(ind in lower for ind in framework_indicators):
|
||||
return SOURCE_TYPE_FRAMEWORK
|
||||
|
||||
# Konservativ: unbekannt = framework (geringste Verbindlichkeit)
|
||||
return SOURCE_TYPE_FRAMEWORK
|
||||
|
||||
@@ -0,0 +1,83 @@
|
||||
# Lizenzregeln der Control-Pipeline
|
||||
|
||||
> **Stand:** 2026-05-21 — Mapping festgezurrt nach DB-Inspektion und IACE-Audit.
|
||||
>
|
||||
> Die Pipeline klassifiziert jede Regulation (und damit jedes daraus extrahierte
|
||||
> Chunk und jeden atomic_control) in eine von **drei Lizenzregeln**. Die Regel
|
||||
> entscheidet, ob der Volltext aufbewahrt werden darf und welche Attribution im
|
||||
> Ausgabe-Renderer Pflicht ist.
|
||||
|
||||
## Die drei Regeln
|
||||
|
||||
| Regel | Bedeutung | Volltext speichern? | Attribution Pflicht? | Beispiele |
|
||||
|-------|-----------|---------------------|----------------------|-----------|
|
||||
| **1** | Wörtlich — Hoheitsrecht / Public Domain | ✓ | nein (empfohlen für Audit) | EU-Recht (EUR-Lex), Bundesrecht, Satzungsrecht (DGUV UVV), TRBS, TRGS, ASR, US Federal Code (OSHA), NIST SP, EU-Leitfäden |
|
||||
| **2** | Wörtlich mit Attribution — freie Lizenzen | ✓ | **ja** | OWASP (CC-BY-SA-4.0), OECD AI Principles (OECD_PUBLIC), ENISA-Dokumente (CC-BY-4.0), Apache-2.0 Werke |
|
||||
| **3** | Nur zitieren — proprietäre Standards | ✗ | nicht anwendbar (kein Volltext) | DIN, EN, ISO, ANSI, UL, IEC, IEEE, DGUV Regeln/Informationen/Grundsätze, Bitkom-Leitfäden, BSI-Bausteine (urheberrechtlich) |
|
||||
|
||||
**Wichtige Klarstellung:** Regel 3 = "nur Identifier/Abschnitt zitieren", **nicht** "umformulieren". Die ursprüngliche Bezeichnung "neu formulieren" war irreführend. Korrekt: Bei Regel-3-Quellen darf die Pipeline den Volltext nicht speichern; sie bewahrt nur die Quellenreferenz (regulation_id + article/paragraph), und der Output-Renderer zeigt diese Referenz im Frontend/PDF.
|
||||
|
||||
## Mapping `license_type` → `license_rule`
|
||||
|
||||
| license_type | license_rule | Erklärung |
|
||||
|---|---|---|
|
||||
| `EU_LAW`, `EU_PUBLIC` | 1 | EU-Verordnungen, Richtlinien, OJ-Veröffentlichungen, EU-Leitfäden |
|
||||
| `DE_LAW`, `DE_PUBLIC` | 1 | Bundesgesetze, TRBS, TRGS, ASR, DGUV-UVV (Satzungsrecht) |
|
||||
| `AT_LAW`, `CH_LAW`, `FR_LAW`, `IT_LAW`, `ES_LAW`, `NL_LAW`, `HU_LAW` | 1 | Andere EU-Mitgliedsstaaten-Recht |
|
||||
| `US_GOV_PUBLIC`, `NIST_PUBLIC_DOMAIN`, `OSHA_PUBLIC` | 1 | US Federal Code (17 U.S.C. §105 Public Domain) |
|
||||
| `CC-BY-4.0`, `CC-BY-SA-4.0`, `CC-BY-3.0`, `CC-BY-SA-3.0` | 2 | Creative-Commons mit Attribution-Pflicht |
|
||||
| `Apache-2.0`, `MIT` | 2 | Permissive OSS-Lizenzen, NOTICE-Pflicht |
|
||||
| `OECD_PUBLIC`, `ENISA_CC_BY_4.0` | 2 | Behörden-Publikationen mit Attribution-Auflage |
|
||||
| `DIN_COPYRIGHT`, `ISO_COPYRIGHT`, `ANSI_COPYRIGHT`, `UL_COPYRIGHT`, `IEC_COPYRIGHT` | 3 | Normungsorganisationen — nur Identifier-Zitat |
|
||||
| `DGUV_COPYRIGHT` | 3 | DGUV Regeln/Informationen/Grundsätze (nicht UVV) |
|
||||
| `BITKOM_COPYRIGHT`, `BSI_COPYRIGHT`, `VDMA_COPYRIGHT` | 3 | Verbands-/Behörden-Publikationen mit eigenständigem Urheberrecht |
|
||||
| `OWN_WORK` | 3 | BreakPilot-Eigentexte (Templates, eigene Patterns) — kein externes Lizenzrisiko, aber auch kein Public-Domain-Status |
|
||||
|
||||
**Sonderfall DGUV:** Die Klasse trennt sich nach Publikationstyp:
|
||||
- DGUV **Vorschriften / UVV** → `DE_LAW` → Regel 1
|
||||
- DGUV **Regeln, Informationen, Grundsätze** → `DGUV_COPYRIGHT` → Regel 3
|
||||
|
||||
## Auswirkung pro Pipeline-Stage
|
||||
|
||||
| Stage | Verhalten bei Regel 1 | Regel 2 | Regel 3 |
|
||||
|---|---|---|---|
|
||||
| Stage 6 ControlCompose (`pipeline_adapter.py:147`) | speichert `chunk_text` | speichert `chunk_text` | speichert `chunk_text = None` |
|
||||
| Atomic-Control-Bildung | Volltext als Quelle | Volltext + Attribution-Vermerk | nur regulation_id + article |
|
||||
| Output-Renderer (Frontend/PDF) | optionaler Quellen-Hinweis | **Pflicht-Attribution in Footer + Inline** | nur Identifier rendern |
|
||||
| Tech-File-Anhang | Quelle nennen | Quelle + Lizenz-URL | Identifier-Liste |
|
||||
|
||||
## Quellen ohne Klassifikation
|
||||
|
||||
Aktuell sind in `regulation_registry` **232 Regulationen** klassifiziert (Stand 2026-05-21). Die folgenden müssen noch ergänzt werden (Task #20 deckt den DGUV-Ingest):
|
||||
|
||||
| Quelle | Regel | Begründung |
|
||||
|---|---|---|
|
||||
| TRBS-Familie (24 PDFs im RAG) | 1 | Technische Regeln Betriebssicherheit — BAuA Bundesarbeitsblatt |
|
||||
| TRGS-Familie (alle Volltext-Chunks) | 1 | Technische Regeln Gefahrstoffe — BAuA |
|
||||
| ASR-Familie (17 PDFs) | 1 | Arbeitsstättenregeln — BAuA |
|
||||
| OSHA 29 CFR 1910 Subpart O + Technical Manual | 1 | US Federal Public Domain (17 U.S.C. §105) |
|
||||
| DGUV Vorschrift 1 + UVV-Familie (sobald ingest) | 1 | Satzungsrecht der BG |
|
||||
| DGUV Regel 100-500 + Information 209-072/074/073 | 3 | DGUV-Copyright, nur Identifier |
|
||||
| DIN-Identifier-Tabelle (ohne Volltext) | 3 | DIN-Beuth-Copyright |
|
||||
| ANSI B11.0 + RIA R15.06 + UL 508A Identifier | 3 | ANSI/UL-Copyright |
|
||||
| ISO 12100/13849/13857 Identifier | 3 | ISO-Copyright |
|
||||
|
||||
## Audit-Pflicht
|
||||
|
||||
Vor jedem Ingest neuer Quellen:
|
||||
1. Lizenz prüfen (publikationen.dguv.de, EUR-Lex, etc.)
|
||||
2. license_type aus obiger Tabelle wählen — wenn nicht vorhanden, hier ergänzen
|
||||
3. license_rule wird daraus deterministisch abgeleitet
|
||||
4. Attribution-Text bei Regel 2 ist Pflichtfeld
|
||||
|
||||
Vor jedem Output:
|
||||
- Wenn ein atomic_control aus einer Regel-3-Quelle stammt: prüfen dass NUR Identifier gezeigt wird, niemals Volltext
|
||||
- Wenn aus Regel-2-Quelle: Attribution muss im PDF-Footer und im Frontend-Tooltip vorhanden sein
|
||||
- Wenn aus Regel-1-Quelle: empfohlen Quelle nennen für Auditierbarkeit
|
||||
|
||||
## Verweise
|
||||
|
||||
- Schema: `migrations/002_regulation_registry.sql`
|
||||
- Code: `services/regulation_registry.py`, `services/pipeline_adapter.py`
|
||||
- Seed-Script: `scripts/f1_migrate_regulation_registry.py`
|
||||
- Tests: `tests/test_regulation_registry.py` (assert: rule IN (1,2,3))
|
||||
@@ -0,0 +1,101 @@
|
||||
# Incremental BatchDedup für nachgeschobene Dokumente
|
||||
|
||||
Eingefuehrt am 2026-05-18. Pattern fuer alle zukuenftigen Einzeldokument-Ingestionen.
|
||||
|
||||
## Problem
|
||||
|
||||
Der Default-BatchDedup-Runner lief gegen ALLE `pass0b` Atomics ohne Filter
|
||||
(WHERE decomposition_method = 'pass0b' AND release_state NOT IN ('deprecated','duplicate')).
|
||||
Das sind bei uns ~172k Controls. Pace ~5k/h → 25-40h Laufzeit. Bei jedem
|
||||
hinzugefuegten Dokument der gleiche volle Lauf — auch wenn das neue Dokument
|
||||
nur 1-2k Atomics erzeugt.
|
||||
|
||||
Zusaetzliches Risiko: Phase 1 schreibt master_controls erst am Ende. Ein
|
||||
Container-Crash mitten im Lauf (z.B. via Qdrant-Timeout) verwirft 100%
|
||||
des In-Memory-Fortschritts.
|
||||
|
||||
## Loesung — `since` Parameter
|
||||
|
||||
`POST /v1/canonical/generate/batch-dedup` akzeptiert jetzt:
|
||||
|
||||
```json
|
||||
{
|
||||
"dry_run": false,
|
||||
"since": "2026-05-18T02:53:00+00:00"
|
||||
}
|
||||
```
|
||||
|
||||
Effekt:
|
||||
- Phase 1 (intra-group dedup) laedt nur Controls mit `created_at >= since`
|
||||
- Phase 2 (cross-group dedup) filtert ebenfalls auf `created_at >= since`
|
||||
- Phase 2 Checkpoint wird vor Lauf-Start geloescht (sonst skippt stale
|
||||
`last_control_id` neu erzeugte Atomics deren control_id alphabetisch
|
||||
davor liegt)
|
||||
|
||||
Phase 2 sucht weiter im **vollen** Qdrant-Index `atomic_controls_dedup`,
|
||||
findet also Matches zu alten Master Controls und verlinkt korrekt.
|
||||
|
||||
## Wann verwenden
|
||||
|
||||
| Szenario | Empfehlung |
|
||||
|---|---|
|
||||
| Einzelnes neues Dokument ingestiert + Pass 0a + Pass 0b durchgelaufen | `since` setzen auf Zeitpunkt vor Pass 0b |
|
||||
| Mehrere kleine Updates seit letztem Full-Dedup | `since` setzen auf Zeitpunkt nach letztem Full-Dedup |
|
||||
| Initial-Setup oder Pipeline-Major-Update | KEIN `since` — full run |
|
||||
| Verdacht auf Drift / Quality-Regression | KEIN `since` — full run |
|
||||
|
||||
## Workflow nach Einzeldokument-Ingestion
|
||||
|
||||
```bash
|
||||
# 1. Pass 0a auf neue Controls (Obligations extrahieren)
|
||||
curl -X POST .../v1/canonical/generate/run-pass0a -d '{...}'
|
||||
|
||||
# 2. Pass 0b Decomposition Submit (Atomics erzeugen)
|
||||
curl -X POST .../v1/canonical/generate/submit-pass0b -d '{...}'
|
||||
|
||||
# 3. Wenn Anthropic Batch durch: process-batch
|
||||
curl -X POST .../v1/canonical/generate/process-batch -d '{
|
||||
"batch_id": "msgbatch_...",
|
||||
"pass_type": "0b"
|
||||
}'
|
||||
|
||||
# 4. Inkrementell deduppen (NEU, statt 25h full run)
|
||||
curl -X POST .../v1/canonical/generate/batch-dedup -d '{
|
||||
"dry_run": false,
|
||||
"since": "<ISO-Datetime kurz vor Pass-0b-Start>"
|
||||
}'
|
||||
```
|
||||
|
||||
## Pace-Beobachtung (CRA-Lauf 2026-05-18)
|
||||
|
||||
- Total neue Atomics: 19.423
|
||||
- Phase 1 multi-groups: 568 (Rest 18.101 sind Singletons → direkt Master)
|
||||
- Phase 2 Cross-Group: ~3-4h erwartet
|
||||
- Vergleich: Full-Run waere 25-40h gewesen, scoped 6-13x schneller.
|
||||
|
||||
## Implementation-Details (fuer Wartung)
|
||||
|
||||
Geaenderte Dateien:
|
||||
- `services/batch_dedup_runner.py` — `run()` + `_load_merge_groups()` +
|
||||
`_run_cross_group_pass()` SQL-Queries
|
||||
- `api/control_generator_routes.py` — `BatchDedupRequest.since` Feld +
|
||||
Handler reicht durch
|
||||
|
||||
Backwards-kompatibel: ohne `since` aequivalent zum alten Verhalten.
|
||||
|
||||
## Bekannte Limits
|
||||
|
||||
1. **Phase 2 Checkpoint wird beim scoped Lauf geloescht.** Wenn waehrend
|
||||
eines `since`-Laufs ein voller Run dazwischen geschoben werden soll
|
||||
(sollte nicht passieren), muss neu starten.
|
||||
2. **Phase 1 commit-Granularitaet nicht angefasst.** Bei Crash mitten in
|
||||
Phase 1 ohne `since` bleibt der Verlust gleich. Aber: scoped Phase 1
|
||||
ist so kurz (Minuten), dass das praktisch egal ist.
|
||||
3. **Singleton-Atomics werden direkt Master ohne Cross-Check.** Wenn ein
|
||||
neues Singleton-Atomic semantisch identisch zu einem alten Master
|
||||
ist, faengt das nur Phase 2 (via Qdrant). Funktioniert solange Phase 2
|
||||
nicht uebersprungen wird (dry_run=false ist Pflicht).
|
||||
|
||||
## Memory-Eintrag
|
||||
|
||||
Siehe `~/.claude/projects/-Users-benjaminadmin-Projekte-breakpilot-core/memory/feedback_incremental_dedup.md`
|
||||
@@ -0,0 +1,72 @@
|
||||
-- Migration 002: Regulation Registry (Block F1)
|
||||
-- Schema: compliance
|
||||
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/002_regulation_registry.sql
|
||||
|
||||
SET search_path TO compliance, public;
|
||||
|
||||
-- ========================================
|
||||
-- regulation_registry
|
||||
-- ========================================
|
||||
-- Central registry for all regulations, laws, guidelines, and frameworks
|
||||
-- referenced by the control pipeline. Replaces hardcoded Python dicts
|
||||
-- (REGULATION_LICENSE_MAP, SOURCE_REGULATION_CLASSIFICATION).
|
||||
|
||||
CREATE TABLE IF NOT EXISTS regulation_registry (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
|
||||
-- regulation_id: machine key (e.g. "eu_2016_679", "nist_sp_800_53")
|
||||
regulation_id VARCHAR(100) UNIQUE NOT NULL,
|
||||
|
||||
-- Display names
|
||||
regulation_name_de TEXT,
|
||||
regulation_name_en TEXT,
|
||||
regulation_short VARCHAR(50),
|
||||
|
||||
-- License classification (3-rule system)
|
||||
license_rule INTEGER NOT NULL DEFAULT 1
|
||||
CHECK (license_rule IN (1, 2, 3)),
|
||||
license_type VARCHAR(50), -- EU_LAW, DE_LAW, CC-BY-SA-4.0, etc.
|
||||
attribution TEXT, -- Required for Rule 2 (CC-BY)
|
||||
|
||||
-- Source classification
|
||||
source_type VARCHAR(20) NOT NULL DEFAULT 'law'
|
||||
CHECK (source_type IN ('law', 'guideline', 'standard', 'framework', 'restricted')),
|
||||
|
||||
-- Metadata
|
||||
jurisdiction VARCHAR(10), -- DE, EU, AT, CH, US, FR, ES, NL, IT, HU, INT
|
||||
category VARCHAR(50),
|
||||
celex VARCHAR(30), -- EU CELEX number if applicable
|
||||
url TEXT,
|
||||
|
||||
-- Lifecycle
|
||||
status VARCHAR(20) NOT NULL DEFAULT 'active'
|
||||
CHECK (status IN ('active', 'needs_review', 'deprecated')),
|
||||
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- Indexes
|
||||
CREATE INDEX IF NOT EXISTS idx_reg_registry_status
|
||||
ON regulation_registry(status);
|
||||
CREATE INDEX IF NOT EXISTS idx_reg_registry_jurisdiction
|
||||
ON regulation_registry(jurisdiction);
|
||||
CREATE INDEX IF NOT EXISTS idx_reg_registry_source_type
|
||||
ON regulation_registry(source_type);
|
||||
CREATE INDEX IF NOT EXISTS idx_reg_registry_license_rule
|
||||
ON regulation_registry(license_rule);
|
||||
|
||||
-- Updated-at trigger
|
||||
CREATE OR REPLACE FUNCTION update_regulation_registry_updated_at()
|
||||
RETURNS TRIGGER AS $$
|
||||
BEGIN
|
||||
NEW.updated_at = NOW();
|
||||
RETURN NEW;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
DROP TRIGGER IF EXISTS trg_regulation_registry_updated_at ON regulation_registry;
|
||||
CREATE TRIGGER trg_regulation_registry_updated_at
|
||||
BEFORE UPDATE ON regulation_registry
|
||||
FOR EACH ROW
|
||||
EXECUTE FUNCTION update_regulation_registry_updated_at();
|
||||
@@ -0,0 +1,58 @@
|
||||
-- Migration 003: Action & Object Ontology (Block F2+F3)
|
||||
-- Schema: compliance
|
||||
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/003_action_object_ontology.sql
|
||||
|
||||
SET search_path TO compliance, public;
|
||||
|
||||
-- ========================================
|
||||
-- action_types — 34 canonical action verbs
|
||||
-- ========================================
|
||||
|
||||
CREATE TABLE IF NOT EXISTS action_types (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
canonical_name VARCHAR(50) UNIQUE NOT NULL,
|
||||
phase VARCHAR(30) NOT NULL,
|
||||
description_de TEXT,
|
||||
description_en TEXT,
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_action_types_phase ON action_types(phase);
|
||||
|
||||
-- ========================================
|
||||
-- action_synonyms — German aliases + negative patterns
|
||||
-- ========================================
|
||||
|
||||
CREATE TABLE IF NOT EXISTS action_synonyms (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
canonical_action VARCHAR(50) NOT NULL REFERENCES action_types(canonical_name),
|
||||
synonym VARCHAR(100) NOT NULL,
|
||||
language VARCHAR(5) NOT NULL DEFAULT 'de',
|
||||
source VARCHAR(20) NOT NULL DEFAULT 'manual'
|
||||
CHECK (source IN ('manual', 'llm', 'migration')),
|
||||
pattern_type VARCHAR(20) NOT NULL DEFAULT 'alias'
|
||||
CHECK (pattern_type IN ('alias', 'negative_pattern')),
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
UNIQUE(synonym, language, pattern_type)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_action_synonyms_canonical ON action_synonyms(canonical_action);
|
||||
CREATE INDEX IF NOT EXISTS idx_action_synonyms_pattern_type ON action_synonyms(pattern_type);
|
||||
|
||||
-- ========================================
|
||||
-- object_synonyms — normalized object tokens
|
||||
-- ========================================
|
||||
|
||||
CREATE TABLE IF NOT EXISTS object_synonyms (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
canonical_token VARCHAR(100) NOT NULL,
|
||||
synonym VARCHAR(200) NOT NULL,
|
||||
language VARCHAR(5) NOT NULL DEFAULT 'de',
|
||||
source VARCHAR(20) NOT NULL DEFAULT 'manual'
|
||||
CHECK (source IN ('manual', 'llm', 'migration')),
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
UNIQUE(synonym, language)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_object_synonyms_canonical ON object_synonyms(canonical_token);
|
||||
@@ -0,0 +1,18 @@
|
||||
-- Migration 004: Object Groups (G-pre1)
|
||||
-- Schema: compliance
|
||||
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/004_object_groups.sql
|
||||
|
||||
SET search_path TO compliance, public;
|
||||
|
||||
CREATE TABLE IF NOT EXISTS object_groups (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
group_id INTEGER NOT NULL,
|
||||
canonical_name VARCHAR(200) NOT NULL,
|
||||
member_count INTEGER DEFAULT 0,
|
||||
members JSONB DEFAULT '[]',
|
||||
top_controls_count INTEGER DEFAULT 0,
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_object_groups_group_id ON object_groups(group_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_object_groups_canonical ON object_groups(canonical_name);
|
||||
@@ -0,0 +1,30 @@
|
||||
-- Migration 005: Master Controls (G-pre2)
|
||||
-- Schema: compliance
|
||||
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/005_master_controls.sql
|
||||
|
||||
SET search_path TO compliance, public;
|
||||
|
||||
CREATE TABLE IF NOT EXISTS master_controls (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
master_control_id VARCHAR(50) UNIQUE NOT NULL,
|
||||
object_group_id INTEGER NOT NULL,
|
||||
canonical_name VARCHAR(200) NOT NULL,
|
||||
phases_covered JSONB NOT NULL DEFAULT '[]',
|
||||
phase_control_count JSONB NOT NULL DEFAULT '{}',
|
||||
total_controls INTEGER DEFAULT 0,
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_master_controls_group ON master_controls(object_group_id);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS master_control_members (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
master_control_uuid UUID NOT NULL REFERENCES master_controls(id) ON DELETE CASCADE,
|
||||
control_uuid UUID NOT NULL,
|
||||
phase VARCHAR(50) NOT NULL,
|
||||
action VARCHAR(50) NOT NULL,
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_mc_members_master ON master_control_members(master_control_uuid);
|
||||
CREATE INDEX IF NOT EXISTS idx_mc_members_control ON master_control_members(control_uuid);
|
||||
@@ -0,0 +1,58 @@
|
||||
-- Migration 006: Decision Traces (G1)
|
||||
-- Schema: compliance
|
||||
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/006_decision_traces.sql
|
||||
|
||||
SET search_path TO compliance, public;
|
||||
|
||||
CREATE TABLE IF NOT EXISTS decision_traces (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
control_uuid UUID NOT NULL,
|
||||
regulation_id VARCHAR(100),
|
||||
obligation_id VARCHAR(100),
|
||||
|
||||
-- Decision
|
||||
status VARCHAR(30) NOT NULL DEFAULT 'not_assessed'
|
||||
CHECK (status IN ('not_assessed', 'compliant', 'partially_compliant',
|
||||
'not_compliant', 'not_applicable', 'under_remediation')),
|
||||
decision_reason TEXT,
|
||||
decided_by VARCHAR(100),
|
||||
decided_at TIMESTAMPTZ,
|
||||
|
||||
-- Fix/Remediation
|
||||
fix_strategy TEXT,
|
||||
fix_owner VARCHAR(100),
|
||||
fix_target_date DATE,
|
||||
fix_completed_date DATE,
|
||||
|
||||
-- Evidence
|
||||
evidence_ids JSONB DEFAULT '[]',
|
||||
confidence NUMERIC(3,2) DEFAULT 0.0,
|
||||
|
||||
-- Multi-tenant
|
||||
tenant_id UUID,
|
||||
project_id UUID,
|
||||
metadata JSONB DEFAULT '{}',
|
||||
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_dt_control ON decision_traces(control_uuid);
|
||||
CREATE INDEX IF NOT EXISTS idx_dt_status ON decision_traces(status);
|
||||
CREATE INDEX IF NOT EXISTS idx_dt_tenant ON decision_traces(tenant_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_dt_decided_at ON decision_traces(decided_at);
|
||||
|
||||
-- Updated-at trigger
|
||||
CREATE OR REPLACE FUNCTION update_decision_traces_updated_at()
|
||||
RETURNS TRIGGER AS $$
|
||||
BEGIN
|
||||
NEW.updated_at = NOW();
|
||||
RETURN NEW;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
DROP TRIGGER IF EXISTS trg_decision_traces_updated_at ON decision_traces;
|
||||
CREATE TRIGGER trg_decision_traces_updated_at
|
||||
BEFORE UPDATE ON decision_traces
|
||||
FOR EACH ROW
|
||||
EXECUTE FUNCTION update_decision_traces_updated_at();
|
||||
@@ -0,0 +1,38 @@
|
||||
-- Migration 007: Compliance Commit Ledger (G2)
|
||||
-- Schema: compliance
|
||||
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/007_compliance_commits.sql
|
||||
|
||||
SET search_path TO compliance, public;
|
||||
|
||||
CREATE TABLE IF NOT EXISTS compliance_commits (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
tenant_id UUID NOT NULL,
|
||||
project_id UUID,
|
||||
|
||||
-- Git Info
|
||||
commit_hash VARCHAR(64) NOT NULL,
|
||||
commit_message TEXT,
|
||||
commit_author VARCHAR(200),
|
||||
commit_date TIMESTAMPTZ,
|
||||
branch VARCHAR(200),
|
||||
repo_url TEXT,
|
||||
|
||||
-- Affected Controls
|
||||
affected_control_ids JSONB NOT NULL DEFAULT '[]',
|
||||
affected_files JSONB DEFAULT '[]',
|
||||
|
||||
-- Analysis
|
||||
risk_level VARCHAR(20) DEFAULT 'low'
|
||||
CHECK (risk_level IN ('low', 'medium', 'high', 'critical')),
|
||||
analysis_summary TEXT,
|
||||
analysis_metadata JSONB DEFAULT '{}',
|
||||
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_cc_tenant ON compliance_commits(tenant_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_cc_hash ON compliance_commits(commit_hash);
|
||||
CREATE INDEX IF NOT EXISTS idx_cc_date ON compliance_commits(commit_date);
|
||||
CREATE INDEX IF NOT EXISTS idx_cc_risk ON compliance_commits(risk_level);
|
||||
-- GIN index for JSONB array containment queries (@>)
|
||||
CREATE INDEX IF NOT EXISTS idx_cc_control_ids ON compliance_commits USING GIN (affected_control_ids);
|
||||
@@ -0,0 +1,37 @@
|
||||
-- Migration 008: Decision Events / Full Decision Memory (G3)
|
||||
-- Schema: compliance
|
||||
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/008_decision_events.sql
|
||||
|
||||
SET search_path TO compliance, public;
|
||||
|
||||
CREATE TABLE IF NOT EXISTS decision_events (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
decision_trace_id UUID REFERENCES decision_traces(id) ON DELETE SET NULL,
|
||||
control_uuid UUID NOT NULL,
|
||||
tenant_id UUID,
|
||||
|
||||
-- Event type
|
||||
event_type VARCHAR(30) NOT NULL
|
||||
CHECK (event_type IN (
|
||||
'assessment', 'decision', 'fix_planned', 'fix_started',
|
||||
'fix_completed', 'verification', 'failure', 'exception', 'escalation'
|
||||
)),
|
||||
|
||||
-- State before/after
|
||||
input_state JSONB DEFAULT '{}',
|
||||
output_state JSONB DEFAULT '{}',
|
||||
|
||||
-- Details
|
||||
summary TEXT,
|
||||
actor VARCHAR(200),
|
||||
evidence_ids JSONB DEFAULT '[]',
|
||||
metadata JSONB DEFAULT '{}',
|
||||
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_de_control ON decision_events(control_uuid);
|
||||
CREATE INDEX IF NOT EXISTS idx_de_trace ON decision_events(decision_trace_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_de_tenant ON decision_events(tenant_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_de_type ON decision_events(event_type);
|
||||
CREATE INDEX IF NOT EXISTS idx_de_created ON decision_events(created_at);
|
||||
@@ -0,0 +1,38 @@
|
||||
-- Migration 009: Deployment Checks / Pre-Deployment Enforcement (G4)
|
||||
-- Schema: compliance
|
||||
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/009_deployment_checks.sql
|
||||
|
||||
SET search_path TO compliance, public;
|
||||
|
||||
CREATE TABLE IF NOT EXISTS deployment_checks (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
tenant_id UUID NOT NULL,
|
||||
|
||||
-- Deploy Info
|
||||
commit_hash VARCHAR(64) NOT NULL,
|
||||
branch VARCHAR(200),
|
||||
environment VARCHAR(50) DEFAULT 'production',
|
||||
|
||||
-- Result
|
||||
verdict VARCHAR(20) NOT NULL DEFAULT 'pending'
|
||||
CHECK (verdict IN ('pending', 'approved', 'blocked', 'override')),
|
||||
|
||||
-- Impact
|
||||
affected_control_ids JSONB DEFAULT '[]',
|
||||
blocking_controls JSONB DEFAULT '[]',
|
||||
warning_controls JSONB DEFAULT '[]',
|
||||
risk_score NUMERIC(5,2) DEFAULT 0.0,
|
||||
|
||||
-- Override
|
||||
override_by VARCHAR(200),
|
||||
override_reason TEXT,
|
||||
|
||||
summary TEXT,
|
||||
metadata JSONB DEFAULT '{}',
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_dc_tenant ON deployment_checks(tenant_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_dc_hash ON deployment_checks(commit_hash);
|
||||
CREATE INDEX IF NOT EXISTS idx_dc_verdict ON deployment_checks(verdict);
|
||||
CREATE INDEX IF NOT EXISTS idx_dc_created ON deployment_checks(created_at);
|
||||
@@ -0,0 +1,162 @@
|
||||
-- Migration 010: Expanded Object Ontology
|
||||
-- Expands from 31 to ~180 canonical object tokens with clear semantic boundaries.
|
||||
-- Each token has a description to prevent ambiguous classification.
|
||||
--
|
||||
-- IMPORTANT: This migration ADDS new tokens. Existing synonyms are preserved.
|
||||
|
||||
SET search_path TO compliance, public;
|
||||
|
||||
-- Add description column to object_synonyms if not exists
|
||||
DO $$ BEGIN
|
||||
ALTER TABLE object_synonyms ADD COLUMN IF NOT EXISTS description TEXT;
|
||||
EXCEPTION WHEN duplicate_column THEN NULL;
|
||||
END $$;
|
||||
|
||||
-- New table: canonical object definitions with clear boundaries
|
||||
CREATE TABLE IF NOT EXISTS object_ontology (
|
||||
canonical_token VARCHAR(100) PRIMARY KEY,
|
||||
category VARCHAR(50) NOT NULL, -- security, data_protection, governance, regulatory, technical
|
||||
description_de TEXT NOT NULL, -- German description for LLM prompts
|
||||
description_en TEXT NOT NULL, -- English description
|
||||
NOT_confused_with TEXT, -- Explicit disambiguation
|
||||
examples TEXT, -- Example controls that belong here
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
-- SECURITY & TECHNICAL
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
|
||||
-- Authentication & Identity
|
||||
INSERT INTO object_ontology VALUES
|
||||
('multi_factor_auth', 'security', 'Multi-Faktor-Authentifizierung (2FA/MFA)', 'Multi-factor authentication', 'NOT password_policy (Passwortregeln) oder session_management (Sitzungen)', 'MFA implementieren, 2FA-Pflicht, Authentifizierungsfaktoren'),
|
||||
('password_policy', 'security', 'Passwortrichtlinien und -komplexität', 'Password policies and complexity', 'NOT credentials (allg. Zugangsdaten) oder multi_factor_auth (MFA)', 'Passwortlänge, Komplexität, Rotation, Passwort-Historie'),
|
||||
('credentials', 'security', 'Zugangsdaten-Verwaltung (Tokens, API-Keys, Secrets)', 'Credential management', 'NOT password_policy (Passwortregeln) oder key_management (kryptografisch)', 'API-Key-Rotation, Token-Verwaltung, Secret Storage'),
|
||||
('session_management', 'security', 'Sitzungsverwaltung (Session Timeout, Token-Lifecycle)', 'Session management', 'NOT multi_factor_auth (Login) oder access_control (Berechtigungen)', 'Session Timeout, Token-Invalidierung, Concurrent Sessions'),
|
||||
('privileged_access', 'security', 'Verwaltung privilegierter Zugriffe (Admin, Root)', 'Privileged access management', 'NOT access_control (allg. Zugriffskontrolle)', 'Admin-Konten, Root-Zugriff, PAM, Just-in-Time-Access'),
|
||||
('access_control', 'security', 'Allgemeine Zugriffskontrolle (RBAC, Berechtigungen)', 'Access control (RBAC, permissions)', 'NOT privileged_access (Admin) oder authentication (Login)', 'Rollenbasierte Zugriffskontrolle, Berechtigungsvergabe, Least Privilege')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- Encryption & Cryptography
|
||||
INSERT INTO object_ontology VALUES
|
||||
('encryption', 'security', 'Verschlüsselung at-rest (Datenverschlüsselung)', 'Encryption at rest', 'NOT transport_encryption (in-transit) oder key_management (Schlüssel)', 'AES-256, Festplattenverschlüsselung, DB-Verschlüsselung'),
|
||||
('transport_encryption', 'security', 'Transportverschlüsselung (TLS, HTTPS)', 'Transport encryption (TLS)', 'NOT encryption (at-rest)', 'TLS 1.3, HTTPS, mTLS, Zertifikats-Pinning'),
|
||||
('key_management', 'security', 'Kryptografische Schlüsselverwaltung', 'Cryptographic key management', 'NOT credentials (API-Keys) oder certificate_management (Zertifikate)', 'Key Rotation, HSM, Key Escrow, Schlüsselerzeugung'),
|
||||
('certificate_management', 'security', 'Zertifikatsverwaltung (PKI, X.509)', 'Certificate management (PKI)', 'NOT key_management (Schlüssel) oder encryption (Verschlüsselung)', 'X.509-Zertifikate, PKI, Zertifikatsrückruf, CA-Verwaltung')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- Network Security
|
||||
INSERT INTO object_ontology VALUES
|
||||
('network_security', 'security', 'Allgemeine Netzwerksicherheit', 'General network security', 'NOT network_segmentation (Segmentierung) oder firewall (Regeln)', 'Netzwerk-Hardening, Port-Management, DNS-Sicherheit'),
|
||||
('network_segmentation', 'security', 'Netzwerksegmentierung (VLANs, Zonen)', 'Network segmentation', 'NOT network_security (allg.) oder firewall (Regeln)', 'VLANs, DMZ, Micro-Segmentation, Zero Trust Network'),
|
||||
('firewall', 'security', 'Firewall-Regeln und -Verwaltung', 'Firewall rules and management', 'NOT network_security (allg.)', 'WAF, Firewall-Regeln, Ingress/Egress, Whitelist'),
|
||||
('vpn', 'security', 'VPN-Konfiguration und -Verwaltung', 'VPN configuration', NULL, 'IPSec, WireGuard, Site-to-Site VPN'),
|
||||
('remote_access', 'security', 'Fernzugriff und Remote-Arbeit', 'Remote access', 'NOT vpn (Technologie)', 'Remote Desktop, Bastion Hosts, Jump Server')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- Monitoring & Logging (CRITICAL: clear boundaries!)
|
||||
INSERT INTO object_ontology VALUES
|
||||
('monitoring', 'security', 'Kontinuierliche Echtzeit-Überwachung von Systemen/Metriken', 'Continuous real-time monitoring of systems', 'NOT audit_logging (Protokollierung), NOT training (Schulung), NOT procedure (Verfahren), NOT risk_assessment (Bewertung)', 'System-Health-Monitoring, Verfügbarkeitsüberwachung, Performance-Monitoring, Anomalie-Erkennung in Echtzeit'),
|
||||
('audit_logging', 'security', 'Protokollierung und Audit-Trail (Nachvollziehbarkeit)', 'Audit logging and trail', 'NOT monitoring (Echtzeit-Überwachung), NOT compliance_audit (Prüfungen)', 'Log-Aufzeichnung, Audit Trail, Zeitstempel, Nachvollziehbarkeit, Protokollierung von Zugriffen'),
|
||||
('siem', 'security', 'Security Information and Event Management', 'SIEM', 'NOT monitoring (allg.) oder audit_logging (Protokollierung)', 'SIEM-Korrelation, Security Events, Log-Aggregation'),
|
||||
('alerting', 'security', 'Benachrichtigungen und Meldepflichten bei Sicherheitsereignissen', 'Security alerting and notification obligations', 'NOT monitoring (Überwachung) oder incident (Vorfallsbehandlung)', 'Sicherheitsmeldungen, Breach Notification, Benachrichtigungspflichten'),
|
||||
('compliance_audit', 'governance', 'Compliance-Prüfungen und externe Audits', 'Compliance audits and external reviews', 'NOT audit_logging (technische Protokollierung), NOT monitoring (Überwachung)', 'Externe Prüfung, Jahresabschlussprüfung, Zertifizierungsaudit, Lieferanten-Audit')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- Vulnerability & Patch Management
|
||||
INSERT INTO object_ontology VALUES
|
||||
('vulnerability', 'security', 'Schwachstellenmanagement und -scanning', 'Vulnerability management', 'NOT patch_management (Updates)', 'Vulnerability Scanning, CVE-Tracking, Penetration Testing'),
|
||||
('patch_management', 'security', 'Software-Updates und Patch-Verwaltung', 'Patch management', 'NOT vulnerability (Scanning)', 'Patch-Zyklus, Update-Policy, Hotfix-Prozess')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- Backup & Recovery
|
||||
INSERT INTO object_ontology VALUES
|
||||
('backup', 'security', 'Datensicherung und Backup-Strategien', 'Backup strategies', 'NOT disaster_recovery (Wiederherstellung)', 'Backup-Rotation, Offsite-Backup, Backup-Verschlüsselung'),
|
||||
('disaster_recovery', 'security', 'Notfallwiederherstellung und Business Continuity', 'Disaster recovery', 'NOT backup (Datensicherung) oder incident (Vorfälle)', 'DR-Plan, RTO/RPO, Failover, Business Continuity')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
-- DATA PROTECTION (CRITICAL: clear boundaries!)
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
|
||||
INSERT INTO object_ontology VALUES
|
||||
('personal_data', 'data_protection', 'Verarbeitung personenbezogener Daten (DSGVO-Grundsätze)', 'Personal data processing principles', 'NOT sensitive_data (besondere Kategorien), NOT data_subject_rights (Betroffenenrechte), NOT consent (Einwilligung)', 'Datenminimierung, Zweckbindung, Speicherbegrenzung, Rechtmäßigkeit der Verarbeitung'),
|
||||
('sensitive_data', 'data_protection', 'Besondere Kategorien personenbezogener Daten (Art. 9 DSGVO)', 'Special categories of personal data', 'NOT personal_data (allg.), NOT health_data (Gesundheit)', 'Biometrische Daten, ethnische Herkunft, politische Meinungen, Gewerkschaftszugehörigkeit'),
|
||||
('health_data', 'data_protection', 'Gesundheitsdaten und Medizindaten', 'Health and medical data', 'NOT sensitive_data (allg. besondere Kategorien)', 'Patientendaten, Medizinprodukte-Daten, klinische Daten'),
|
||||
('consent', 'data_protection', 'Einwilligungsmanagement', 'Consent management', 'NOT data_subject_rights (andere Betroffenenrechte)', 'Einwilligung einholen, Widerruf, Opt-In, Consent-Banner'),
|
||||
('data_subject_rights', 'data_protection', 'Betroffenenrechte (Auskunft, Löschung, Portabilität)', 'Data subject rights (access, erasure, portability)', 'NOT consent (Einwilligung), NOT personal_data (Verarbeitung)', 'Auskunftsrecht, Recht auf Löschung, Datenportabilität, Widerspruchsrecht'),
|
||||
('data_retention', 'data_protection', 'Aufbewahrungsfristen und Löschkonzept', 'Data retention and deletion', 'NOT backup (technische Sicherung)', 'Löschfristen, Aufbewahrungspflichten, Löschkonzept, Archivierung'),
|
||||
('data_transfer', 'data_protection', 'Internationale Datenübermittlung (Drittländer, SCC)', 'International data transfer', 'NOT data_processing (Verarbeitung)', 'Drittlandtransfer, Standardvertragsklauseln, Angemessenheitsbeschluss, BCR'),
|
||||
('data_breach_notification', 'data_protection', 'Meldung von Datenschutzverletzungen (Art. 33/34 DSGVO)', 'Data breach notification', 'NOT incident (allg. Sicherheitsvorfälle), NOT alerting (techn. Alerts)', 'Breach-Meldung an Aufsichtsbehörde, Benachrichtigung Betroffener, 72-Stunden-Frist'),
|
||||
('dpia', 'data_protection', 'Datenschutz-Folgenabschätzung (Art. 35 DSGVO)', 'Data protection impact assessment', NULL, 'DSFA, Schwellwertanalyse, Risikobewertung für Betroffene'),
|
||||
('data_processing_agreement', 'data_protection', 'Auftragsverarbeitung (Art. 28 DSGVO)', 'Data processing agreements', NULL, 'AVV, Auftragsverarbeiter, Sub-Auftragsverarbeiter, TOMs'),
|
||||
('privacy_by_design', 'data_protection', 'Datenschutz durch Technikgestaltung (Art. 25 DSGVO)', 'Privacy by design and default', NULL, 'Privacy by Default, Datenminimierung in der Architektur'),
|
||||
('data_processing_register', 'data_protection', 'Verzeichnis von Verarbeitungstätigkeiten (Art. 30 DSGVO)', 'Records of processing activities', NULL, 'VVT, Verarbeitungsverzeichnis')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
-- GOVERNANCE & ORGANIZATION
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
|
||||
INSERT INTO object_ontology VALUES
|
||||
('policy', 'governance', 'Richtlinien und Leitlinien ERSTELLEN/DEFINIEREN', 'Creating/defining policies', 'NOT procedure (Verfahrensablauf), NOT compliance_audit (Prüfung)', 'Sicherheitsrichtlinie erstellen, Policy-Framework definieren, Leitlinie verabschieden'),
|
||||
('procedure', 'governance', 'Verfahren und Prozessabläufe DEFINIEREN/DOKUMENTIEREN', 'Defining/documenting procedures', 'NOT incident (Vorfallsbehandlung), NOT process (laufender Betrieb)', 'Verfahrensanweisung, Ablaufbeschreibung, Standardprozess definieren'),
|
||||
('process', 'governance', 'Laufende betriebliche Prozesse AUSFÜHREN', 'Executing operational processes', 'NOT procedure (Definition), NOT monitoring (Überwachung)', 'Betriebsprozess, Geschäftsprozess, Workflow-Ausführung'),
|
||||
('training', 'governance', 'Schulung und Weiterbildung DURCHFÜHREN', 'Training and education', 'NOT awareness (Sensibilisierung), NOT monitoring (Überwachung!)', 'Mitarbeiterschulung, Zertifizierungskurs, Pflichtunterweisung'),
|
||||
('awareness', 'governance', 'Sicherheitsbewusstsein und Sensibilisierung', 'Security awareness', 'NOT training (formale Schulung)', 'Phishing-Simulation, Awareness-Kampagne, Sicherheitskultur'),
|
||||
('incident', 'governance', 'Sicherheitsvorfälle BEHANDELN (Incident Response)', 'Incident response and handling', 'NOT alerting (Benachrichtigung), NOT data_breach_notification (DSGVO-Meldung)', 'Incident Response Plan, Vorfallsanalyse, Containment, Recovery, Lessons Learned'),
|
||||
('risk_management', 'governance', 'Risikomanagement und -bewertung', 'Risk management and assessment', 'NOT vulnerability (techn. Schwachstellen), NOT monitoring (Überwachung)', 'Risikobewertung, Risikobehandlung, Risikoakzeptanz, Risikomatrix'),
|
||||
('third_party_management', 'governance', 'Lieferanten- und Drittanbieter-Management', 'Third-party and vendor management', 'NOT data_processing_agreement (AVV)', 'Lieferantenbewertung, Vendor Risk Assessment, Supply Chain Security'),
|
||||
('change_management', 'governance', 'Änderungsmanagement', 'Change management', 'NOT patch_management (Updates)', 'Change Request, Change Advisory Board, Rollback-Verfahren'),
|
||||
('documentation', 'governance', 'Allgemeine Dokumentationspflichten', 'General documentation requirements', 'NOT audit_logging (technische Logs), NOT data_processing_register (VVT)', 'Betriebshandbuch, Systemdokumentation, Verfahrensdokumentation'),
|
||||
('records_management', 'governance', 'Akten- und Unterlagenverwaltung', 'Records management', 'NOT data_retention (Löschfristen)', 'Archivierung, Aktenführung, Aufbewahrungspflichten nach HGB/AO'),
|
||||
('compliance_reporting', 'governance', 'Compliance-Berichterstattung', 'Compliance reporting', 'NOT alerting (techn. Alerts), NOT supervisory_authority (Behördenkommunikation)', 'Compliance-Bericht, Management-Reporting, KPI-Tracking'),
|
||||
('asset_management', 'governance', 'IT-Asset-Verwaltung und Inventar', 'IT asset management', NULL, 'Asset-Inventar, CMDB, Hardware-Lifecycle, Software-Inventar'),
|
||||
('physical_security', 'security', 'Physische Sicherheit und Zutrittskontrolle', 'Physical security and access', NULL, 'Zutrittskontrolle, Videoüberwachung (physisch), Serverraum-Sicherheit'),
|
||||
('human_resources_security', 'governance', 'Personalsicherheit', 'HR security', 'NOT training (Schulung)', 'Background-Checks, Geheimhaltungsvereinbarungen, Onboarding/Offboarding')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
-- REGULATORY SPECIFIC
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
|
||||
INSERT INTO object_ontology VALUES
|
||||
('supervisory_authority', 'regulatory', 'Kommunikation mit Aufsichtsbehörden', 'Supervisory authority communication', 'NOT compliance_reporting (interne Berichte)', 'Meldung an BaFin, Abstimmung mit DPA, behördliche Anfragen'),
|
||||
('certification', 'regulatory', 'Zertifizierung und Konformitätsbewertung', 'Certification and conformity assessment', 'NOT compliance_audit (Prüfung), NOT personal_data (Datenschutz)', 'CE-Kennzeichnung, ISO-Zertifizierung, Konformitätserklärung'),
|
||||
('product_safety', 'regulatory', 'Produktsicherheit und Marktüberwachung', 'Product safety and market surveillance', 'NOT certification (Zertifizierung)', 'Rückrufmanagement, Sicherheitsbewertung, RAPEX-Meldung'),
|
||||
('ai_system', 'regulatory', 'KI-System-Regulierung (AI Act)', 'AI system regulation', NULL, 'KI-Risikobewertung, Hochrisiko-KI, Transparenzpflichten, FRIA'),
|
||||
('financial_reporting', 'regulatory', 'Finanzberichterstattung und Rechnungslegung', 'Financial reporting and accounting', NULL, 'Jahresabschluss, HGB-Pflichten, IFRS, Buchführung'),
|
||||
('aml', 'regulatory', 'Geldwäscheprävention und KYC', 'Anti-money laundering and KYC', NULL, 'KYC, Verdachtsmeldung, PEP-Prüfung, Transaktionsmonitoring'),
|
||||
('whistleblowing', 'regulatory', 'Hinweisgeberschutz und Meldekanäle', 'Whistleblower protection', NULL, 'Hinweisgebersystem, Meldekanal, Hinweisgeberschutzgesetz'),
|
||||
('consumer_protection', 'regulatory', 'Verbraucherschutz und AGB', 'Consumer protection', NULL, 'AGB-Prüfung, Widerrufsrecht, Informationspflichten, Preistransparenz'),
|
||||
('ecommerce', 'regulatory', 'E-Commerce-Pflichten (Impressum, Fernabsatz)', 'E-commerce obligations', NULL, 'Impressumspflicht, Fernabsatzrecht, Online-Handel-Pflichten'),
|
||||
('telecommunications', 'regulatory', 'Telekommunikationsregulierung', 'Telecommunications regulation', NULL, 'TKG-Pflichten, Vorratsdatenspeicherung, Notruf'),
|
||||
('medical_device', 'regulatory', 'Medizinprodukte-Regulierung (MDR)', 'Medical device regulation', NULL, 'UDI, klinische Bewertung, Post-Market Surveillance'),
|
||||
('payment_services', 'regulatory', 'Zahlungsdienste-Regulierung (PSD2)', 'Payment services regulation', NULL, 'Starke Kundenauthentifizierung, PSD2-Compliance, Open Banking'),
|
||||
('critical_infrastructure', 'regulatory', 'KRITIS und NIS2-Pflichten', 'Critical infrastructure (NIS2)', NULL, 'KRITIS-Meldepflichten, NIS2-Maßnahmen, Mindeststandards'),
|
||||
('supply_chain_due_diligence', 'regulatory', 'Lieferkettensorgfaltspflicht (LkSG)', 'Supply chain due diligence', 'NOT third_party_management (allg. Lieferanten)', 'Menschenrechts-Due-Diligence, Umwelt-Sorgfaltspflicht, LkSG-Bericht'),
|
||||
('sustainability_reporting', 'regulatory', 'Nachhaltigkeitsberichterstattung (CSRD)', 'Sustainability reporting', NULL, 'ESG-Reporting, CSRD, Nachhaltigkeitsbericht'),
|
||||
('cookie_consent', 'regulatory', 'Cookie-Consent und Tracking (TDDDG/ePrivacy)', 'Cookie consent and tracking', 'NOT consent (allg. Einwilligung)', 'Cookie-Banner, Tracking-Einwilligung, TDDDG §25'),
|
||||
('video_surveillance', 'regulatory', 'Videoüberwachung (datenschutzrechtlich)', 'Video surveillance (data protection)', 'NOT physical_security (physische Sicherheit), NOT monitoring (IT-Monitoring)', 'Kamera-Überwachung, Speicherfristen, Kennzeichnungspflicht')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
-- APPLICATION SECURITY
|
||||
-- ═══════════════════════════════════════════════════════════════
|
||||
|
||||
INSERT INTO object_ontology VALUES
|
||||
('secure_development', 'technical', 'Sichere Softwareentwicklung (SDLC)', 'Secure software development lifecycle', NULL, 'Secure Coding, Code Review, SAST/DAST, DevSecOps'),
|
||||
('api_security', 'technical', 'API-Sicherheit', 'API security', NULL, 'API-Authentifizierung, Rate Limiting, Input Validation'),
|
||||
('input_validation', 'technical', 'Eingabevalidierung und Output Encoding', 'Input validation and output encoding', NULL, 'XSS-Prävention, SQL-Injection-Schutz, Parametervalidierung'),
|
||||
('container_security', 'technical', 'Container- und Cloud-Sicherheit', 'Container and cloud security', NULL, 'Docker-Hardening, Kubernetes-Security, Image-Scanning'),
|
||||
('logging_configuration', 'technical', 'Log-Konfiguration und -Format', 'Log configuration and format', 'NOT audit_logging (Nachvollziehbarkeit), NOT monitoring (Überwachung)', 'Log-Format, Log-Rotation, Log-Shipping, Structured Logging'),
|
||||
('data_classification', 'governance', 'Datenklassifizierung und -kennzeichnung', 'Data classification and labeling', 'NOT sensitive_data (besondere Kategorien)', 'Vertraulichkeitsstufen, Datenklassifizierung, Labeling')
|
||||
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
|
||||
|
||||
-- Count results
|
||||
DO $$
|
||||
DECLARE cnt INTEGER;
|
||||
BEGIN
|
||||
SELECT count(*) INTO cnt FROM object_ontology;
|
||||
RAISE NOTICE 'object_ontology: % canonical tokens defined', cnt;
|
||||
END $$;
|
||||
@@ -0,0 +1,58 @@
|
||||
-- Migration 011: Derived Controls Library (Clean-Room MCs from external sources)
|
||||
-- Schema: compliance
|
||||
--
|
||||
-- Holds Master Controls + atomic controls + mitigations + metrics that were
|
||||
-- derived Clean-Room from external regulatory sources (BSI QUAIDAL today,
|
||||
-- Grundschutz++/CRA/NIST AI RMF next). Kept separate from the gpre2
|
||||
-- master_controls table because:
|
||||
-- 1) The shape is different (no object_group/phase concepts).
|
||||
-- 2) Source-Layer-Trennung: derivations from external IP must be cleanly
|
||||
-- separable from internally-generated artifacts.
|
||||
-- 3) Each row carries the licence + provenance for due diligence.
|
||||
--
|
||||
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" \
|
||||
-- < control-pipeline/migrations/011_derived_controls.sql
|
||||
|
||||
SET search_path TO compliance, public;
|
||||
|
||||
CREATE TABLE IF NOT EXISTS derived_controls (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
derived_id VARCHAR(200) UNIQUE NOT NULL, -- e.g. MC-AI-DATA-QKB-01-repraesentativitaet
|
||||
kind VARCHAR(30) NOT NULL, -- criterion | building_block | measure | metric
|
||||
canonical_name VARCHAR(300) NOT NULL,
|
||||
description TEXT NOT NULL, -- our own wording, never the original
|
||||
regulation_anchor TEXT, -- e.g. "EU AI Act Art. 10"
|
||||
related_quaidal_ids JSONB NOT NULL DEFAULT '[]', -- ["QB-03", "QB-04", ...]
|
||||
external_refs JSONB NOT NULL DEFAULT '[]', -- [{framework, citation}, ...]
|
||||
source_framework VARCHAR(80) NOT NULL, -- "BSI QUAIDAL"
|
||||
source_section VARCHAR(80) NOT NULL, -- "QKB-01"
|
||||
source_url TEXT,
|
||||
source_commit_sha VARCHAR(80),
|
||||
source_title_original TEXT, -- original title (label, not protected)
|
||||
source_license_note TEXT,
|
||||
plagiarism_score_at_generation NUMERIC(5,4), -- 0..1; gate was 0.20
|
||||
generated_by_model VARCHAR(80),
|
||||
yaml_path TEXT, -- pointer back to source YAML
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_derived_controls_kind ON derived_controls(kind);
|
||||
CREATE INDEX IF NOT EXISTS idx_derived_controls_source_framework ON derived_controls(source_framework);
|
||||
CREATE INDEX IF NOT EXISTS idx_derived_controls_source_section ON derived_controls(source_section);
|
||||
CREATE INDEX IF NOT EXISTS idx_derived_controls_related_quaidal_gin
|
||||
ON derived_controls USING GIN(related_quaidal_ids);
|
||||
|
||||
-- Trigger to keep updated_at fresh
|
||||
CREATE OR REPLACE FUNCTION trg_derived_controls_set_updated_at()
|
||||
RETURNS TRIGGER AS $$
|
||||
BEGIN
|
||||
NEW.updated_at = NOW();
|
||||
RETURN NEW;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
DROP TRIGGER IF EXISTS derived_controls_updated_at ON derived_controls;
|
||||
CREATE TRIGGER derived_controls_updated_at
|
||||
BEFORE UPDATE ON derived_controls
|
||||
FOR EACH ROW EXECUTE FUNCTION trg_derived_controls_set_updated_at();
|
||||
@@ -0,0 +1,170 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Upsert derived QUAIDAL controls from YAML into compliance.derived_controls.
|
||||
|
||||
Reads:
|
||||
control-pipeline/data/quaidal/master_controls.yaml
|
||||
control-pipeline/data/quaidal/atomic_controls.yaml
|
||||
control-pipeline/data/quaidal/mitigations.yaml
|
||||
control-pipeline/data/quaidal/metrics.yaml
|
||||
|
||||
Writes: compliance.derived_controls (idempotent UPSERT by derived_id)
|
||||
|
||||
Usage:
|
||||
# Mac Mini direct:
|
||||
python3 control-pipeline/scripts/apply_quaidal_to_db.py
|
||||
|
||||
# Via SSH (locally, against macmini DB):
|
||||
DB_HOST=macmini python3 control-pipeline/scripts/apply_quaidal_to_db.py
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
try:
|
||||
import psycopg
|
||||
import yaml
|
||||
except ImportError as e:
|
||||
print(f"ERROR: missing dependency {e.name}. Install with: pip install psycopg[binary] pyyaml", file=sys.stderr)
|
||||
sys.exit(2)
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||
DATA_DIR = REPO_ROOT / "control-pipeline" / "data" / "quaidal"
|
||||
|
||||
KIND_FILES = {
|
||||
"criterion": "master_controls.yaml",
|
||||
"building_block": "atomic_controls.yaml",
|
||||
"measure": "mitigations.yaml",
|
||||
"metric": "metrics.yaml",
|
||||
}
|
||||
|
||||
UPSERT_SQL = """
|
||||
INSERT INTO compliance.derived_controls (
|
||||
derived_id, kind, canonical_name, description, regulation_anchor,
|
||||
related_quaidal_ids, external_refs,
|
||||
source_framework, source_section, source_url, source_commit_sha,
|
||||
source_title_original, source_license_note,
|
||||
plagiarism_score_at_generation, generated_by_model, yaml_path
|
||||
) VALUES (
|
||||
%(derived_id)s, %(kind)s, %(canonical_name)s, %(description)s, %(regulation_anchor)s,
|
||||
%(related_quaidal_ids)s::jsonb, %(external_refs)s::jsonb,
|
||||
%(source_framework)s, %(source_section)s, %(source_url)s, %(source_commit_sha)s,
|
||||
%(source_title_original)s, %(source_license_note)s,
|
||||
%(plagiarism_score)s, %(generated_by_model)s, %(yaml_path)s
|
||||
)
|
||||
ON CONFLICT (derived_id) DO UPDATE SET
|
||||
kind = EXCLUDED.kind,
|
||||
canonical_name = EXCLUDED.canonical_name,
|
||||
description = EXCLUDED.description,
|
||||
regulation_anchor = EXCLUDED.regulation_anchor,
|
||||
related_quaidal_ids = EXCLUDED.related_quaidal_ids,
|
||||
external_refs = EXCLUDED.external_refs,
|
||||
source_framework = EXCLUDED.source_framework,
|
||||
source_section = EXCLUDED.source_section,
|
||||
source_url = EXCLUDED.source_url,
|
||||
source_commit_sha = EXCLUDED.source_commit_sha,
|
||||
source_title_original = EXCLUDED.source_title_original,
|
||||
source_license_note = EXCLUDED.source_license_note,
|
||||
plagiarism_score_at_generation = EXCLUDED.plagiarism_score_at_generation,
|
||||
generated_by_model = EXCLUDED.generated_by_model,
|
||||
yaml_path = EXCLUDED.yaml_path
|
||||
"""
|
||||
|
||||
|
||||
def load_yaml_records(yaml_path: Path) -> tuple[list[dict], str | None, str | None]:
|
||||
if not yaml_path.exists():
|
||||
return [], None, None
|
||||
data = yaml.safe_load(yaml_path.read_text(encoding="utf-8"))
|
||||
return data.get("controls", []), data.get("commit_sha"), data.get("generated_by_model")
|
||||
|
||||
|
||||
def to_row(ctrl: dict, yaml_path: Path, default_model: str | None, default_commit: str | None) -> dict:
|
||||
source = ctrl.get("source") or {}
|
||||
return {
|
||||
"derived_id": ctrl["id"],
|
||||
"kind": ctrl["kind"],
|
||||
"canonical_name": ctrl["canonical_name"],
|
||||
"description": ctrl["description"],
|
||||
"regulation_anchor": ctrl.get("regulation_anchor"),
|
||||
"related_quaidal_ids": json.dumps(ctrl.get("related_quaidal_ids", []), ensure_ascii=False),
|
||||
"external_refs": json.dumps(ctrl.get("external_refs", []), ensure_ascii=False),
|
||||
"source_framework": source.get("framework", "BSI QUAIDAL"),
|
||||
"source_section": source.get("section", ""),
|
||||
"source_url": source.get("url"),
|
||||
"source_commit_sha": source.get("commit_sha") or default_commit,
|
||||
"source_title_original": source.get("title_original_de"),
|
||||
"source_license_note": source.get("license_note"),
|
||||
"plagiarism_score": ctrl.get("plagiarism_score_at_generation"),
|
||||
"generated_by_model": default_model,
|
||||
"yaml_path": str(yaml_path.relative_to(REPO_ROOT)),
|
||||
}
|
||||
|
||||
|
||||
def build_dsn(args: argparse.Namespace) -> str:
|
||||
if args.dsn:
|
||||
return args.dsn
|
||||
return (
|
||||
f"host={args.db_host} port={args.db_port} "
|
||||
f"dbname={args.db_name} user={args.db_user} password={args.db_password}"
|
||||
)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
ap = argparse.ArgumentParser(description=__doc__)
|
||||
ap.add_argument("--dsn", help="Full DSN; overrides individual flags")
|
||||
ap.add_argument("--db-host", default=os.environ.get("DB_HOST", "localhost"))
|
||||
ap.add_argument("--db-port", default=os.environ.get("DB_PORT", "5432"))
|
||||
ap.add_argument("--db-name", default=os.environ.get("DB_NAME", "breakpilot_db"))
|
||||
ap.add_argument("--db-user", default=os.environ.get("DB_USER", "breakpilot"))
|
||||
ap.add_argument("--db-password", default=os.environ.get("DB_PASSWORD", "breakpilot"))
|
||||
ap.add_argument("--dry-run", action="store_true")
|
||||
args = ap.parse_args()
|
||||
|
||||
total = 0
|
||||
rows: list[dict] = []
|
||||
for kind, fname in KIND_FILES.items():
|
||||
path = DATA_DIR / fname
|
||||
records, commit, model = load_yaml_records(path)
|
||||
for rec in records:
|
||||
rows.append(to_row(rec, path, model, commit))
|
||||
if records:
|
||||
print(f" {fname}: {len(records)} entries", file=sys.stderr)
|
||||
total += len(records)
|
||||
|
||||
if not rows:
|
||||
print("ERROR: no YAML records found; run derive_quaidal_mcs.py first", file=sys.stderr)
|
||||
return 2
|
||||
|
||||
print(f"Total rows: {total}", file=sys.stderr)
|
||||
if args.dry_run:
|
||||
print("Dry run — sample row:", file=sys.stderr)
|
||||
print(json.dumps({k: (v[:200] if isinstance(v, str) else v) for k, v in rows[0].items()}, indent=2, ensure_ascii=False))
|
||||
return 0
|
||||
|
||||
dsn = build_dsn(args)
|
||||
print(f"Connecting to {args.db_host}:{args.db_port}/{args.db_name}", file=sys.stderr)
|
||||
inserted = updated = 0
|
||||
with psycopg.connect(dsn) as conn:
|
||||
with conn.cursor() as cur:
|
||||
for row in rows:
|
||||
cur.execute(
|
||||
"SELECT 1 FROM compliance.derived_controls WHERE derived_id = %s",
|
||||
(row["derived_id"],),
|
||||
)
|
||||
existed = cur.fetchone() is not None
|
||||
cur.execute(UPSERT_SQL, row)
|
||||
if existed:
|
||||
updated += 1
|
||||
else:
|
||||
inserted += 1
|
||||
conn.commit()
|
||||
print(f"Inserted: {inserted}, Updated: {updated}", file=sys.stderr)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,148 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Inherit source_citation from parent to atom controls.
|
||||
|
||||
Background
|
||||
==========
|
||||
|
||||
citation_backfill.py fills source_citation on the *source-bearing* controls
|
||||
(those with source_original_text — ~2-7 %) by re-linking them to the
|
||||
re-ingested, article_label-bearing chunks. The remaining ~93 % are "atom"
|
||||
controls (decompositions) that carry a parent_control_uuid but no own citation.
|
||||
They cite the SAME norm as their parent, so the citation can be inherited —
|
||||
no re-matching needed.
|
||||
|
||||
Self-written controls (license_rule = 3) are skipped (no external source).
|
||||
|
||||
Runs in idempotent iterations (atom -> master -> grandmaster) and prints
|
||||
per-stage counts before any write. Safe to rerun — only fills rows whose
|
||||
source_citation lacks an 'article'.
|
||||
|
||||
Usage::
|
||||
|
||||
python3 scripts/atom_citation_inheritance.py --db-host 100.80.114.48 \\
|
||||
--db-password breakpilot123 --dry-run
|
||||
python3 scripts/atom_citation_inheritance.py --db-host 100.80.114.48 \\
|
||||
--db-password breakpilot123 --apply
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
|
||||
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot@localhost:5432/breakpilot_db")
|
||||
|
||||
def _art(alias: str) -> str:
|
||||
"""SQL for source_citation->>'article' that works whether the column is jsonb
|
||||
(macmini) or text-containing-JSON (prod schema anomaly from the DB swap).
|
||||
pg_input_is_valid (PG16+) guards rows with invalid JSON so the cast never errors."""
|
||||
col = f"{alias}.source_citation"
|
||||
return (
|
||||
f"(CASE WHEN {col} IS NOT NULL AND pg_input_is_valid({col}::text, 'jsonb') "
|
||||
f"THEN ({col}::text)::jsonb->>'article' ELSE NULL END)"
|
||||
)
|
||||
|
||||
|
||||
# A row "needs" a citation when it has no article yet.
|
||||
_NEEDS = f"({_art('cc')} IS NULL OR {_art('cc')} = '')"
|
||||
# A parent can supply one when it carries a real article.
|
||||
_PARENT_HAS = f"({_art('p')} IS NOT NULL AND {_art('p')} <> '')"
|
||||
|
||||
SQL_REPORT = f"""
|
||||
SET search_path TO compliance, public;
|
||||
SELECT
|
||||
CASE WHEN cc.parent_control_uuid IS NULL THEN 'no_parent'
|
||||
WHEN ({_PARENT_HAS.replace('p.', 'p2.')}) THEN 'parent_has_article'
|
||||
ELSE 'parent_no_article' END AS bucket,
|
||||
COUNT(*) AS n
|
||||
FROM canonical_controls cc
|
||||
LEFT JOIN canonical_controls p2 ON cc.parent_control_uuid = p2.id
|
||||
WHERE {_NEEDS}
|
||||
AND cc.license_rule IS DISTINCT FROM 3
|
||||
GROUP BY 1 ORDER BY 2 DESC;
|
||||
"""
|
||||
|
||||
SQL_INHERIT = f"""
|
||||
SET search_path TO compliance, public;
|
||||
UPDATE canonical_controls cc
|
||||
SET source_citation = p.source_citation, updated_at = NOW()
|
||||
FROM canonical_controls p
|
||||
WHERE cc.parent_control_uuid = p.id
|
||||
AND {_NEEDS}
|
||||
AND {_PARENT_HAS}
|
||||
AND cc.license_rule IS DISTINCT FROM 3;
|
||||
"""
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument("--db-url", default=DB_URL,
|
||||
help="Postgres URL (default: $DATABASE_URL)")
|
||||
p.add_argument("--max-iterations", type=int, default=6,
|
||||
help="Cap on inheritance iterations to avoid loops")
|
||||
g = p.add_mutually_exclusive_group(required=True)
|
||||
g.add_argument("--dry-run", action="store_true")
|
||||
g.add_argument("--apply", action="store_true")
|
||||
return p.parse_args()
|
||||
|
||||
|
||||
def print_bucket(rows, label: str) -> None:
|
||||
print(f"\n## {label}")
|
||||
total = 0
|
||||
for bucket, n in rows:
|
||||
print(f" {bucket:20} {n:>8}")
|
||||
total += n
|
||||
print(f" {'TOTAL':20} {total:>8}")
|
||||
|
||||
|
||||
def main() -> int:
|
||||
args = parse_args()
|
||||
try:
|
||||
import psycopg2
|
||||
except ImportError:
|
||||
print("error: psycopg2 not installed", file=sys.stderr)
|
||||
return 2
|
||||
|
||||
conn = psycopg2.connect(args.db_url)
|
||||
conn.autocommit = False
|
||||
cur = conn.cursor()
|
||||
|
||||
print("=" * 60)
|
||||
print(" Atom citation inheritance — source_citation via parent")
|
||||
print(f" Mode: {'DRY-RUN' if args.dry_run else 'APPLY'}")
|
||||
print("=" * 60)
|
||||
|
||||
cur.execute(SQL_REPORT)
|
||||
print_bucket(cur.fetchall(), "Controls without article (need citation)")
|
||||
|
||||
if args.dry_run:
|
||||
cur.execute(
|
||||
"SET search_path TO compliance, public; "
|
||||
f"SELECT COUNT(*) FROM canonical_controls cc "
|
||||
f"JOIN canonical_controls p ON cc.parent_control_uuid = p.id "
|
||||
f"WHERE {_NEEDS} AND {_PARENT_HAS} AND cc.license_rule IS DISTINCT FROM 3;"
|
||||
)
|
||||
print(f"\n## First inherit-pass would fill: {cur.fetchone()[0]} rows")
|
||||
print("\nNo writes performed. Use --apply to execute.")
|
||||
conn.rollback()
|
||||
return 0
|
||||
|
||||
total = 0
|
||||
for i in range(1, args.max_iterations + 1):
|
||||
cur.execute(SQL_INHERIT)
|
||||
updated = cur.rowcount
|
||||
total += updated
|
||||
print(f"\n iteration {i}: {updated} rows inherited")
|
||||
if updated == 0:
|
||||
break
|
||||
conn.commit()
|
||||
print(f"\n✓ Total atoms inherited: {total}")
|
||||
|
||||
cur.execute(SQL_REPORT)
|
||||
print_bucket(cur.fetchall(), "Remaining without article")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -0,0 +1,256 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Audit script for license classification gaps in the control pipeline.
|
||||
|
||||
Reports:
|
||||
|
||||
1. **regulation_registry coverage** — how many regulations are classified, by
|
||||
rule and license_type.
|
||||
2. **atomic_controls without license_rule** — how many controls reference a
|
||||
regulation_id that has no entry (or no license_rule) in the registry.
|
||||
3. **Qdrant payload consistency** — for each indexed collection, how many
|
||||
chunks carry both ``license`` and ``license_rule`` payload fields.
|
||||
|
||||
The goal is to surface every record where the engine could in principle
|
||||
extract or emit content but the license rule is unknown — those records are
|
||||
the highest-risk material in a license audit.
|
||||
|
||||
Usage::
|
||||
|
||||
python3 scripts/audit_license_classification.py --db-host 100.80.114.48
|
||||
|
||||
Add ``--check-qdrant`` to also probe ``http://<host>:6333`` collections.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
from urllib import request as urllib_request
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
DEFAULT_HOST = "100.80.114.48"
|
||||
DEFAULT_PORT = 5432
|
||||
DEFAULT_USER = "breakpilot"
|
||||
DEFAULT_DB = "breakpilot_db"
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument("--db-host", default=DEFAULT_HOST)
|
||||
p.add_argument("--db-port", type=int, default=DEFAULT_PORT)
|
||||
p.add_argument("--db-user", default=DEFAULT_USER)
|
||||
p.add_argument("--db-name", default=DEFAULT_DB)
|
||||
p.add_argument("--db-password", default="")
|
||||
p.add_argument("--check-qdrant", action="store_true")
|
||||
p.add_argument("--qdrant-host", default="100.80.114.48")
|
||||
p.add_argument("--qdrant-port", type=int, default=6333)
|
||||
p.add_argument("--json", action="store_true", help="Emit JSON result on stdout")
|
||||
return p.parse_args()
|
||||
|
||||
|
||||
def audit_registry(conn) -> dict:
|
||||
"""Coverage of regulation_registry."""
|
||||
cur = conn.cursor()
|
||||
cur.execute(
|
||||
"SET search_path TO compliance, public; "
|
||||
"SELECT license_rule, license_type, COUNT(*) "
|
||||
"FROM regulation_registry GROUP BY license_rule, license_type "
|
||||
"ORDER BY license_rule, license_type;"
|
||||
)
|
||||
by_rule_and_type: list[tuple] = []
|
||||
by_rule: Counter = Counter()
|
||||
for rule, ltype, count in cur.fetchall():
|
||||
by_rule_and_type.append((rule, ltype or "(empty)", count))
|
||||
by_rule[rule] += count
|
||||
|
||||
cur.execute(
|
||||
"SELECT COUNT(*) FROM regulation_registry "
|
||||
"WHERE license_type IS NULL OR license_type = '';"
|
||||
)
|
||||
missing_type = cur.fetchone()[0]
|
||||
|
||||
cur.execute("SELECT COUNT(*) FROM regulation_registry;")
|
||||
total = cur.fetchone()[0]
|
||||
|
||||
return {
|
||||
"total": total,
|
||||
"by_rule": dict(by_rule),
|
||||
"by_rule_and_type": by_rule_and_type,
|
||||
"missing_license_type": missing_type,
|
||||
}
|
||||
|
||||
|
||||
def audit_atomic_controls(conn) -> dict:
|
||||
"""Controls whose source regulation has no license rule.
|
||||
|
||||
Important: the schema differs between core (bp-core) and customer
|
||||
deployments. We probe a handful of likely column names and skip if
|
||||
none are found.
|
||||
"""
|
||||
cur = conn.cursor()
|
||||
# Detect controls table
|
||||
cur.execute(
|
||||
"SELECT table_name FROM information_schema.tables "
|
||||
"WHERE table_schema='compliance' AND table_name IN "
|
||||
"('atomic_controls','atomic_controls_dedup','canonical_controls');"
|
||||
)
|
||||
tables = [r[0] for r in cur.fetchall()]
|
||||
if not tables:
|
||||
return {"skipped": True, "reason": "no controls table found"}
|
||||
|
||||
result: dict = {"tables": {}}
|
||||
for tbl in tables:
|
||||
cur.execute(
|
||||
f"SELECT column_name FROM information_schema.columns "
|
||||
f"WHERE table_schema='compliance' AND table_name='{tbl}';"
|
||||
)
|
||||
cols = {r[0] for r in cur.fetchall()}
|
||||
if "license_rule" not in cols:
|
||||
result["tables"][tbl] = {"skipped": True, "reason": "no license_rule column"}
|
||||
continue
|
||||
cur.execute(f"SELECT COUNT(*) FROM compliance.{tbl};")
|
||||
total = cur.fetchone()[0]
|
||||
cur.execute(
|
||||
f"SELECT license_rule, COUNT(*) FROM compliance.{tbl} "
|
||||
f"GROUP BY license_rule ORDER BY license_rule;"
|
||||
)
|
||||
by_rule = {str(r[0]): r[1] for r in cur.fetchall()}
|
||||
cur.execute(
|
||||
f"SELECT COUNT(*) FROM compliance.{tbl} WHERE license_rule IS NULL;"
|
||||
)
|
||||
missing = cur.fetchone()[0]
|
||||
result["tables"][tbl] = {
|
||||
"total": total,
|
||||
"by_rule": by_rule,
|
||||
"missing_license_rule": missing,
|
||||
}
|
||||
return result
|
||||
|
||||
|
||||
def audit_qdrant(host: str, port: int) -> dict:
|
||||
"""Probe Qdrant collections for license + license_rule payload coverage.
|
||||
|
||||
Samples 500 points per collection and reports how many have neither
|
||||
field populated.
|
||||
"""
|
||||
out: dict = {"collections": {}}
|
||||
base = f"http://{host}:{port}"
|
||||
try:
|
||||
with urllib_request.urlopen(f"{base}/collections", timeout=10) as r:
|
||||
colls = json.loads(r.read()).get("result", {}).get("collections", [])
|
||||
except Exception as e:
|
||||
return {"error": str(e)}
|
||||
|
||||
for c in colls:
|
||||
name = c["name"]
|
||||
if "compliance" not in name and "atomic_controls" not in name:
|
||||
continue
|
||||
payload = {"limit": 500, "with_payload": True, "with_vector": False}
|
||||
req = urllib_request.Request(
|
||||
f"{base}/collections/{name}/points/scroll",
|
||||
data=json.dumps(payload).encode(),
|
||||
headers={"Content-Type": "application/json"},
|
||||
)
|
||||
try:
|
||||
with urllib_request.urlopen(req, timeout=15) as r:
|
||||
points = json.loads(r.read()).get("result", {}).get("points", [])
|
||||
except Exception as e:
|
||||
out["collections"][name] = {"error": str(e)}
|
||||
continue
|
||||
sampled = len(points)
|
||||
both_set = 0
|
||||
only_license = 0
|
||||
only_rule = 0
|
||||
neither = 0
|
||||
for p in points:
|
||||
pl = p.get("payload", {}) or {}
|
||||
has_lic = bool(pl.get("license"))
|
||||
has_rule = pl.get("license_rule") is not None
|
||||
if has_lic and has_rule:
|
||||
both_set += 1
|
||||
elif has_lic:
|
||||
only_license += 1
|
||||
elif has_rule:
|
||||
only_rule += 1
|
||||
else:
|
||||
neither += 1
|
||||
out["collections"][name] = {
|
||||
"sampled": sampled,
|
||||
"both_set": both_set,
|
||||
"only_license_field": only_license,
|
||||
"only_license_rule_field": only_rule,
|
||||
"neither_set": neither,
|
||||
"neither_pct": round(neither / sampled * 100, 1) if sampled else 0,
|
||||
}
|
||||
return out
|
||||
|
||||
|
||||
def main() -> int:
|
||||
args = parse_args()
|
||||
try:
|
||||
import psycopg2
|
||||
except ImportError:
|
||||
print("error: psycopg2 not installed (pip install psycopg2-binary)", file=sys.stderr)
|
||||
return 2
|
||||
|
||||
conn = psycopg2.connect(
|
||||
host=args.db_host,
|
||||
port=args.db_port,
|
||||
user=args.db_user,
|
||||
dbname=args.db_name,
|
||||
password=args.db_password or None,
|
||||
)
|
||||
try:
|
||||
registry = audit_registry(conn)
|
||||
controls = audit_atomic_controls(conn)
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
qdrant: Optional[dict] = None
|
||||
if args.check_qdrant:
|
||||
qdrant = audit_qdrant(args.qdrant_host, args.qdrant_port)
|
||||
|
||||
result = {"registry": registry, "atomic_controls": controls, "qdrant": qdrant}
|
||||
|
||||
if args.json:
|
||||
print(json.dumps(result, indent=2, default=str))
|
||||
return 0
|
||||
|
||||
print("=" * 60)
|
||||
print(" Audit — License Classification")
|
||||
print("=" * 60)
|
||||
print()
|
||||
print(f"## regulation_registry ({registry['total']} rows)")
|
||||
print(f" By rule: {registry['by_rule']}")
|
||||
print(f" Missing license_type: {registry['missing_license_type']}")
|
||||
print()
|
||||
print("## atomic_controls")
|
||||
for tbl, info in controls.get("tables", {}).items():
|
||||
if info.get("skipped"):
|
||||
print(f" {tbl}: SKIPPED ({info['reason']})")
|
||||
continue
|
||||
print(f" {tbl}: {info['total']} rows")
|
||||
print(f" by_rule={info['by_rule']}")
|
||||
print(f" missing_license_rule={info['missing_license_rule']}")
|
||||
print()
|
||||
if qdrant:
|
||||
print("## qdrant")
|
||||
for name, info in qdrant.get("collections", {}).items():
|
||||
if "error" in info:
|
||||
print(f" {name}: ERROR {info['error']}")
|
||||
continue
|
||||
print(
|
||||
f" {name:30} sampled={info['sampled']:4} "
|
||||
f"both={info['both_set']:4} "
|
||||
f"neither={info['neither_set']:4} ({info['neither_pct']}%)"
|
||||
)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -0,0 +1,184 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Backfill license_rule on canonical_controls by inheriting from parent.
|
||||
|
||||
Background
|
||||
==========
|
||||
|
||||
Audit (audit_license_classification.py) showed that 279,384 of 314,811 rows
|
||||
in compliance.canonical_controls have NULL license_rule. Drilling in:
|
||||
|
||||
- 261,980 of those (94%) have a parent_control_uuid whose parent already
|
||||
carries a non-NULL license_rule. The pass0b decomposition pipeline did
|
||||
not propagate the rule to its child controls — this is a clear inheritance
|
||||
bug, fixable without any classification decisions.
|
||||
- 16,617 have a parent that itself has no license_rule (transitive case).
|
||||
Inheriting iteratively converges to either rule-set or root-orphan.
|
||||
- 787 have no parent at all (decomposition roots). These need cluster-based
|
||||
manual classification (see Strategy Notes at the bottom of this file).
|
||||
|
||||
This script runs the inheritance fix in three idempotent stages and
|
||||
prints per-stage counts before any write happens.
|
||||
|
||||
Usage::
|
||||
|
||||
# Always dry-run first:
|
||||
python3 scripts/backfill_license_rule.py --db-host 100.80.114.48 \\
|
||||
--db-password breakpilot123 --dry-run
|
||||
|
||||
# If counts look right:
|
||||
python3 scripts/backfill_license_rule.py --db-host 100.80.114.48 \\
|
||||
--db-password breakpilot123 --apply
|
||||
|
||||
The script is safe to rerun — it only touches rows where license_rule
|
||||
IS NULL.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument("--db-host", default="100.80.114.48")
|
||||
p.add_argument("--db-port", type=int, default=5432)
|
||||
p.add_argument("--db-user", default="breakpilot")
|
||||
p.add_argument("--db-name", default="breakpilot_db")
|
||||
p.add_argument("--db-password", required=True)
|
||||
g = p.add_mutually_exclusive_group(required=True)
|
||||
g.add_argument("--dry-run", action="store_true")
|
||||
g.add_argument("--apply", action="store_true")
|
||||
p.add_argument("--max-iterations", type=int, default=5,
|
||||
help="Cap on inheritance iterations to avoid loops")
|
||||
return p.parse_args()
|
||||
|
||||
|
||||
# Stage 1: direct parent has license_rule — single UPDATE.
|
||||
# Stage 2: iterative — parent did not have it, but a grandparent does.
|
||||
# We loop until no more rows can be filled or max-iterations.
|
||||
# Stage 3: residual rows with no resolvable parent. Report them clustered
|
||||
# by category/pattern_id so the user can classify by family.
|
||||
|
||||
SQL_REPORT_NULLS = """
|
||||
SET search_path TO compliance, public;
|
||||
SELECT
|
||||
CASE WHEN cc.parent_control_uuid IS NULL THEN 'no_parent'
|
||||
WHEN p.license_rule IS NULL THEN 'parent_null'
|
||||
ELSE 'parent_set' END AS bucket,
|
||||
COUNT(*) AS n
|
||||
FROM canonical_controls cc
|
||||
LEFT JOIN canonical_controls p ON cc.parent_control_uuid = p.id
|
||||
WHERE cc.license_rule IS NULL
|
||||
GROUP BY 1 ORDER BY 2 DESC;
|
||||
"""
|
||||
|
||||
SQL_INHERIT_FROM_PARENT = """
|
||||
SET search_path TO compliance, public;
|
||||
UPDATE canonical_controls cc
|
||||
SET license_rule = p.license_rule, updated_at = NOW()
|
||||
FROM canonical_controls p
|
||||
WHERE cc.parent_control_uuid = p.id
|
||||
AND cc.license_rule IS NULL
|
||||
AND p.license_rule IS NOT NULL;
|
||||
"""
|
||||
|
||||
SQL_REPORT_ORPHAN_CLUSTERS = """
|
||||
SET search_path TO compliance, public;
|
||||
SELECT
|
||||
COALESCE(category, '(null)') AS category,
|
||||
COALESCE(pattern_id, '(null)') AS pattern_id,
|
||||
COALESCE(generation_strategy, '(null)') AS gen,
|
||||
COUNT(*) AS n
|
||||
FROM canonical_controls
|
||||
WHERE license_rule IS NULL AND parent_control_uuid IS NULL
|
||||
GROUP BY 1, 2, 3 ORDER BY n DESC LIMIT 25;
|
||||
"""
|
||||
|
||||
|
||||
def print_bucket(rows, label: str) -> None:
|
||||
print(f"\n## {label}")
|
||||
total = 0
|
||||
for bucket, n in rows:
|
||||
print(f" {bucket:12} {n:>8}")
|
||||
total += n
|
||||
print(f" {'TOTAL':12} {total:>8}")
|
||||
|
||||
|
||||
def main() -> int:
|
||||
args = parse_args()
|
||||
try:
|
||||
import psycopg2
|
||||
except ImportError:
|
||||
print("error: psycopg2 not installed", file=sys.stderr)
|
||||
return 2
|
||||
|
||||
conn = psycopg2.connect(
|
||||
host=args.db_host, port=args.db_port, user=args.db_user,
|
||||
dbname=args.db_name, password=args.db_password,
|
||||
)
|
||||
conn.autocommit = False
|
||||
cur = conn.cursor()
|
||||
|
||||
print("=" * 60)
|
||||
print(" Backfill — license_rule via parent inheritance")
|
||||
print(f" Mode: {'DRY-RUN' if args.dry_run else 'APPLY'}")
|
||||
print("=" * 60)
|
||||
|
||||
# Initial bucket report
|
||||
cur.execute(SQL_REPORT_NULLS)
|
||||
rows = cur.fetchall()
|
||||
print_bucket(rows, "Initial NULL distribution")
|
||||
|
||||
if args.dry_run:
|
||||
# Print what the FIRST inherit pass would resolve (without writing)
|
||||
cur.execute(
|
||||
"SET search_path TO compliance, public; "
|
||||
"SELECT p.license_rule, COUNT(*) "
|
||||
"FROM canonical_controls cc "
|
||||
"JOIN canonical_controls p ON cc.parent_control_uuid = p.id "
|
||||
"WHERE cc.license_rule IS NULL AND p.license_rule IS NOT NULL "
|
||||
"GROUP BY 1 ORDER BY 1;"
|
||||
)
|
||||
print("\n## First inherit-pass would fill:")
|
||||
for rule, n in cur.fetchall():
|
||||
print(f" rule={rule} {n:>8} rows")
|
||||
|
||||
# Show orphan clusters that would remain
|
||||
cur.execute(SQL_REPORT_ORPHAN_CLUSTERS)
|
||||
print("\n## Orphan clusters (no parent + no rule, top 25):")
|
||||
for cat, pid, gen, n in cur.fetchall():
|
||||
print(f" cat={cat[:20]:20} pat={pid[:20]:20} gen={gen[:20]:20} n={n}")
|
||||
print("\nNo writes performed. Use --apply to execute.")
|
||||
conn.rollback()
|
||||
return 0
|
||||
|
||||
# Apply mode — iterative inheritance
|
||||
total_updated = 0
|
||||
for i in range(1, args.max_iterations + 1):
|
||||
cur.execute(SQL_INHERIT_FROM_PARENT)
|
||||
updated = cur.rowcount
|
||||
total_updated += updated
|
||||
print(f"\n iteration {i}: {updated} rows updated")
|
||||
if updated == 0:
|
||||
break
|
||||
|
||||
conn.commit()
|
||||
print(f"\n✓ Total rows backfilled: {total_updated}")
|
||||
|
||||
# Final bucket report
|
||||
cur.execute(SQL_REPORT_NULLS)
|
||||
print_bucket(cur.fetchall(), "Remaining NULL distribution")
|
||||
|
||||
cur.execute(SQL_REPORT_ORPHAN_CLUSTERS)
|
||||
rows = cur.fetchall()
|
||||
if rows:
|
||||
print("\n## Orphan clusters still need classification:")
|
||||
for cat, pid, gen, n in rows:
|
||||
print(f" cat={cat[:20]:20} pat={pid[:20]:20} gen={gen[:20]:20} n={n}")
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -0,0 +1,203 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Backfill ``license_rule`` payload field into Qdrant atomic_controls_dedup
|
||||
and related compliance collections, sourced from canonical_controls in Postgres.
|
||||
|
||||
The audit (audit_license_classification.py) surfaced that Qdrant collections
|
||||
holding canonical-control vectors (notably ``atomic_controls_dedup``) carry no
|
||||
license_rule payload at all, even though the underlying Postgres table is now
|
||||
fully classified. This script joins the two via ``control_uuid`` and patches the
|
||||
Qdrant payload in batches.
|
||||
|
||||
Usage::
|
||||
|
||||
python3 scripts/backfill_qdrant_license_payload.py \\
|
||||
--pg-host 100.80.114.48 --pg-password breakpilot123 \\
|
||||
--qdrant http://100.80.114.48:6333 \\
|
||||
--collection atomic_controls_dedup \\
|
||||
--dry-run
|
||||
|
||||
# apply
|
||||
python3 scripts/backfill_qdrant_license_payload.py ... --apply
|
||||
|
||||
Notes
|
||||
-----
|
||||
- ``control_uuid`` lives in the payload of atomic_controls_dedup. For other
|
||||
collections that key the canonical control by a different field, override with
|
||||
``--uuid-field``.
|
||||
- Qdrant ``set_payload`` is keyed by point id, not payload field. We resolve
|
||||
UUID → point id by a paginated scroll-and-filter pass, then issue grouped
|
||||
set_payload requests per license_rule (3 batches per collection).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
from typing import Iterator
|
||||
from urllib import request as urllib_request
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument("--pg-host", default="100.80.114.48")
|
||||
p.add_argument("--pg-port", type=int, default=5432)
|
||||
p.add_argument("--pg-user", default="breakpilot")
|
||||
p.add_argument("--pg-name", default="breakpilot_db")
|
||||
p.add_argument("--pg-password", required=True)
|
||||
p.add_argument("--qdrant", default="http://100.80.114.48:6333")
|
||||
p.add_argument("--qdrant-api-key", default="",
|
||||
help="API key for managed Qdrant (Production)")
|
||||
p.add_argument("--collection", default="atomic_controls_dedup")
|
||||
p.add_argument("--uuid-field", default="control_uuid",
|
||||
help="Payload field used for lookup (control_uuid or regulation_id)")
|
||||
p.add_argument("--lookup", choices=["canonical_controls", "regulation_registry"],
|
||||
default="canonical_controls",
|
||||
help="Postgres table to resolve the lookup against")
|
||||
p.add_argument("--batch-size", type=int, default=500)
|
||||
g = p.add_mutually_exclusive_group(required=True)
|
||||
g.add_argument("--dry-run", action="store_true")
|
||||
g.add_argument("--apply", action="store_true")
|
||||
return p.parse_args()
|
||||
|
||||
|
||||
def fetch_rule_by_uuid(args) -> dict[str, int]:
|
||||
"""Pull lookup-key → license_rule mapping from Postgres.
|
||||
|
||||
Source table is chosen by ``--lookup``:
|
||||
- canonical_controls: id (UUID) → license_rule, for atomic_controls_dedup
|
||||
- regulation_registry: regulation_id → license_rule, for document chunks
|
||||
"""
|
||||
import psycopg2
|
||||
conn = psycopg2.connect(
|
||||
host=args.pg_host, port=args.pg_port, user=args.pg_user,
|
||||
dbname=args.pg_name, password=args.pg_password,
|
||||
)
|
||||
cur = conn.cursor()
|
||||
cur.execute("SET search_path TO compliance, public;")
|
||||
if args.lookup == "regulation_registry":
|
||||
cur.execute(
|
||||
"SELECT regulation_id, license_rule FROM regulation_registry "
|
||||
"WHERE license_rule IS NOT NULL"
|
||||
)
|
||||
else:
|
||||
cur.execute(
|
||||
"SELECT id::text, license_rule FROM canonical_controls "
|
||||
"WHERE license_rule IS NOT NULL"
|
||||
)
|
||||
mapping = {row[0]: int(row[1]) for row in cur.fetchall()}
|
||||
conn.close()
|
||||
return mapping
|
||||
|
||||
|
||||
def _headers(api_key: str = "") -> dict:
|
||||
h = {"Content-Type": "application/json"}
|
||||
if api_key:
|
||||
h["api-key"] = api_key
|
||||
return h
|
||||
|
||||
|
||||
def scroll_collection(qdrant: str, collection: str, uuid_field: str, api_key: str = "") -> Iterator[dict]:
|
||||
"""Yield (point_id, uuid_value, has_rule_already) tuples."""
|
||||
next_offset = None
|
||||
while True:
|
||||
body = {"limit": 1000, "with_payload": True, "with_vector": False}
|
||||
if next_offset is not None:
|
||||
body["offset"] = next_offset
|
||||
req = urllib_request.Request(
|
||||
f"{qdrant}/collections/{collection}/points/scroll",
|
||||
data=json.dumps(body).encode(),
|
||||
headers=_headers(api_key),
|
||||
)
|
||||
with urllib_request.urlopen(req, timeout=60) as r:
|
||||
payload = json.loads(r.read())
|
||||
result = payload.get("result", {})
|
||||
for pt in result.get("points", []):
|
||||
pl = pt.get("payload", {}) or {}
|
||||
yield {
|
||||
"id": pt["id"],
|
||||
"uuid": pl.get(uuid_field),
|
||||
"has_rule": "license_rule" in pl,
|
||||
}
|
||||
next_offset = result.get("next_page_offset")
|
||||
if next_offset is None:
|
||||
break
|
||||
|
||||
|
||||
def set_payload_batch(qdrant: str, collection: str, point_ids: list, rule: int, api_key: str = "") -> int:
|
||||
"""POST set_payload for a batch of point IDs with a single license_rule."""
|
||||
body = {
|
||||
"payload": {"license_rule": rule},
|
||||
"points": point_ids,
|
||||
}
|
||||
req = urllib_request.Request(
|
||||
f"{qdrant}/collections/{collection}/points/payload?wait=true",
|
||||
data=json.dumps(body).encode(),
|
||||
headers=_headers(api_key),
|
||||
method="POST",
|
||||
)
|
||||
with urllib_request.urlopen(req, timeout=120) as r:
|
||||
resp = json.loads(r.read())
|
||||
if resp.get("status") != "ok":
|
||||
raise RuntimeError(f"set_payload failed: {resp}")
|
||||
return len(point_ids)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
args = parse_args()
|
||||
print("Loading canonical_controls → license_rule mapping…")
|
||||
rule_by_uuid = fetch_rule_by_uuid(args)
|
||||
print(f" Postgres returned {len(rule_by_uuid)} classified controls")
|
||||
|
||||
print(f"Scrolling Qdrant collection {args.collection!r}…")
|
||||
by_rule: dict[int, list] = {1: [], 2: [], 3: []}
|
||||
points_total = 0
|
||||
points_with_uuid = 0
|
||||
points_already_set = 0
|
||||
points_no_match = 0
|
||||
|
||||
for pt in scroll_collection(args.qdrant, args.collection, args.uuid_field, args.qdrant_api_key):
|
||||
points_total += 1
|
||||
uuid = pt["uuid"]
|
||||
if not uuid:
|
||||
continue
|
||||
points_with_uuid += 1
|
||||
if pt["has_rule"]:
|
||||
points_already_set += 1
|
||||
continue
|
||||
rule = rule_by_uuid.get(uuid)
|
||||
if rule is None:
|
||||
points_no_match += 1
|
||||
continue
|
||||
if rule not in by_rule:
|
||||
continue
|
||||
by_rule[rule].append(pt["id"])
|
||||
|
||||
print(f" total points scanned: {points_total}")
|
||||
print(f" with {args.uuid_field}: {points_with_uuid}")
|
||||
print(f" already had license_rule: {points_already_set}")
|
||||
print(f" uuid not found in Postgres: {points_no_match}")
|
||||
print(f" to set per rule: rule1={len(by_rule[1])} rule2={len(by_rule[2])} rule3={len(by_rule[3])}")
|
||||
|
||||
if args.dry_run:
|
||||
print("\nDRY-RUN: no writes performed. Use --apply to execute.")
|
||||
return 0
|
||||
|
||||
total_written = 0
|
||||
for rule, ids in by_rule.items():
|
||||
if not ids:
|
||||
continue
|
||||
print(f"\nWriting license_rule={rule} to {len(ids)} points (batch {args.batch_size})…")
|
||||
for i in range(0, len(ids), args.batch_size):
|
||||
chunk = ids[i:i + args.batch_size]
|
||||
n = set_payload_batch(args.qdrant, args.collection, chunk, rule, args.qdrant_api_key)
|
||||
total_written += n
|
||||
print(f" batch {i // args.batch_size + 1}: {n} points (cumulative {total_written})")
|
||||
time.sleep(0.05)
|
||||
print(f"\nWrote license_rule on {total_written} Qdrant points in {args.collection}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -0,0 +1,498 @@
|
||||
#!/usr/bin/env python3
|
||||
"""D6 Citation Backfill — update ~291k controls with section metadata from Qdrant chunks.
|
||||
|
||||
Archives old source_citation in generation_metadata.old_citation.
|
||||
Updates source_citation.article, .paragraph, .page from matched Qdrant chunks.
|
||||
|
||||
3-tier matching:
|
||||
Tier 1: sha256(source_original_text) → exact chunk text match
|
||||
Tier 2: Parse [section] prefix from source_original_text
|
||||
Tier 3: Best text overlap within same regulation_id
|
||||
|
||||
Usage:
|
||||
python3 control-pipeline/scripts/d6_citation_backfill.py --dry-run --limit 100
|
||||
python3 control-pipeline/scripts/d6_citation_backfill.py --batch-size 1000
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
import httpx
|
||||
import psycopg2
|
||||
import psycopg2.extras
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
)
|
||||
logger = logging.getLogger("d6-backfill")
|
||||
|
||||
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot@localhost:5432/breakpilot_db")
|
||||
QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
|
||||
|
||||
COLLECTIONS = [
|
||||
"bp_compliance_ce",
|
||||
"bp_compliance_gesetze",
|
||||
"bp_compliance_datenschutz",
|
||||
"bp_dsfa_corpus",
|
||||
"bp_legal_templates",
|
||||
]
|
||||
|
||||
# Parse [§ 312k Title] or [AC-1 POLICY] prefix from chunk text
|
||||
_SECTION_PREFIX_RE = re.compile(r'^\[([^\]]+)\]\s*')
|
||||
|
||||
|
||||
@dataclass
|
||||
class ChunkMeta:
|
||||
section: str
|
||||
section_title: str
|
||||
paragraph: str
|
||||
page: Optional[int]
|
||||
regulation_id: str
|
||||
|
||||
|
||||
@dataclass
|
||||
class Stats:
|
||||
total: int = 0
|
||||
already_correct: int = 0
|
||||
matched_hash: int = 0
|
||||
matched_prefix: int = 0
|
||||
matched_overlap: int = 0
|
||||
unmatched: int = 0
|
||||
updated: int = 0
|
||||
errors: int = 0
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Phase 1: Build Qdrant index
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
def build_qdrant_index(qdrant_url: str) -> tuple[dict, dict]:
|
||||
"""Build hash index and regulation index from all Qdrant collections.
|
||||
|
||||
Returns:
|
||||
hash_index: {sha256(chunk_text) → ChunkMeta}
|
||||
reg_index: {regulation_id → [ChunkMeta with text snippets]}
|
||||
"""
|
||||
hash_index: dict[str, ChunkMeta] = {}
|
||||
reg_index: dict[str, list[tuple[str, ChunkMeta]]] = {}
|
||||
total_chunks = 0
|
||||
|
||||
for coll in COLLECTIONS:
|
||||
offset = None
|
||||
coll_count = 0
|
||||
with httpx.Client(timeout=60.0) as c:
|
||||
while True:
|
||||
body = {
|
||||
"limit": 250,
|
||||
"with_payload": [
|
||||
"chunk_text", "section", "section_title",
|
||||
"paragraph", "page", "regulation_id",
|
||||
],
|
||||
"with_vector": False,
|
||||
}
|
||||
if offset is not None:
|
||||
body["offset"] = offset
|
||||
resp = c.post(
|
||||
f"{qdrant_url}/collections/{coll}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
|
||||
for pt in data["points"]:
|
||||
p = pt.get("payload", {})
|
||||
chunk_text = p.get("chunk_text", "")
|
||||
if not chunk_text or len(chunk_text.strip()) < 30:
|
||||
continue
|
||||
|
||||
meta = ChunkMeta(
|
||||
section=p.get("section", "") or "",
|
||||
section_title=p.get("section_title", "") or "",
|
||||
paragraph=p.get("paragraph", "") or "",
|
||||
page=p.get("page"),
|
||||
regulation_id=p.get("regulation_id", "") or "",
|
||||
)
|
||||
|
||||
# Hash index
|
||||
h = hashlib.sha256(chunk_text.encode()).hexdigest()
|
||||
if meta.section: # only index chunks WITH section data
|
||||
hash_index[h] = meta
|
||||
|
||||
# Regulation index (for text overlap matching)
|
||||
if meta.regulation_id and meta.section:
|
||||
reg_index.setdefault(meta.regulation_id, []).append(
|
||||
(chunk_text[:500], meta)
|
||||
)
|
||||
|
||||
coll_count += 1
|
||||
|
||||
offset = data.get("next_page_offset")
|
||||
if offset is None:
|
||||
break
|
||||
|
||||
total_chunks += coll_count
|
||||
logger.info(" [%s] %d chunks indexed", coll, coll_count)
|
||||
|
||||
logger.info("Qdrant index: %d total chunks, %d with section (hash), %d regulations",
|
||||
total_chunks, len(hash_index), len(reg_index))
|
||||
return hash_index, reg_index
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Phase 2: Load controls
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
def load_controls(db_url: str, limit: int = 0) -> list[dict]:
|
||||
"""Load all controls needing citation update."""
|
||||
conn = psycopg2.connect(db_url)
|
||||
conn.set_session(autocommit=False)
|
||||
cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
|
||||
|
||||
cur.execute("SET search_path TO compliance, core, public")
|
||||
|
||||
query = """
|
||||
SELECT id, control_id, source_citation, source_original_text,
|
||||
generation_metadata, license_rule
|
||||
FROM canonical_controls
|
||||
WHERE license_rule IN (1, 2)
|
||||
AND source_citation IS NOT NULL
|
||||
ORDER BY control_id
|
||||
"""
|
||||
if limit > 0:
|
||||
query += f" LIMIT {limit}"
|
||||
|
||||
cur.execute(query)
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
|
||||
controls = []
|
||||
for row in rows:
|
||||
ctrl = dict(row)
|
||||
ctrl["id"] = str(ctrl["id"])
|
||||
for jf in ("source_citation", "generation_metadata"):
|
||||
val = ctrl.get(jf)
|
||||
if isinstance(val, str):
|
||||
try:
|
||||
ctrl[jf] = json.loads(val)
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
ctrl[jf] = {}
|
||||
elif val is None:
|
||||
ctrl[jf] = {}
|
||||
controls.append(ctrl)
|
||||
|
||||
return controls
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Phase 3: Matching
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
def match_control(
|
||||
ctrl: dict,
|
||||
hash_index: dict[str, ChunkMeta],
|
||||
reg_index: dict[str, list[tuple[str, ChunkMeta]]],
|
||||
) -> tuple[Optional[ChunkMeta], str]:
|
||||
"""Match a control to a Qdrant chunk. Returns (meta, method) or (None, '')."""
|
||||
source_text = ctrl.get("source_original_text", "") or ""
|
||||
|
||||
# Tier 1: Hash match
|
||||
if source_text:
|
||||
h = hashlib.sha256(source_text.encode()).hexdigest()
|
||||
meta = hash_index.get(h)
|
||||
if meta and meta.section:
|
||||
return meta, "hash"
|
||||
|
||||
# Tier 2: Parse [section] prefix from source_original_text
|
||||
if source_text:
|
||||
m = _SECTION_PREFIX_RE.match(source_text)
|
||||
if m:
|
||||
prefix = m.group(1).strip()
|
||||
parsed = _parse_section_from_prefix(prefix)
|
||||
if parsed:
|
||||
return parsed, "prefix"
|
||||
|
||||
# Tier 3: Text overlap within same regulation
|
||||
gen_meta = ctrl.get("generation_metadata") or {}
|
||||
reg_id = gen_meta.get("source_regulation", "")
|
||||
if reg_id and source_text and reg_id in reg_index:
|
||||
best = _find_best_overlap(source_text, reg_index[reg_id])
|
||||
if best:
|
||||
return best, "overlap"
|
||||
|
||||
return None, ""
|
||||
|
||||
|
||||
def _parse_section_from_prefix(prefix: str) -> Optional[ChunkMeta]:
|
||||
"""Parse a section prefix like '§ 312k Kuendigungsbutton' or 'AC-1 POLICY'."""
|
||||
if not prefix:
|
||||
return None
|
||||
|
||||
# § pattern
|
||||
m = re.match(r'(§\s*\d+[a-z]*)\s*(.*)', prefix)
|
||||
if m:
|
||||
return ChunkMeta(
|
||||
section=m.group(1).strip(),
|
||||
section_title=m.group(2).strip(),
|
||||
paragraph="", page=None, regulation_id="",
|
||||
)
|
||||
|
||||
# Art./Artikel pattern
|
||||
m = re.match(r'(Art(?:ikel|\.)\s*\d+)\s*(.*)', prefix, re.IGNORECASE)
|
||||
if m:
|
||||
return ChunkMeta(
|
||||
section=m.group(1).strip(),
|
||||
section_title=m.group(2).strip(),
|
||||
paragraph="", page=None, regulation_id="",
|
||||
)
|
||||
|
||||
# NIST control pattern (AC-1, AU-2, etc.)
|
||||
m = re.match(r'([A-Z]{2,4}-\d+(?:\(\d+\))?)\s*(.*)', prefix)
|
||||
if m:
|
||||
return ChunkMeta(
|
||||
section=m.group(1).strip(),
|
||||
section_title=m.group(2).strip(),
|
||||
paragraph="", page=None, regulation_id="",
|
||||
)
|
||||
|
||||
# Numbered section (3.1 Title)
|
||||
m = re.match(r'(\d+(?:\.\d+)+)\s*(.*)', prefix)
|
||||
if m:
|
||||
return ChunkMeta(
|
||||
section=m.group(1).strip(),
|
||||
section_title=m.group(2).strip(),
|
||||
paragraph="", page=None, regulation_id="",
|
||||
)
|
||||
|
||||
# ALL-CAPS heading (fallback — use as section_title)
|
||||
if prefix == prefix.upper() and len(prefix) > 3:
|
||||
return ChunkMeta(
|
||||
section="", section_title=prefix,
|
||||
paragraph="", page=None, regulation_id="",
|
||||
)
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def _find_best_overlap(source_text: str, chunks: list[tuple[str, ChunkMeta]]) -> Optional[ChunkMeta]:
|
||||
"""Find chunk with best text overlap (simple word-set Jaccard)."""
|
||||
source_words = set(source_text.lower().split())
|
||||
if len(source_words) < 5:
|
||||
return None
|
||||
|
||||
best_score = 0.0
|
||||
best_meta = None
|
||||
|
||||
for chunk_text, meta in chunks:
|
||||
chunk_words = set(chunk_text.lower().split())
|
||||
if not chunk_words:
|
||||
continue
|
||||
intersection = len(source_words & chunk_words)
|
||||
union = len(source_words | chunk_words)
|
||||
jaccard = intersection / union if union > 0 else 0
|
||||
if jaccard > best_score and jaccard > 0.3: # 30% threshold
|
||||
best_score = jaccard
|
||||
best_meta = meta
|
||||
|
||||
return best_meta
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Phase 4: Update controls
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
def update_controls(
|
||||
db_url: str,
|
||||
controls: list[dict],
|
||||
hash_index: dict[str, ChunkMeta],
|
||||
reg_index: dict[str, list[tuple[str, ChunkMeta]]],
|
||||
dry_run: bool = True,
|
||||
batch_size: int = 1000,
|
||||
) -> Stats:
|
||||
"""Match and update all controls."""
|
||||
stats = Stats(total=len(controls))
|
||||
|
||||
conn = psycopg2.connect(db_url)
|
||||
conn.set_session(autocommit=False)
|
||||
cur = conn.cursor()
|
||||
cur.execute("SET search_path TO compliance, core, public")
|
||||
|
||||
updates = []
|
||||
|
||||
for i, ctrl in enumerate(controls):
|
||||
if i > 0 and i % 5000 == 0:
|
||||
logger.info("Progress: %d/%d (hash=%d prefix=%d overlap=%d unmatched=%d)",
|
||||
i, stats.total, stats.matched_hash, stats.matched_prefix,
|
||||
stats.matched_overlap, stats.unmatched)
|
||||
|
||||
citation = ctrl.get("source_citation") or {}
|
||||
old_article = citation.get("article", "")
|
||||
gen_meta = ctrl.get("generation_metadata") or {}
|
||||
|
||||
# Match
|
||||
meta, method = match_control(ctrl, hash_index, reg_index)
|
||||
|
||||
if not meta or not meta.section:
|
||||
# No match — check if existing article is already good
|
||||
if old_article:
|
||||
stats.already_correct += 1
|
||||
else:
|
||||
stats.unmatched += 1
|
||||
continue
|
||||
|
||||
# Check if update is needed
|
||||
if old_article == meta.section:
|
||||
stats.already_correct += 1
|
||||
continue
|
||||
|
||||
# Track method
|
||||
if method == "hash":
|
||||
stats.matched_hash += 1
|
||||
elif method == "prefix":
|
||||
stats.matched_prefix += 1
|
||||
elif method == "overlap":
|
||||
stats.matched_overlap += 1
|
||||
|
||||
# Archive old citation
|
||||
if old_article or citation.get("paragraph"):
|
||||
gen_meta["old_citation"] = {
|
||||
"article": old_article,
|
||||
"paragraph": citation.get("paragraph", ""),
|
||||
"page": citation.get("page"),
|
||||
"archived_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
|
||||
}
|
||||
|
||||
# Update citation
|
||||
citation["article"] = meta.section
|
||||
if meta.paragraph:
|
||||
citation["paragraph"] = meta.paragraph
|
||||
if meta.page is not None:
|
||||
citation["page"] = meta.page
|
||||
|
||||
# Update generation_metadata
|
||||
gen_meta["source_article"] = meta.section
|
||||
if meta.paragraph:
|
||||
gen_meta["source_paragraph"] = meta.paragraph
|
||||
if meta.page is not None:
|
||||
gen_meta["source_page"] = meta.page
|
||||
gen_meta["backfill_method"] = method
|
||||
gen_meta["backfill_at"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
|
||||
|
||||
updates.append((
|
||||
json.dumps(citation, ensure_ascii=False),
|
||||
json.dumps(gen_meta, ensure_ascii=False, default=str),
|
||||
ctrl["id"],
|
||||
))
|
||||
|
||||
# Batch commit
|
||||
if len(updates) >= batch_size and not dry_run:
|
||||
_execute_batch(cur, updates)
|
||||
conn.commit()
|
||||
stats.updated += len(updates)
|
||||
logger.info("Committed batch: %d updates (total %d)", len(updates), stats.updated)
|
||||
updates = []
|
||||
|
||||
# Final batch
|
||||
if updates and not dry_run:
|
||||
_execute_batch(cur, updates)
|
||||
conn.commit()
|
||||
stats.updated += len(updates)
|
||||
logger.info("Committed final batch: %d updates (total %d)", len(updates), stats.updated)
|
||||
elif updates and dry_run:
|
||||
stats.updated = len(updates) # would-be updates
|
||||
|
||||
conn.close()
|
||||
return stats
|
||||
|
||||
|
||||
def _execute_batch(cur, updates: list[tuple]):
|
||||
"""Execute batch UPDATE statements."""
|
||||
for citation_json, meta_json, ctrl_id in updates:
|
||||
cur.execute(
|
||||
"""UPDATE canonical_controls
|
||||
SET source_citation = %s::jsonb,
|
||||
generation_metadata = %s::jsonb,
|
||||
updated_at = NOW()
|
||||
WHERE id = %s::uuid""",
|
||||
(citation_json, meta_json, ctrl_id),
|
||||
)
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Main
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="D6 Citation Backfill")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Don't write to DB")
|
||||
parser.add_argument("--limit", type=int, default=0, help="Limit controls (0=all)")
|
||||
parser.add_argument("--batch-size", type=int, default=1000)
|
||||
parser.add_argument("--db-url", default=DB_URL)
|
||||
parser.add_argument("--qdrant-url", default=QDRANT_URL)
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info("D6 Citation Backfill")
|
||||
logger.info(" DB: %s", args.db_url.split("@")[-1])
|
||||
logger.info(" Qdrant: %s", args.qdrant_url)
|
||||
logger.info(" Dry run: %s", args.dry_run)
|
||||
logger.info(" Limit: %s", args.limit or "ALL")
|
||||
logger.info("=" * 60)
|
||||
|
||||
# Phase 1: Build Qdrant index
|
||||
logger.info("\nPhase 1: Building Qdrant index...")
|
||||
t0 = time.time()
|
||||
hash_index, reg_index = build_qdrant_index(args.qdrant_url)
|
||||
logger.info("Index built in %.1fs", time.time() - t0)
|
||||
|
||||
# Phase 2: Load controls
|
||||
logger.info("\nPhase 2: Loading controls...")
|
||||
controls = load_controls(args.db_url, args.limit)
|
||||
logger.info("Loaded %d controls", len(controls))
|
||||
|
||||
if not controls:
|
||||
logger.info("No controls to process")
|
||||
return
|
||||
|
||||
# Phase 3+4: Match and update
|
||||
logger.info("\nPhase 3+4: Matching and updating...")
|
||||
t0 = time.time()
|
||||
stats = update_controls(
|
||||
args.db_url, controls, hash_index, reg_index,
|
||||
dry_run=args.dry_run, batch_size=args.batch_size,
|
||||
)
|
||||
elapsed = time.time() - t0
|
||||
|
||||
# Summary
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("RESULTS")
|
||||
logger.info("=" * 60)
|
||||
logger.info(" Total controls: %d", stats.total)
|
||||
logger.info(" Already correct: %d (%.1f%%)", stats.already_correct,
|
||||
stats.already_correct / max(stats.total, 1) * 100)
|
||||
logger.info(" Matched (hash): %d (%.1f%%)", stats.matched_hash,
|
||||
stats.matched_hash / max(stats.total, 1) * 100)
|
||||
logger.info(" Matched (prefix): %d (%.1f%%)", stats.matched_prefix,
|
||||
stats.matched_prefix / max(stats.total, 1) * 100)
|
||||
logger.info(" Matched (overlap): %d (%.1f%%)", stats.matched_overlap,
|
||||
stats.matched_overlap / max(stats.total, 1) * 100)
|
||||
logger.info(" Unmatched: %d (%.1f%%)", stats.unmatched,
|
||||
stats.unmatched / max(stats.total, 1) * 100)
|
||||
logger.info(" Updated: %d", stats.updated)
|
||||
logger.info(" Errors: %d", stats.errors)
|
||||
logger.info(" Time: %.1fs (%.0f controls/sec)", elapsed,
|
||||
stats.total / max(elapsed, 1))
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("\nDRY RUN — no changes written. Run without --dry-run to apply.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,310 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Derive doc_check_controls from existing Master Controls.
|
||||
|
||||
Filters MCs by document-relevant regulations, then uses Claude Haiku
|
||||
to generate check_question + pass_criteria + fail_criteria per control.
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/derive_doc_check_controls.py --dry-run
|
||||
python3 /app/scripts/derive_doc_check_controls.py
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("doc-check-derive")
|
||||
|
||||
DB_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
|
||||
)
|
||||
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
|
||||
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||
|
||||
# Document types and their regulation sources
|
||||
DOC_TYPES = {
|
||||
"dse": {
|
||||
"name": "Datenschutzinformation",
|
||||
"sources": ["DSGVO (EU) 2016/679"],
|
||||
"articles": ["%13%", "%14%"],
|
||||
"extra_tokens": ["personal_data%", "data_subject_rights%", "consent%",
|
||||
"data_processing_register%", "data_transfer%"],
|
||||
},
|
||||
"cookie": {
|
||||
"name": "Cookie-Richtlinie",
|
||||
"sources": ["TDDDG", "ePrivacy-Richtlinie"],
|
||||
"articles": ["%25%", "%5%"],
|
||||
"extra_tokens": ["cookie_consent%", "consent%"],
|
||||
},
|
||||
"impressum": {
|
||||
"name": "Impressum",
|
||||
"sources": ["TMG"],
|
||||
"articles": ["%5%"],
|
||||
"extra_tokens": ["ecommerce%"],
|
||||
},
|
||||
"widerruf": {
|
||||
"name": "Widerrufsbelehrung",
|
||||
"sources": ["BGB"],
|
||||
"articles": ["%355%", "%312%"],
|
||||
"extra_tokens": ["consumer_protection%"],
|
||||
},
|
||||
"agb": {
|
||||
"name": "AGB",
|
||||
"sources": ["BGB"],
|
||||
"articles": ["%305%", "%307%", "%308%", "%309%"],
|
||||
"extra_tokens": ["consumer_protection%"],
|
||||
},
|
||||
"dsfa": {
|
||||
"name": "Datenschutz-Folgenabschaetzung",
|
||||
"sources": ["DSGVO (EU) 2016/679"],
|
||||
"articles": ["%35%"],
|
||||
"extra_tokens": ["dpia%"],
|
||||
},
|
||||
"avv": {
|
||||
"name": "Auftragsverarbeitung",
|
||||
"sources": ["DSGVO (EU) 2016/679"],
|
||||
"articles": ["%28%"],
|
||||
"extra_tokens": ["data_processing_agreement%"],
|
||||
},
|
||||
"loeschkonzept": {
|
||||
"name": "Loeschkonzept",
|
||||
"sources": ["DSGVO (EU) 2016/679"],
|
||||
"articles": ["%5%", "%17%"],
|
||||
"extra_tokens": ["data_retention%"],
|
||||
},
|
||||
}
|
||||
|
||||
SYSTEM_PROMPT = """Du erzeugst binäre Prüfkriterien für Compliance-Dokumente.
|
||||
|
||||
Für jeden Control erzeugst du:
|
||||
1. check_question: Eine JA/NEIN Frage die ein LLM anhand eines Dokuments beantworten kann
|
||||
2. pass_criteria: Konkrete Textinhalte die vorhanden sein MÜSSEN (3-5 Stück)
|
||||
3. fail_criteria: Typische Fehler/Mängel (2-3 Stück)
|
||||
4. severity: HIGH, MEDIUM oder LOW
|
||||
|
||||
REGELN:
|
||||
- check_question muss BINÄR beantwortbar sein (nicht "wie gut")
|
||||
- pass_criteria müssen KONKRET sein ("Name + Rechtsform + Anschrift", nicht "Angaben")
|
||||
- fail_criteria müssen TYPISCHE Fehler beschreiben
|
||||
- Alles auf Deutsch
|
||||
|
||||
Antworte als JSON-Array:
|
||||
[{"id":"...","check_question":"...","pass_criteria":["..."],"fail_criteria":["..."],"severity":"HIGH"}]"""
|
||||
|
||||
|
||||
def get_doc_controls(engine, doc_type: str, config: dict) -> list[dict]:
|
||||
"""Get controls relevant for a document type."""
|
||||
controls = []
|
||||
|
||||
# Strategy 1: By source + article
|
||||
for source in config["sources"]:
|
||||
for article in config["articles"]:
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text("""
|
||||
SELECT cc.id, cc.control_id, cc.title,
|
||||
COALESCE(cc.objective, '') as objective,
|
||||
pc.source_citation->>'article' as article
|
||||
FROM compliance.canonical_controls cc
|
||||
JOIN compliance.canonical_controls pc ON pc.id = cc.parent_control_uuid
|
||||
WHERE pc.source_citation->>'source' = :source
|
||||
AND pc.source_citation->>'article' LIKE :article
|
||||
AND cc.release_state NOT IN ('deprecated', 'rejected')
|
||||
LIMIT 200
|
||||
"""), {"source": source, "article": article}).fetchall()
|
||||
for r in rows:
|
||||
controls.append({
|
||||
"uuid": str(r[0]), "control_id": r[1],
|
||||
"title": r[2] or "", "objective": r[3] or "",
|
||||
"article": r[4] or "", "doc_type": doc_type,
|
||||
})
|
||||
|
||||
# Strategy 2: By MC canonical_name
|
||||
for token_pattern in config.get("extra_tokens", []):
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text("""
|
||||
SELECT cc.id, cc.control_id, cc.title,
|
||||
COALESCE(cc.objective, '') as objective
|
||||
FROM compliance.master_controls mc
|
||||
JOIN compliance.master_control_members mcm ON mcm.master_control_uuid = mc.id
|
||||
JOIN compliance.canonical_controls cc ON cc.id = mcm.control_uuid
|
||||
WHERE mc.canonical_name LIKE :pattern
|
||||
AND cc.release_state NOT IN ('deprecated', 'rejected')
|
||||
LIMIT 100
|
||||
"""), {"pattern": token_pattern}).fetchall()
|
||||
for r in rows:
|
||||
controls.append({
|
||||
"uuid": str(r[0]), "control_id": r[1],
|
||||
"title": r[2] or "", "objective": r[3] or "",
|
||||
"article": "", "doc_type": doc_type,
|
||||
})
|
||||
|
||||
# Deduplicate
|
||||
seen = set()
|
||||
unique = []
|
||||
for c in controls:
|
||||
if c["control_id"] not in seen:
|
||||
seen.add(c["control_id"])
|
||||
unique.append(c)
|
||||
|
||||
return unique
|
||||
|
||||
|
||||
def enrich_with_llm(controls: list[dict], doc_type_name: str) -> list[dict]:
|
||||
"""Add check_question, pass/fail_criteria via Haiku."""
|
||||
enriched = []
|
||||
batch_size = 5
|
||||
|
||||
for i in range(0, len(controls), batch_size):
|
||||
batch = controls[i:i + batch_size]
|
||||
items = [
|
||||
f'- id="{c["control_id"]}" doc="{doc_type_name}" '
|
||||
f't="{c["title"]}" o="{c["objective"][:100]}"'
|
||||
for c in batch
|
||||
]
|
||||
|
||||
prompt = (
|
||||
f"Dokumenttyp: {doc_type_name}\n"
|
||||
f"Erzeuge Prüfkriterien:\n" + "\n".join(items)
|
||||
)
|
||||
|
||||
try:
|
||||
resp = httpx.post(ANTHROPIC_URL, headers={
|
||||
"x-api-key": ANTHROPIC_API_KEY,
|
||||
"anthropic-version": "2023-06-01",
|
||||
"content-type": "application/json",
|
||||
}, json={
|
||||
"model": "claude-haiku-4-5-20251001",
|
||||
"max_tokens": 2000, "temperature": 0.1,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
}, timeout=45.0)
|
||||
resp.raise_for_status()
|
||||
content = resp.json().get("content", [{}])[0].get("text", "")
|
||||
start = content.find("[")
|
||||
end = content.rfind("]") + 1
|
||||
if start >= 0 and end > start:
|
||||
results = json.loads(content[start:end])
|
||||
result_map = {r.get("id", ""): r for r in results}
|
||||
for ctrl in batch:
|
||||
r = result_map.get(ctrl["control_id"], {})
|
||||
if r.get("check_question"):
|
||||
ctrl["check_question"] = r["check_question"]
|
||||
ctrl["pass_criteria"] = r.get("pass_criteria", [])
|
||||
ctrl["fail_criteria"] = r.get("fail_criteria", [])
|
||||
ctrl["severity"] = r.get("severity", "MEDIUM")
|
||||
enriched.append(ctrl)
|
||||
except Exception as e:
|
||||
logger.error("Batch %d failed: %s", i, e)
|
||||
|
||||
time.sleep(0.5)
|
||||
|
||||
return enriched
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
parser.add_argument("--doc-type", choices=list(DOC_TYPES.keys()),
|
||||
help="Only one doc type")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(
|
||||
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
|
||||
)
|
||||
|
||||
# Create table
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
c.execute(text("""
|
||||
CREATE TABLE IF NOT EXISTS doc_check_controls (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
control_id VARCHAR(500) NOT NULL,
|
||||
control_uuid UUID,
|
||||
doc_type VARCHAR(50) NOT NULL,
|
||||
title VARCHAR(500),
|
||||
regulation VARCHAR(200),
|
||||
article VARCHAR(100),
|
||||
check_question TEXT NOT NULL,
|
||||
pass_criteria JSONB DEFAULT '[]',
|
||||
fail_criteria JSONB DEFAULT '[]',
|
||||
severity VARCHAR(20) DEFAULT 'MEDIUM',
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
)
|
||||
"""))
|
||||
c.execute(text("""
|
||||
CREATE INDEX IF NOT EXISTS idx_doc_check_doc_type
|
||||
ON doc_check_controls(doc_type)
|
||||
"""))
|
||||
|
||||
doc_types = [args.doc_type] if args.doc_type else list(DOC_TYPES.keys())
|
||||
all_checks = []
|
||||
|
||||
for dt in doc_types:
|
||||
config = DOC_TYPES[dt]
|
||||
logger.info("\n=== %s (%s) ===", dt, config["name"])
|
||||
|
||||
controls = get_doc_controls(engine, dt, config)
|
||||
logger.info("Found %d relevant controls", len(controls))
|
||||
|
||||
if not controls:
|
||||
continue
|
||||
|
||||
enriched = enrich_with_llm(controls, config["name"])
|
||||
logger.info("Enriched %d with check criteria", len(enriched))
|
||||
all_checks.extend(enriched)
|
||||
|
||||
logger.info("\nTotal: %d doc_check_controls across %d doc types",
|
||||
len(all_checks), len(doc_types))
|
||||
|
||||
if args.dry_run:
|
||||
for dc in all_checks[:5]:
|
||||
logger.info(" [%s] %s: %s", dc["doc_type"], dc["control_id"],
|
||||
dc.get("check_question", "?")[:80])
|
||||
logger.info("DRY RUN — not writing")
|
||||
return
|
||||
|
||||
# Write to DB
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
c.execute(text("DELETE FROM doc_check_controls"))
|
||||
for dc in all_checks:
|
||||
c.execute(text("""
|
||||
INSERT INTO doc_check_controls
|
||||
(control_id, control_uuid, doc_type, title,
|
||||
check_question, pass_criteria, fail_criteria, severity)
|
||||
VALUES (:cid, CAST(:uuid AS uuid), :doc_type, :title,
|
||||
:question, CAST(:pass AS jsonb),
|
||||
CAST(:fail AS jsonb), :severity)
|
||||
"""), {
|
||||
"cid": dc["control_id"],
|
||||
"uuid": dc["uuid"],
|
||||
"doc_type": dc["doc_type"],
|
||||
"title": dc["title"],
|
||||
"question": dc.get("check_question", ""),
|
||||
"pass": json.dumps(dc.get("pass_criteria", [])),
|
||||
"fail": json.dumps(dc.get("fail_criteria", [])),
|
||||
"severity": dc.get("severity", "MEDIUM"),
|
||||
})
|
||||
|
||||
logger.info("Wrote %d doc_check_controls to DB", len(all_checks))
|
||||
|
||||
# Save as JSON too
|
||||
Path("/tmp/doc_check_controls.json").write_text(
|
||||
json.dumps(all_checks, indent=2, ensure_ascii=False)
|
||||
)
|
||||
logger.info("Saved to /tmp/doc_check_controls.json")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,400 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Clean-Room MC derivation from BSI QUAIDAL.
|
||||
|
||||
For each QUAIDAL entry in the parsed index, ask a local LLM to produce our own
|
||||
wording for a Master Control / atomic control / mitigation / metric. Reject any
|
||||
output whose 4-gram overlap with the BSI source text exceeds PLAGIARISM_LIMIT.
|
||||
|
||||
We never store the BSI prose; only our own derived wording plus structural
|
||||
references (BSI section ID + URL + commit SHA).
|
||||
|
||||
Usage:
|
||||
# Single entry, prints to stdout for review:
|
||||
python3 control-pipeline/scripts/derive_quaidal_mcs.py --only QKB-01 --dry-run
|
||||
|
||||
# Full run, writes YAML:
|
||||
python3 control-pipeline/scripts/derive_quaidal_mcs.py --ollama-host macmini
|
||||
|
||||
Output: control-pipeline/data/quaidal/{master_controls,atomic_controls,mitigations,metrics}.yaml
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
try:
|
||||
import httpx
|
||||
import yaml
|
||||
except ImportError as e:
|
||||
print(f"ERROR: missing dependency {e.name}. Install with: pip install httpx pyyaml", file=sys.stderr)
|
||||
sys.exit(2)
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||
SOURCE_ROOT = REPO_ROOT / "legal-sources" / "bsi-quaidal"
|
||||
INDEX_FILE = REPO_ROOT / "control-pipeline" / "data" / "quaidal" / "quaidal_index.json"
|
||||
OUTPUT_DIR = REPO_ROOT / "control-pipeline" / "data" / "quaidal"
|
||||
|
||||
PLAGIARISM_LIMIT = 0.20 # max share of 4-grams that may appear in BSI source
|
||||
N_GRAM = 4
|
||||
MAX_RETRIES = 3
|
||||
|
||||
DEFAULT_OLLAMA_URL = "http://macmini:11434"
|
||||
OLLAMA_MODEL = "qwen3.5:35b-a3b"
|
||||
QUAIDAL_REPO_URL = "https://github.com/BSI-Bund/QUAIDAL"
|
||||
|
||||
KIND_TO_PROMPT_ROLE = {
|
||||
"criterion": "Master Control",
|
||||
"building_block": "atomarer technischer Control",
|
||||
"measure": "Schutzmaßnahme",
|
||||
"metric": "messbarer Qualitäts-Indikator",
|
||||
}
|
||||
|
||||
KIND_TO_OUTPUT_FILE = {
|
||||
"criterion": "master_controls.yaml",
|
||||
"building_block": "atomic_controls.yaml",
|
||||
"measure": "mitigations.yaml",
|
||||
"metric": "metrics.yaml",
|
||||
}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Source-side extraction (kept in memory, never written to disk)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
FRONTMATTER_RE = re.compile(r"^---\s*\n.*?\n---\s*\n", re.DOTALL)
|
||||
SECTION_RE = re.compile(r"^###?\s+(.+?)\s*$", re.MULTILINE)
|
||||
|
||||
|
||||
def load_source_extract(rel_path: str) -> dict:
|
||||
"""Load BSI source text for ONE entry. Used only for prompt + plagiarism check."""
|
||||
path = SOURCE_ROOT / rel_path
|
||||
text = path.read_text(encoding="utf-8")
|
||||
|
||||
# Strip frontmatter; capture shortdesc separately for the prompt.
|
||||
fm_match = re.match(r"^---\s*\n(.*?)\n---\s*\n", text, re.DOTALL)
|
||||
shortdesc = ""
|
||||
if fm_match:
|
||||
for line in fm_match.group(1).splitlines():
|
||||
if line.lower().startswith("shortdesc:"):
|
||||
shortdesc = line.split(":", 1)[1].strip()
|
||||
break
|
||||
body = FRONTMATTER_RE.sub("", text, count=1)
|
||||
|
||||
# Pull the first 1-2 paragraphs under "Beschreibung" (or whole body if none)
|
||||
desc_match = re.search(r"###?\s+Beschreibung\s*\n+(.+?)(?:\n###?\s|\Z)", body, re.DOTALL)
|
||||
description_excerpt = desc_match.group(1).strip() if desc_match else body[:1500].strip()
|
||||
paragraphs = [p.strip() for p in description_excerpt.split("\n\n") if p.strip()]
|
||||
description_excerpt = "\n\n".join(paragraphs[:2])
|
||||
|
||||
return {
|
||||
"shortdesc": shortdesc,
|
||||
"description_excerpt": description_excerpt,
|
||||
"full_body": body,
|
||||
}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Plagiarism gate
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
WORD_RE = re.compile(r"\b[\wäöüÄÖÜß]+\b", re.UNICODE)
|
||||
|
||||
|
||||
def _tokenize(text: str) -> list[str]:
|
||||
return [w.lower() for w in WORD_RE.findall(text)]
|
||||
|
||||
|
||||
def ngram_overlap(produced: str, source: str, n: int = N_GRAM) -> float:
|
||||
"""Share of produced n-grams that also appear in source."""
|
||||
p_tokens = _tokenize(produced)
|
||||
s_tokens = _tokenize(source)
|
||||
if len(p_tokens) < n:
|
||||
return 0.0
|
||||
s_grams = {tuple(s_tokens[i : i + n]) for i in range(len(s_tokens) - n + 1)}
|
||||
if not s_grams:
|
||||
return 0.0
|
||||
p_grams = [tuple(p_tokens[i : i + n]) for i in range(len(p_tokens) - n + 1)]
|
||||
hits = sum(1 for g in p_grams if g in s_grams)
|
||||
return hits / len(p_grams)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# LLM prompt + call
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
PROMPT_TEMPLATE = """Du bist Compliance-Engineer bei BreakPilot. Schreibe eine eigenständige Anforderung im Stil einer technischen Kontroll-Spezifikation.
|
||||
|
||||
Quelle: BSI QUAIDAL Sektion {entry_id} ("{title_de}"). Die Quelle steht unter unklarer Lizenz (BSI-Veröffentlichung, § 5 UrhG anwendbar) — wir dürfen die Idee aufgreifen, aber NICHT abschreiben.
|
||||
|
||||
Aufgabe: Formuliere eine eigenständige Anforderung im Stil eines {role}. Anforderungen:
|
||||
- Eigene Formulierung in deutscher Sprache. Kein Satz darf aus der Quelle übernommen werden, auch nicht teilweise. Synonyme verwenden, Satzbau ändern, Inhalt strukturell anders aufbauen.
|
||||
- 2-4 Sätze (max 80 Wörter).
|
||||
- Sprachstil: nüchtern, technisch, normativ ("muss", "ist sicherzustellen", "ist zu prüfen").
|
||||
- Bezug auf KI-Trainingsdaten oder KI-Datenqualität, je nach Quelle.
|
||||
- Nicht die wörtlichen BSI-Beispiele kopieren.
|
||||
|
||||
Quellauszug (NUR zur Orientierung, NICHT abschreiben):
|
||||
---
|
||||
shortdesc: {shortdesc}
|
||||
|
||||
{description_excerpt}
|
||||
---
|
||||
|
||||
Antwort: Liefere AUSSCHLIESSLICH die fertige Beschreibung als reinen Text — kein JSON, keine Überschriften, keine Anführungszeichen, keine Quellenangabe."""
|
||||
|
||||
|
||||
def call_ollama(prompt: str, ollama_url: str, model: str, retries: int = 2) -> str:
|
||||
last_err = None
|
||||
for attempt in range(retries + 1):
|
||||
try:
|
||||
resp = httpx.post(
|
||||
f"{ollama_url}/api/chat",
|
||||
json={
|
||||
"model": model,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
"stream": False,
|
||||
"options": {"temperature": 0.4},
|
||||
"think": False,
|
||||
},
|
||||
timeout=180.0,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["message"]["content"].strip()
|
||||
except (httpx.HTTPError, KeyError, ValueError) as e:
|
||||
last_err = e
|
||||
if attempt < retries:
|
||||
time.sleep(2 ** attempt)
|
||||
raise RuntimeError(f"Ollama call failed after {retries+1} attempts: {last_err}")
|
||||
|
||||
|
||||
def strip_llm_artifacts(text: str) -> str:
|
||||
"""Clean leading/trailing markdown and quotes from LLM output."""
|
||||
text = text.strip()
|
||||
# Strip surrounding code fences
|
||||
if text.startswith("```"):
|
||||
text = re.sub(r"^```[a-zA-Z]*\n?", "", text)
|
||||
text = re.sub(r"\n?```\s*$", "", text)
|
||||
# Strip surrounding quotes
|
||||
text = text.strip('"„"”„')
|
||||
# Drop a leading "Beschreibung:" or similar label
|
||||
text = re.sub(r"^(Beschreibung|Description|Anforderung|Control):\s*", "", text, flags=re.IGNORECASE)
|
||||
return text.strip()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Derivation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@dataclass
|
||||
class DerivedControl:
|
||||
derived_id: str
|
||||
source_id: str
|
||||
kind: str
|
||||
canonical_name: str
|
||||
description: str
|
||||
plagiarism_score: float
|
||||
related_quaidal_ids: list[str]
|
||||
external_refs: list[dict]
|
||||
source: dict
|
||||
|
||||
|
||||
_ASCII_FOLD = str.maketrans({"ä": "ae", "ö": "oe", "ü": "ue", "Ä": "ae", "Ö": "oe", "Ü": "ue", "ß": "ss"})
|
||||
|
||||
|
||||
def slug(text: str) -> str:
|
||||
text = text.translate(_ASCII_FOLD).lower()
|
||||
text = re.sub(r"[^a-z0-9]+", "-", text)
|
||||
return text.strip("-")
|
||||
|
||||
|
||||
def derived_id_for(entry: dict) -> str:
|
||||
prefix = {
|
||||
"criterion": "MC-AI-DATA",
|
||||
"building_block": "AC-AI-DATA",
|
||||
"measure": "MIT-AI-DATA",
|
||||
"metric": "MET-AI-DATA",
|
||||
}.get(entry["kind"], "X-AI-DATA")
|
||||
title = entry["title_de"]
|
||||
title = re.sub(r"^\s*(QKB|QB|MA|QM)-\d+[a-zA-Z]?\s*", "", title)
|
||||
return f"{prefix}-{entry['id']}-{slug(title)[:40]}".rstrip("-")
|
||||
|
||||
|
||||
def derive_one(entry: dict, source_extract: dict, ollama_url: str, model: str, *, verbose: bool = False) -> DerivedControl:
|
||||
role = KIND_TO_PROMPT_ROLE.get(entry["kind"], "Control")
|
||||
prompt = PROMPT_TEMPLATE.format(
|
||||
entry_id=entry["id"],
|
||||
title_de=entry["title_de"],
|
||||
role=role,
|
||||
shortdesc=source_extract["shortdesc"] or "(keiner)",
|
||||
description_excerpt=source_extract["description_excerpt"] or "(keine Beschreibung)",
|
||||
)
|
||||
|
||||
source_corpus = "\n\n".join(filter(None, [source_extract["shortdesc"], source_extract["description_excerpt"]]))
|
||||
|
||||
best: tuple[str, float] | None = None
|
||||
for attempt in range(1, MAX_RETRIES + 1):
|
||||
output = call_ollama(prompt, ollama_url, model)
|
||||
output = strip_llm_artifacts(output)
|
||||
score = ngram_overlap(output, source_corpus)
|
||||
if verbose:
|
||||
print(f" attempt {attempt}: overlap={score:.2%} len={len(output)}", file=sys.stderr)
|
||||
if score < PLAGIARISM_LIMIT:
|
||||
best = (output, score)
|
||||
break
|
||||
if best is None or score < best[1]:
|
||||
best = (output, score)
|
||||
# Strengthen the next prompt by appending a reject notice
|
||||
prompt += f"\n\n(Vorheriger Versuch hatte {score:.0%} Wortdeckung mit der Quelle. Verwende völlig andere Begriffe und Satzstruktur.)"
|
||||
|
||||
if best is None:
|
||||
raise RuntimeError(f"Could not derive {entry['id']}: no output")
|
||||
output, score = best
|
||||
if score >= PLAGIARISM_LIMIT:
|
||||
raise RuntimeError(
|
||||
f"Plagiarism gate failed for {entry['id']}: best overlap {score:.2%} >= limit {PLAGIARISM_LIMIT:.0%}.\n"
|
||||
f"Output:\n{output}"
|
||||
)
|
||||
|
||||
title_de_clean = re.sub(r"^\s*(QKB|QB|MA|QM)-\d+[a-zA-Z]?\s*", "", entry["title_de"]).strip()
|
||||
return DerivedControl(
|
||||
derived_id=derived_id_for(entry),
|
||||
source_id=entry["id"],
|
||||
kind=entry["kind"],
|
||||
canonical_name=title_de_clean or entry["title_de"],
|
||||
description=output,
|
||||
plagiarism_score=round(score, 4),
|
||||
related_quaidal_ids=entry["referenced_ids"],
|
||||
external_refs=entry["external_refs"],
|
||||
source={
|
||||
"framework": "BSI QUAIDAL",
|
||||
"section": entry["id"],
|
||||
"title_original_de": entry["title_de"],
|
||||
"url": f"{QUAIDAL_REPO_URL}/blob/main/{entry['source_path'].replace(' ', '%20')}",
|
||||
"commit_sha": None, # filled in by main()
|
||||
"license_note": "§ 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.",
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Output writers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def control_to_dict(c: DerivedControl) -> dict:
|
||||
d = {
|
||||
"id": c.derived_id,
|
||||
"canonical_name": c.canonical_name,
|
||||
"description": c.description,
|
||||
"kind": c.kind,
|
||||
"regulation_anchor": "EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)",
|
||||
"related_quaidal_ids": c.related_quaidal_ids,
|
||||
"external_refs": c.external_refs,
|
||||
"source": c.source,
|
||||
"plagiarism_score_at_generation": c.plagiarism_score,
|
||||
}
|
||||
return d
|
||||
|
||||
|
||||
def write_yaml_per_kind(controls: list[DerivedControl], commit_sha: str | None) -> dict[str, Path]:
|
||||
out: dict[str, list[dict]] = {}
|
||||
for c in controls:
|
||||
c.source["commit_sha"] = commit_sha
|
||||
fname = KIND_TO_OUTPUT_FILE.get(c.kind, "other.yaml")
|
||||
out.setdefault(fname, []).append(control_to_dict(c))
|
||||
|
||||
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
written: dict[str, Path] = {}
|
||||
for fname, items in out.items():
|
||||
path = OUTPUT_DIR / fname
|
||||
payload = {
|
||||
"source": "Derived from BSI QUAIDAL (Clean-Room)",
|
||||
"source_url": QUAIDAL_REPO_URL,
|
||||
"commit_sha": commit_sha,
|
||||
"plagiarism_limit_4gram": PLAGIARISM_LIMIT,
|
||||
"generated_by_model": OLLAMA_MODEL,
|
||||
"controls": items,
|
||||
}
|
||||
path.write_text(yaml.safe_dump(payload, allow_unicode=True, sort_keys=False), encoding="utf-8")
|
||||
written[fname] = path
|
||||
return written
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def main() -> int:
|
||||
ap = argparse.ArgumentParser(description=__doc__)
|
||||
ap.add_argument("--only", help="Derive only this QUAIDAL ID (e.g. QKB-01)")
|
||||
ap.add_argument("--kind", help="Derive only entries of this kind (criterion/building_block/measure/metric)")
|
||||
ap.add_argument("--limit", type=int, help="Process at most N entries")
|
||||
ap.add_argument("--dry-run", action="store_true", help="Print derived controls instead of writing YAML")
|
||||
ap.add_argument("--ollama-host", default="macmini", help="Ollama host (default: macmini)")
|
||||
ap.add_argument("--model", default=OLLAMA_MODEL)
|
||||
ap.add_argument("--verbose", action="store_true")
|
||||
args = ap.parse_args()
|
||||
|
||||
if not INDEX_FILE.exists():
|
||||
print(f"ERROR: missing index. Run ingest_bsi_quaidal.py first ({INDEX_FILE})", file=sys.stderr)
|
||||
return 2
|
||||
index = json.loads(INDEX_FILE.read_text(encoding="utf-8"))
|
||||
entries = index["entries"]
|
||||
if args.only:
|
||||
entries = [e for e in entries if e["id"].upper() == args.only.upper()]
|
||||
if args.kind:
|
||||
entries = [e for e in entries if e["kind"] == args.kind]
|
||||
if args.limit:
|
||||
entries = entries[: args.limit]
|
||||
|
||||
if not entries:
|
||||
print("No entries match the filter.", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
ollama_url = args.ollama_host if "://" in args.ollama_host else f"http://{args.ollama_host}:11434"
|
||||
print(f"Derivation: {len(entries)} entries, model={args.model}, ollama={ollama_url}, limit={PLAGIARISM_LIMIT:.0%}", file=sys.stderr)
|
||||
|
||||
derived: list[DerivedControl] = []
|
||||
failed: list[tuple[str, str]] = []
|
||||
for i, entry in enumerate(entries, 1):
|
||||
if args.verbose:
|
||||
print(f"[{i}/{len(entries)}] {entry['id']} ({entry['kind']}): {entry['title_de']}", file=sys.stderr)
|
||||
try:
|
||||
extract = load_source_extract(entry["source_path"])
|
||||
ctrl = derive_one(entry, extract, ollama_url, args.model, verbose=args.verbose)
|
||||
derived.append(ctrl)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
failed.append((entry["id"], str(exc)))
|
||||
print(f" FAILED {entry['id']}: {exc}", file=sys.stderr)
|
||||
|
||||
print(f"\nDerived: {len(derived)} | Failed: {len(failed)}", file=sys.stderr)
|
||||
|
||||
if args.dry_run:
|
||||
for c in derived:
|
||||
c.source["commit_sha"] = index.get("commit_sha")
|
||||
print(yaml.safe_dump(control_to_dict(c), allow_unicode=True, sort_keys=False))
|
||||
print("---")
|
||||
return 0 if not failed else 1
|
||||
|
||||
written = write_yaml_per_kind(derived, index.get("commit_sha"))
|
||||
for fname, path in written.items():
|
||||
print(f"Wrote {path.relative_to(REPO_ROOT)} ({sum(1 for c in derived if KIND_TO_OUTPUT_FILE[c.kind] == fname)} entries)", file=sys.stderr)
|
||||
|
||||
if failed:
|
||||
print("\nFailures:", file=sys.stderr)
|
||||
for fid, msg in failed:
|
||||
print(f" - {fid}: {msg.splitlines()[0]}", file=sys.stderr)
|
||||
return 1
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,280 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Extract large NIST PDFs locally, then upload as .txt to RAG service.
|
||||
|
||||
Workaround for embedding-service container crashing on large PDFs (>5 MB).
|
||||
Runs pdfplumber + normalization locally, uploads extracted text as .txt.
|
||||
|
||||
Usage (on Mac Mini):
|
||||
python3 control-pipeline/scripts/extract_and_upload_nist.py
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import tempfile
|
||||
import unicodedata
|
||||
|
||||
import httpx
|
||||
import pdfplumber
|
||||
|
||||
RAG_URL = "https://localhost:8097"
|
||||
QDRANT_URL = "http://localhost:6333"
|
||||
|
||||
DOCS = [
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_53r5.pdf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"filename": "NIST_SP_800_53r5.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp800_53r5",
|
||||
"source_id": "nist",
|
||||
"doc_type": "controls_catalog",
|
||||
"guideline_name": "NIST SP 800-53 Rev. 5 Security and Privacy Controls",
|
||||
"license": "public_domain_us_gov",
|
||||
"attribution": "NIST",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nist_sp_800_82r3.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "nist_sp_800_82r3.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_82r3",
|
||||
"regulation_name_de": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
|
||||
"regulation_name_en": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
|
||||
"regulation_short": "NIST SP 800-82",
|
||||
"category": "ot_security",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nist_sp_800_160v1r1.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "nist_sp_800_160v1r1.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_160v1r1",
|
||||
"regulation_name_de": "NIST SP 800-160 Vol. 1 Rev. 1",
|
||||
"regulation_name_en": "NIST SP 800-160 Vol. 1 Rev. 1",
|
||||
"regulation_short": "NIST SP 800-160",
|
||||
"category": "security_engineering",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_207.pdf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"filename": "NIST_SP_800_207.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp800_207",
|
||||
"source_id": "nist",
|
||||
"doc_type": "architecture",
|
||||
"guideline_name": "NIST SP 800-207 Zero Trust Architecture",
|
||||
"license": "public_domain_us_gov",
|
||||
"attribution": "NIST",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def normalize_pdf_text(text: str) -> str:
|
||||
"""Fix broken spacing from multi-column PDF extraction."""
|
||||
text = unicodedata.normalize('NFKC', text)
|
||||
text = text.replace('\u00ad', '').replace('\u200b', '')
|
||||
prev = None
|
||||
while prev != text:
|
||||
prev = text
|
||||
text = re.sub(r'(\d+)\s+\.\s+(\d+)', r'\1.\2', text)
|
||||
text = re.sub(r'\b([A-Z]{2,4})\s+-\s+(\d+)\b', r'\1-\2', text)
|
||||
text = re.sub(
|
||||
r'\b([A-Z]{2})\s*\.\s*([A-Z]{2})\s*-\s*(\d{2})\b', r'\1.\2-\3', text
|
||||
)
|
||||
text = re.sub(r'\(\s+(\d+)\s+\)', r'(\1)', text)
|
||||
text = re.sub(r'[^\S\n]{2,}', ' ', text)
|
||||
return text
|
||||
|
||||
|
||||
def extract_pdf_locally(pdf_bytes: bytes) -> str:
|
||||
"""Extract text from PDF using pdfplumber with normalization."""
|
||||
import io
|
||||
text_parts = []
|
||||
with pdfplumber.open(io.BytesIO(pdf_bytes)) as pdf:
|
||||
print(f" Pages: {len(pdf.pages)}")
|
||||
for i, page in enumerate(pdf.pages):
|
||||
text = page.extract_text(x_tolerance=3, y_tolerance=4)
|
||||
if text:
|
||||
text_parts.append(text)
|
||||
if (i + 1) % 50 == 0:
|
||||
print(f" Extracted {i + 1}/{len(pdf.pages)} pages...")
|
||||
raw = "\n\n".join(text_parts)
|
||||
return normalize_pdf_text(raw)
|
||||
|
||||
|
||||
def download_from_minio(object_name: str) -> bytes:
|
||||
"""Download file from MinIO via RAG service."""
|
||||
with httpx.Client(timeout=60.0, verify=False) as c:
|
||||
resp = c.get(f"{RAG_URL}/api/v1/documents/download/{object_name}")
|
||||
resp.raise_for_status()
|
||||
url = resp.json()["url"]
|
||||
with httpx.Client(timeout=300.0, verify=False) as c:
|
||||
resp = c.get(url)
|
||||
resp.raise_for_status()
|
||||
return resp.content
|
||||
|
||||
|
||||
def upload_text(
|
||||
text: str, filename: str, collection: str, extra_metadata: dict,
|
||||
) -> dict:
|
||||
"""Upload extracted text to RAG service as .txt."""
|
||||
form_data = {
|
||||
"collection": collection,
|
||||
"data_type": "compliance",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_strategy": "recursive",
|
||||
"chunk_size": "1500",
|
||||
"chunk_overlap": "100",
|
||||
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
|
||||
}
|
||||
text_bytes = text.encode("utf-8")
|
||||
with httpx.Client(timeout=1800.0, verify=False) as c:
|
||||
resp = c.post(
|
||||
f"{RAG_URL}/api/v1/documents/upload",
|
||||
files={"file": (filename, text_bytes, "text/plain")},
|
||||
data=form_data,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def count_chunks(collection: str, regulation_id: str) -> int:
|
||||
"""Count chunks for a regulation in Qdrant."""
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{collection}/points/count",
|
||||
json={
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "regulation_id",
|
||||
"match": {"value": regulation_id},
|
||||
}]
|
||||
},
|
||||
"exact": True,
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["result"]["count"]
|
||||
|
||||
|
||||
def check_section_rate(collection: str, regulation_id: str) -> tuple:
|
||||
"""Returns (total_chunks, chunks_with_section)."""
|
||||
total = 0
|
||||
with_sec = 0
|
||||
offset = None
|
||||
with httpx.Client(timeout=60.0) as c:
|
||||
while True:
|
||||
body = {
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "regulation_id",
|
||||
"match": {"value": regulation_id},
|
||||
}]
|
||||
},
|
||||
"limit": 100,
|
||||
"with_payload": ["section"],
|
||||
}
|
||||
if offset is not None:
|
||||
body["offset"] = offset
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{collection}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
for pt in data["points"]:
|
||||
total += 1
|
||||
s = pt.get("payload", {}).get("section", "")
|
||||
if s and s.strip():
|
||||
with_sec += 1
|
||||
offset = data.get("next_page_offset")
|
||||
if offset is None:
|
||||
break
|
||||
return total, with_sec
|
||||
|
||||
|
||||
def main():
|
||||
print("=" * 60)
|
||||
print("NIST PDF Local Extraction + Upload")
|
||||
print("=" * 60)
|
||||
|
||||
results = []
|
||||
for i, doc in enumerate(DOCS, 1):
|
||||
reg_id = doc["extra_metadata"]["regulation_id"]
|
||||
print(f"\n[{i}/{len(DOCS)}] {doc['filename']} → {doc['collection']}")
|
||||
|
||||
# 1. Check current state
|
||||
existing = count_chunks(doc["collection"], reg_id)
|
||||
print(f" Existing chunks: {existing}")
|
||||
|
||||
# 2. Download PDF from MinIO
|
||||
print(f" Downloading from MinIO...")
|
||||
pdf_bytes = download_from_minio(doc["object_name"])
|
||||
print(f" Downloaded {len(pdf_bytes) / 1024 / 1024:.1f} MB")
|
||||
|
||||
# 3. Extract text locally with pdfplumber
|
||||
print(f" Extracting text locally...")
|
||||
text = extract_pdf_locally(pdf_bytes)
|
||||
print(f" Extracted {len(text):,} chars, {text.count(chr(10)):,} lines")
|
||||
|
||||
# 4. Save extracted text temporarily (for debugging)
|
||||
tmp_path = f"/tmp/nist_{reg_id}.txt"
|
||||
with open(tmp_path, "w", encoding="utf-8") as f:
|
||||
f.write(text)
|
||||
print(f" Saved to {tmp_path}")
|
||||
|
||||
# 5. Upload as .txt
|
||||
print(f" Uploading as .txt to RAG service...")
|
||||
result = upload_text(text, doc["filename"], doc["collection"],
|
||||
doc["extra_metadata"])
|
||||
new_chunks = result.get("chunks_count", 0)
|
||||
new_doc_id = result.get("document_id", "")
|
||||
print(f" Uploaded: {new_chunks} chunks (doc_id={new_doc_id})")
|
||||
|
||||
# 6. Check section rate
|
||||
if new_chunks > 0:
|
||||
total, with_sec = check_section_rate(doc["collection"], reg_id)
|
||||
pct = (with_sec / total * 100) if total > 0 else 0
|
||||
print(f" Section rate: {with_sec}/{total} ({pct:.0f}%)")
|
||||
else:
|
||||
pct = 0
|
||||
print(" WARNING: 0 chunks created!")
|
||||
|
||||
results.append({
|
||||
"file": doc["filename"],
|
||||
"old": existing,
|
||||
"new": new_chunks,
|
||||
"section_rate": round(pct, 1),
|
||||
})
|
||||
|
||||
# Summary
|
||||
print("\n" + "=" * 60)
|
||||
print("RESULTS")
|
||||
print("=" * 60)
|
||||
for r in results:
|
||||
print(f" {r['file']:<40} old={r['old']} new={r['new']} sect={r['section_rate']}%")
|
||||
|
||||
total_new = sum(r["new"] for r in results)
|
||||
print(f"\nTotal new chunks: {total_new}")
|
||||
|
||||
if any(r["new"] == 0 for r in results):
|
||||
print("\nWARNING: Some documents produced 0 chunks!")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,214 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Extract CE-relevant obligations from TRBS/TRGS/ASR/OSHA chunks in Qdrant.
|
||||
|
||||
Searches for MUSS/SOLL patterns in chunk texts and classifies them.
|
||||
Output: JSON file with structured obligations for the CE session.
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/extract_ce_obligations.py
|
||||
python3 /app/scripts/extract_ce_obligations.py --output /tmp/ce_obligations.json
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("ce-obligations")
|
||||
|
||||
QDRANT_URL = os.getenv("QDRANT_URL", "http://qdrant:6333")
|
||||
COLLECTION = "bp_compliance_ce"
|
||||
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://host.docker.internal:11434")
|
||||
LLM_MODEL = "qwen3.5:35b-a3b"
|
||||
|
||||
# Obligation patterns (DE + EN)
|
||||
OBLIGATION_PATTERNS = re.compile(
|
||||
r"(muss|müssen|hat\s+[\w\s]*zu\s|ist\s+[\w\s]*sicherzustellen|"
|
||||
r"ist\s+verpflichtet|sind\s+verpflichtet|darf\s+nicht|"
|
||||
r"shall|must|required\s+to|is\s+required|shall\s+not)",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
# CE relevance keywords
|
||||
CE_KEYWORDS = re.compile(
|
||||
r"(maschine|schutzeinrichtung|gefährdung|quetsch|scher|stoß|"
|
||||
r"schneid|fang|einzug|absturz|druck|explosion|brand|"
|
||||
r"elektrisch|spannung|erdung|schutzleiter|not-halt|"
|
||||
r"betriebsanleitung|kennzeichnung|prüfung|prüfpflicht|"
|
||||
r"instandhaltung|wartung|sicherheitsabstand|"
|
||||
r"schutzmaßnahme|persönliche schutzausrüstung|psa|"
|
||||
r"machine|guard|hazard|crush|shear|cut|entangle|"
|
||||
r"lockout|tagout|electrical|grounding|emergency stop|"
|
||||
r"safety distance|protective device|ppe|inspection)",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
HAZARD_CATEGORIES = {
|
||||
"quetsch|crush|squeeze": "mechanical_crushing",
|
||||
"schneid|cut": "mechanical_cutting",
|
||||
"fang|einzug|entangle|draw": "mechanical_entanglement",
|
||||
"absturz|fall": "fall_hazard",
|
||||
"explosion|ex-bereich|atex": "explosion_hazard",
|
||||
"brand|fire|feuer": "fire_hazard",
|
||||
"elektrisch|electrical|spannung|voltage": "electrical_hazard",
|
||||
"lärm|noise|schall": "noise_hazard",
|
||||
"gefahrstoff|hazardous substance|chemical": "chemical_hazard",
|
||||
"ergonomie|ergonomic|heben|lift": "ergonomic_hazard",
|
||||
"temperatur|heat|hitze|kälte|cold": "thermal_hazard",
|
||||
"strahlung|radiation|laser": "radiation_hazard",
|
||||
"not-halt|emergency stop|e-stop": "emergency_stop",
|
||||
"lockout|tagout|loto": "lockout_tagout",
|
||||
"kennzeichnung|label|marking|sign": "safety_marking",
|
||||
"prüfung|inspection|test": "inspection_requirement",
|
||||
"instandhaltung|maintenance|wartung": "maintenance",
|
||||
"schutzeinrichtung|guard|protective device": "protective_device",
|
||||
"betriebsanleitung|instruction|manual": "operating_instructions",
|
||||
"druck|pressure|behälter|vessel|kessel|boiler": "pressure_hazard",
|
||||
}
|
||||
|
||||
# Source-based overrides: TRGS docs about chemicals/storage
|
||||
# should never be classified as mechanical hazards
|
||||
_CHEMICAL_SOURCES = re.compile(
|
||||
r"trgs\s*(5[0-9]{2}|7[0-9]{2}|9[0-9]{2}|4[0-9]{2}|6[0-9]{2})",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
|
||||
def _classify_hazard(text: str, source: str) -> str:
|
||||
"""Classify hazard with source-aware overrides."""
|
||||
# TRGS sources → chemical/pressure/explosion, never mechanical
|
||||
if _CHEMICAL_SOURCES.search(source):
|
||||
if re.search(r"explosion|ex-bereich|atex|zündfähig", text, re.IGNORECASE):
|
||||
return "explosion_hazard"
|
||||
if re.search(r"druck|pressure|behälter|vessel", text, re.IGNORECASE):
|
||||
return "pressure_hazard"
|
||||
if re.search(r"brand|fire|feuer", text, re.IGNORECASE):
|
||||
return "fire_hazard"
|
||||
return "chemical_hazard"
|
||||
|
||||
# Standard pattern matching (order matters — specific first)
|
||||
for pattern, category in HAZARD_CATEGORIES.items():
|
||||
if re.search(pattern, text, re.IGNORECASE):
|
||||
return category
|
||||
return "general"
|
||||
|
||||
|
||||
def scroll_chunks(source_filter: str = None) -> list[dict]:
|
||||
"""Scroll through Qdrant to get all relevant chunks."""
|
||||
chunks = []
|
||||
offset = None
|
||||
batch = 100
|
||||
|
||||
while True:
|
||||
scroll_body = {
|
||||
"limit": batch,
|
||||
"with_payload": True,
|
||||
"with_vector": False,
|
||||
}
|
||||
if offset is not None:
|
||||
scroll_body["offset"] = offset
|
||||
|
||||
resp = httpx.post(
|
||||
f"{QDRANT_URL}/collections/{COLLECTION}/points/scroll",
|
||||
json=scroll_body,
|
||||
timeout=30.0,
|
||||
)
|
||||
data = resp.json()
|
||||
points = data.get("result", {}).get("points", [])
|
||||
next_offset = data.get("result", {}).get("next_page_offset")
|
||||
|
||||
for pt in points:
|
||||
payload = pt.get("payload", {})
|
||||
source = payload.get("source", payload.get("filename", ""))
|
||||
text = payload.get("chunk_text", "")
|
||||
|
||||
# Filter for TRBS/TRGS/ASR/OSHA
|
||||
source_lower = source.lower()
|
||||
is_relevant = any(k in source_lower for k in
|
||||
["trbs", "trgs", "asr", "osha"])
|
||||
if not is_relevant:
|
||||
continue
|
||||
|
||||
# Check for obligation patterns
|
||||
if not OBLIGATION_PATTERNS.search(text):
|
||||
continue
|
||||
|
||||
# Check CE relevance
|
||||
if not CE_KEYWORDS.search(text):
|
||||
continue
|
||||
|
||||
# Classify hazard category (source-aware)
|
||||
hazard = _classify_hazard(text, source)
|
||||
|
||||
# Determine obligation type
|
||||
if re.search(r"muss|müssen|shall|must|required", text, re.IGNORECASE):
|
||||
obl_type = "MUSS"
|
||||
elif re.search(r"soll|sollte|should", text, re.IGNORECASE):
|
||||
obl_type = "SOLL"
|
||||
else:
|
||||
obl_type = "MUSS"
|
||||
|
||||
chunks.append({
|
||||
"source": source,
|
||||
"section": payload.get("section", ""),
|
||||
"paragraph": payload.get("paragraph", ""),
|
||||
"obligation_text": text.strip()[:500],
|
||||
"hazard_category": hazard,
|
||||
"obligation_type": obl_type,
|
||||
"ce_relevance": "high" if hazard != "general" else "medium",
|
||||
"filename": payload.get("filename", ""),
|
||||
})
|
||||
|
||||
if next_offset is None or not points:
|
||||
break
|
||||
offset = next_offset
|
||||
|
||||
if len(chunks) % 500 == 0:
|
||||
logger.info(" Scanned... %d obligations found so far", len(chunks))
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--output", default="/tmp/ce_obligations.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("Scanning %s for CE obligations...", COLLECTION)
|
||||
obligations = scroll_chunks()
|
||||
|
||||
logger.info("Found %d CE-relevant obligations", len(obligations))
|
||||
|
||||
# Stats
|
||||
by_source = {}
|
||||
by_hazard = {}
|
||||
for o in obligations:
|
||||
src = o["source"][:30]
|
||||
by_source[src] = by_source.get(src, 0) + 1
|
||||
by_hazard[o["hazard_category"]] = by_hazard.get(o["hazard_category"], 0) + 1
|
||||
|
||||
logger.info("\nBy source:")
|
||||
for src, cnt in sorted(by_source.items(), key=lambda x: -x[1])[:20]:
|
||||
logger.info(" %4d %s", cnt, src)
|
||||
|
||||
logger.info("\nBy hazard category:")
|
||||
for cat, cnt in sorted(by_hazard.items(), key=lambda x: -x[1]):
|
||||
logger.info(" %4d %s", cnt, cat)
|
||||
|
||||
# Save
|
||||
Path(args.output).write_text(
|
||||
json.dumps(obligations, indent=2, ensure_ascii=False)
|
||||
)
|
||||
logger.info("\nSaved to %s", args.output)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,247 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
F1 Migration: Populate regulation_registry from hardcoded Python dicts.
|
||||
|
||||
Sources:
|
||||
- REGULATION_LICENSE_MAP (control_generator.py) — 135 entries keyed by regulation_id
|
||||
- SOURCE_REGULATION_CLASSIFICATION (source_type_classification.py) — 58 entries keyed by name
|
||||
|
||||
Usage:
|
||||
# Dry run (prints SQL, no DB write):
|
||||
python3 scripts/f1_migrate_regulation_registry.py --dry-run
|
||||
|
||||
# Against Mac Mini:
|
||||
python3 scripts/f1_migrate_regulation_registry.py --db-host macmini
|
||||
|
||||
# Against local Docker:
|
||||
python3 scripts/f1_migrate_regulation_registry.py --db-host localhost
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add parent so we can import from services/data
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
from services.control_generator import REGULATION_LICENSE_MAP, _RULE2_PREFIXES, _RULE3_PREFIXES # noqa: E402
|
||||
from data.source_type_classification import SOURCE_REGULATION_CLASSIFICATION # noqa: E402
|
||||
|
||||
# Derive jurisdiction from license_type
|
||||
_LICENSE_TO_JURISDICTION = {
|
||||
"EU_LAW": "EU",
|
||||
"EU_PUBLIC": "EU",
|
||||
"DE_LAW": "DE",
|
||||
"DE_PUBLIC": "DE",
|
||||
"AT_LAW": "AT",
|
||||
"CH_LAW": "CH",
|
||||
"FR_LAW": "FR",
|
||||
"ES_LAW": "ES",
|
||||
"NL_LAW": "NL",
|
||||
"IT_LAW": "IT",
|
||||
"HU_LAW": "HU",
|
||||
"NIST_PUBLIC_DOMAIN": "US",
|
||||
"US_GOV_PUBLIC": "US",
|
||||
"CC-BY-SA-4.0": "INT",
|
||||
"CC-BY-4.0": "INT",
|
||||
"OECD_PUBLIC": "INT",
|
||||
}
|
||||
|
||||
|
||||
def _derive_jurisdiction(license_type: str) -> str:
|
||||
"""Map license_type to jurisdiction code."""
|
||||
return _LICENSE_TO_JURISDICTION.get(license_type, "INT")
|
||||
|
||||
|
||||
def build_rows() -> list[dict]:
|
||||
"""Merge REGULATION_LICENSE_MAP + SOURCE_REGULATION_CLASSIFICATION into rows."""
|
||||
rows = []
|
||||
# Track names we've seen (for dedup against SOURCE_REGULATION_CLASSIFICATION)
|
||||
seen_names: set[str] = set()
|
||||
|
||||
# 1) Primary source: REGULATION_LICENSE_MAP (has regulation_id as key)
|
||||
for reg_id, info in REGULATION_LICENSE_MAP.items():
|
||||
name = info.get("name", reg_id)
|
||||
seen_names.add(name)
|
||||
|
||||
rows.append({
|
||||
"regulation_id": reg_id.lower().strip(),
|
||||
"regulation_name_de": name,
|
||||
"license_rule": info["rule"],
|
||||
"license_type": info.get("license", ""),
|
||||
"attribution": info.get("attribution"),
|
||||
"source_type": info.get("source_type", "law"),
|
||||
"jurisdiction": _derive_jurisdiction(info.get("license", "")),
|
||||
"status": "active",
|
||||
})
|
||||
|
||||
# 2) Secondary: SOURCE_REGULATION_CLASSIFICATION entries not already covered
|
||||
# These are keyed by name, not by regulation_id. We create synthetic IDs.
|
||||
for name, source_type in SOURCE_REGULATION_CLASSIFICATION.items():
|
||||
if name in seen_names:
|
||||
continue
|
||||
# Generate a regulation_id from the name
|
||||
synthetic_id = (
|
||||
name.lower()
|
||||
.replace(" ", "_")
|
||||
.replace("(", "")
|
||||
.replace(")", "")
|
||||
.replace("/", "_")
|
||||
.replace("-", "_")
|
||||
.replace(".", "")
|
||||
.replace(",", "")
|
||||
.replace("ä", "ae")
|
||||
.replace("ö", "oe")
|
||||
.replace("ü", "ue")
|
||||
.replace("á", "a")
|
||||
.replace("é", "e")
|
||||
.replace("ó", "o")
|
||||
.strip("_")
|
||||
)[:100]
|
||||
|
||||
# Guess jurisdiction from name content
|
||||
jurisdiction = "INT"
|
||||
name_lower = name.lower()
|
||||
if any(x in name_lower for x in ["edpb", "edps", "(eu)", "eu ", "wp2"]):
|
||||
jurisdiction = "EU"
|
||||
elif any(x in name_lower for x in ["bsi", "bdsg", "bundes", "gwg"]):
|
||||
jurisdiction = "DE"
|
||||
elif "nist" in name_lower or "cisa" in name_lower:
|
||||
jurisdiction = "US"
|
||||
elif "österreich" in name_lower:
|
||||
jurisdiction = "AT"
|
||||
elif "schweiz" in name_lower:
|
||||
jurisdiction = "CH"
|
||||
elif "spanien" in name_lower:
|
||||
jurisdiction = "ES"
|
||||
elif "frankreich" in name_lower:
|
||||
jurisdiction = "FR"
|
||||
elif "ungarn" in name_lower:
|
||||
jurisdiction = "HU"
|
||||
|
||||
# Map source_type_classification's "framework" to our "standard"
|
||||
# (source_type_classification uses law/guideline/framework)
|
||||
mapped_source_type = source_type
|
||||
if source_type == "framework":
|
||||
mapped_source_type = "standard"
|
||||
|
||||
rows.append({
|
||||
"regulation_id": synthetic_id,
|
||||
"regulation_name_de": name,
|
||||
"license_rule": 1, # default: conservative
|
||||
"license_type": "",
|
||||
"attribution": None,
|
||||
"source_type": mapped_source_type,
|
||||
"jurisdiction": jurisdiction,
|
||||
"status": "needs_review", # needs manual review since we guessed
|
||||
})
|
||||
|
||||
return rows
|
||||
|
||||
|
||||
def generate_sql(rows: list[dict]) -> str:
|
||||
"""Generate INSERT SQL for all rows."""
|
||||
lines = [
|
||||
"SET search_path TO compliance, public;",
|
||||
"",
|
||||
"-- Auto-generated by f1_migrate_regulation_registry.py",
|
||||
f"-- {len(rows)} rows total",
|
||||
"",
|
||||
]
|
||||
|
||||
for row in rows:
|
||||
attr = f"'{row['attribution']}'" if row["attribution"] else "NULL"
|
||||
lines.append(
|
||||
f"INSERT INTO regulation_registry "
|
||||
f"(regulation_id, regulation_name_de, license_rule, license_type, "
|
||||
f"attribution, source_type, jurisdiction, status) "
|
||||
f"VALUES ("
|
||||
f"'{row['regulation_id']}', "
|
||||
f"'{_escape_sql(row['regulation_name_de'])}', "
|
||||
f"{row['license_rule']}, "
|
||||
f"'{row['license_type']}', "
|
||||
f"{attr}, "
|
||||
f"'{row['source_type']}', "
|
||||
f"'{row['jurisdiction']}', "
|
||||
f"'{row['status']}'"
|
||||
f") ON CONFLICT (regulation_id) DO UPDATE SET "
|
||||
f"regulation_name_de = EXCLUDED.regulation_name_de, "
|
||||
f"license_rule = EXCLUDED.license_rule, "
|
||||
f"license_type = EXCLUDED.license_type, "
|
||||
f"attribution = EXCLUDED.attribution, "
|
||||
f"source_type = EXCLUDED.source_type, "
|
||||
f"jurisdiction = EXCLUDED.jurisdiction;"
|
||||
)
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def _escape_sql(val: str) -> str:
|
||||
"""Escape single quotes for SQL."""
|
||||
return val.replace("'", "''")
|
||||
|
||||
|
||||
def insert_via_sqlalchemy(rows: list[dict], db_host: str) -> int:
|
||||
"""Insert rows using SQLAlchemy (same pattern as control-pipeline)."""
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
url = f"postgresql://breakpilot:breakpilot123@{db_host}:5432/breakpilot_db"
|
||||
engine = create_engine(url)
|
||||
|
||||
inserted = 0
|
||||
with engine.connect() as conn:
|
||||
conn.execute(text("SET search_path TO compliance, public"))
|
||||
for row in rows:
|
||||
conn.execute(
|
||||
text("""
|
||||
INSERT INTO regulation_registry
|
||||
(regulation_id, regulation_name_de, license_rule, license_type,
|
||||
attribution, source_type, jurisdiction, status)
|
||||
VALUES
|
||||
(:regulation_id, :regulation_name_de, :license_rule, :license_type,
|
||||
:attribution, :source_type, :jurisdiction, :status)
|
||||
ON CONFLICT (regulation_id) DO UPDATE SET
|
||||
regulation_name_de = EXCLUDED.regulation_name_de,
|
||||
license_rule = EXCLUDED.license_rule,
|
||||
license_type = EXCLUDED.license_type,
|
||||
attribution = EXCLUDED.attribution,
|
||||
source_type = EXCLUDED.source_type,
|
||||
jurisdiction = EXCLUDED.jurisdiction
|
||||
"""),
|
||||
row,
|
||||
)
|
||||
inserted += 1
|
||||
conn.commit()
|
||||
|
||||
return inserted
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Migrate regulation registry data")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Print SQL only")
|
||||
parser.add_argument("--db-host", default="localhost", help="PostgreSQL host")
|
||||
args = parser.parse_args()
|
||||
|
||||
rows = build_rows()
|
||||
print(f"Built {len(rows)} rows from hardcoded dicts")
|
||||
|
||||
# Stats
|
||||
by_rule = {}
|
||||
by_status = {}
|
||||
for r in rows:
|
||||
by_rule[r["license_rule"]] = by_rule.get(r["license_rule"], 0) + 1
|
||||
by_status[r["status"]] = by_status.get(r["status"], 0) + 1
|
||||
print(f" By license_rule: {by_rule}")
|
||||
print(f" By status: {by_status}")
|
||||
|
||||
if args.dry_run:
|
||||
print("\n--- DRY RUN (SQL output) ---\n")
|
||||
print(generate_sql(rows))
|
||||
return
|
||||
|
||||
inserted = insert_via_sqlalchemy(rows, args.db_host)
|
||||
print(f"Inserted/updated {inserted} rows into regulation_registry")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,206 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
F2 Migration: Populate action_types + action_synonyms from hardcoded dicts.
|
||||
|
||||
Sources:
|
||||
- ACTION_TYPES (control_ontology.py) — 26 types + ~150 aliases
|
||||
- _NEGATIVE_PATTERNS (control_ontology.py) — 22 patterns
|
||||
- _ACTION_SYNONYMS (control_dedup.py) — 65 synonyms
|
||||
|
||||
Usage:
|
||||
python3 scripts/f2_migrate_actions.py --dry-run
|
||||
python3 scripts/f2_migrate_actions.py --db-host macmini
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
from services.control_ontology import ACTION_TYPES, _NEGATIVE_PATTERNS # noqa: E402
|
||||
from services.control_dedup import _ACTION_SYNONYMS # noqa: E402
|
||||
|
||||
# Extra action types found in _ACTION_SYNONYMS but missing from ACTION_TYPES
|
||||
_EXTRA_ACTION_TYPES = {
|
||||
"audit": "evidence",
|
||||
"log": "evidence",
|
||||
"block": "implementation",
|
||||
"authorize": "governance",
|
||||
"authenticate": "implementation",
|
||||
"update": "operation",
|
||||
"backup": "operation",
|
||||
"restore": "operation",
|
||||
}
|
||||
|
||||
|
||||
def build_action_types() -> list[dict]:
|
||||
"""Build action_types rows from ACTION_TYPES + extras."""
|
||||
rows = []
|
||||
for name, info in ACTION_TYPES.items():
|
||||
rows.append({
|
||||
"canonical_name": name,
|
||||
"phase": info["phase"],
|
||||
})
|
||||
for name, phase in _EXTRA_ACTION_TYPES.items():
|
||||
if name not in ACTION_TYPES:
|
||||
rows.append({
|
||||
"canonical_name": name,
|
||||
"phase": phase,
|
||||
})
|
||||
return rows
|
||||
|
||||
|
||||
def build_action_synonyms() -> list[dict]:
|
||||
"""Build action_synonyms rows from all 3 sources."""
|
||||
rows = []
|
||||
seen: set[tuple[str, str, str]] = set() # (synonym, language, pattern_type)
|
||||
|
||||
# 1) Aliases from ACTION_TYPES
|
||||
for action_type, info in ACTION_TYPES.items():
|
||||
for alias in info.get("aliases", []):
|
||||
key = (alias.lower(), "de", "alias")
|
||||
if key not in seen:
|
||||
seen.add(key)
|
||||
rows.append({
|
||||
"canonical_action": action_type,
|
||||
"synonym": alias.lower(),
|
||||
"language": "de",
|
||||
"source": "migration",
|
||||
"pattern_type": "alias",
|
||||
})
|
||||
|
||||
# 2) Negative patterns
|
||||
for pattern, action_type in _NEGATIVE_PATTERNS:
|
||||
key = (pattern.lower(), "de", "negative_pattern")
|
||||
if key not in seen:
|
||||
seen.add(key)
|
||||
rows.append({
|
||||
"canonical_action": action_type,
|
||||
"synonym": pattern.lower(),
|
||||
"language": "de",
|
||||
"source": "migration",
|
||||
"pattern_type": "negative_pattern",
|
||||
})
|
||||
|
||||
# 3) _ACTION_SYNONYMS (German → canonical English)
|
||||
for synonym, canonical in _ACTION_SYNONYMS.items():
|
||||
# Determine language
|
||||
lang = "en" if synonym == canonical else "de"
|
||||
key = (synonym.lower(), lang, "alias")
|
||||
if key not in seen:
|
||||
seen.add(key)
|
||||
# Map canonical to valid action_type
|
||||
action = _map_dedup_canonical(canonical)
|
||||
rows.append({
|
||||
"canonical_action": action,
|
||||
"synonym": synonym.lower(),
|
||||
"language": lang,
|
||||
"source": "migration",
|
||||
"pattern_type": "alias",
|
||||
})
|
||||
|
||||
return rows
|
||||
|
||||
|
||||
def _map_dedup_canonical(canonical: str) -> str:
|
||||
"""Map control_dedup canonical names to action_types names."""
|
||||
# Most map directly, some need adjustment
|
||||
mapping = {
|
||||
"test": "test",
|
||||
"verify": "verify", # in ACTION_TYPES
|
||||
"validate": "validate", # in ACTION_TYPES
|
||||
"audit": "audit",
|
||||
"log": "log",
|
||||
"block": "block",
|
||||
"restrict": "restrict_access",
|
||||
"authorize": "authorize",
|
||||
"authenticate": "authenticate",
|
||||
"update": "update",
|
||||
"backup": "backup",
|
||||
"restore": "restore",
|
||||
}
|
||||
return mapping.get(canonical, canonical)
|
||||
|
||||
|
||||
def insert_via_sqlalchemy(action_types: list[dict], synonyms: list[dict], db_host: str):
|
||||
"""Insert rows using SQLAlchemy."""
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
url = "postgresql://breakpilot:breakpilot123@%s:5432/breakpilot_db" % db_host
|
||||
engine = create_engine(url)
|
||||
|
||||
with engine.connect() as conn:
|
||||
conn.execute(text("SET search_path TO compliance, public"))
|
||||
|
||||
# Insert action_types
|
||||
for row in action_types:
|
||||
conn.execute(
|
||||
text("""
|
||||
INSERT INTO action_types (canonical_name, phase)
|
||||
VALUES (:canonical_name, :phase)
|
||||
ON CONFLICT (canonical_name) DO UPDATE SET
|
||||
phase = EXCLUDED.phase
|
||||
"""),
|
||||
row,
|
||||
)
|
||||
print("Inserted %d action_types" % len(action_types))
|
||||
|
||||
# Insert action_synonyms
|
||||
inserted = 0
|
||||
skipped = 0
|
||||
for row in synonyms:
|
||||
try:
|
||||
conn.execute(
|
||||
text("""
|
||||
INSERT INTO action_synonyms
|
||||
(canonical_action, synonym, language, source, pattern_type)
|
||||
VALUES
|
||||
(:canonical_action, :synonym, :language, :source, :pattern_type)
|
||||
ON CONFLICT (synonym, language, pattern_type) DO UPDATE SET
|
||||
canonical_action = EXCLUDED.canonical_action,
|
||||
source = EXCLUDED.source
|
||||
"""),
|
||||
row,
|
||||
)
|
||||
inserted += 1
|
||||
except Exception as e:
|
||||
print(" Skip %s: %s" % (row["synonym"], e))
|
||||
skipped += 1
|
||||
|
||||
conn.commit()
|
||||
print("Inserted %d action_synonyms (%d skipped)" % (inserted, skipped))
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Migrate action types + synonyms")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Print stats only")
|
||||
parser.add_argument("--db-host", default="localhost", help="PostgreSQL host")
|
||||
args = parser.parse_args()
|
||||
|
||||
action_types = build_action_types()
|
||||
synonyms = build_action_synonyms()
|
||||
|
||||
print("Action types: %d" % len(action_types))
|
||||
print("Action synonyms: %d" % len(synonyms))
|
||||
by_type = {}
|
||||
for s in synonyms:
|
||||
by_type[s["pattern_type"]] = by_type.get(s["pattern_type"], 0) + 1
|
||||
print(" By pattern_type: %s" % by_type)
|
||||
by_source = {}
|
||||
for s in synonyms:
|
||||
by_source[s["canonical_action"]] = by_source.get(s["canonical_action"], 0) + 1
|
||||
print(" Top actions: %s" % dict(sorted(by_source.items(), key=lambda x: -x[1])[:10]))
|
||||
|
||||
if args.dry_run:
|
||||
print("\n--- DRY RUN ---")
|
||||
print("\nAction types:")
|
||||
for at in action_types:
|
||||
print(" %s (%s)" % (at["canonical_name"], at["phase"]))
|
||||
return
|
||||
|
||||
insert_via_sqlalchemy(action_types, synonyms, args.db_host)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,100 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
F3 Migration: Populate object_synonyms from hardcoded dict.
|
||||
|
||||
Source: _OBJECT_SYNONYMS (control_dedup.py) — 75 synonyms
|
||||
|
||||
Usage:
|
||||
python3 scripts/f3_migrate_objects.py --dry-run
|
||||
python3 scripts/f3_migrate_objects.py --db-host macmini
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
from services.control_dedup import _OBJECT_SYNONYMS # noqa: E402
|
||||
|
||||
|
||||
def build_rows() -> list[dict]:
|
||||
"""Build object_synonyms rows."""
|
||||
rows = []
|
||||
for synonym, canonical in _OBJECT_SYNONYMS.items():
|
||||
# Detect language (heuristic: German if contains umlauts or common DE words)
|
||||
lang = "de"
|
||||
lower = synonym.lower()
|
||||
if all(c in "abcdefghijklmnopqrstuvwxyz0123456789 -_" for c in lower):
|
||||
# Pure ASCII — likely English
|
||||
lang = "en"
|
||||
# Override for known German without umlauts
|
||||
if lower in ("passwort", "kennwort", "zugangsdaten", "fernzugriff",
|
||||
"sitzung", "firewall", "netzwerk", "vorfall",
|
||||
"schwachstelle", "richtlinie", "schulung",
|
||||
"protokoll", "datensicherung", "wiederherstellung"):
|
||||
lang = "de"
|
||||
|
||||
rows.append({
|
||||
"canonical_token": canonical,
|
||||
"synonym": lower,
|
||||
"language": lang,
|
||||
"source": "migration",
|
||||
})
|
||||
return rows
|
||||
|
||||
|
||||
def insert_via_sqlalchemy(rows: list[dict], db_host: str):
|
||||
"""Insert rows using SQLAlchemy."""
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
url = "postgresql://breakpilot:breakpilot123@%s:5432/breakpilot_db" % db_host
|
||||
engine = create_engine(url)
|
||||
|
||||
with engine.connect() as conn:
|
||||
conn.execute(text("SET search_path TO compliance, public"))
|
||||
|
||||
inserted = 0
|
||||
for row in rows:
|
||||
conn.execute(
|
||||
text("""
|
||||
INSERT INTO object_synonyms
|
||||
(canonical_token, synonym, language, source)
|
||||
VALUES
|
||||
(:canonical_token, :synonym, :language, :source)
|
||||
ON CONFLICT (synonym, language) DO UPDATE SET
|
||||
canonical_token = EXCLUDED.canonical_token,
|
||||
source = EXCLUDED.source
|
||||
"""),
|
||||
row,
|
||||
)
|
||||
inserted += 1
|
||||
|
||||
conn.commit()
|
||||
print("Inserted %d object_synonyms" % inserted)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Migrate object synonyms")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Print stats only")
|
||||
parser.add_argument("--db-host", default="localhost", help="PostgreSQL host")
|
||||
args = parser.parse_args()
|
||||
|
||||
rows = build_rows()
|
||||
print("Object synonyms: %d" % len(rows))
|
||||
|
||||
# Group by canonical
|
||||
by_canonical = {}
|
||||
for r in rows:
|
||||
by_canonical[r["canonical_token"]] = by_canonical.get(r["canonical_token"], 0) + 1
|
||||
print("Unique canonical tokens: %d" % len(by_canonical))
|
||||
print("Top tokens: %s" % dict(sorted(by_canonical.items(), key=lambda x: -x[1])[:10]))
|
||||
|
||||
if args.dry_run:
|
||||
return
|
||||
|
||||
insert_via_sqlalchemy(rows, args.db_host)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,267 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
F4: LLM-based Synonym Enrichment for Action Types and Object Tokens.
|
||||
|
||||
Uses Ollama (qwen3.5:35b-a3b) to generate additional German synonyms
|
||||
for each canonical action type and object token. Results are stored
|
||||
with source='llm' in the DB.
|
||||
|
||||
Usage:
|
||||
# Dry run (print, no DB write):
|
||||
python3 scripts/f4_llm_enrich_synonyms.py --dry-run
|
||||
|
||||
# Against Mac Mini:
|
||||
python3 scripts/f4_llm_enrich_synonyms.py --db-host macmini --ollama-host macmini
|
||||
|
||||
# Only actions or only objects:
|
||||
python3 scripts/f4_llm_enrich_synonyms.py --actions-only
|
||||
python3 scripts/f4_llm_enrich_synonyms.py --objects-only
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("f4-enrich")
|
||||
|
||||
OLLAMA_MODEL = "qwen3.5:35b-a3b"
|
||||
|
||||
|
||||
def call_ollama(prompt: str, ollama_url: str) -> str:
|
||||
"""Call Ollama with think:false for direct answers."""
|
||||
resp = httpx.post(
|
||||
f"{ollama_url}/api/chat",
|
||||
json={
|
||||
"model": OLLAMA_MODEL,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
"stream": False,
|
||||
"options": {"temperature": 0.3},
|
||||
"think": False,
|
||||
},
|
||||
timeout=60.0,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json().get("message", {}).get("content", "")
|
||||
|
||||
|
||||
def enrich_action_types(db_url: str, ollama_url: str, dry_run: bool) -> dict:
|
||||
"""Generate synonyms for each action type."""
|
||||
engine = create_engine(db_url, connect_args={"options": "-c search_path=compliance,public"})
|
||||
|
||||
with engine.connect() as conn:
|
||||
# Get existing action types and their current synonyms
|
||||
types = conn.execute(text("SELECT canonical_name, phase FROM action_types")).fetchall()
|
||||
existing = {}
|
||||
for row in conn.execute(text("SELECT canonical_action, synonym FROM action_synonyms")).fetchall():
|
||||
existing.setdefault(row[0], set()).add(row[1])
|
||||
|
||||
stats = {"types_processed": 0, "new_synonyms": 0, "skipped": 0}
|
||||
all_new: list[dict] = []
|
||||
|
||||
for canonical, phase in types:
|
||||
current_synonyms = existing.get(canonical, set())
|
||||
|
||||
prompt = f"""Du bist ein Compliance-Experte. Gib mir 5-8 deutsche Synonyme oder Umschreibungen fuer die Handlung "{canonical}" (Phase: {phase}) im Kontext von IT-Compliance und Datenschutz.
|
||||
|
||||
Bestehende Synonyme (NICHT wiederholen): {', '.join(sorted(current_synonyms)[:10])}
|
||||
|
||||
Antworte NUR mit einer JSON-Liste von Strings, z.B.: ["synonym1", "synonym2", ...]
|
||||
Keine Erklaerungen, nur die JSON-Liste."""
|
||||
|
||||
try:
|
||||
response = call_ollama(prompt, ollama_url)
|
||||
# Parse JSON from response
|
||||
synonyms = _parse_json_list(response)
|
||||
|
||||
new_count = 0
|
||||
for syn in synonyms:
|
||||
syn_lower = syn.lower().strip()
|
||||
if not syn_lower or len(syn_lower) < 3:
|
||||
continue
|
||||
if syn_lower in current_synonyms:
|
||||
stats["skipped"] += 1
|
||||
continue
|
||||
all_new.append({
|
||||
"canonical_action": canonical,
|
||||
"synonym": syn_lower,
|
||||
"language": "de",
|
||||
"source": "llm",
|
||||
"pattern_type": "alias",
|
||||
})
|
||||
current_synonyms.add(syn_lower)
|
||||
new_count += 1
|
||||
|
||||
stats["types_processed"] += 1
|
||||
stats["new_synonyms"] += new_count
|
||||
logger.info("%s: +%d new synonyms", canonical, new_count)
|
||||
|
||||
except Exception as e:
|
||||
logger.warning("Error for %s: %s", canonical, e)
|
||||
|
||||
time.sleep(1) # Rate limit
|
||||
|
||||
# Write to DB
|
||||
if not dry_run and all_new:
|
||||
with engine.begin() as conn:
|
||||
for row in all_new:
|
||||
conn.execute(
|
||||
text("""
|
||||
INSERT INTO action_synonyms (canonical_action, synonym, language, source, pattern_type)
|
||||
VALUES (:canonical_action, :synonym, :language, :source, :pattern_type)
|
||||
ON CONFLICT (synonym, language, pattern_type) DO NOTHING
|
||||
"""),
|
||||
row,
|
||||
)
|
||||
logger.info("Wrote %d new action synonyms to DB", len(all_new))
|
||||
elif dry_run:
|
||||
print("\n--- DRY RUN: Action Synonyms ---")
|
||||
for row in all_new[:20]:
|
||||
print(" %s → %s" % (row["canonical_action"], row["synonym"]))
|
||||
if len(all_new) > 20:
|
||||
print(" ... and %d more" % (len(all_new) - 20))
|
||||
|
||||
return stats
|
||||
|
||||
|
||||
def enrich_object_tokens(db_url: str, ollama_url: str, dry_run: bool) -> dict:
|
||||
"""Generate synonyms for each object canonical token."""
|
||||
engine = create_engine(db_url, connect_args={"options": "-c search_path=compliance,public"})
|
||||
|
||||
with engine.connect() as conn:
|
||||
# Get unique canonical tokens
|
||||
tokens = conn.execute(text(
|
||||
"SELECT DISTINCT canonical_token FROM object_synonyms ORDER BY canonical_token"
|
||||
)).fetchall()
|
||||
existing = {}
|
||||
for row in conn.execute(text("SELECT canonical_token, synonym FROM object_synonyms")).fetchall():
|
||||
existing.setdefault(row[0], set()).add(row[1])
|
||||
|
||||
stats = {"tokens_processed": 0, "new_synonyms": 0, "skipped": 0}
|
||||
all_new: list[dict] = []
|
||||
|
||||
for (token,) in tokens:
|
||||
current_synonyms = existing.get(token, set())
|
||||
|
||||
prompt = f"""Du bist ein IT-Security-Experte. Gib mir 5-8 deutsche und englische Begriffe/Synonyme fuer das Konzept "{token}" im Kontext von IT-Sicherheit und Compliance.
|
||||
|
||||
Bestehende Synonyme (NICHT wiederholen): {', '.join(sorted(current_synonyms)[:8])}
|
||||
|
||||
Antworte NUR mit einer JSON-Liste von Strings, z.B.: ["synonym1", "synonym2", ...]
|
||||
Keine Erklaerungen, nur die JSON-Liste."""
|
||||
|
||||
try:
|
||||
response = call_ollama(prompt, ollama_url)
|
||||
synonyms = _parse_json_list(response)
|
||||
|
||||
new_count = 0
|
||||
for syn in synonyms:
|
||||
syn_lower = syn.lower().strip()
|
||||
if not syn_lower or len(syn_lower) < 2:
|
||||
continue
|
||||
if syn_lower in current_synonyms:
|
||||
stats["skipped"] += 1
|
||||
continue
|
||||
# Detect language
|
||||
lang = "de"
|
||||
if all(c in "abcdefghijklmnopqrstuvwxyz0123456789 -_" for c in syn_lower):
|
||||
lang = "en"
|
||||
all_new.append({
|
||||
"canonical_token": token,
|
||||
"synonym": syn_lower,
|
||||
"language": lang,
|
||||
"source": "llm",
|
||||
})
|
||||
current_synonyms.add(syn_lower)
|
||||
new_count += 1
|
||||
|
||||
stats["tokens_processed"] += 1
|
||||
stats["new_synonyms"] += new_count
|
||||
logger.info("%s: +%d new synonyms", token, new_count)
|
||||
|
||||
except Exception as e:
|
||||
logger.warning("Error for %s: %s", token, e)
|
||||
|
||||
time.sleep(1)
|
||||
|
||||
# Write to DB
|
||||
if not dry_run and all_new:
|
||||
with engine.begin() as conn:
|
||||
for row in all_new:
|
||||
conn.execute(
|
||||
text("""
|
||||
INSERT INTO object_synonyms (canonical_token, synonym, language, source)
|
||||
VALUES (:canonical_token, :synonym, :language, :source)
|
||||
ON CONFLICT (synonym, language) DO NOTHING
|
||||
"""),
|
||||
row,
|
||||
)
|
||||
logger.info("Wrote %d new object synonyms to DB", len(all_new))
|
||||
elif dry_run:
|
||||
print("\n--- DRY RUN: Object Synonyms ---")
|
||||
for row in all_new[:20]:
|
||||
print(" %s → %s (%s)" % (row["canonical_token"], row["synonym"], row["language"]))
|
||||
if len(all_new) > 20:
|
||||
print(" ... and %d more" % (len(all_new) - 20))
|
||||
|
||||
return stats
|
||||
|
||||
|
||||
def _parse_json_list(text: str) -> list[str]:
|
||||
"""Extract JSON list from LLM response."""
|
||||
# Try to find JSON array in response
|
||||
text = text.strip()
|
||||
# Remove markdown code fences
|
||||
if "```" in text:
|
||||
text = text.split("```")[1] if text.count("```") >= 2 else text
|
||||
text = text.strip()
|
||||
if text.startswith("json"):
|
||||
text = text[4:].strip()
|
||||
|
||||
# Find first [ and last ]
|
||||
start = text.find("[")
|
||||
end = text.rfind("]")
|
||||
if start >= 0 and end > start:
|
||||
try:
|
||||
return json.loads(text[start:end + 1])
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
return []
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="LLM Synonym Enrichment")
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
parser.add_argument("--db-host", default="localhost")
|
||||
parser.add_argument("--ollama-host", default="localhost")
|
||||
parser.add_argument("--actions-only", action="store_true")
|
||||
parser.add_argument("--objects-only", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
db_url = f"postgresql://breakpilot:breakpilot123@{args.db_host}:5432/breakpilot_db"
|
||||
ollama_url = f"http://{args.ollama_host}:11434"
|
||||
|
||||
if args.dry_run:
|
||||
print("=== DRY RUN MODE ===\n")
|
||||
|
||||
if not args.objects_only:
|
||||
print("=== Enriching Action Types ===")
|
||||
action_stats = enrich_action_types(db_url, ollama_url, args.dry_run)
|
||||
print("Actions: %d processed, %d new synonyms\n" % (
|
||||
action_stats["types_processed"], action_stats["new_synonyms"]))
|
||||
|
||||
if not args.actions_only:
|
||||
print("=== Enriching Object Tokens ===")
|
||||
object_stats = enrich_object_tokens(db_url, ollama_url, args.dry_run)
|
||||
print("Objects: %d processed, %d new synonyms\n" % (
|
||||
object_stats["tokens_processed"], object_stats["new_synonyms"]))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,289 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Add L2 sub-topics to broad tokens. Instead of just "incident",
|
||||
produces "incident:response", "incident:detection", etc.
|
||||
|
||||
Only processes tokens with >500 controls AND <90% audit accuracy.
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/gpre0_add_subtopics.py --dry-run
|
||||
python3 /app/scripts/gpre0_add_subtopics.py
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("gpre0-subtopics")
|
||||
|
||||
DB_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
|
||||
)
|
||||
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
|
||||
ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"
|
||||
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||
CHECKPOINT_DIR = Path("/tmp/gpre0_subtopic_checkpoints")
|
||||
|
||||
# Tokens that are too broad — need L2 sub-topics
|
||||
BROAD_TOKENS = {
|
||||
# Round 1 (already done)
|
||||
"risk_management", "policy", "audit_logging", "incident",
|
||||
"access_control", "compliance_audit", "asset_management",
|
||||
"key_management", "third_party_management", "monitoring",
|
||||
"financial_reporting", "data_classification", "change_management",
|
||||
"alerting", "multi_factor_auth", "api_security",
|
||||
"certificate_management", "human_resources_security",
|
||||
"training", "data_processing_agreement", "data_processing_register",
|
||||
"consumer_protection", "input_validation", "vulnerability",
|
||||
"dpia", "data_breach_notification", "backup",
|
||||
"supply_chain_due_diligence", "awareness",
|
||||
"privacy_by_design", "credentials", "logging_configuration",
|
||||
# Round 2 (remaining large tokens)
|
||||
"supervisory_authority", "certification", "secure_development",
|
||||
"product_safety", "personal_data", "data_subject_rights", "consent",
|
||||
"ai_system", "encryption", "data_retention", "disaster_recovery",
|
||||
"data_transfer", "aml", "transport_encryption", "network_security",
|
||||
"physical_security", "medical_device", "patch_management",
|
||||
"cookie_consent", "video_surveillance", "network_segmentation",
|
||||
"telecommunications", "privileged_access", "session_management",
|
||||
"password_policy", "governance", "whistleblowing", "payment_services",
|
||||
"health_data", "sensitive_data", "ecommerce", "sustainability_reporting",
|
||||
"critical_infrastructure", "regulatory",
|
||||
}
|
||||
|
||||
SYSTEM_PROMPT = """Du bist ein Compliance-Spezialist. Jeder Control hat bereits ein Hauptthema (L1 Token).
|
||||
Deine Aufgabe: Bestimme ein SPEZIFISCHES Sub-Thema (L2) innerhalb des Hauptthemas.
|
||||
|
||||
Das L2 Sub-Thema soll den KONKRETEN Aspekt beschreiben. Verwende kurze, klare englische Bezeichnungen.
|
||||
|
||||
Beispiele:
|
||||
- L1=incident, Titel="Incident Response Plan erstellen" → L2="response_plan"
|
||||
- L1=incident, Titel="Sicherheitsvorfälle erkennen" → L2="detection"
|
||||
- L1=incident, Titel="Recovery nach Vorfall dokumentieren" → L2="recovery"
|
||||
- L1=incident, Titel="Forensische Analyse durchführen" → L2="forensics"
|
||||
- L1=risk_management, Titel="Risikobewertung durchführen" → L2="assessment"
|
||||
- L1=risk_management, Titel="Risikominderungsmaßnahmen umsetzen" → L2="treatment"
|
||||
- L1=risk_management, Titel="Restrisiko akzeptieren" → L2="acceptance"
|
||||
- L1=access_control, Titel="Rollenbasierte Zugriffskontrolle" → L2="rbac"
|
||||
- L1=access_control, Titel="Zugriffsrechte regelmäßig prüfen" → L2="access_review"
|
||||
- L1=access_control, Titel="Identitätsmanagement implementieren" → L2="identity_management"
|
||||
- L1=monitoring, Titel="Systemverfügbarkeit überwachen" → L2="availability"
|
||||
- L1=monitoring, Titel="Sicherheitsereignisse überwachen" → L2="security_events"
|
||||
- L1=policy, Titel="Datenschutzrichtlinie erstellen" → L2="data_protection"
|
||||
- L1=policy, Titel="Acceptable Use Policy definieren" → L2="acceptable_use"
|
||||
- L1=policy, Titel="Passwortrichtlinie festlegen" → L2="password"
|
||||
- L1=financial_reporting, Titel="Jahresabschluss erstellen" → L2="annual_accounts"
|
||||
- L1=financial_reporting, Titel="Steuererklärung einreichen" → L2="tax"
|
||||
- L1=alerting, Titel="Datenpanne an Behörde melden" → L2="breach_notification"
|
||||
- L1=alerting, Titel="Sicherheitswarnung eskalieren" → L2="escalation"
|
||||
|
||||
REGELN:
|
||||
- L2 soll 1-3 Wörter sein, snake_case
|
||||
- L2 soll SPEZIFISCH sein (nicht das L1 wiederholen)
|
||||
- Verwende konsistente L2-Bezeichnungen für ähnliche Controls
|
||||
|
||||
Antworte NUR als JSON-Array: [{"id":"...","l2":"subtopic"}, ...]"""
|
||||
|
||||
|
||||
def call_claude(controls_batch: list[dict]) -> tuple[list[dict], dict]:
|
||||
"""Send batch to Claude for L2 sub-topic assignment."""
|
||||
items = []
|
||||
for c in controls_batch:
|
||||
items.append(
|
||||
f'- id="{c["control_id"]}" '
|
||||
f'L1="{c["current_object"]}" '
|
||||
f't="{c["title"]}" '
|
||||
f'o="{c["objective"][:80]}"'
|
||||
)
|
||||
|
||||
prompt = "Bestimme L2 Sub-Topics:\n" + "\n".join(items)
|
||||
|
||||
headers = {
|
||||
"x-api-key": ANTHROPIC_API_KEY,
|
||||
"anthropic-version": "2023-06-01",
|
||||
"content-type": "application/json",
|
||||
}
|
||||
payload = {
|
||||
"model": ANTHROPIC_MODEL,
|
||||
"max_tokens": 1500,
|
||||
"temperature": 0.0,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
}
|
||||
|
||||
try:
|
||||
resp = httpx.post(
|
||||
ANTHROPIC_URL, headers=headers, json=payload, timeout=45.0
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
content = data.get("content", [{}])[0].get("text", "")
|
||||
usage = data.get("usage", {})
|
||||
start = content.find("[")
|
||||
end = content.rfind("]") + 1
|
||||
if start >= 0 and end > start:
|
||||
return json.loads(content[start:end]), usage
|
||||
return [], usage
|
||||
except httpx.TimeoutException:
|
||||
logger.error("TIMEOUT — skipping")
|
||||
return [], {}
|
||||
except httpx.HTTPStatusError as e:
|
||||
if e.response.status_code == 429:
|
||||
logger.warning("Rate limited — waiting 60s")
|
||||
time.sleep(60)
|
||||
else:
|
||||
logger.error("API error %d", e.response.status_code)
|
||||
return [], {}
|
||||
except Exception as e:
|
||||
logger.error("Failed: %s", e)
|
||||
return [], {}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--batch-size", type=int, default=20)
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(
|
||||
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
|
||||
)
|
||||
|
||||
# Build LIKE patterns for broad tokens
|
||||
like_clauses = " OR ".join(
|
||||
f"cc.generation_metadata->>'merge_group_hint' LIKE '%:{tok}:%'"
|
||||
for tok in BROAD_TOKENS
|
||||
)
|
||||
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text(f"""
|
||||
SELECT cc.id, cc.control_id, cc.title,
|
||||
COALESCE(cc.objective, '') as objective,
|
||||
cc.generation_metadata->>'merge_group_hint' as hint
|
||||
FROM canonical_controls cc
|
||||
WHERE cc.generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND cc.release_state NOT IN ('deprecated', 'rejected')
|
||||
AND ({like_clauses})
|
||||
""")).fetchall()
|
||||
|
||||
controls = []
|
||||
for uuid, cid, title, objective, hint in rows:
|
||||
parts = hint.split(":", 2) if hint else []
|
||||
obj = parts[1] if len(parts) > 1 else ""
|
||||
if obj in BROAD_TOKENS:
|
||||
controls.append({
|
||||
"uuid": str(uuid), "control_id": cid,
|
||||
"title": title or "", "objective": objective or "",
|
||||
"current_hint": hint, "current_object": obj,
|
||||
})
|
||||
|
||||
logger.info("Found %d controls in broad tokens to add L2 sub-topics", len(controls))
|
||||
|
||||
# Process
|
||||
total_tagged = 0
|
||||
total_skipped = 0
|
||||
total_input_tokens = 0
|
||||
total_output_tokens = 0
|
||||
corrections = []
|
||||
l2_stats: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
|
||||
|
||||
for i in range(0, len(controls), args.batch_size):
|
||||
batch = controls[i:i + args.batch_size]
|
||||
results, usage = call_claude(batch)
|
||||
|
||||
total_input_tokens += usage.get("input_tokens", 0)
|
||||
total_output_tokens += usage.get("output_tokens", 0)
|
||||
|
||||
if not results:
|
||||
total_skipped += len(batch)
|
||||
continue
|
||||
|
||||
result_map = {r.get("id", ""): r for r in results}
|
||||
for ctrl in batch:
|
||||
r = result_map.get(ctrl["control_id"], {})
|
||||
l2 = r.get("l2", "")
|
||||
if not l2:
|
||||
total_skipped += 1
|
||||
continue
|
||||
|
||||
total_tagged += 1
|
||||
old_hint = ctrl["current_hint"]
|
||||
parts = old_hint.split(":", 2)
|
||||
action = parts[0] if parts else "implement"
|
||||
l1 = parts[1] if len(parts) > 1 else "unknown"
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
# New format: action:L1_L2:phase
|
||||
new_obj = f"{l1}_{l2}"
|
||||
new_hint = f"{action}:{new_obj}:{phase}"
|
||||
corrections.append({
|
||||
"uuid": ctrl["uuid"],
|
||||
"old_hint": old_hint,
|
||||
"new_hint": new_hint,
|
||||
})
|
||||
l2_stats[l1][l2] += 1
|
||||
|
||||
processed = min(i + args.batch_size, len(controls))
|
||||
if processed % 5000 < args.batch_size or processed >= len(controls):
|
||||
logger.info(
|
||||
"Progress: %d/%d (tagged=%d skip=%d)",
|
||||
processed, len(controls), total_tagged, total_skipped,
|
||||
)
|
||||
|
||||
time.sleep(0.3)
|
||||
|
||||
# Report
|
||||
cost_in = total_input_tokens / 1_000_000 * 0.80
|
||||
cost_out = total_output_tokens / 1_000_000 * 4.00
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("SUBTOPIC REPORT")
|
||||
logger.info("=" * 60)
|
||||
logger.info("Total: %d | Tagged: %d | Skipped: %d", len(controls), total_tagged, total_skipped)
|
||||
logger.info("Cost: $%.2f (Haiku)", cost_in + cost_out)
|
||||
|
||||
# Show L2 distribution per L1
|
||||
for l1, subs in sorted(l2_stats.items()):
|
||||
top_subs = sorted(subs.items(), key=lambda x: -x[1])[:10]
|
||||
logger.info("\n%s (%d unique L2):", l1, len(subs))
|
||||
for l2, cnt in top_subs:
|
||||
logger.info(" %4d %s_%s", cnt, l1, l2)
|
||||
|
||||
# Save corrections
|
||||
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
corr_file = CHECKPOINT_DIR / "corrections_subtopics.json"
|
||||
corr_file.write_text(json.dumps(corrections))
|
||||
logger.info("\nSaved %d corrections to %s", len(corrections), corr_file)
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("DRY RUN — not updating DB")
|
||||
return
|
||||
|
||||
if corrections:
|
||||
logger.info("Applying %d corrections...", len(corrections))
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
for corr in corrections:
|
||||
c.execute(text("""
|
||||
UPDATE canonical_controls
|
||||
SET generation_metadata = jsonb_set(
|
||||
generation_metadata,
|
||||
'{merge_group_hint}',
|
||||
to_jsonb(CAST(:new_hint AS text))
|
||||
)
|
||||
WHERE id = CAST(:uuid AS uuid)
|
||||
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
|
||||
logger.info("Done. %d hints updated.", len(corrections))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,52 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Apply saved corrections from JSON file to DB (crash recovery)."""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("apply-corrections")
|
||||
|
||||
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("file", help="Path to corrections JSON file")
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
corrections = json.loads(Path(args.file).read_text())
|
||||
logger.info("Loaded %d corrections from %s", len(corrections), args.file)
|
||||
|
||||
if args.dry_run:
|
||||
for c in corrections[:10]:
|
||||
logger.info(" %s: %s → %s", c["uuid"][:8], c["old_hint"], c["new_hint"])
|
||||
logger.info("DRY RUN — not applying")
|
||||
return
|
||||
|
||||
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
|
||||
applied = 0
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
for corr in corrections:
|
||||
c.execute(text("""
|
||||
UPDATE canonical_controls
|
||||
SET generation_metadata = jsonb_set(
|
||||
generation_metadata,
|
||||
'{merge_group_hint}',
|
||||
to_jsonb(CAST(:new_hint AS text))
|
||||
)
|
||||
WHERE id = CAST(:uuid AS uuid)
|
||||
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
|
||||
applied += 1
|
||||
logger.info("Applied %d corrections.", applied)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,153 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Fix bad L2 subtopics: stakeholder_*, escalation fragments, *_approval*, *_documentation."""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("fix-subtopics")
|
||||
|
||||
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
|
||||
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
|
||||
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||
|
||||
SYSTEM_PROMPT = """Du klassifizierst Controls mit einem L1_L2 Token. Das L2 soll den KONKRETEN fachlichen Aspekt beschreiben.
|
||||
|
||||
VERBOTENE L2-Wörter (zu generisch):
|
||||
- stakeholder (zu vage — WER sind die Stakeholder? WAS wird getan?)
|
||||
- documentation (ist eine Handlung, kein Thema)
|
||||
- approval (ist eine Handlung)
|
||||
- communication (zu vage)
|
||||
|
||||
Stattdessen SPEZIFISCH:
|
||||
- "stakeholder_notification" bei Behördenmeldung → "authority_reporting"
|
||||
- "stakeholder_consultation" bei DSFA → "impact_consultation"
|
||||
- "stakeholder_engagement" bei Training → "participant_selection"
|
||||
- "escalation_procedure" → "severity_classification" oder "response_plan"
|
||||
- "access_documentation" → "access_policy" oder "permission_matrix"
|
||||
- "approval_process" → "authorization_workflow" oder "sign_off"
|
||||
|
||||
L2 = 1-3 Wörter, snake_case, FACHLICH SPEZIFISCH.
|
||||
|
||||
Antworte NUR als JSON-Array: [{"id":"...","token":"L1_L2"}, ...]"""
|
||||
|
||||
|
||||
def main():
|
||||
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
|
||||
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text("""
|
||||
SELECT cc.id, cc.control_id, cc.title,
|
||||
COALESCE(cc.objective, '') as objective,
|
||||
cc.generation_metadata->>'merge_group_hint' as hint
|
||||
FROM canonical_controls cc
|
||||
WHERE cc.release_state NOT IN ('deprecated', 'rejected')
|
||||
AND cc.generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND (
|
||||
cc.generation_metadata->>'merge_group_hint' LIKE '%stakeholder%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%_escalation_%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%_approval_%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%response_time%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%machine_re%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%management_app%'
|
||||
)
|
||||
""")).fetchall()
|
||||
|
||||
controls = []
|
||||
for uuid, cid, title, objective, hint in rows:
|
||||
parts = hint.split(":", 2) if hint else []
|
||||
controls.append({
|
||||
"uuid": str(uuid), "control_id": cid,
|
||||
"title": title or "", "objective": objective or "",
|
||||
"current_hint": hint,
|
||||
"current_object": parts[1] if len(parts) > 1 else "",
|
||||
})
|
||||
|
||||
logger.info("Found %d controls with bad subtopics to fix", len(controls))
|
||||
|
||||
headers = {
|
||||
"x-api-key": ANTHROPIC_API_KEY,
|
||||
"anthropic-version": "2023-06-01",
|
||||
"content-type": "application/json",
|
||||
}
|
||||
|
||||
corrections = []
|
||||
total_fixed = 0
|
||||
batch_size = 20
|
||||
|
||||
for i in range(0, len(controls), batch_size):
|
||||
batch = controls[i:i + batch_size]
|
||||
items = [
|
||||
f'- id="{c["control_id"]}" cur="{c["current_object"]}" t="{c["title"]}" o="{c["objective"][:80]}"'
|
||||
for c in batch
|
||||
]
|
||||
|
||||
try:
|
||||
resp = httpx.post(ANTHROPIC_URL, headers=headers, json={
|
||||
"model": "claude-haiku-4-5-20251001",
|
||||
"max_tokens": 1500, "temperature": 0.0,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": [{"role": "user", "content": "Fix:\n" + "\n".join(items)}],
|
||||
}, timeout=45.0)
|
||||
resp.raise_for_status()
|
||||
content = resp.json().get("content", [{}])[0].get("text", "")
|
||||
start = content.find("[")
|
||||
end = content.rfind("]") + 1
|
||||
results = json.loads(content[start:end]) if start >= 0 else []
|
||||
except Exception as e:
|
||||
logger.error("Batch %d failed: %s", i, e)
|
||||
continue
|
||||
|
||||
result_map = {r.get("id", ""): r for r in results}
|
||||
for ctrl in batch:
|
||||
r = result_map.get(ctrl["control_id"], {})
|
||||
new_token = r.get("token", "")
|
||||
if not new_token or new_token == ctrl["current_object"]:
|
||||
continue
|
||||
if "stakeholder" in new_token or "approval" in new_token:
|
||||
continue # Still bad
|
||||
|
||||
parts = ctrl["current_hint"].split(":", 2)
|
||||
action = parts[0] if parts else "implement"
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
corrections.append({
|
||||
"uuid": ctrl["uuid"],
|
||||
"old_hint": ctrl["current_hint"],
|
||||
"new_hint": f"{action}:{new_token}:{phase}",
|
||||
})
|
||||
total_fixed += 1
|
||||
|
||||
if (i + batch_size) % 200 < batch_size:
|
||||
logger.info("Progress: %d/%d (fixed=%d)", min(i + batch_size, len(controls)), len(controls), total_fixed)
|
||||
time.sleep(0.3)
|
||||
|
||||
logger.info("Fixed: %d of %d controls", total_fixed, len(controls))
|
||||
|
||||
# Save + apply
|
||||
Path("/tmp/corrections_bad_subtopics.json").write_text(json.dumps(corrections))
|
||||
|
||||
if corrections:
|
||||
logger.info("Applying %d corrections...", len(corrections))
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
for corr in corrections:
|
||||
c.execute(text("""
|
||||
UPDATE canonical_controls
|
||||
SET generation_metadata = jsonb_set(
|
||||
generation_metadata,
|
||||
'{merge_group_hint}',
|
||||
to_jsonb(CAST(:new_hint AS text))
|
||||
)
|
||||
WHERE id = CAST(:uuid AS uuid)
|
||||
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
|
||||
logger.info("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,284 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Fix generic tokens: Re-classify controls that were assigned to
|
||||
action-based tokens (documentation, procedure, process, etc.)
|
||||
instead of topic-based tokens.
|
||||
|
||||
Runs sequentially in 5 batches. NO retry on timeout.
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/gpre0_fix_generic_tokens.py --dry-run
|
||||
python3 /app/scripts/gpre0_fix_generic_tokens.py
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("gpre0-fix-generic")
|
||||
|
||||
DB_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
|
||||
)
|
||||
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
|
||||
ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"
|
||||
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||
CHECKPOINT_DIR = Path("/tmp/gpre0_fix_checkpoints")
|
||||
|
||||
# Tokens that are ACTION-based, not TOPIC-based → must be re-classified
|
||||
FORBIDDEN_TOKENS = {
|
||||
"documentation", "procedure", "process",
|
||||
"compliance_reporting", "records_management",
|
||||
}
|
||||
|
||||
SYSTEM_PROMPT = """Du bist ein Compliance-Klassifizierer. Ordne jeden Control dem THEMA zu, nicht der Handlung.
|
||||
|
||||
KRITISCH: Die Tokens "documentation", "procedure", "process", "compliance_reporting",
|
||||
"records_management" sind VERBOTEN. Klassifiziere nach dem INHALTLICHEN THEMA.
|
||||
|
||||
Beispiele:
|
||||
- "Risikobewertung dokumentieren" → risk_management (NICHT documentation)
|
||||
- "Incident-Verfahren definieren" → incident (NICHT procedure)
|
||||
- "Verschlüsselungsprozess implementieren" → encryption (NICHT process)
|
||||
- "Audit-Ergebnisse berichten" → compliance_audit (NICHT compliance_reporting)
|
||||
- "Datenschutz-Unterlagen verwalten" → personal_data (NICHT records_management)
|
||||
- "Löschkonzept dokumentieren" → data_retention (NICHT documentation)
|
||||
- "Zertifizierungsverfahren definieren" → certification (NICHT procedure)
|
||||
- "Schulungsprozess durchführen" → training (NICHT process)
|
||||
|
||||
ERLAUBTE TOKENS:
|
||||
|
||||
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
|
||||
privileged_access, access_control, encryption, transport_encryption,
|
||||
key_management, certificate_management, network_security, network_segmentation,
|
||||
firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting,
|
||||
compliance_audit, vulnerability, patch_management, backup, disaster_recovery,
|
||||
physical_security, secure_development, api_security, input_validation,
|
||||
container_security, logging_configuration
|
||||
|
||||
DATA_PROTECTION: personal_data, sensitive_data, health_data, consent,
|
||||
data_subject_rights, data_retention, data_transfer, data_breach_notification,
|
||||
dpia, data_processing_agreement, privacy_by_design, data_processing_register,
|
||||
data_classification, cookie_consent, video_surveillance
|
||||
|
||||
GOVERNANCE: policy, training, awareness, incident, risk_management,
|
||||
third_party_management, change_management, asset_management,
|
||||
human_resources_security
|
||||
|
||||
REGULATORY: supervisory_authority, certification, product_safety, ai_system,
|
||||
financial_reporting, aml, whistleblowing, consumer_protection, ecommerce,
|
||||
telecommunications, medical_device, payment_services, critical_infrastructure,
|
||||
supply_chain_due_diligence, sustainability_reporting
|
||||
|
||||
Antworte NUR als JSON-Array: [{"id":"...","token":"...","conf":0.9}, ...]"""
|
||||
|
||||
|
||||
def call_claude(controls_batch: list[dict]) -> tuple[list[dict], dict]:
|
||||
"""Send batch to Claude. NO retry on timeout."""
|
||||
items = []
|
||||
for c in controls_batch:
|
||||
items.append(
|
||||
f'- id="{c["control_id"]}" '
|
||||
f'cur="{c["current_object"]}" '
|
||||
f't="{c["title"]}" '
|
||||
f'o="{c["objective"][:100]}"'
|
||||
)
|
||||
|
||||
prompt = "Klassifiziere nach THEMA (nicht Handlung):\n" + "\n".join(items)
|
||||
|
||||
headers = {
|
||||
"x-api-key": ANTHROPIC_API_KEY,
|
||||
"anthropic-version": "2023-06-01",
|
||||
"content-type": "application/json",
|
||||
}
|
||||
payload = {
|
||||
"model": ANTHROPIC_MODEL,
|
||||
"max_tokens": 1500,
|
||||
"temperature": 0.0,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
}
|
||||
|
||||
try:
|
||||
resp = httpx.post(
|
||||
ANTHROPIC_URL, headers=headers, json=payload, timeout=45.0
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
content = data.get("content", [{}])[0].get("text", "")
|
||||
usage = data.get("usage", {})
|
||||
start = content.find("[")
|
||||
end = content.rfind("]") + 1
|
||||
if start >= 0 and end > start:
|
||||
return json.loads(content[start:end]), usage
|
||||
return [], usage
|
||||
except httpx.TimeoutException:
|
||||
logger.error("TIMEOUT — skipping batch")
|
||||
return [], {}
|
||||
except httpx.HTTPStatusError as e:
|
||||
if e.response.status_code == 429:
|
||||
logger.warning("Rate limited — waiting 60s")
|
||||
time.sleep(60)
|
||||
else:
|
||||
logger.error("API error %d", e.response.status_code)
|
||||
return [], {}
|
||||
except Exception as e:
|
||||
logger.error("Failed: %s", e)
|
||||
return [], {}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--batch-size", type=int, default=20)
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(
|
||||
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
|
||||
)
|
||||
|
||||
# Load only controls with forbidden tokens
|
||||
forbidden_pattern = "|".join(
|
||||
f":{tok}:" for tok in FORBIDDEN_TOKENS
|
||||
)
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text("""
|
||||
SELECT cc.id, cc.control_id, cc.title,
|
||||
COALESCE(cc.objective, '') as objective,
|
||||
cc.generation_metadata->>'merge_group_hint' as hint
|
||||
FROM canonical_controls cc
|
||||
WHERE cc.generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND cc.release_state NOT IN ('deprecated', 'rejected')
|
||||
AND (
|
||||
cc.generation_metadata->>'merge_group_hint' LIKE '%:documentation:%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:procedure:%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:process:%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:compliance_reporting:%'
|
||||
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:records_management:%'
|
||||
)
|
||||
""")).fetchall()
|
||||
|
||||
controls = []
|
||||
for uuid, cid, title, objective, hint in rows:
|
||||
parts = hint.split(":", 2) if hint else []
|
||||
controls.append({
|
||||
"uuid": str(uuid), "control_id": cid,
|
||||
"title": title or "", "objective": objective or "",
|
||||
"current_hint": hint,
|
||||
"current_object": parts[1] if len(parts) > 1 else hint,
|
||||
})
|
||||
|
||||
logger.info("Found %d controls with forbidden tokens to re-classify", len(controls))
|
||||
|
||||
# Process
|
||||
total_fixed = 0
|
||||
total_kept = 0
|
||||
total_skipped = 0
|
||||
total_input_tokens = 0
|
||||
total_output_tokens = 0
|
||||
corrections = []
|
||||
change_stats: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
|
||||
|
||||
for i in range(0, len(controls), args.batch_size):
|
||||
batch = controls[i:i + args.batch_size]
|
||||
results, usage = call_claude(batch)
|
||||
|
||||
total_input_tokens += usage.get("input_tokens", 0)
|
||||
total_output_tokens += usage.get("output_tokens", 0)
|
||||
|
||||
if not results:
|
||||
total_skipped += len(batch)
|
||||
continue
|
||||
|
||||
result_map = {r.get("id", ""): r for r in results}
|
||||
for ctrl in batch:
|
||||
r = result_map.get(ctrl["control_id"], {})
|
||||
new_token = r.get("token", "")
|
||||
if not new_token or new_token in FORBIDDEN_TOKENS:
|
||||
total_kept += 1
|
||||
continue
|
||||
|
||||
old_obj = ctrl["current_object"]
|
||||
if new_token != old_obj:
|
||||
total_fixed += 1
|
||||
parts = ctrl["current_hint"].split(":", 2)
|
||||
action = parts[0] if parts else "implement"
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
corrections.append({
|
||||
"uuid": ctrl["uuid"],
|
||||
"old_hint": ctrl["current_hint"],
|
||||
"new_hint": f"{action}:{new_token}:{phase}",
|
||||
})
|
||||
change_stats[old_obj][new_token] += 1
|
||||
else:
|
||||
total_kept += 1
|
||||
|
||||
processed = min(i + args.batch_size, len(controls))
|
||||
if processed % 2000 < args.batch_size or processed >= len(controls):
|
||||
logger.info(
|
||||
"Progress: %d/%d (fixed=%d kept=%d skip=%d)",
|
||||
processed, len(controls), total_fixed, total_kept, total_skipped,
|
||||
)
|
||||
|
||||
time.sleep(0.3)
|
||||
|
||||
# Report
|
||||
cost_in = total_input_tokens / 1_000_000 * 0.80
|
||||
cost_out = total_output_tokens / 1_000_000 * 4.00
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("GENERIC TOKEN FIX REPORT")
|
||||
logger.info("=" * 60)
|
||||
logger.info("Total: %d controls", len(controls))
|
||||
logger.info("Fixed: %d", total_fixed)
|
||||
logger.info("Kept: %d (LLM also chose forbidden → kept as-is)", total_kept)
|
||||
logger.info("Skipped: %d", total_skipped)
|
||||
logger.info("Cost: $%.2f (Haiku)", cost_in + cost_out)
|
||||
|
||||
logger.info("\nTop changes:")
|
||||
flat = []
|
||||
for old, news in change_stats.items():
|
||||
for new, cnt in news.items():
|
||||
flat.append((cnt, old, new))
|
||||
for cnt, old, new in sorted(flat, reverse=True)[:30]:
|
||||
logger.info(" %4d × %s → %s", cnt, old, new)
|
||||
|
||||
# Save corrections
|
||||
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
corr_file = CHECKPOINT_DIR / "corrections_generic_fix.json"
|
||||
corr_file.write_text(json.dumps(corrections))
|
||||
logger.info("Saved %d corrections to %s", len(corrections), corr_file)
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("DRY RUN — not updating DB")
|
||||
return
|
||||
|
||||
if corrections:
|
||||
logger.info("Applying %d corrections...", len(corrections))
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
for corr in corrections:
|
||||
c.execute(text("""
|
||||
UPDATE canonical_controls
|
||||
SET generation_metadata = jsonb_set(
|
||||
generation_metadata,
|
||||
'{merge_group_hint}',
|
||||
to_jsonb(CAST(:new_hint AS text))
|
||||
)
|
||||
WHERE id = CAST(:uuid AS uuid)
|
||||
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
|
||||
logger.info("Done. %d hints corrected.", len(corrections))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,37 @@
|
||||
#!/bin/bash
|
||||
# Run all 10 batches sequentially. Safe: if one fails, the rest don't run.
|
||||
# Each batch saves corrections to JSON before applying to DB.
|
||||
#
|
||||
# Usage: bash /app/scripts/gpre0_run_all.sh
|
||||
# bash /app/scripts/gpre0_run_all.sh 5 # start from batch 5
|
||||
|
||||
set -e
|
||||
|
||||
START=${1:-1}
|
||||
TOTAL=10
|
||||
|
||||
echo "=== Starting from batch $START of $TOTAL ==="
|
||||
|
||||
for i in $(seq $START $TOTAL); do
|
||||
echo ""
|
||||
echo "================================================================"
|
||||
echo " BATCH $i/$TOTAL — $(date)"
|
||||
echo "================================================================"
|
||||
|
||||
PYTHONPATH=/app python3 /app/scripts/gpre0_validate_hints.py \
|
||||
--batch-id $i \
|
||||
--total-batches $TOTAL \
|
||||
--batch-size 20
|
||||
|
||||
EXIT_CODE=$?
|
||||
if [ $EXIT_CODE -ne 0 ]; then
|
||||
echo "BATCH $i FAILED with exit code $EXIT_CODE"
|
||||
echo "Resume with: bash /app/scripts/gpre0_run_all.sh $i"
|
||||
exit $EXIT_CODE
|
||||
fi
|
||||
|
||||
echo "BATCH $i DONE — $(date)"
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "ALL $TOTAL BATCHES COMPLETE!"
|
||||
@@ -0,0 +1,351 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Phase 2: Validate and correct merge_group_hints using Claude Haiku.
|
||||
|
||||
Re-classifies each control's object token against the expanded ontology
|
||||
(74 canonical tokens). Corrects wrong hints in the DB.
|
||||
|
||||
SAFETY: Split into 4 batches. NEVER retries on timeout (double-billing!).
|
||||
Writes checkpoint after each API call for safe resume.
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/gpre0_validate_hints.py --batch-id 1 --dry-run
|
||||
python3 /app/scripts/gpre0_validate_hints.py --batch-id 1
|
||||
python3 /app/scripts/gpre0_validate_hints.py --batch-id 2
|
||||
python3 /app/scripts/gpre0_validate_hints.py --batch-id 3
|
||||
python3 /app/scripts/gpre0_validate_hints.py --batch-id 4
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("gpre0-validate")
|
||||
|
||||
DB_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
|
||||
)
|
||||
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
|
||||
ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"
|
||||
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||
CHECKPOINT_DIR = Path("/tmp/gpre0_checkpoints")
|
||||
|
||||
SYSTEM_PROMPT = """Du bist ein Compliance-Klassifizierer. Ordne jeden Control GENAU EINEM Token zu.
|
||||
|
||||
REGEL: Waehle IMMER den naechstbesten Token aus der Liste. OTHER nur wenn ABSOLUT
|
||||
kein Token auch nur entfernt passt (<1% der Faelle). Im Zweifel: den breitesten
|
||||
passenden Token waehlen (z.B. "policy" fuer Governance-Dokumente, "procedure" fuer
|
||||
Ablauf-Definitionen, "risk_management" fuer Bewertungen).
|
||||
|
||||
TOKENS:
|
||||
|
||||
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
|
||||
privileged_access, access_control, encryption, transport_encryption,
|
||||
key_management, certificate_management, network_security, network_segmentation,
|
||||
firewall, vpn, remote_access, monitoring (NUR Echtzeit-Systemueberwachung),
|
||||
audit_logging (Protokollierung/Audit Trail), siem, alerting (Meldepflichten),
|
||||
compliance_audit (externe Pruefungen), vulnerability, patch_management,
|
||||
backup, disaster_recovery, physical_security, secure_development,
|
||||
api_security, input_validation, container_security, logging_configuration
|
||||
|
||||
DATA_PROTECTION: personal_data (DSGVO-Verarbeitung), sensitive_data (Art.9),
|
||||
health_data, consent, data_subject_rights, data_retention, data_transfer,
|
||||
data_breach_notification, dpia, data_processing_agreement, privacy_by_design,
|
||||
data_processing_register, data_classification, cookie_consent, video_surveillance
|
||||
|
||||
GOVERNANCE: policy (Richtlinie definieren), procedure (Verfahren definieren),
|
||||
process (Betriebsprozess ausfuehren), training (Schulung), awareness,
|
||||
incident (Vorfallsbehandlung), risk_management, third_party_management,
|
||||
change_management, documentation, records_management, compliance_reporting,
|
||||
asset_management, human_resources_security
|
||||
|
||||
REGULATORY: supervisory_authority, certification (Zertifizierung/Konformitaet),
|
||||
product_safety, ai_system, financial_reporting, aml, whistleblowing,
|
||||
consumer_protection, ecommerce, telecommunications, medical_device,
|
||||
payment_services, critical_infrastructure, supply_chain_due_diligence,
|
||||
sustainability_reporting
|
||||
|
||||
ABGRENZUNGEN:
|
||||
- monitoring = NUR Echtzeit-Systemueberwachung, NICHT Audit/Schulung/Bewertung
|
||||
- audit_logging = Protokollierung, NICHT externe Pruefung (→ compliance_audit)
|
||||
- procedure = Verfahren DEFINIEREN, NICHT Vorfaelle behandeln (→ incident)
|
||||
- personal_data = DSGVO-Verarbeitung, NICHT Zertifizierung (→ certification)
|
||||
- alerting = Meldepflichten, NICHT Vorfallsbehandlung (→ incident)
|
||||
|
||||
Antworte NUR als JSON-Array: [{"id":"...","token":"...","conf":0.9}, ...]
|
||||
KEIN weiterer Text. Nur das Array."""
|
||||
|
||||
|
||||
def call_claude(controls_batch: list[dict]) -> tuple[list[dict], dict]:
|
||||
"""Send batch to Claude. NO RETRY on timeout (double-billing risk!)."""
|
||||
items = []
|
||||
for c in controls_batch:
|
||||
items.append(
|
||||
f'- id="{c["control_id"]}" '
|
||||
f'cur="{c["current_object"]}" '
|
||||
f't="{c["title"]}" '
|
||||
f'o="{c["objective"][:100]}"'
|
||||
)
|
||||
|
||||
prompt = "Klassifiziere:\n" + "\n".join(items)
|
||||
|
||||
headers = {
|
||||
"x-api-key": ANTHROPIC_API_KEY,
|
||||
"anthropic-version": "2023-06-01",
|
||||
"content-type": "application/json",
|
||||
}
|
||||
payload = {
|
||||
"model": ANTHROPIC_MODEL,
|
||||
"max_tokens": 1500,
|
||||
"temperature": 0.0,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
}
|
||||
|
||||
try:
|
||||
resp = httpx.post(
|
||||
ANTHROPIC_URL, headers=headers, json=payload, timeout=45.0
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
content = data.get("content", [{}])[0].get("text", "")
|
||||
usage = data.get("usage", {})
|
||||
start = content.find("[")
|
||||
end = content.rfind("]") + 1
|
||||
if start >= 0 and end > start:
|
||||
return json.loads(content[start:end]), usage
|
||||
logger.warning("No JSON array in response")
|
||||
return [], usage
|
||||
except httpx.TimeoutException:
|
||||
# CRITICAL: Do NOT retry! Log and skip.
|
||||
logger.error("TIMEOUT — skipping batch (NOT retrying to avoid double-billing)")
|
||||
return [], {}
|
||||
except httpx.HTTPStatusError as e:
|
||||
if e.response.status_code == 429:
|
||||
logger.warning("Rate limited — waiting 60s then skipping")
|
||||
time.sleep(60)
|
||||
else:
|
||||
logger.error("API error %d — skipping batch", e.response.status_code)
|
||||
return [], {}
|
||||
except Exception as e:
|
||||
logger.error("Request failed — skipping: %s", e)
|
||||
return [], {}
|
||||
|
||||
|
||||
def load_checkpoint(batch_id: int) -> int:
|
||||
"""Load last processed index for this batch."""
|
||||
cp_file = CHECKPOINT_DIR / f"batch_{batch_id}.json"
|
||||
if cp_file.exists():
|
||||
data = json.loads(cp_file.read_text())
|
||||
return data.get("last_index", 0)
|
||||
return 0
|
||||
|
||||
|
||||
def save_checkpoint(batch_id: int, last_index: int, stats: dict):
|
||||
"""Save progress checkpoint."""
|
||||
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
cp_file = CHECKPOINT_DIR / f"batch_{batch_id}.json"
|
||||
cp_file.write_text(json.dumps({
|
||||
"batch_id": batch_id,
|
||||
"last_index": last_index,
|
||||
**stats,
|
||||
}))
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--batch-id", type=int, required=True)
|
||||
parser.add_argument("--total-batches", type=int, default=10)
|
||||
parser.add_argument("--batch-size", type=int, default=20)
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
parser.add_argument("--resume", action="store_true",
|
||||
help="Resume from checkpoint")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(
|
||||
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
|
||||
)
|
||||
|
||||
# Load ALL control IDs ordered deterministically, then select quarter
|
||||
with engine.connect() as c:
|
||||
all_ids = c.execute(text("""
|
||||
SELECT cc.id
|
||||
FROM canonical_controls cc
|
||||
WHERE cc.generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND cc.generation_metadata->>'merge_group_hint' != ''
|
||||
AND cc.release_state NOT IN ('deprecated', 'rejected')
|
||||
ORDER BY cc.id
|
||||
""")).fetchall()
|
||||
|
||||
total = len(all_ids)
|
||||
chunk = total // args.total_batches
|
||||
start_idx = (args.batch_id - 1) * chunk
|
||||
end_idx = total if args.batch_id == args.total_batches else args.batch_id * chunk
|
||||
batch_ids = [str(r[0]) for r in all_ids[start_idx:end_idx]]
|
||||
|
||||
logger.info("Batch %d/%d: controls %d-%d (%d controls of %d total)",
|
||||
args.batch_id, args.total_batches, start_idx, end_idx, len(batch_ids), total)
|
||||
|
||||
# Load full data for this batch
|
||||
id_list = ",".join(f"'{uid}'" for uid in batch_ids)
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text(f"""
|
||||
SELECT cc.id, cc.control_id, cc.title,
|
||||
COALESCE(cc.objective, '') as objective,
|
||||
cc.generation_metadata->>'merge_group_hint' as hint
|
||||
FROM canonical_controls cc
|
||||
WHERE cc.id IN ({id_list})
|
||||
ORDER BY cc.id
|
||||
""")).fetchall()
|
||||
|
||||
controls = []
|
||||
for uuid, cid, title, objective, hint in rows:
|
||||
parts = hint.split(":", 2) if hint else []
|
||||
controls.append({
|
||||
"uuid": str(uuid), "control_id": cid,
|
||||
"title": title or "", "objective": objective or "",
|
||||
"current_hint": hint, "current_object": parts[1] if len(parts) > 1 else hint,
|
||||
})
|
||||
|
||||
# Resume from checkpoint?
|
||||
start_from = 0
|
||||
if args.resume:
|
||||
start_from = load_checkpoint(args.batch_id)
|
||||
if start_from > 0:
|
||||
logger.info("Resuming from index %d", start_from)
|
||||
|
||||
# Process
|
||||
total_same = 0
|
||||
total_changed = 0
|
||||
total_other = 0
|
||||
total_skipped = 0
|
||||
total_input_tokens = 0
|
||||
total_output_tokens = 0
|
||||
corrections: list[dict] = []
|
||||
change_stats: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
|
||||
|
||||
for i in range(start_from, len(controls), args.batch_size):
|
||||
batch = controls[i:i + args.batch_size]
|
||||
results, usage = call_claude(batch)
|
||||
|
||||
total_input_tokens += usage.get("input_tokens", 0)
|
||||
total_output_tokens += usage.get("output_tokens", 0)
|
||||
|
||||
if not results:
|
||||
total_skipped += len(batch)
|
||||
save_checkpoint(args.batch_id, i + args.batch_size, {
|
||||
"same": total_same, "changed": total_changed,
|
||||
"other": total_other, "skipped": total_skipped,
|
||||
})
|
||||
continue
|
||||
|
||||
result_map = {r.get("id", ""): r for r in results}
|
||||
for ctrl in batch:
|
||||
r = result_map.get(ctrl["control_id"], {})
|
||||
new_token = r.get("token", "")
|
||||
if not new_token:
|
||||
total_skipped += 1
|
||||
continue
|
||||
|
||||
old_obj = ctrl["current_object"]
|
||||
if new_token == "OTHER":
|
||||
total_other += 1
|
||||
elif new_token == old_obj:
|
||||
total_same += 1
|
||||
else:
|
||||
total_changed += 1
|
||||
parts = ctrl["current_hint"].split(":", 2)
|
||||
action = parts[0] if parts else "implement"
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
corrections.append({
|
||||
"uuid": ctrl["uuid"],
|
||||
"old_hint": ctrl["current_hint"],
|
||||
"new_hint": f"{action}:{new_token}:{phase}",
|
||||
})
|
||||
change_stats[old_obj][new_token] += 1
|
||||
|
||||
# Checkpoint every batch
|
||||
save_checkpoint(args.batch_id, i + args.batch_size, {
|
||||
"same": total_same, "changed": total_changed,
|
||||
"other": total_other, "skipped": total_skipped,
|
||||
})
|
||||
|
||||
processed = min(i + args.batch_size, len(controls))
|
||||
if processed % 1000 < args.batch_size or processed >= len(controls):
|
||||
logger.info(
|
||||
"Batch %d: %d/%d (same=%d changed=%d other=%d skip=%d)",
|
||||
args.batch_id, processed, len(controls),
|
||||
total_same, total_changed, total_other, total_skipped,
|
||||
)
|
||||
|
||||
time.sleep(0.3)
|
||||
|
||||
# Report
|
||||
cost_in = total_input_tokens / 1_000_000 * 0.80 # Haiku
|
||||
cost_out = total_output_tokens / 1_000_000 * 4.00 # Haiku
|
||||
total_cost = cost_in + cost_out
|
||||
total_proc = total_same + total_changed + total_other
|
||||
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("BATCH %d REPORT", args.batch_id)
|
||||
logger.info("=" * 60)
|
||||
logger.info("Processed: %d | Skipped: %d", total_proc, total_skipped)
|
||||
logger.info("Same: %d (%.1f%%)", total_same, total_same / max(total_proc, 1) * 100)
|
||||
logger.info("Changed: %d (%.1f%%)", total_changed, total_changed / max(total_proc, 1) * 100)
|
||||
logger.info("OTHER: %d (%.1f%%)", total_other, total_other / max(total_proc, 1) * 100)
|
||||
logger.info("Cost: $%.2f (Haiku)", total_cost)
|
||||
logger.info("Cost/ctrl: $%.5f", total_cost / max(total_proc, 1))
|
||||
|
||||
# Top changes
|
||||
flat = []
|
||||
for old, news in change_stats.items():
|
||||
for new, cnt in news.items():
|
||||
flat.append((cnt, old, new))
|
||||
logger.info("\nTop Changes:")
|
||||
for cnt, old, new in sorted(flat, reverse=True)[:20]:
|
||||
logger.info(" %4d × %s → %s", cnt, old, new)
|
||||
|
||||
# Always save corrections to file (recovery safety)
|
||||
corr_file = CHECKPOINT_DIR / f"corrections_batch_{args.batch_id}.json"
|
||||
if corrections:
|
||||
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
corr_file.write_text(json.dumps(corrections))
|
||||
logger.info("Saved %d corrections to %s", len(corrections), corr_file)
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("\nDRY RUN — not updating DB")
|
||||
return
|
||||
|
||||
# Apply corrections in single transaction
|
||||
if corrections:
|
||||
logger.info("\nApplying %d corrections...", len(corrections))
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
for corr in corrections:
|
||||
c.execute(text("""
|
||||
UPDATE canonical_controls
|
||||
SET generation_metadata = jsonb_set(
|
||||
generation_metadata,
|
||||
'{merge_group_hint}',
|
||||
to_jsonb(CAST(:new_hint AS text))
|
||||
)
|
||||
WHERE id = CAST(:uuid AS uuid)
|
||||
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
|
||||
logger.info("Done. %d hints corrected.", len(corrections))
|
||||
else:
|
||||
logger.info("No corrections needed.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,37 @@
|
||||
#!/usr/bin/env python3
|
||||
"""G-pre1: Analyze unique objects and test normalization reduction."""
|
||||
from collections import Counter
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
engine = create_engine(
|
||||
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
|
||||
connect_args={"options": "-c search_path=compliance,public"},
|
||||
)
|
||||
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text("""
|
||||
SELECT DISTINCT
|
||||
split_part(generation_metadata->>'merge_group_hint', ':', 2) AS obj
|
||||
FROM canonical_controls
|
||||
WHERE generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND generation_metadata->>'merge_group_hint' != ''
|
||||
""")).fetchall()
|
||||
|
||||
objects = [r[0] for r in rows if r[0] and r[0].strip()]
|
||||
print("Unique raw objects: %d" % len(objects))
|
||||
|
||||
from services.control_dedup import normalize_object
|
||||
|
||||
norm_counts: Counter = Counter()
|
||||
for obj in objects:
|
||||
norm_counts[normalize_object(obj)] += 1
|
||||
|
||||
print("After normalize_object(): %d unique" % len(norm_counts))
|
||||
print("Reduction: %.1f%%" % ((1 - len(norm_counts) / len(objects)) * 100))
|
||||
print()
|
||||
print("Top 20 normalized objects:")
|
||||
for token, count in norm_counts.most_common(20):
|
||||
print(" %5d %s" % (count, token))
|
||||
print()
|
||||
print("Singletons (only 1 raw object): %d" % sum(1 for c in norm_counts.values() if c == 1))
|
||||
print("Groups with 2+ members: %d" % sum(1 for c in norm_counts.values() if c >= 2))
|
||||
@@ -0,0 +1,219 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
G-pre1: Object Clustering via Mini-Batch K-Means on Embeddings.
|
||||
|
||||
Clusters ~144k unique normalized objects into ~15-25k semantic groups
|
||||
using bge-m3 embeddings and Mini-Batch K-Means.
|
||||
|
||||
Usage (inside control-pipeline container):
|
||||
python3 /app/scripts/gpre1_object_clustering.py --k 20000
|
||||
python3 /app/scripts/gpre1_object_clustering.py --k 20000 --dry-run
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
import time
|
||||
from collections import Counter
|
||||
|
||||
import httpx
|
||||
import numpy as np
|
||||
from sklearn.cluster import MiniBatchKMeans
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("gpre1")
|
||||
|
||||
import os
|
||||
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
|
||||
EMBEDDING_URL = "http://embedding-service:8087"
|
||||
BATCH_SIZE = 64 # Embeddings per API call
|
||||
|
||||
|
||||
def extract_objects(engine) -> tuple[list[str], dict[str, int]]:
|
||||
"""Extract unique normalized objects and their frequencies."""
|
||||
from services.control_dedup import normalize_object
|
||||
|
||||
logger.info("Extracting objects from canonical_controls...")
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text("""
|
||||
SELECT split_part(generation_metadata->>'merge_group_hint', ':', 2) AS obj,
|
||||
count(*) AS freq
|
||||
FROM canonical_controls
|
||||
WHERE generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND generation_metadata->>'merge_group_hint' != ''
|
||||
GROUP BY 1
|
||||
""")).fetchall()
|
||||
|
||||
# Normalize and aggregate
|
||||
norm_freq: Counter = Counter()
|
||||
norm_to_raw: dict[str, list[str]] = {}
|
||||
for raw_obj, freq in rows:
|
||||
if not raw_obj or not raw_obj.strip():
|
||||
continue
|
||||
normed = normalize_object(raw_obj)
|
||||
norm_freq[normed] += freq
|
||||
norm_to_raw.setdefault(normed, []).append(raw_obj)
|
||||
|
||||
objects = list(norm_freq.keys())
|
||||
freqs = {obj: norm_freq[obj] for obj in objects}
|
||||
logger.info("Extracted %d unique normalized objects (from %d raw)", len(objects), len(rows))
|
||||
return objects, freqs
|
||||
|
||||
|
||||
def generate_embeddings(objects: list[str]) -> np.ndarray:
|
||||
"""Generate embeddings via embedding-service in batches.
|
||||
|
||||
Uses pre-allocated numpy array to avoid Python list memory overhead
|
||||
(Python float = 28 bytes vs numpy float32 = 4 bytes).
|
||||
"""
|
||||
total = len(objects)
|
||||
# Pre-allocate: 144k × 1024 × 4 bytes = ~590 MB (vs ~4 GB with Python lists)
|
||||
result = np.zeros((total, 1024), dtype=np.float32)
|
||||
logger.info("Generating embeddings for %d objects (pre-allocated %.0f MB)...",
|
||||
total, result.nbytes / 1024 / 1024)
|
||||
|
||||
failed_batches = []
|
||||
for i in range(0, total, BATCH_SIZE):
|
||||
batch = objects[i:i + BATCH_SIZE]
|
||||
success = False
|
||||
for attempt in range(3): # Max 3 retries per batch
|
||||
try:
|
||||
with httpx.Client(timeout=httpx.Timeout(60.0, connect=10.0)) as client:
|
||||
resp = client.post(
|
||||
f"{EMBEDDING_URL}/embed",
|
||||
json={"texts": batch},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
embeddings = resp.json().get("embeddings", [])
|
||||
end = min(i + len(embeddings), total)
|
||||
result[i:end] = np.array(embeddings, dtype=np.float32)
|
||||
success = True
|
||||
break
|
||||
except Exception as e:
|
||||
if attempt < 2:
|
||||
logger.warning("Batch %d attempt %d failed: %s — retrying", i, attempt + 1, e)
|
||||
import time
|
||||
time.sleep(2)
|
||||
else:
|
||||
logger.error("Batch %d failed after 3 attempts: %s", i, e)
|
||||
failed_batches.append(i)
|
||||
|
||||
if (i + BATCH_SIZE) % 5000 == 0 or i + BATCH_SIZE >= total:
|
||||
logger.info(" Embedded %d/%d (%.1f%%) [%d failed]",
|
||||
min(i + BATCH_SIZE, total), total,
|
||||
min(i + BATCH_SIZE, total) / total * 100,
|
||||
len(failed_batches))
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def cluster_objects(embeddings: np.ndarray, k: int) -> np.ndarray:
|
||||
"""Run Mini-Batch K-Means clustering."""
|
||||
logger.info("Clustering %d objects into %d groups (Mini-Batch K-Means)...", len(embeddings), k)
|
||||
|
||||
# Normalize embeddings for cosine-like clustering
|
||||
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
|
||||
norms[norms == 0] = 1
|
||||
normalized = embeddings / norms
|
||||
|
||||
kmeans = MiniBatchKMeans(
|
||||
n_clusters=k,
|
||||
batch_size=1000,
|
||||
max_iter=100,
|
||||
random_state=42,
|
||||
verbose=0,
|
||||
)
|
||||
labels = kmeans.fit_predict(normalized)
|
||||
logger.info("Clustering done. Inertia: %.2f", kmeans.inertia_)
|
||||
return labels
|
||||
|
||||
|
||||
def store_results(engine, objects: list[str], freqs: dict[str, int],
|
||||
labels: np.ndarray, dry_run: bool):
|
||||
"""Store clustering results in object_groups table."""
|
||||
# Build groups
|
||||
groups: dict[int, list[tuple[str, int]]] = {}
|
||||
for i, obj in enumerate(objects):
|
||||
gid = int(labels[i])
|
||||
groups.setdefault(gid, []).append((obj, freqs.get(obj, 0)))
|
||||
|
||||
# Pick canonical name (highest frequency in group)
|
||||
results = []
|
||||
for gid, members in groups.items():
|
||||
members_sorted = sorted(members, key=lambda x: -x[1])
|
||||
canonical = members_sorted[0][0]
|
||||
results.append({
|
||||
"group_id": gid,
|
||||
"canonical_name": canonical,
|
||||
"member_count": len(members),
|
||||
"members": json.dumps([m[0] for m in members_sorted]),
|
||||
"top_controls_count": members_sorted[0][1],
|
||||
})
|
||||
|
||||
# Stats
|
||||
sizes = [r["member_count"] for r in results]
|
||||
logger.info("Groups: %d total", len(results))
|
||||
logger.info(" Singletons: %d", sum(1 for s in sizes if s == 1))
|
||||
logger.info(" Groups 2-5: %d", sum(1 for s in sizes if 2 <= s <= 5))
|
||||
logger.info(" Groups 6-20: %d", sum(1 for s in sizes if 6 <= s <= 20))
|
||||
logger.info(" Groups 21-100: %d", sum(1 for s in sizes if 21 <= s <= 100))
|
||||
logger.info(" Groups >100: %d", sum(1 for s in sizes if s > 100))
|
||||
logger.info(" Max group size: %d", max(sizes))
|
||||
logger.info(" Avg group size: %.1f", sum(sizes) / len(sizes))
|
||||
|
||||
# Top 10 largest groups
|
||||
top10 = sorted(results, key=lambda x: -x["member_count"])[:10]
|
||||
logger.info("\nTop 10 largest groups:")
|
||||
for g in top10:
|
||||
members_list = json.loads(g["members"])
|
||||
logger.info(" [%d] %s (%d members): %s",
|
||||
g["group_id"], g["canonical_name"], g["member_count"],
|
||||
", ".join(members_list[:5]))
|
||||
|
||||
if dry_run:
|
||||
logger.info("DRY RUN — not writing to DB")
|
||||
return
|
||||
|
||||
# Write to DB
|
||||
with engine.begin() as conn:
|
||||
conn.execute(text("SET search_path TO compliance, public"))
|
||||
conn.execute(text("DELETE FROM object_groups")) # Clear old results
|
||||
for r in results:
|
||||
conn.execute(text("""
|
||||
INSERT INTO object_groups (group_id, canonical_name, member_count, members, top_controls_count)
|
||||
VALUES (:group_id, :canonical_name, :member_count, CAST(:members AS jsonb), :top_controls_count)
|
||||
"""), r)
|
||||
logger.info("Wrote %d groups to object_groups table", len(results))
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--k", type=int, default=20000, help="Number of clusters")
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
|
||||
|
||||
# Step 1: Extract
|
||||
objects, freqs = extract_objects(engine)
|
||||
|
||||
# Step 2: Embed
|
||||
embeddings = generate_embeddings(objects)
|
||||
logger.info("Embedding matrix: %s (%.1f MB)", embeddings.shape,
|
||||
embeddings.nbytes / 1024 / 1024)
|
||||
|
||||
# Adjust k if we have fewer objects
|
||||
k = min(args.k, len(objects) // 2)
|
||||
logger.info("Using k=%d (requested %d, objects=%d)", k, args.k, len(objects))
|
||||
|
||||
# Step 3: Cluster
|
||||
labels = cluster_objects(embeddings, k)
|
||||
|
||||
# Step 4: Store
|
||||
store_results(engine, objects, freqs, labels, args.dry_run)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,203 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
G-pre1 INCREMENTAL: Append new objects to object_groups via embedding similarity.
|
||||
|
||||
Non-destructive alternative to gpre1_object_clustering.py (which DELETEs and
|
||||
rebuilds all groups via K-Means). This script:
|
||||
- Finds objects referenced in atomic controls that are NOT yet in
|
||||
object_groups.members
|
||||
- Embeds each unmatched object via bge-m3 (local embedding-service)
|
||||
- Nearest-neighbor search against existing object_groups.canonical_name
|
||||
- Cosine >= --threshold (default 0.85) → APPEND to existing group's members
|
||||
- Cosine < --threshold → CREATE new object_group with next free group_id
|
||||
|
||||
Existing groups stay; only members get appended and new groups get added.
|
||||
|
||||
Usage (inside control-pipeline container):
|
||||
python3 /app/scripts/gpre1_object_groups_incremental.py --since 2026-05-18T02:53:00+00:00 --dry-run
|
||||
python3 /app/scripts/gpre1_object_groups_incremental.py --since 2026-05-18T02:53:00+00:00
|
||||
python3 /app/scripts/gpre1_object_groups_incremental.py --since 2026-05-18T02:53:00+00:00 --threshold 0.82
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime
|
||||
|
||||
import httpx
|
||||
import numpy as np
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("gpre1_inc")
|
||||
|
||||
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
|
||||
EMBEDDING_URL = os.getenv("EMBEDDING_URL", "http://embedding-service:8087")
|
||||
BATCH_SIZE = 64
|
||||
|
||||
|
||||
def embed_batch(texts: list[str]) -> np.ndarray:
|
||||
"""Embed a list of strings via bge-m3 embedding-service."""
|
||||
with httpx.Client(timeout=120.0) as c:
|
||||
resp = c.post(f"{EMBEDDING_URL}/embed", json={"texts": texts, "normalize": True})
|
||||
resp.raise_for_status()
|
||||
return np.array(resp.json()["embeddings"], dtype=np.float32)
|
||||
|
||||
|
||||
def embed_many(texts: list[str], label: str = "") -> np.ndarray:
|
||||
"""Embed many strings in batches."""
|
||||
n = len(texts)
|
||||
out = np.zeros((n, 1024), dtype=np.float32)
|
||||
for i in range(0, n, BATCH_SIZE):
|
||||
batch = texts[i:i + BATCH_SIZE]
|
||||
out[i:i + len(batch)] = embed_batch(batch)
|
||||
if (i // BATCH_SIZE) % 20 == 0:
|
||||
logger.info(" %s: %d/%d embedded", label, i + len(batch), n)
|
||||
return out
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--since", required=True, help="ISO datetime — consider atomics from this date onwards")
|
||||
parser.add_argument("--threshold", type=float, default=0.85,
|
||||
help="Cosine threshold for appending to existing group (default 0.85)")
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
since_dt = datetime.fromisoformat(args.since.replace("Z", "+00:00"))
|
||||
logger.info("Incremental object_groups update since %s, threshold=%.2f, dry_run=%s",
|
||||
since_dt.isoformat(), args.threshold, args.dry_run)
|
||||
|
||||
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
|
||||
|
||||
# 1. Load existing object_groups (id, canonical_name, members)
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text("""
|
||||
SELECT group_id, canonical_name, members FROM object_groups
|
||||
""")).fetchall()
|
||||
existing_groups = [(r[0], r[1], json.loads(r[2]) if isinstance(r[2], str) else r[2]) for r in rows]
|
||||
logger.info("Loaded %d existing object_groups", len(existing_groups))
|
||||
|
||||
existing_members: set[str] = set()
|
||||
for _, _, members in existing_groups:
|
||||
for m in members:
|
||||
existing_members.add(m)
|
||||
logger.info("Existing union of members: %d distinct strings", len(existing_members))
|
||||
|
||||
# 2. Find unmatched objects from atomics since `since`
|
||||
from services.control_dedup import normalize_object
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text("""
|
||||
SELECT DISTINCT split_part(generation_metadata->>'merge_group_hint', ':', 2) AS obj
|
||||
FROM canonical_controls
|
||||
WHERE decomposition_method = 'pass0b'
|
||||
AND created_at >= :since
|
||||
AND generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND generation_metadata->>'merge_group_hint' != ''
|
||||
AND release_state NOT IN ('deprecated', 'rejected', 'duplicate')
|
||||
"""), {"since": since_dt}).fetchall()
|
||||
new_objects_raw = [r[0] for r in rows if r[0]]
|
||||
logger.info("Distinct objects in new atomics: %d", len(new_objects_raw))
|
||||
|
||||
# Normalize each + dedupe; track originals → normalized
|
||||
normed_to_originals: dict[str, set[str]] = {}
|
||||
for obj in new_objects_raw:
|
||||
normed = normalize_object(obj)
|
||||
if not normed:
|
||||
continue
|
||||
if normed in existing_members or obj in existing_members:
|
||||
continue # already in some group
|
||||
normed_to_originals.setdefault(normed, set()).update([normed, obj])
|
||||
|
||||
unmatched_normed = list(normed_to_originals.keys())
|
||||
logger.info("Unmatched normalized objects: %d", len(unmatched_normed))
|
||||
|
||||
if not unmatched_normed:
|
||||
logger.info("Nothing to do — all objects already mapped.")
|
||||
return
|
||||
|
||||
# 3. Embed existing canonical_names + unmatched objects
|
||||
logger.info("Embedding %d existing canonical_names...", len(existing_groups))
|
||||
existing_emb = embed_many([g[1] for g in existing_groups], label="existing")
|
||||
logger.info("Embedding %d unmatched objects...", len(unmatched_normed))
|
||||
unmatched_emb = embed_many(unmatched_normed, label="unmatched")
|
||||
|
||||
# 4. Nearest-neighbor: for each unmatched, find best existing match
|
||||
# cosine = dot product (both already L2-normalized)
|
||||
logger.info("Computing nearest-neighbor matches...")
|
||||
sims = unmatched_emb @ existing_emb.T # (N_unmatched, N_existing)
|
||||
best_idx = sims.argmax(axis=1)
|
||||
best_score = sims.max(axis=1)
|
||||
|
||||
appends: dict[int, list[str]] = {} # group_id → list of new members
|
||||
new_groups: list[tuple[str, list[str]]] = [] # (canonical_name, members)
|
||||
|
||||
for i, normed in enumerate(unmatched_normed):
|
||||
originals = sorted(normed_to_originals[normed])
|
||||
if best_score[i] >= args.threshold:
|
||||
gid = existing_groups[int(best_idx[i])][0]
|
||||
appends.setdefault(gid, []).extend(originals)
|
||||
else:
|
||||
# Create a new group with this object as canonical
|
||||
new_groups.append((normed, originals))
|
||||
|
||||
# Stats
|
||||
distinct_groups_to_extend = len(appends)
|
||||
total_appends = sum(len(v) for v in appends.values())
|
||||
logger.info("Plan: extend %d existing groups (+%d members), create %d new groups",
|
||||
distinct_groups_to_extend, total_appends, len(new_groups))
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("DRY RUN — no writes")
|
||||
# Sample
|
||||
if appends:
|
||||
sample = list(appends.items())[:5]
|
||||
for gid, members in sample:
|
||||
gname = next((g[1] for g in existing_groups if g[0] == gid), "?")
|
||||
logger.info(" Extend group_id=%d (%s) with: %s", gid, gname, members[:3])
|
||||
if new_groups:
|
||||
for name, members in new_groups[:5]:
|
||||
logger.info(" NEW group: %s — members=%s", name, members[:3])
|
||||
return
|
||||
|
||||
# 5. Write — pure INSERT/UPDATE
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
|
||||
# UPDATE existing groups (append to members JSONB)
|
||||
for gid, new_members in appends.items():
|
||||
c.execute(text("""
|
||||
UPDATE object_groups
|
||||
SET members = (
|
||||
SELECT jsonb_agg(DISTINCT m)
|
||||
FROM jsonb_array_elements_text(members || CAST(:new_members AS jsonb)) AS x(m)
|
||||
),
|
||||
member_count = (
|
||||
SELECT count(DISTINCT m)
|
||||
FROM jsonb_array_elements_text(members || CAST(:new_members AS jsonb)) AS x(m)
|
||||
)
|
||||
WHERE group_id = :gid
|
||||
"""), {"gid": gid, "new_members": json.dumps(new_members)})
|
||||
|
||||
# INSERT new groups with next free group_id
|
||||
next_gid_row = c.execute(text("SELECT COALESCE(MAX(group_id), 0) + 1 FROM object_groups")).fetchone()
|
||||
next_gid = next_gid_row[0] if next_gid_row else 1
|
||||
for name, members in new_groups:
|
||||
c.execute(text("""
|
||||
INSERT INTO object_groups (group_id, canonical_name, member_count, members, top_controls_count)
|
||||
VALUES (:gid, :name, :count, CAST(:members AS jsonb), 0)
|
||||
"""), {
|
||||
"gid": next_gid,
|
||||
"name": name[:200],
|
||||
"count": len(members),
|
||||
"members": json.dumps(members),
|
||||
})
|
||||
next_gid += 1
|
||||
|
||||
logger.info("DONE — extended %d existing groups (+%d members), created %d new groups",
|
||||
distinct_groups_to_extend, total_appends, len(new_groups))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,254 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
G-pre1 Refinement: Re-cluster large object groups (>200 members in master_controls)
|
||||
with k=10 sub-clusters for finer granularity.
|
||||
|
||||
Replaces the large master controls with smaller, more specific ones.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
|
||||
import httpx
|
||||
import numpy as np
|
||||
from sklearn.cluster import MiniBatchKMeans
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("gpre1-refine")
|
||||
|
||||
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
|
||||
EMBEDDING_URL = "http://embedding-service:8087"
|
||||
|
||||
|
||||
def main():
|
||||
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
|
||||
|
||||
# Step 1: Find large master controls and their object_group_ids
|
||||
with engine.connect() as c:
|
||||
large_mcs = c.execute(text("""
|
||||
SELECT mc.master_control_id, mc.object_group_id, mc.canonical_name, mc.total_controls,
|
||||
og.members, og.member_count
|
||||
FROM master_controls mc
|
||||
JOIN object_groups og ON og.group_id = mc.object_group_id
|
||||
WHERE mc.total_controls > 200
|
||||
ORDER BY mc.total_controls DESC
|
||||
""")).fetchall()
|
||||
|
||||
logger.info("Found %d large master controls to refine", len(large_mcs))
|
||||
|
||||
# Step 2: For each large group, re-cluster the object members
|
||||
with engine.connect() as c:
|
||||
max_gid = c.execute(text("SELECT COALESCE(MAX(group_id), 0) FROM object_groups")).scalar()
|
||||
next_gid = max_gid + 1
|
||||
|
||||
groups_to_delete = []
|
||||
new_groups = []
|
||||
total_sub = 0
|
||||
|
||||
for mc_id, og_id, canonical, total, members_json, member_count in large_mcs:
|
||||
members = json.loads(members_json) if isinstance(members_json, str) else members_json
|
||||
|
||||
if len(members) < 20:
|
||||
logger.info(" Skip %s (%d members) — too few to split", canonical, len(members))
|
||||
continue
|
||||
|
||||
# Determine k based on group size
|
||||
k = max(4, min(len(members) // 15, 20)) # 4-20 sub-clusters
|
||||
|
||||
# Embed members
|
||||
embeddings = _embed_texts(members)
|
||||
if embeddings is None:
|
||||
logger.error(" Failed to embed %s", canonical)
|
||||
continue
|
||||
|
||||
# Normalize + cluster
|
||||
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
|
||||
norms[norms == 0] = 1
|
||||
normalized = embeddings / norms
|
||||
|
||||
kmeans = MiniBatchKMeans(n_clusters=k, batch_size=min(100, len(members)),
|
||||
max_iter=50, random_state=42)
|
||||
labels = kmeans.fit_predict(normalized)
|
||||
|
||||
# Build sub-groups
|
||||
subs: dict[int, list[str]] = {}
|
||||
for i, member in enumerate(members):
|
||||
subs.setdefault(int(labels[i]), []).append(member)
|
||||
|
||||
for sub_members in subs.values():
|
||||
new_groups.append({
|
||||
"group_id": next_gid,
|
||||
"canonical_name": sub_members[0],
|
||||
"member_count": len(sub_members),
|
||||
"members": json.dumps(sub_members),
|
||||
"top_controls_count": 0,
|
||||
})
|
||||
next_gid += 1
|
||||
total_sub += 1
|
||||
|
||||
groups_to_delete.append(og_id)
|
||||
logger.info(" %s (%s, %d members) → %d sub-groups (k=%d)",
|
||||
mc_id, canonical, len(members), len(subs), k)
|
||||
|
||||
logger.info("Refinement: %d groups → %d sub-groups", len(groups_to_delete), total_sub)
|
||||
|
||||
# Step 3: Update DB — replace old object_groups, delete old master_controls
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
|
||||
# Delete old master controls and their members for affected groups
|
||||
for og_id in groups_to_delete:
|
||||
c.execute(text("""
|
||||
DELETE FROM master_control_members
|
||||
WHERE master_control_uuid IN (
|
||||
SELECT id FROM master_controls WHERE object_group_id = :gid
|
||||
)
|
||||
"""), {"gid": og_id})
|
||||
c.execute(text("DELETE FROM master_controls WHERE object_group_id = :gid"), {"gid": og_id})
|
||||
c.execute(text("DELETE FROM object_groups WHERE group_id = :gid"), {"gid": og_id})
|
||||
|
||||
# Insert new sub-groups
|
||||
for g in new_groups:
|
||||
c.execute(text("""
|
||||
INSERT INTO object_groups (group_id, canonical_name, member_count, members, top_controls_count)
|
||||
VALUES (:group_id, :canonical_name, :member_count, CAST(:members AS jsonb), :top_controls_count)
|
||||
"""), g)
|
||||
|
||||
logger.info("DB updated: %d old groups deleted, %d new groups inserted", len(groups_to_delete), len(new_groups))
|
||||
|
||||
# Step 4: Re-run master control generation for affected groups
|
||||
logger.info("Re-generating master controls for new sub-groups...")
|
||||
_regenerate_master_controls(engine, [g["group_id"] for g in new_groups])
|
||||
|
||||
# Final stats
|
||||
with engine.connect() as c:
|
||||
mc_count = c.execute(text("SELECT count(*) FROM master_controls")).scalar()
|
||||
og_count = c.execute(text("SELECT count(*) FROM object_groups")).scalar()
|
||||
large = c.execute(text("SELECT count(*) FROM master_controls WHERE total_controls > 200")).scalar()
|
||||
logger.info("Final: %d master controls, %d object groups, %d still >200", mc_count, og_count, large)
|
||||
|
||||
|
||||
def _regenerate_master_controls(engine, group_ids: list[int]):
|
||||
"""Re-create master controls for specific object_group_ids."""
|
||||
from collections import defaultdict
|
||||
from services.control_dedup import normalize_object
|
||||
|
||||
# Build reverse index for new groups only
|
||||
object_to_group = {}
|
||||
with engine.connect() as c:
|
||||
for gid in group_ids:
|
||||
row = c.execute(text(
|
||||
"SELECT group_id, canonical_name, members FROM object_groups WHERE group_id = :gid"
|
||||
), {"gid": gid}).fetchone()
|
||||
if row:
|
||||
members = json.loads(row[2]) if isinstance(row[2], str) else row[2]
|
||||
for m in members:
|
||||
object_to_group[m] = (row[0], row[1])
|
||||
|
||||
# Load controls for these objects
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text("""
|
||||
SELECT id, control_id, generation_metadata->>'merge_group_hint' AS hint
|
||||
FROM canonical_controls
|
||||
WHERE generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND release_state NOT IN ('deprecated', 'rejected')
|
||||
""")).fetchall()
|
||||
|
||||
group_phases: dict[int, dict[str, list]] = defaultdict(lambda: defaultdict(list))
|
||||
group_names: dict[int, str] = {}
|
||||
|
||||
for uuid, control_id, hint in rows:
|
||||
parts = hint.split(":", 2)
|
||||
if len(parts) < 2:
|
||||
continue
|
||||
action, obj = parts[0], parts[1]
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
|
||||
normed = normalize_object(obj)
|
||||
if normed in object_to_group:
|
||||
gid, canonical = object_to_group[normed]
|
||||
elif obj in object_to_group:
|
||||
gid, canonical = object_to_group[obj]
|
||||
else:
|
||||
continue
|
||||
|
||||
group_phases[gid][phase].append((str(uuid), control_id, action))
|
||||
group_names[gid] = canonical
|
||||
|
||||
# Create master controls
|
||||
mc_count = 0
|
||||
mem_count = 0
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
for gid, phases in group_phases.items():
|
||||
if len(phases) < 2:
|
||||
continue
|
||||
|
||||
mc_id = "MC-%d" % gid
|
||||
canonical = group_names.get(gid, "unknown")
|
||||
sorted_phases = sorted(phases.keys())
|
||||
phase_counts = {p: len(ctrls) for p, ctrls in phases.items()}
|
||||
total = sum(phase_counts.values())
|
||||
|
||||
c.execute(text("""
|
||||
INSERT INTO master_controls
|
||||
(master_control_id, object_group_id, canonical_name,
|
||||
phases_covered, phase_control_count, total_controls)
|
||||
VALUES (:mcid, :gid, :name,
|
||||
CAST(:phases AS jsonb), CAST(:pcounts AS jsonb), :total)
|
||||
"""), {
|
||||
"mcid": mc_id, "gid": gid, "name": canonical,
|
||||
"phases": json.dumps(sorted_phases),
|
||||
"pcounts": json.dumps(phase_counts),
|
||||
"total": total,
|
||||
})
|
||||
|
||||
mc_uuid = c.execute(text(
|
||||
"SELECT id FROM master_controls WHERE master_control_id = :mcid"
|
||||
), {"mcid": mc_id}).scalar()
|
||||
|
||||
for phase, controls in phases.items():
|
||||
for ctrl_uuid, ctrl_id, action in controls:
|
||||
c.execute(text("""
|
||||
INSERT INTO master_control_members
|
||||
(master_control_uuid, control_uuid, phase, action)
|
||||
VALUES (CAST(:mc AS uuid), CAST(:ctrl AS uuid), :phase, :action)
|
||||
"""), {"mc": str(mc_uuid), "ctrl": ctrl_uuid, "phase": phase, "action": action})
|
||||
mem_count += 1
|
||||
|
||||
mc_count += 1
|
||||
|
||||
logger.info("Created %d new master controls with %d members", mc_count, mem_count)
|
||||
|
||||
|
||||
def _embed_texts(texts: list[str]) -> np.ndarray | None:
|
||||
"""Embed texts with retry logic."""
|
||||
try:
|
||||
result = np.zeros((len(texts), 1024), dtype=np.float32)
|
||||
batch_size = 64
|
||||
for i in range(0, len(texts), batch_size):
|
||||
batch = texts[i:i + batch_size]
|
||||
for attempt in range(3):
|
||||
try:
|
||||
with httpx.Client(timeout=httpx.Timeout(60.0, connect=10.0)) as client:
|
||||
resp = client.post(f"{EMBEDDING_URL}/embed", json={"texts": batch})
|
||||
resp.raise_for_status()
|
||||
embs = resp.json().get("embeddings", [])
|
||||
end = min(i + len(embs), len(texts))
|
||||
result[i:end] = np.array(embs, dtype=np.float32)
|
||||
break
|
||||
except Exception as e:
|
||||
if attempt == 2:
|
||||
logger.error("Embed batch %d failed: %s", i, e)
|
||||
import time
|
||||
time.sleep(2)
|
||||
return result
|
||||
except Exception as e:
|
||||
logger.error("Embedding failed: %s", e)
|
||||
return None
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,164 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
G-pre1 Step 2: Sub-cluster large object groups (>50 members) into k=4 sub-groups.
|
||||
|
||||
Reads existing object_groups, re-embeds members of large groups,
|
||||
applies K-Means with k=4 per group, and writes sub-groups back.
|
||||
|
||||
Usage (inside container or with PYTHONPATH):
|
||||
python3 /app/scripts/gpre1_subcluster.py
|
||||
python3 /app/scripts/gpre1_subcluster.py --min-size 100 # only groups >100
|
||||
python3 /app/scripts/gpre1_subcluster.py --sub-k 6 # 6 sub-clusters
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
|
||||
import httpx
|
||||
import numpy as np
|
||||
from sklearn.cluster import MiniBatchKMeans
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("gpre1-sub")
|
||||
|
||||
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
|
||||
EMBEDDING_URL = "http://embedding-service:8087"
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--min-size", type=int, default=50, help="Min group size to sub-cluster")
|
||||
parser.add_argument("--sub-k", type=int, default=4, help="Sub-clusters per group")
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
|
||||
|
||||
# Load large groups
|
||||
with engine.connect() as c:
|
||||
groups = c.execute(text(
|
||||
"SELECT group_id, canonical_name, member_count, members "
|
||||
"FROM object_groups WHERE member_count > :min ORDER BY member_count DESC"
|
||||
), {"min": args.min_size}).fetchall()
|
||||
|
||||
logger.info("Found %d groups with >%d members to sub-cluster", len(groups), args.min_size)
|
||||
|
||||
# Find next available group_id
|
||||
with engine.connect() as c:
|
||||
max_gid = c.execute(text("SELECT COALESCE(MAX(group_id), 0) FROM object_groups")).scalar()
|
||||
next_gid = max_gid + 1
|
||||
|
||||
total_sub_groups = 0
|
||||
all_new_rows = []
|
||||
groups_to_delete = []
|
||||
|
||||
for group_id, canonical_name, member_count, members_json in groups:
|
||||
members = json.loads(members_json) if isinstance(members_json, str) else members_json
|
||||
|
||||
if len(members) < args.sub_k * 2:
|
||||
logger.info(" Skip group %d (%s, %d members) — too small for k=%d",
|
||||
group_id, canonical_name, len(members), args.sub_k)
|
||||
continue
|
||||
|
||||
# Embed members
|
||||
embeddings = _embed_batch(members)
|
||||
if embeddings is None:
|
||||
logger.error(" Failed to embed group %d (%s)", group_id, canonical_name)
|
||||
continue
|
||||
|
||||
# Normalize for cosine
|
||||
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
|
||||
norms[norms == 0] = 1
|
||||
normalized = embeddings / norms
|
||||
|
||||
# Sub-cluster
|
||||
k = min(args.sub_k, len(members) // 2)
|
||||
kmeans = MiniBatchKMeans(n_clusters=k, batch_size=min(100, len(members)),
|
||||
max_iter=50, random_state=42)
|
||||
labels = kmeans.fit_predict(normalized)
|
||||
|
||||
# Build sub-groups
|
||||
sub_groups: dict[int, list[str]] = {}
|
||||
for i, member in enumerate(members):
|
||||
sub_groups.setdefault(int(labels[i]), []).append(member)
|
||||
|
||||
# Create new rows
|
||||
for sub_id, sub_members in sub_groups.items():
|
||||
sub_canonical = sub_members[0] # Most frequent would be better but we don't have freq here
|
||||
all_new_rows.append({
|
||||
"group_id": next_gid,
|
||||
"canonical_name": sub_canonical,
|
||||
"member_count": len(sub_members),
|
||||
"members": json.dumps(sub_members),
|
||||
"top_controls_count": 0,
|
||||
"parent_group_id": group_id,
|
||||
})
|
||||
next_gid += 1
|
||||
|
||||
groups_to_delete.append(group_id)
|
||||
total_sub_groups += len(sub_groups)
|
||||
|
||||
if len(groups_to_delete) % 50 == 0:
|
||||
logger.info(" Processed %d/%d groups, %d sub-groups created",
|
||||
len(groups_to_delete), len(groups), total_sub_groups)
|
||||
|
||||
logger.info("Sub-clustering complete: %d groups → %d sub-groups",
|
||||
len(groups_to_delete), total_sub_groups)
|
||||
|
||||
# Stats
|
||||
sub_sizes = [r["member_count"] for r in all_new_rows]
|
||||
if sub_sizes:
|
||||
logger.info(" Sub-group sizes: avg=%.1f, max=%d, min=%d",
|
||||
sum(sub_sizes) / len(sub_sizes), max(sub_sizes), min(sub_sizes))
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("DRY RUN — not writing to DB")
|
||||
for r in all_new_rows[:10]:
|
||||
logger.info(" [%d] %s (%d members)", r["group_id"], r["canonical_name"], r["member_count"])
|
||||
return
|
||||
|
||||
# Write to DB: delete old large groups, insert sub-groups
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
# Delete old large groups
|
||||
for gid in groups_to_delete:
|
||||
c.execute(text("DELETE FROM object_groups WHERE group_id = :gid"), {"gid": gid})
|
||||
# Insert sub-groups
|
||||
for r in all_new_rows:
|
||||
c.execute(text("""
|
||||
INSERT INTO object_groups (group_id, canonical_name, member_count, members, top_controls_count)
|
||||
VALUES (:group_id, :canonical_name, :member_count, CAST(:members AS jsonb), :top_controls_count)
|
||||
"""), r)
|
||||
|
||||
logger.info("Wrote %d sub-groups to DB (replaced %d large groups)", len(all_new_rows), len(groups_to_delete))
|
||||
|
||||
# Final stats
|
||||
with engine.connect() as c:
|
||||
total = c.execute(text("SELECT count(*) FROM object_groups")).scalar()
|
||||
logger.info("Total groups in DB: %d", total)
|
||||
|
||||
|
||||
def _embed_batch(texts: list[str]) -> np.ndarray | None:
|
||||
"""Embed a list of texts, return numpy array."""
|
||||
try:
|
||||
all_emb = np.zeros((len(texts), 1024), dtype=np.float32)
|
||||
batch_size = 64
|
||||
for i in range(0, len(texts), batch_size):
|
||||
batch = texts[i:i + batch_size]
|
||||
with httpx.Client(timeout=httpx.Timeout(60.0, connect=10.0)) as client:
|
||||
resp = client.post(f"{EMBEDDING_URL}/embed", json={"texts": batch})
|
||||
resp.raise_for_status()
|
||||
embs = resp.json().get("embeddings", [])
|
||||
end = min(i + len(embs), len(texts))
|
||||
all_emb[i:end] = np.array(embs, dtype=np.float32)
|
||||
return all_emb
|
||||
except Exception as e:
|
||||
logger.error("Embedding failed: %s", e)
|
||||
return None
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,214 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
G-pre2 v2: Build Master Controls directly from canonical tokens.
|
||||
|
||||
No K-Means needed — Phase 2 already normalized merge_group_hints
|
||||
to 74 canonical tokens. Each token = one object group.
|
||||
|
||||
Groups controls by (canonical_token, phase) and creates MCs
|
||||
for tokens with >=2 distinct phases.
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/gpre2_direct_mc.py --dry-run
|
||||
python3 /app/scripts/gpre2_direct_mc.py --min-phases 2
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from collections import defaultdict
|
||||
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("gpre2-direct")
|
||||
|
||||
DB_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
|
||||
)
|
||||
|
||||
PHASE_ORDER = {
|
||||
"scope": 0, "definition": 1, "governance": 1,
|
||||
"design": 2, "implementation": 3, "configuration": 3,
|
||||
"operation": 4, "training": 4, "monitoring": 5,
|
||||
"testing": 6, "review": 7, "assessment": 8, "remediation": 8,
|
||||
"validation": 9, "reporting": 10, "evidence": 11,
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--min-phases", type=int, default=2)
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(
|
||||
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
|
||||
)
|
||||
|
||||
# Step 1: Load all controls with merge_group_hint
|
||||
logger.info("Loading controls...")
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text("""
|
||||
SELECT id, control_id,
|
||||
generation_metadata->>'merge_group_hint' AS hint
|
||||
FROM canonical_controls
|
||||
WHERE generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND generation_metadata->>'merge_group_hint' != ''
|
||||
AND release_state NOT IN ('deprecated', 'rejected')
|
||||
""")).fetchall()
|
||||
|
||||
logger.info("Loaded %d controls", len(rows))
|
||||
|
||||
# Step 2: Group by (object_token, phase)
|
||||
token_phases: dict[str, dict[str, list]] = defaultdict(
|
||||
lambda: defaultdict(list)
|
||||
)
|
||||
|
||||
for uuid, control_id, hint in rows:
|
||||
parts = hint.split(":", 2)
|
||||
if len(parts) < 2:
|
||||
continue
|
||||
action = parts[0]
|
||||
obj = parts[1]
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
token_phases[obj][phase].append((str(uuid), control_id, action))
|
||||
|
||||
logger.info("Found %d unique object tokens", len(token_phases))
|
||||
|
||||
# Step 3: Create Master Controls
|
||||
master_controls = []
|
||||
master_members = []
|
||||
|
||||
for token, phases in token_phases.items():
|
||||
if len(phases) < args.min_phases:
|
||||
continue
|
||||
|
||||
sorted_phases = sorted(
|
||||
phases.keys(), key=lambda p: PHASE_ORDER.get(p, 99)
|
||||
)
|
||||
phase_counts = {p: len(ctrls) for p, ctrls in phases.items()}
|
||||
total = sum(phase_counts.values())
|
||||
|
||||
master_controls.append({
|
||||
"canonical_name": token,
|
||||
"phases_covered": json.dumps(sorted_phases),
|
||||
"phase_control_count": json.dumps(phase_counts),
|
||||
"total_controls": total,
|
||||
})
|
||||
|
||||
for phase, controls in phases.items():
|
||||
for ctrl_uuid, ctrl_id, action in controls:
|
||||
master_members.append({
|
||||
"canonical_name": token,
|
||||
"control_uuid": ctrl_uuid,
|
||||
"phase": phase,
|
||||
"action": action,
|
||||
})
|
||||
|
||||
logger.info(
|
||||
"Created %d Master Controls with %d members (min %d phases)",
|
||||
len(master_controls), len(master_members), args.min_phases,
|
||||
)
|
||||
|
||||
# Stats
|
||||
if master_controls:
|
||||
counts = [mc["total_controls"] for mc in master_controls]
|
||||
phases_per = [
|
||||
len(json.loads(mc["phases_covered"])) for mc in master_controls
|
||||
]
|
||||
logger.info(" Avg controls/MC: %.1f", sum(counts) / len(counts))
|
||||
logger.info(" Max controls/MC: %d", max(counts))
|
||||
logger.info(" Avg phases/MC: %.1f", sum(phases_per) / len(phases_per))
|
||||
logger.info(" Max phases/MC: %d", max(phases_per))
|
||||
|
||||
# Size distribution
|
||||
logger.info("\n Size distribution:")
|
||||
logger.info(" ≤10: %d", sum(1 for c in counts if c <= 10))
|
||||
logger.info(" 11-50: %d", sum(1 for c in counts if 11 <= c <= 50))
|
||||
logger.info(" 51-200: %d", sum(1 for c in counts if 51 <= c <= 200))
|
||||
logger.info(" 201-500: %d", sum(1 for c in counts if 201 <= c <= 500))
|
||||
logger.info(" 501-2K: %d", sum(1 for c in counts if 501 <= c <= 2000))
|
||||
logger.info(" >2K: %d", sum(1 for c in counts if c > 2000))
|
||||
|
||||
# Top 15
|
||||
top = sorted(master_controls, key=lambda x: -x["total_controls"])[:15]
|
||||
logger.info("\n Top 15 Master Controls:")
|
||||
for mc in top:
|
||||
logger.info(
|
||||
" %6d %s (%d phases)",
|
||||
mc["total_controls"],
|
||||
mc["canonical_name"],
|
||||
len(json.loads(mc["phases_covered"])),
|
||||
)
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("\nDRY RUN — not writing to DB")
|
||||
return
|
||||
|
||||
# Step 4: Write to DB
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
c.execute(text("DELETE FROM master_control_members"))
|
||||
c.execute(text("DELETE FROM master_controls"))
|
||||
|
||||
# Get next object_group_id
|
||||
max_gid = c.execute(
|
||||
text("SELECT COALESCE(MAX(group_id), 0) FROM object_groups")
|
||||
).scalar()
|
||||
next_gid = max_gid + 1
|
||||
|
||||
mc_uuids = {}
|
||||
for mc in master_controls:
|
||||
gid = next_gid
|
||||
next_gid += 1
|
||||
mc_id = f"MC-{gid}"
|
||||
|
||||
c.execute(text("""
|
||||
INSERT INTO master_controls
|
||||
(master_control_id, object_group_id, canonical_name,
|
||||
phases_covered, phase_control_count, total_controls)
|
||||
VALUES (:mcid, :gid, :name,
|
||||
CAST(:phases AS jsonb),
|
||||
CAST(:pcounts AS jsonb), :total)
|
||||
"""), {
|
||||
"mcid": mc_id, "gid": gid,
|
||||
"name": mc["canonical_name"],
|
||||
"phases": mc["phases_covered"],
|
||||
"pcounts": mc["phase_control_count"],
|
||||
"total": mc["total_controls"],
|
||||
})
|
||||
|
||||
mc_uuid = c.execute(text(
|
||||
"SELECT id FROM master_controls WHERE master_control_id = :mcid"
|
||||
), {"mcid": mc_id}).scalar()
|
||||
mc_uuids[mc["canonical_name"]] = str(mc_uuid)
|
||||
|
||||
# Insert members
|
||||
mem_count = 0
|
||||
for mem in master_members:
|
||||
mc_uuid = mc_uuids.get(mem["canonical_name"])
|
||||
if not mc_uuid:
|
||||
continue
|
||||
c.execute(text("""
|
||||
INSERT INTO master_control_members
|
||||
(master_control_uuid, control_uuid, phase, action)
|
||||
VALUES (CAST(:mc AS uuid), CAST(:ctrl AS uuid),
|
||||
:phase, :action)
|
||||
"""), {
|
||||
"mc": mc_uuid,
|
||||
"ctrl": mem["control_uuid"],
|
||||
"phase": mem["phase"],
|
||||
"action": mem["action"],
|
||||
})
|
||||
mem_count += 1
|
||||
|
||||
logger.info("Wrote %d MCs + %d members to DB", len(master_controls), mem_count)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,213 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
G-pre2: Build Master Controls from Object Groups + Lifecycle Phases.
|
||||
|
||||
Groups atomic controls by (object_group_id, phase) and creates
|
||||
Master Controls for groups with >=2 distinct phases.
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/gpre2_master_controls.py
|
||||
python3 /app/scripts/gpre2_master_controls.py --min-phases 3
|
||||
python3 /app/scripts/gpre2_master_controls.py --dry-run
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from collections import defaultdict
|
||||
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("gpre2")
|
||||
|
||||
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
|
||||
|
||||
# Canonical phase ordering for lifecycle chains
|
||||
PHASE_ORDER = {
|
||||
"scope": 0,
|
||||
"definition": 1, "governance": 1,
|
||||
"design": 2,
|
||||
"implementation": 3, "configuration": 3,
|
||||
"operation": 4, "training": 4,
|
||||
"monitoring": 5,
|
||||
"testing": 6,
|
||||
"review": 7,
|
||||
"assessment": 8, "remediation": 8,
|
||||
"validation": 9,
|
||||
"reporting": 10,
|
||||
"evidence": 11,
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--min-phases", type=int, default=2, help="Min distinct phases for Master Control")
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
|
||||
|
||||
# Step 1: Build reverse index (object_token → group_id)
|
||||
logger.info("Building object → group_id reverse index...")
|
||||
object_to_group = {}
|
||||
with engine.connect() as c:
|
||||
groups = c.execute(text("SELECT group_id, canonical_name, members FROM object_groups")).fetchall()
|
||||
|
||||
for gid, canonical, members_json in groups:
|
||||
members = json.loads(members_json) if isinstance(members_json, str) else members_json
|
||||
for member in members:
|
||||
object_to_group[member] = (gid, canonical)
|
||||
|
||||
logger.info("Reverse index: %d objects → %d groups", len(object_to_group), len(groups))
|
||||
|
||||
# Step 2: Load all controls with merge_group_hint
|
||||
logger.info("Loading controls with merge_group_hint...")
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text("""
|
||||
SELECT id, control_id,
|
||||
generation_metadata->>'merge_group_hint' AS hint,
|
||||
title
|
||||
FROM canonical_controls
|
||||
WHERE generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND generation_metadata->>'merge_group_hint' != ''
|
||||
AND release_state NOT IN ('deprecated', 'rejected')
|
||||
""")).fetchall()
|
||||
|
||||
logger.info("Loaded %d controls with merge_group_hint", len(rows))
|
||||
|
||||
# Step 3: Parse and group by (group_id, phase)
|
||||
# Structure: group_id → {phase → [(control_uuid, control_id, action, title)]}
|
||||
group_phases: dict[int, dict[str, list]] = defaultdict(lambda: defaultdict(list))
|
||||
group_names: dict[int, str] = {}
|
||||
unmatched = 0
|
||||
|
||||
for uuid, control_id, hint, title in rows:
|
||||
parts = hint.split(":", 2)
|
||||
if len(parts) < 2:
|
||||
continue
|
||||
action = parts[0]
|
||||
obj = parts[1]
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
|
||||
# Normalize object and find group
|
||||
from services.control_dedup import normalize_object
|
||||
normed = normalize_object(obj)
|
||||
|
||||
if normed in object_to_group:
|
||||
gid, canonical = object_to_group[normed]
|
||||
elif obj in object_to_group:
|
||||
gid, canonical = object_to_group[obj]
|
||||
else:
|
||||
unmatched += 1
|
||||
continue
|
||||
|
||||
group_phases[gid][phase].append((str(uuid), control_id, action, title))
|
||||
group_names[gid] = canonical
|
||||
|
||||
logger.info("Grouped into %d object groups (%d controls unmatched to any group)",
|
||||
len(group_phases), unmatched)
|
||||
|
||||
# Step 4: Create Master Controls (groups with >= min_phases distinct phases)
|
||||
master_controls = []
|
||||
master_members = []
|
||||
mc_counter = 0
|
||||
|
||||
for gid, phases in group_phases.items():
|
||||
if len(phases) < args.min_phases:
|
||||
continue
|
||||
|
||||
mc_counter += 1
|
||||
mc_id = "MC-%d" % gid
|
||||
canonical = group_names.get(gid, "unknown")
|
||||
|
||||
# Sort phases by lifecycle order
|
||||
sorted_phases = sorted(phases.keys(), key=lambda p: PHASE_ORDER.get(p, 99))
|
||||
phase_counts = {p: len(ctrls) for p, ctrls in phases.items()}
|
||||
total = sum(phase_counts.values())
|
||||
|
||||
master_controls.append({
|
||||
"master_control_id": mc_id,
|
||||
"object_group_id": gid,
|
||||
"canonical_name": canonical,
|
||||
"phases_covered": json.dumps(sorted_phases),
|
||||
"phase_control_count": json.dumps(phase_counts),
|
||||
"total_controls": total,
|
||||
})
|
||||
|
||||
for phase, controls in phases.items():
|
||||
for ctrl_uuid, ctrl_id, action, title in controls:
|
||||
master_members.append({
|
||||
"mc_id": mc_id,
|
||||
"control_uuid": ctrl_uuid,
|
||||
"phase": phase,
|
||||
"action": action,
|
||||
})
|
||||
|
||||
logger.info("Created %d Master Controls with %d members (min %d phases)",
|
||||
len(master_controls), len(master_members), args.min_phases)
|
||||
|
||||
# Stats
|
||||
if master_controls:
|
||||
phase_counts = [mc["total_controls"] for mc in master_controls]
|
||||
phases_per_mc = [len(json.loads(mc["phases_covered"])) for mc in master_controls]
|
||||
logger.info(" Avg controls per MC: %.1f", sum(phase_counts) / len(phase_counts))
|
||||
logger.info(" Avg phases per MC: %.1f", sum(phases_per_mc) / len(phases_per_mc))
|
||||
logger.info(" Max controls in MC: %d", max(phase_counts))
|
||||
logger.info(" Max phases in MC: %d", max(phases_per_mc))
|
||||
|
||||
# Top 10
|
||||
top10 = sorted(master_controls, key=lambda x: -x["total_controls"])[:10]
|
||||
logger.info("\nTop 10 Master Controls:")
|
||||
for mc in top10:
|
||||
logger.info(" %s: %s (%d controls, phases: %s)",
|
||||
mc["master_control_id"], mc["canonical_name"],
|
||||
mc["total_controls"], mc["phases_covered"])
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("DRY RUN — not writing to DB")
|
||||
return
|
||||
|
||||
# Step 5: Write to DB
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
c.execute(text("DELETE FROM master_control_members"))
|
||||
c.execute(text("DELETE FROM master_controls"))
|
||||
|
||||
for mc in master_controls:
|
||||
c.execute(text("""
|
||||
INSERT INTO master_controls
|
||||
(master_control_id, object_group_id, canonical_name,
|
||||
phases_covered, phase_control_count, total_controls)
|
||||
VALUES (:master_control_id, :object_group_id, :canonical_name,
|
||||
CAST(:phases_covered AS jsonb), CAST(:phase_control_count AS jsonb),
|
||||
:total_controls)
|
||||
"""), mc)
|
||||
|
||||
# Get MC UUIDs for member inserts
|
||||
mc_uuids = {}
|
||||
for row in c.execute(text("SELECT id, master_control_id FROM master_controls")).fetchall():
|
||||
mc_uuids[row[1]] = str(row[0])
|
||||
|
||||
for mem in master_members:
|
||||
mc_uuid = mc_uuids.get(mem["mc_id"])
|
||||
if not mc_uuid:
|
||||
continue
|
||||
c.execute(text("""
|
||||
INSERT INTO master_control_members
|
||||
(master_control_uuid, control_uuid, phase, action)
|
||||
VALUES (CAST(:mc_uuid AS uuid), CAST(:control_uuid AS uuid), :phase, :action)
|
||||
"""), {
|
||||
"mc_uuid": mc_uuid,
|
||||
"control_uuid": mem["control_uuid"],
|
||||
"phase": mem["phase"],
|
||||
"action": mem["action"],
|
||||
})
|
||||
|
||||
logger.info("Wrote %d Master Controls + %d members to DB",
|
||||
len(master_controls), len(master_members))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,267 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
G-pre2 INCREMENTAL: Add new atomic controls to Master Controls without rebuild.
|
||||
|
||||
Unlike gpre2_master_controls.py which DELETEs and rebuilds the entire
|
||||
master_controls table, this script is non-destructive:
|
||||
- Existing master_controls stay untouched (same UUIDs, same MC-IDs)
|
||||
- For each object_group that gained new atomic controls:
|
||||
* If MC exists: append new members + update total_controls/phase_counts
|
||||
* If MC missing AND group now has >= min_phases: create new MC + all members
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/gpre2_master_controls_incremental.py --since 2026-05-18T02:53:00+00:00
|
||||
python3 /app/scripts/gpre2_master_controls_incremental.py --since 2026-05-18T02:53:00+00:00 --dry-run
|
||||
python3 /app/scripts/gpre2_master_controls_incremental.py --since 2026-05-18T02:53:00+00:00 --min-phases 2
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from collections import defaultdict
|
||||
from datetime import datetime
|
||||
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("gpre2_incremental")
|
||||
|
||||
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--since", required=True, help="ISO datetime — only consider atomics created at/after this")
|
||||
parser.add_argument("--min-phases", type=int, default=2, help="Min distinct phases to form a new MC (default 2)")
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
since_dt = datetime.fromisoformat(args.since.replace("Z", "+00:00"))
|
||||
logger.info("Incremental run since %s, min_phases=%d, dry_run=%s",
|
||||
since_dt.isoformat(), args.min_phases, args.dry_run)
|
||||
|
||||
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
|
||||
|
||||
# Step 1: object → group_id reverse index
|
||||
object_to_group = {}
|
||||
with engine.connect() as c:
|
||||
groups = c.execute(text("SELECT group_id, canonical_name, members FROM object_groups")).fetchall()
|
||||
for gid, canonical, members_json in groups:
|
||||
members = json.loads(members_json) if isinstance(members_json, str) else members_json
|
||||
for member in members:
|
||||
object_to_group[member] = (gid, canonical)
|
||||
logger.info("Reverse index: %d objects → %d groups", len(object_to_group), len(groups))
|
||||
|
||||
# Step 2: Load ALL atomics with merge_group_hint (we need full picture)
|
||||
with engine.connect() as c:
|
||||
all_rows = c.execute(text("""
|
||||
SELECT id, control_id,
|
||||
generation_metadata->>'merge_group_hint' AS hint,
|
||||
title,
|
||||
created_at
|
||||
FROM canonical_controls
|
||||
WHERE generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
AND generation_metadata->>'merge_group_hint' != ''
|
||||
AND release_state NOT IN ('deprecated', 'rejected', 'duplicate')
|
||||
""")).fetchall()
|
||||
logger.info("Loaded %d atomic controls total", len(all_rows))
|
||||
|
||||
# Step 3: Build group_phases (gid → phase → [(uuid, control_id, action, title, is_new)])
|
||||
from services.control_dedup import normalize_object
|
||||
group_phases: dict[int, dict[str, list]] = defaultdict(lambda: defaultdict(list))
|
||||
group_names: dict[int, str] = {}
|
||||
new_atomic_count = 0
|
||||
new_groups_touched: set[int] = set()
|
||||
unmatched = 0
|
||||
|
||||
for uuid, control_id, hint, title, created_at in all_rows:
|
||||
parts = hint.split(":", 2)
|
||||
if len(parts) < 2:
|
||||
continue
|
||||
action = parts[0]
|
||||
obj = parts[1]
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
normed = normalize_object(obj)
|
||||
if normed in object_to_group:
|
||||
gid, canonical = object_to_group[normed]
|
||||
elif obj in object_to_group:
|
||||
gid, canonical = object_to_group[obj]
|
||||
else:
|
||||
unmatched += 1
|
||||
continue
|
||||
is_new = created_at >= since_dt
|
||||
group_phases[gid][phase].append((str(uuid), control_id, action, title, is_new))
|
||||
group_names[gid] = canonical
|
||||
if is_new:
|
||||
new_atomic_count += 1
|
||||
new_groups_touched.add(gid)
|
||||
|
||||
logger.info("Total: %d new atomics across %d object_groups (%d unmatched)",
|
||||
new_atomic_count, len(new_groups_touched), unmatched)
|
||||
|
||||
if not new_groups_touched:
|
||||
logger.info("Nothing to do — no new atomics matched to any object_group.")
|
||||
return
|
||||
|
||||
# Step 4: For each touched object_group, decide action
|
||||
stats = {
|
||||
"groups_examined": len(new_groups_touched),
|
||||
"mcs_existing_updated": 0,
|
||||
"mcs_new_created": 0,
|
||||
"members_inserted": 0,
|
||||
"members_skipped_existing": 0,
|
||||
"groups_skipped_below_min_phases": 0,
|
||||
"groups_skipped_no_member_change": 0,
|
||||
}
|
||||
|
||||
# Load existing master_controls index: master_control_id → uuid
|
||||
with engine.connect() as c:
|
||||
mc_index = {row[1]: (str(row[0]), row[2]) for row in c.execute(text(
|
||||
"SELECT id, master_control_id, total_controls FROM master_controls"
|
||||
)).fetchall()}
|
||||
logger.info("Existing master_controls: %d", len(mc_index))
|
||||
|
||||
# Load existing members for touched MCs (avoid duplicate inserts)
|
||||
touched_mc_ids = ["MC-%d" % gid for gid in new_groups_touched]
|
||||
existing_members: dict[str, set[str]] = defaultdict(set)
|
||||
with engine.connect() as c:
|
||||
for mc_id_str in touched_mc_ids:
|
||||
mc_uuid_info = mc_index.get(mc_id_str)
|
||||
if not mc_uuid_info:
|
||||
continue
|
||||
mc_uuid = mc_uuid_info[0]
|
||||
for row in c.execute(text(
|
||||
"SELECT control_uuid FROM master_control_members WHERE master_control_uuid = CAST(:u AS uuid)"
|
||||
), {"u": mc_uuid}).fetchall():
|
||||
existing_members[mc_id_str].add(str(row[0]))
|
||||
|
||||
# Build INSERT/UPDATE plans
|
||||
inserts_new_mcs = []
|
||||
inserts_members = []
|
||||
updates_mcs = []
|
||||
|
||||
PHASE_ORDER = {
|
||||
"scope": 0, "definition": 1, "governance": 1, "design": 2,
|
||||
"implementation": 3, "configuration": 3, "operation": 4, "training": 4,
|
||||
"monitoring": 5, "testing": 6, "review": 7, "assessment": 8,
|
||||
"remediation": 8, "validation": 9, "reporting": 10, "evidence": 11,
|
||||
}
|
||||
|
||||
for gid in new_groups_touched:
|
||||
mc_id_str = "MC-%d" % gid
|
||||
phases = group_phases[gid]
|
||||
canonical = group_names[gid]
|
||||
all_phases = sorted(phases.keys(), key=lambda p: PHASE_ORDER.get(p, 99))
|
||||
phase_counts = {p: len(ctrls) for p, ctrls in phases.items()}
|
||||
total = sum(phase_counts.values())
|
||||
|
||||
existing_mc = mc_index.get(mc_id_str)
|
||||
|
||||
if existing_mc:
|
||||
# MC exists — append only NEW atomics that aren't already members
|
||||
mc_uuid = existing_mc[0]
|
||||
existing_set = existing_members[mc_id_str]
|
||||
added_for_this_mc = 0
|
||||
for phase, controls in phases.items():
|
||||
for ctrl_uuid, ctrl_id, action, title, is_new in controls:
|
||||
if ctrl_uuid in existing_set:
|
||||
stats["members_skipped_existing"] += 1
|
||||
continue
|
||||
inserts_members.append({
|
||||
"mc_uuid": mc_uuid, "control_uuid": ctrl_uuid,
|
||||
"phase": phase, "action": action,
|
||||
})
|
||||
stats["members_inserted"] += 1
|
||||
added_for_this_mc += 1
|
||||
if added_for_this_mc > 0:
|
||||
updates_mcs.append({
|
||||
"mc_uuid": mc_uuid,
|
||||
"phases_covered": json.dumps(all_phases),
|
||||
"phase_control_count": json.dumps(phase_counts),
|
||||
"total_controls": total,
|
||||
})
|
||||
stats["mcs_existing_updated"] += 1
|
||||
else:
|
||||
stats["groups_skipped_no_member_change"] += 1
|
||||
else:
|
||||
# MC missing — create only if group now meets min_phases threshold
|
||||
if len(phases) < args.min_phases:
|
||||
stats["groups_skipped_below_min_phases"] += 1
|
||||
continue
|
||||
inserts_new_mcs.append({
|
||||
"master_control_id": mc_id_str,
|
||||
"object_group_id": gid,
|
||||
"canonical_name": canonical,
|
||||
"phases_covered": json.dumps(all_phases),
|
||||
"phase_control_count": json.dumps(phase_counts),
|
||||
"total_controls": total,
|
||||
"_members": [
|
||||
{"control_uuid": c[0], "phase": p, "action": c[2]}
|
||||
for p, ctrls in phases.items() for c in ctrls
|
||||
],
|
||||
})
|
||||
stats["mcs_new_created"] += 1
|
||||
|
||||
logger.info("Plan summary: %s", stats)
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("DRY RUN — no writes")
|
||||
# Show first few examples
|
||||
if inserts_new_mcs:
|
||||
logger.info("Sample NEW MCs (up to 5):")
|
||||
for mc in inserts_new_mcs[:5]:
|
||||
logger.info(" %s: %s — total=%d, phases=%s",
|
||||
mc["master_control_id"], mc["canonical_name"],
|
||||
mc["total_controls"], mc["phases_covered"])
|
||||
if updates_mcs:
|
||||
logger.info("Updates to existing MCs: %d", len(updates_mcs))
|
||||
return
|
||||
|
||||
# Step 5: WRITE — strictly INSERT/UPDATE, no DELETE
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
|
||||
# 5a: Insert new MCs + their members
|
||||
for mc in inserts_new_mcs:
|
||||
new_uuid_row = c.execute(text("""
|
||||
INSERT INTO master_controls
|
||||
(master_control_id, object_group_id, canonical_name,
|
||||
phases_covered, phase_control_count, total_controls)
|
||||
VALUES (:master_control_id, :object_group_id, :canonical_name,
|
||||
CAST(:phases_covered AS jsonb), CAST(:phase_control_count AS jsonb),
|
||||
:total_controls)
|
||||
RETURNING id
|
||||
"""), {k: v for k, v in mc.items() if k != "_members"}).fetchone()
|
||||
new_mc_uuid = str(new_uuid_row[0])
|
||||
for mem in mc["_members"]:
|
||||
c.execute(text("""
|
||||
INSERT INTO master_control_members
|
||||
(master_control_uuid, control_uuid, phase, action)
|
||||
VALUES (CAST(:mc_uuid AS uuid), CAST(:control_uuid AS uuid), :phase, :action)
|
||||
"""), {"mc_uuid": new_mc_uuid, **mem})
|
||||
|
||||
# 5b: Append new members to existing MCs
|
||||
for mem in inserts_members:
|
||||
c.execute(text("""
|
||||
INSERT INTO master_control_members
|
||||
(master_control_uuid, control_uuid, phase, action)
|
||||
VALUES (CAST(:mc_uuid AS uuid), CAST(:control_uuid AS uuid), :phase, :action)
|
||||
"""), mem)
|
||||
|
||||
# 5c: Update phase counts / totals on touched existing MCs
|
||||
for upd in updates_mcs:
|
||||
c.execute(text("""
|
||||
UPDATE master_controls
|
||||
SET phases_covered = CAST(:phases_covered AS jsonb),
|
||||
phase_control_count = CAST(:phase_control_count AS jsonb),
|
||||
total_controls = :total_controls
|
||||
WHERE id = CAST(:mc_uuid AS uuid)
|
||||
"""), upd)
|
||||
|
||||
logger.info("DONE — wrote %d new MCs, updated %d existing MCs, %d members inserted",
|
||||
stats["mcs_new_created"], stats["mcs_existing_updated"], stats["members_inserted"])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,298 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
G-pre3: Split large Master Controls by regulation source.
|
||||
|
||||
For each MC with >200 controls:
|
||||
1. Load member controls with parent's source_citation->>'source'
|
||||
2. Group by regulation source
|
||||
3. Sources with >= MIN_SOURCE_SIZE → new sub-MC
|
||||
4. Small sources → merge into "mixed" bucket
|
||||
5. UNKNOWN (no source_citation) → sub-cluster by embedding if >MAX_MC
|
||||
6. Delete original large MC, create new sub-MCs
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/gpre3_regulation_split.py --dry-run
|
||||
python3 /app/scripts/gpre3_regulation_split.py --min-source 15 --max-mc 100
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
from collections import defaultdict
|
||||
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
from services.embedding_utils import subcluster_controls
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("gpre3")
|
||||
|
||||
DB_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
|
||||
)
|
||||
|
||||
# ── Source key normalization ────────────────────────────────────────
|
||||
# fmt: off
|
||||
_SOURCE_SHORT: dict[str, str] = {
|
||||
"DSGVO (EU) 2016/679": "dsgvo", "NIS2-Richtlinie (EU) 2022/2555": "nis2",
|
||||
"KI-Verordnung (EU) 2024/1689": "ai_act", "Cyber Resilience Act (CRA)": "cra",
|
||||
"Digital Services Act (DSA)": "dsa", "Digital Markets Act (DMA)": "dma",
|
||||
"Digital Operational Resilience Act": "dora", "Data Governance Act (DGA)": "dga",
|
||||
"Data Act": "data_act", "Maschinenverordnung (EU) 2023/1230": "machinery_reg",
|
||||
"Medizinprodukteverordnung (EU) 2017/745 (MDR)": "mdr",
|
||||
"European Health Data Space": "ehds", "European Accessibility Act": "eaa",
|
||||
"EU Cybersecurity Act": "eu_csa", "EU Blue Guide 2022": "eu_blue_guide",
|
||||
"EU-US Data Privacy Framework": "eu_us_dpf", "Markets in Crypto-Assets (MiCA)": "mica",
|
||||
"Standardvertragsklauseln (SCC)": "scc", "ePrivacy-Richtlinie": "eprivacy",
|
||||
"Batterieverordnung (EU) 2023/1542": "battery_reg",
|
||||
"Bundesdatenschutzgesetz (BDSG)": "bdsg",
|
||||
"BSI-Gesetz (BSIG 2025, NIS2-Umsetzung)": "bsig",
|
||||
"BSI-Kritisverordnung (BSI-KritisV)": "bsi_kritisv",
|
||||
"Geldwaeschegesetz (GwG)": "gwg", "Hinweisgeberschutzgesetz (HinSchG)": "hinschg",
|
||||
"Lieferkettensorgfaltspflichtengesetz (LkSG)": "lksg",
|
||||
"KRITIS-Dachgesetz (KRITISDachG)": "kritisdachg",
|
||||
"NIST SP 800-53 Rev. 5": "nist_800_53", "NIST Cybersecurity Framework 2.0": "nist_csf",
|
||||
"NIST Privacy Framework 1.0": "nist_privacy",
|
||||
"NIST SP 800-207 (Zero Trust)": "nist_zero_trust",
|
||||
"NIST SP 800-218 (SSDF)": "nist_ssdf", "NIST SP 800-63-3": "nist_800_63",
|
||||
"NIST AI Risk Management Framework": "nist_ai_rmf",
|
||||
"NISTIR 8259A IoT Security": "nist_iot",
|
||||
"OWASP Top 10 (2021)": "owasp_top10", "OWASP API Security Top 10 (2023)": "owasp_api",
|
||||
"OWASP ASVS 4.0": "owasp_asvs", "OWASP SAMM 2.0": "owasp_samm",
|
||||
"OWASP MASVS 2.0": "owasp_masvs", "OWASP Mobile Top 10": "owasp_mobile",
|
||||
"ENISA": "enisa", "TDDDG": "tdddg", "TKG": "tkg", "TMG": "tmg",
|
||||
"BGB": "bgb", "UWG": "uwg", "UrhG": "urhg",
|
||||
"BAIT (BaFin 2024)": "bait", "VAIT (BaFin 2022)": "vait",
|
||||
"AML-Verordnung": "aml_reg", "Zahlungsdiensterichtlinie 2": "psd2",
|
||||
"Telekommunikationsgesetz Oesterreich": "at_tkg",
|
||||
"Österreichisches Datenschutzgesetz (DSG)": "at_dsg",
|
||||
"Allgemeines Gleichbehandlungsgesetz (AGG)": "agg",
|
||||
"Aktiengesetz (AktG)": "aktg", "Handelsgesetzbuch (HGB)": "hgb",
|
||||
"GmbH-Gesetz (GmbHG)": "gmbhg", "Insolvenzordnung (InsO)": "inso",
|
||||
"Gewerbeordnung (GewO)": "gewo", "Abgabenordnung (AO)": "ao",
|
||||
}
|
||||
# fmt: on
|
||||
|
||||
|
||||
def source_to_key(source: str) -> str:
|
||||
"""Normalize regulation source name to a short slug key."""
|
||||
if source in _SOURCE_SHORT:
|
||||
return _SOURCE_SHORT[source]
|
||||
s = source.lower()
|
||||
s = re.sub(r"\(.*?\)", "", s)
|
||||
s = re.sub(r"[^a-z0-9äöüß]+", "_", s)
|
||||
s = re.sub(r"_+", "_", s).strip("_")
|
||||
return s[:40] if s else "unknown"
|
||||
|
||||
|
||||
# ── Main ───────────────────────────────────────────────────────────
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--min-source", type=int, default=15,
|
||||
help="Min controls per source for own sub-MC")
|
||||
parser.add_argument("--max-mc", type=int, default=100,
|
||||
help="Max controls per sub-MC before sub-clustering")
|
||||
parser.add_argument("--threshold", type=int, default=200,
|
||||
help="Only split MCs with more than N controls")
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(
|
||||
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
|
||||
)
|
||||
|
||||
# Step 1: Find large master controls
|
||||
with engine.connect() as c:
|
||||
large_mcs = c.execute(text("""
|
||||
SELECT mc.id, mc.master_control_id, mc.object_group_id,
|
||||
mc.canonical_name, mc.total_controls
|
||||
FROM master_controls mc
|
||||
WHERE mc.total_controls > :threshold
|
||||
ORDER BY mc.total_controls DESC
|
||||
"""), {"threshold": args.threshold}).fetchall()
|
||||
|
||||
logger.info("Found %d MCs with >%d controls", len(large_mcs), args.threshold)
|
||||
if not large_mcs:
|
||||
return
|
||||
|
||||
# Step 2: Build split plans
|
||||
all_splits = []
|
||||
for mc_uuid, mc_id, og_id, canonical, total in large_mcs:
|
||||
plan = _build_split_plan(engine, mc_uuid, mc_id, og_id, canonical, total, args)
|
||||
all_splits.append(plan)
|
||||
|
||||
total_new = sum(len(sp["sub_groups"]) for sp in all_splits)
|
||||
total_covered = sum(
|
||||
sum(len(sg["controls"]) for sg in sp["sub_groups"]) for sp in all_splits
|
||||
)
|
||||
logger.info("SUMMARY: %d large MCs → %d sub-MCs (%d controls)", len(all_splits), total_new, total_covered)
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("DRY RUN — not writing to DB")
|
||||
return
|
||||
|
||||
_write_splits(engine, all_splits)
|
||||
|
||||
|
||||
def _build_split_plan(engine, mc_uuid, mc_id, og_id, canonical, total, args) -> dict:
|
||||
"""Build a regulation-source split plan for one large MC."""
|
||||
logger.info("\n━━━ %s: %s (%d controls) ━━━", mc_id, canonical, total)
|
||||
|
||||
with engine.connect() as c:
|
||||
members = c.execute(text("""
|
||||
SELECT mcm.control_uuid, mcm.phase, mcm.action,
|
||||
cc.control_id, cc.title,
|
||||
COALESCE(pc.source_citation->>'source', 'UNKNOWN') AS src
|
||||
FROM master_control_members mcm
|
||||
JOIN canonical_controls cc ON cc.id = mcm.control_uuid
|
||||
LEFT JOIN canonical_controls pc ON pc.id = cc.parent_control_uuid
|
||||
WHERE mcm.master_control_uuid = CAST(:mc_uuid AS uuid)
|
||||
"""), {"mc_uuid": str(mc_uuid)}).fetchall()
|
||||
|
||||
by_source: dict[str, list[dict]] = defaultdict(list)
|
||||
for ctrl_uuid, phase, action, cid, title, src in members:
|
||||
by_source[src].append({
|
||||
"control_uuid": str(ctrl_uuid), "phase": phase,
|
||||
"action": action, "control_id": cid, "title": title,
|
||||
})
|
||||
|
||||
sorted_sources = sorted(by_source.items(), key=lambda x: -len(x[1]))
|
||||
for src, ctrls in sorted_sources[:8]:
|
||||
logger.info(" %4d %s", len(ctrls), src)
|
||||
if len(sorted_sources) > 8:
|
||||
logger.info(" ... +%d more sources", len(sorted_sources) - 8)
|
||||
|
||||
plan = {"mc_uuid": str(mc_uuid), "mc_id": mc_id, "og_id": og_id,
|
||||
"canonical": canonical, "total": total, "sub_groups": []}
|
||||
|
||||
own_mc_sources = []
|
||||
mixed_controls = []
|
||||
for src, ctrls in sorted_sources:
|
||||
if src == "UNKNOWN":
|
||||
continue
|
||||
if len(ctrls) >= args.min_source:
|
||||
own_mc_sources.append((src, ctrls))
|
||||
else:
|
||||
mixed_controls.extend(ctrls)
|
||||
|
||||
unknown_controls = by_source.get("UNKNOWN", [])
|
||||
|
||||
# (a) Named regulation sub-MCs
|
||||
for src, ctrls in own_mc_sources:
|
||||
key = source_to_key(src)
|
||||
name = f"{canonical}_{key}"
|
||||
_add_subgroups(plan, name, src, ctrls, args.max_mc)
|
||||
|
||||
# (b) Mixed small-source bucket
|
||||
if mixed_controls:
|
||||
_add_subgroups(plan, f"{canonical}_mixed", "mixed", mixed_controls, args.max_mc)
|
||||
|
||||
# (c) UNKNOWN bucket
|
||||
if unknown_controls:
|
||||
_add_subgroups(plan, f"{canonical}_general", "general", unknown_controls, args.max_mc)
|
||||
|
||||
logger.info(" → %d sub-groups:", len(plan["sub_groups"]))
|
||||
for sg in sorted(plan["sub_groups"], key=lambda x: -len(x["controls"])):
|
||||
logger.info(" %4d %s", len(sg["controls"]), sg["name"])
|
||||
|
||||
return plan
|
||||
|
||||
|
||||
def _add_subgroups(plan: dict, name: str, source: str,
|
||||
controls: list[dict], max_mc: int):
|
||||
"""Add controls as one or more sub-groups to the plan."""
|
||||
if len(controls) <= max_mc:
|
||||
plan["sub_groups"].append({"name": name, "source": source, "controls": controls})
|
||||
else:
|
||||
clusters = subcluster_controls(controls, max_mc)
|
||||
for i, cluster in enumerate(clusters):
|
||||
sub_name = f"{name}_{i+1}" if len(clusters) > 1 else name
|
||||
plan["sub_groups"].append({"name": sub_name, "source": source, "controls": cluster})
|
||||
|
||||
|
||||
def _write_splits(engine, splits: list[dict]):
|
||||
"""Apply split plan: delete old MCs, create new object_groups + MCs."""
|
||||
with engine.begin() as c:
|
||||
c.execute(text("SET search_path TO compliance, public"))
|
||||
max_gid = c.execute(
|
||||
text("SELECT COALESCE(MAX(group_id), 0) FROM object_groups")
|
||||
).scalar()
|
||||
next_gid = max_gid + 1
|
||||
total_mc = 0
|
||||
total_mem = 0
|
||||
|
||||
for sp in splits:
|
||||
c.execute(text(
|
||||
"DELETE FROM master_control_members "
|
||||
"WHERE master_control_uuid = CAST(:u AS uuid)"
|
||||
), {"u": sp["mc_uuid"]})
|
||||
c.execute(text(
|
||||
"DELETE FROM master_controls WHERE id = CAST(:u AS uuid)"
|
||||
), {"u": sp["mc_uuid"]})
|
||||
logger.info("Deleted %s (%s)", sp["mc_id"], sp["canonical"])
|
||||
|
||||
for sg in sp["sub_groups"]:
|
||||
if not sg["controls"]:
|
||||
continue
|
||||
gid = next_gid
|
||||
next_gid += 1
|
||||
|
||||
members_list = list({ctrl["control_id"] for ctrl in sg["controls"]})
|
||||
c.execute(text("""
|
||||
INSERT INTO object_groups
|
||||
(group_id, canonical_name, member_count, members, top_controls_count)
|
||||
VALUES (:gid, :name, :cnt, CAST(:members AS jsonb), 0)
|
||||
"""), {"gid": gid, "name": sg["name"], "cnt": len(members_list),
|
||||
"members": json.dumps(members_list)})
|
||||
|
||||
by_phase: dict[str, list[dict]] = defaultdict(list)
|
||||
for ctrl in sg["controls"]:
|
||||
by_phase[ctrl["phase"]].append(ctrl)
|
||||
|
||||
sorted_phases = sorted(by_phase.keys())
|
||||
phase_counts = {p: len(v) for p, v in by_phase.items()}
|
||||
mc_id = f"MC-{gid}"
|
||||
|
||||
c.execute(text("""
|
||||
INSERT INTO master_controls
|
||||
(master_control_id, object_group_id, canonical_name,
|
||||
phases_covered, phase_control_count, total_controls)
|
||||
VALUES (:mcid, :gid, :name,
|
||||
CAST(:phases AS jsonb), CAST(:pcounts AS jsonb), :total)
|
||||
"""), {"mcid": mc_id, "gid": gid, "name": sg["name"],
|
||||
"phases": json.dumps(sorted_phases),
|
||||
"pcounts": json.dumps(phase_counts),
|
||||
"total": sum(phase_counts.values())})
|
||||
|
||||
mc_uuid = c.execute(text(
|
||||
"SELECT id FROM master_controls WHERE master_control_id = :mcid"
|
||||
), {"mcid": mc_id}).scalar()
|
||||
|
||||
for ctrl in sg["controls"]:
|
||||
c.execute(text("""
|
||||
INSERT INTO master_control_members
|
||||
(master_control_uuid, control_uuid, phase, action)
|
||||
VALUES (CAST(:mc AS uuid), CAST(:ctrl AS uuid), :phase, :action)
|
||||
"""), {"mc": str(mc_uuid), "ctrl": ctrl["control_uuid"],
|
||||
"phase": ctrl["phase"], "action": ctrl["action"]})
|
||||
total_mem += 1
|
||||
total_mc += 1
|
||||
|
||||
logger.info("Created %d new MCs with %d members", total_mc, total_mem)
|
||||
|
||||
with engine.connect() as c:
|
||||
stats = c.execute(text("""
|
||||
SELECT count(*), count(CASE WHEN total_controls > 200 THEN 1 END),
|
||||
AVG(total_controls)::int
|
||||
FROM compliance.master_controls
|
||||
""")).fetchone()
|
||||
logger.info("Final: %d MCs, %d still >200, avg %d controls/MC", stats[0], stats[1], stats[2])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,310 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Phase 0: Quality Audit for Master Control Assignments.
|
||||
|
||||
Uses Claude Sonnet to validate whether controls are correctly assigned
|
||||
to their Master Controls. Samples controls from large and small MCs.
|
||||
|
||||
Usage:
|
||||
python3 /app/scripts/gpre_quality_audit.py
|
||||
python3 /app/scripts/gpre_quality_audit.py --large-sample 50 --small-sample 10
|
||||
python3 /app/scripts/gpre_quality_audit.py --mc MC-8292 # single MC
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
import time
|
||||
from collections import defaultdict
|
||||
|
||||
import httpx
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
|
||||
)
|
||||
logger = logging.getLogger("quality-audit")
|
||||
|
||||
DB_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
|
||||
)
|
||||
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
|
||||
ANTHROPIC_MODEL = os.getenv("AUDIT_MODEL", "claude-sonnet-4-20250514")
|
||||
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
|
||||
|
||||
SYSTEM_PROMPT = """Du bist ein Compliance-Experte der prüft ob Controls korrekt zu Master Controls zugeordnet sind.
|
||||
|
||||
Für jeden Control beantworte:
|
||||
1. MATCH: Gehört dieser Control thematisch zum Master Control Topic?
|
||||
2. CONFIDENCE: Wie sicher bist du? (0.0-1.0)
|
||||
3. REASON: Kurze Begründung (max 1 Satz)
|
||||
4. SUGGESTED_TOPIC: Falls MATCH=false, welches Topic wäre korrekt?
|
||||
|
||||
Wichtige Unterscheidungen:
|
||||
- "monitoring" = kontinuierliche Überwachung, Alerting, Log-Analyse
|
||||
- "training" = Schulung, Awareness, Lernmaterialien
|
||||
- "personal_data" = personenbezogene Daten, DSGVO-Betroffenenrechte
|
||||
- "procedure" = Verfahren, Prozesse (aber NICHT wenn es spezifisch um Incidents geht)
|
||||
- "incident" = Sicherheitsvorfälle, Breach Notification, Recovery
|
||||
- "policy" = Richtlinien, Regelwerke, Governance-Dokumente
|
||||
- "encryption" = Verschlüsselung, Kryptografie, Key Management
|
||||
- "audit_logging" = Protokollierung, Audit Trail, Nachvollziehbarkeit
|
||||
|
||||
Antworte NUR als JSON-Array, ein Objekt pro Control."""
|
||||
|
||||
|
||||
def call_claude(controls_batch: list[dict], mc_topic: str) -> list[dict]:
|
||||
"""Send a batch of controls to Claude for validation."""
|
||||
items = []
|
||||
for c in controls_batch:
|
||||
items.append(
|
||||
f"- Control '{c['control_id']}': "
|
||||
f"Titel=\"{c['title']}\", "
|
||||
f"Objective=\"{c['objective'][:150]}...\", "
|
||||
f"Phase={c['phase']}, Action={c['action']}"
|
||||
)
|
||||
|
||||
prompt = (
|
||||
f"Master Control Topic: \"{mc_topic}\"\n\n"
|
||||
f"Prüfe diese {len(controls_batch)} Controls:\n\n"
|
||||
+ "\n".join(items)
|
||||
+ "\n\nAntwort als JSON-Array mit Feldern: "
|
||||
"control_id, match (bool), confidence (float), reason (str), "
|
||||
"suggested_topic (str, nur wenn match=false)."
|
||||
)
|
||||
|
||||
headers = {
|
||||
"x-api-key": ANTHROPIC_API_KEY,
|
||||
"anthropic-version": "2023-06-01",
|
||||
"content-type": "application/json",
|
||||
}
|
||||
payload = {
|
||||
"model": ANTHROPIC_MODEL,
|
||||
"max_tokens": 2048,
|
||||
"temperature": 0.1,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
}
|
||||
|
||||
for attempt in range(3):
|
||||
try:
|
||||
resp = httpx.post(
|
||||
ANTHROPIC_URL,
|
||||
headers=headers,
|
||||
json=payload,
|
||||
timeout=60.0,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
content = data.get("content", [{}])[0].get("text", "")
|
||||
usage = data.get("usage", {})
|
||||
|
||||
# Parse JSON from response
|
||||
start = content.find("[")
|
||||
end = content.rfind("]") + 1
|
||||
if start >= 0 and end > start:
|
||||
results = json.loads(content[start:end])
|
||||
return results, usage
|
||||
logger.warning("No JSON array in response: %s", content[:200])
|
||||
return [], usage
|
||||
except httpx.HTTPStatusError as e:
|
||||
if e.response.status_code == 429:
|
||||
wait = 30 * (attempt + 1)
|
||||
logger.warning("Rate limited, waiting %ds...", wait)
|
||||
time.sleep(wait)
|
||||
else:
|
||||
logger.error("API error: %s", e)
|
||||
return [], {}
|
||||
except Exception as e:
|
||||
logger.error("Request failed (attempt %d): %s", attempt + 1, e)
|
||||
if attempt < 2:
|
||||
time.sleep(5)
|
||||
return [], {}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--large-sample", type=int, default=50,
|
||||
help="Controls to sample per large MC")
|
||||
parser.add_argument("--small-sample", type=int, default=10,
|
||||
help="Controls to sample per small MC")
|
||||
parser.add_argument("--small-mc-count", type=int, default=50,
|
||||
help="Number of small MCs to audit")
|
||||
parser.add_argument("--mc", type=str, default=None,
|
||||
help="Audit a single MC by ID (e.g., MC-8292)")
|
||||
parser.add_argument("--batch-size", type=int, default=10,
|
||||
help="Controls per API call")
|
||||
args = parser.parse_args()
|
||||
|
||||
engine = create_engine(
|
||||
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
|
||||
)
|
||||
|
||||
# Load MCs to audit
|
||||
with engine.connect() as c:
|
||||
if args.mc:
|
||||
mcs = c.execute(text("""
|
||||
SELECT id, master_control_id, canonical_name, total_controls
|
||||
FROM master_controls WHERE master_control_id = :mc
|
||||
"""), {"mc": args.mc}).fetchall()
|
||||
else:
|
||||
# Large MCs (>200) + random small MCs
|
||||
large = c.execute(text("""
|
||||
SELECT id, master_control_id, canonical_name, total_controls
|
||||
FROM master_controls WHERE total_controls > 200
|
||||
ORDER BY total_controls DESC
|
||||
""")).fetchall()
|
||||
|
||||
small = c.execute(text("""
|
||||
SELECT id, master_control_id, canonical_name, total_controls
|
||||
FROM master_controls WHERE total_controls BETWEEN 10 AND 200
|
||||
ORDER BY RANDOM() LIMIT :cnt
|
||||
"""), {"cnt": args.small_mc_count}).fetchall()
|
||||
|
||||
mcs = list(large) + list(small)
|
||||
|
||||
logger.info("Auditing %d Master Controls", len(mcs))
|
||||
|
||||
# Results tracking
|
||||
total_checked = 0
|
||||
total_match = 0
|
||||
total_mismatch = 0
|
||||
total_input_tokens = 0
|
||||
total_output_tokens = 0
|
||||
mc_results: dict[str, dict] = {}
|
||||
all_mismatches: list[dict] = []
|
||||
|
||||
for mc_uuid, mc_id, canonical, total in mcs:
|
||||
is_large = total > 200
|
||||
sample_size = args.large_sample if is_large else args.small_sample
|
||||
|
||||
# Sample controls
|
||||
with engine.connect() as c:
|
||||
controls = c.execute(text("""
|
||||
SELECT mcm.control_uuid, mcm.phase, mcm.action,
|
||||
cc.control_id, cc.title,
|
||||
COALESCE(cc.objective, '') as objective
|
||||
FROM master_control_members mcm
|
||||
JOIN canonical_controls cc ON cc.id = mcm.control_uuid
|
||||
WHERE mcm.master_control_uuid = CAST(:mc AS uuid)
|
||||
ORDER BY RANDOM()
|
||||
LIMIT :n
|
||||
"""), {"mc": str(mc_uuid), "n": sample_size}).fetchall()
|
||||
|
||||
if not controls:
|
||||
continue
|
||||
|
||||
control_dicts = [
|
||||
{"control_uuid": str(r[0]), "phase": r[1], "action": r[2],
|
||||
"control_id": r[3], "title": r[4] or "", "objective": r[5] or ""}
|
||||
for r in controls
|
||||
]
|
||||
|
||||
logger.info("\n%s: %s (%d total, sampling %d)",
|
||||
mc_id, canonical, total, len(control_dicts))
|
||||
|
||||
mc_match = 0
|
||||
mc_mismatch = 0
|
||||
|
||||
# Process in batches
|
||||
for i in range(0, len(control_dicts), args.batch_size):
|
||||
batch = control_dicts[i:i + args.batch_size]
|
||||
results, usage = call_claude(batch, canonical)
|
||||
|
||||
total_input_tokens += usage.get("input_tokens", 0)
|
||||
total_output_tokens += usage.get("output_tokens", 0)
|
||||
|
||||
for r in results:
|
||||
if r.get("match", True):
|
||||
mc_match += 1
|
||||
total_match += 1
|
||||
else:
|
||||
mc_mismatch += 1
|
||||
total_mismatch += 1
|
||||
mismatch = {
|
||||
"mc_id": mc_id,
|
||||
"mc_topic": canonical,
|
||||
"control_id": r.get("control_id", "?"),
|
||||
"confidence": r.get("confidence", 0),
|
||||
"reason": r.get("reason", ""),
|
||||
"suggested_topic": r.get("suggested_topic", ""),
|
||||
}
|
||||
all_mismatches.append(mismatch)
|
||||
|
||||
total_checked += len(results)
|
||||
|
||||
# Rate limit
|
||||
time.sleep(1)
|
||||
|
||||
accuracy = mc_match / (mc_match + mc_mismatch) if (mc_match + mc_mismatch) > 0 else 1.0
|
||||
mc_results[mc_id] = {
|
||||
"canonical": canonical, "total": total,
|
||||
"checked": mc_match + mc_mismatch,
|
||||
"match": mc_match, "mismatch": mc_mismatch,
|
||||
"accuracy": accuracy,
|
||||
}
|
||||
logger.info(" → %d/%d correct (%.1f%%)",
|
||||
mc_match, mc_match + mc_mismatch, accuracy * 100)
|
||||
|
||||
# Final report
|
||||
_print_report(mc_results, all_mismatches, total_checked, total_match,
|
||||
total_mismatch, total_input_tokens, total_output_tokens)
|
||||
|
||||
|
||||
def _print_report(mc_results, mismatches, checked, match, mismatch,
|
||||
input_tok, output_tok):
|
||||
"""Print the quality audit report."""
|
||||
logger.info("\n" + "=" * 70)
|
||||
logger.info("QUALITY AUDIT REPORT")
|
||||
logger.info("=" * 70)
|
||||
logger.info("Total controls checked: %d", checked)
|
||||
logger.info("Correct assignments: %d (%.1f%%)",
|
||||
match, match / max(checked, 1) * 100)
|
||||
logger.info("Wrong assignments: %d (%.1f%%)",
|
||||
mismatch, mismatch / max(checked, 1) * 100)
|
||||
|
||||
# Cost estimate
|
||||
cost_input = input_tok / 1_000_000 * 3.0 # Sonnet input: $3/MTok
|
||||
cost_output = output_tok / 1_000_000 * 15.0 # Sonnet output: $15/MTok
|
||||
logger.info("\nAPI Usage: %d input + %d output tokens",
|
||||
input_tok, output_tok)
|
||||
logger.info("Estimated cost: $%.2f", cost_input + cost_output)
|
||||
|
||||
# Per-MC breakdown (worst first)
|
||||
logger.info("\n--- Per-MC Accuracy (worst first) ---")
|
||||
sorted_mcs = sorted(mc_results.values(), key=lambda x: x["accuracy"])
|
||||
for mc in sorted_mcs:
|
||||
flag = "❌" if mc["accuracy"] < 0.9 else "⚠️" if mc["accuracy"] < 0.95 else "✅"
|
||||
logger.info(" %s %s (%s): %d/%d = %.1f%% [total: %d]",
|
||||
flag, mc["canonical"][:30].ljust(30),
|
||||
"large" if mc["total"] > 200 else "small",
|
||||
mc["match"], mc["checked"],
|
||||
mc["accuracy"] * 100, mc["total"])
|
||||
|
||||
# Top mismatches
|
||||
if mismatches:
|
||||
logger.info("\n--- Mismatches (all %d) ---", len(mismatches))
|
||||
for m in sorted(mismatches, key=lambda x: -x.get("confidence", 0)):
|
||||
logger.info(" %s in %s (%s) → should be '%s': %s",
|
||||
m["control_id"], m["mc_id"], m["mc_topic"],
|
||||
m["suggested_topic"], m["reason"])
|
||||
|
||||
# Size-class breakdown
|
||||
large_mcs = [m for m in mc_results.values() if m["total"] > 200]
|
||||
small_mcs = [m for m in mc_results.values() if m["total"] <= 200]
|
||||
|
||||
if large_mcs:
|
||||
lg_acc = sum(m["match"] for m in large_mcs) / max(sum(m["checked"] for m in large_mcs), 1)
|
||||
logger.info("\nLarge MCs (>200): %.1f%% accuracy (%d MCs)",
|
||||
lg_acc * 100, len(large_mcs))
|
||||
if small_mcs:
|
||||
sm_acc = sum(m["match"] for m in small_mcs) / max(sum(m["checked"] for m in small_mcs), 1)
|
||||
logger.info("Small MCs (≤200): %.1f%% accuracy (%d MCs)",
|
||||
sm_acc * 100, len(small_mcs))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,242 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Parse BSI QUAIDAL Markdown catalog into a structural index.
|
||||
|
||||
Clean-Room principle: this script does NOT persist any QUAIDAL prose to disk.
|
||||
It only extracts non-protectable structural facts (IDs, type, file paths,
|
||||
cross-references to other QUAIDAL entries, references to external norms).
|
||||
|
||||
The derivation step (derive_quaidal_mcs.py) reads the index plus the original
|
||||
.md files from the gitignored clone and asks the LLM to produce our own
|
||||
wordings, never copying the BSI prose into our own controls/database.
|
||||
|
||||
Input: legal-sources/bsi-quaidal/0000_Markdown/**/*.md (gitignored clone)
|
||||
Output: control-pipeline/data/quaidal/quaidal_index.json (structural only)
|
||||
|
||||
Usage:
|
||||
python3 control-pipeline/scripts/ingest_bsi_quaidal.py
|
||||
python3 control-pipeline/scripts/ingest_bsi_quaidal.py --check # validate only
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
from dataclasses import asdict, dataclass, field
|
||||
from pathlib import Path
|
||||
|
||||
try:
|
||||
import yaml
|
||||
except ImportError:
|
||||
print("ERROR: PyYAML missing. Install with: pip install pyyaml", file=sys.stderr)
|
||||
sys.exit(2)
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||
SOURCE_ROOT = REPO_ROOT / "legal-sources" / "bsi-quaidal"
|
||||
MARKDOWN_ROOT = SOURCE_ROOT / "0000_Markdown"
|
||||
OUTPUT_DIR = REPO_ROOT / "control-pipeline" / "data" / "quaidal"
|
||||
OUTPUT_FILE = OUTPUT_DIR / "quaidal_index.json"
|
||||
|
||||
# Map folder name -> our internal kind. Sub-folders inside the Methoden tree
|
||||
# (e.g. "QM-10_Dimension Reduction") are treated as method variants of their
|
||||
# parent QM.
|
||||
KIND_BY_PARENT_DIR = {
|
||||
"0000_Qualitätskriterien": "criterion", # QKB → Master Control candidates
|
||||
"0001_Qualitätsbausteine": "building_block", # QB → atomic controls
|
||||
"0002_Maßnahmen": "measure", # M → mitigations
|
||||
"0003_Qualitätsmetriken_methoden": "metric", # QM → runtime check / metric
|
||||
"0002_Referenz-Matrizen": "matrix", # cross-walk matrix
|
||||
"9998_CustomTemplates": "template",
|
||||
}
|
||||
|
||||
FRONTMATTER_RE = re.compile(r"^---\s*\n(.*?)\n---\s*\n", re.DOTALL)
|
||||
ID_RE = re.compile(r"\b((?:QKB|QB|MA|QM)-\d+[a-zA-Z]?)", re.IGNORECASE)
|
||||
|
||||
|
||||
@dataclass
|
||||
class IndexEntry:
|
||||
id: str # Canonical ID: QKB-01, QB-03, M-12, QM-07
|
||||
kind: str # criterion / building_block / measure / metric / matrix / template
|
||||
title_de: str
|
||||
title_en: str
|
||||
source_path: str # relative to SOURCE_ROOT
|
||||
referenced_ids: list[str] = field(default_factory=list) # other QUAIDAL IDs linked in this file
|
||||
external_refs: list[dict] = field(default_factory=list) # {framework, citation, ref_id}
|
||||
tags: list[str] = field(default_factory=list)
|
||||
share: bool | None = None
|
||||
|
||||
|
||||
def parse_frontmatter(text: str) -> dict:
|
||||
m = FRONTMATTER_RE.match(text)
|
||||
if not m:
|
||||
return {}
|
||||
try:
|
||||
return yaml.safe_load(m.group(1)) or {}
|
||||
except yaml.YAMLError:
|
||||
return {}
|
||||
|
||||
|
||||
def canonical_id(raw_id: str | list | None, filename: str) -> str | None:
|
||||
"""QUAIDAL files sometimes list multiple IDs or odd casing — normalise."""
|
||||
candidates: list[str] = []
|
||||
if isinstance(raw_id, list):
|
||||
candidates.extend(str(x) for x in raw_id)
|
||||
elif isinstance(raw_id, str):
|
||||
candidates.append(raw_id)
|
||||
# Fallback: derive from filename
|
||||
candidates.append(filename)
|
||||
for c in candidates:
|
||||
m = ID_RE.search(c)
|
||||
if m:
|
||||
return m.group(1).upper().replace(" ", "-")
|
||||
return None
|
||||
|
||||
|
||||
def determine_kind(path: Path) -> str:
|
||||
for parent in path.parents:
|
||||
if parent.name in KIND_BY_PARENT_DIR:
|
||||
return KIND_BY_PARENT_DIR[parent.name]
|
||||
return "unknown"
|
||||
|
||||
|
||||
def collect_referenced_ids(body: str, own_id: str) -> list[str]:
|
||||
found = {m.group(1).upper() for m in ID_RE.finditer(body)}
|
||||
found.discard(own_id)
|
||||
return sorted(found)
|
||||
|
||||
|
||||
REF_FRAMEWORKS = [
|
||||
("AI Act", ["AI-Act", "AI Act", "Verordnung (EU) 2024/1689", "KI-VO"]),
|
||||
("EU GDPR", ["DSGVO", "Verordnung (EU) 2016/679", "GDPR"]),
|
||||
("ISO/IEC 25012", ["ISO/IEC 25012", "ISO 25012"]),
|
||||
("ISO/IEC 25024", ["ISO/IEC 25024", "ISO 25024"]),
|
||||
("ISO/IEC 23894", ["ISO/IEC 23894", "ISO 23894"]),
|
||||
("ISO/IEC 42001", ["ISO/IEC 42001", "ISO 42001"]),
|
||||
("NIST AI RMF", ["NIST AI RMF", "AI Risk Management Framework"]),
|
||||
("BSI Grundschutz", ["IT-Grundschutz", "Grundschutz"]),
|
||||
("BSI AIC4", ["AIC4", "AI Cloud Service Compliance Criteria"]),
|
||||
]
|
||||
|
||||
|
||||
def detect_external_refs(body: str) -> list[dict]:
|
||||
refs: list[dict] = []
|
||||
seen: set[tuple[str, str]] = set()
|
||||
# Section "Referenzen" tables — pick up first column ref-id and first
|
||||
# textual hit of the framework. We do NOT store the BSI "Kurzbeschr."
|
||||
# column to avoid copying their prose.
|
||||
for line in body.splitlines():
|
||||
for framework, patterns in REF_FRAMEWORKS:
|
||||
for pat in patterns:
|
||||
if pat.lower() in line.lower():
|
||||
# Try to grab an article/section nearby (e.g. "Artikel 10")
|
||||
art = re.search(r"(Artikel|Art\.?|Section|§)\s*([0-9]+[a-z]?)", line, re.IGNORECASE)
|
||||
citation = f"{art.group(1)} {art.group(2)}" if art else None
|
||||
key = (framework, citation or "")
|
||||
if key in seen:
|
||||
continue
|
||||
seen.add(key)
|
||||
refs.append({"framework": framework, "citation": citation})
|
||||
break
|
||||
return refs
|
||||
|
||||
|
||||
def parse_file(path: Path) -> IndexEntry | None:
|
||||
text = path.read_text(encoding="utf-8")
|
||||
fm = parse_frontmatter(text)
|
||||
body = text[text.find("---", 3) + 3 :] if text.startswith("---") else text
|
||||
|
||||
own_id = canonical_id(fm.get("ID"), path.stem)
|
||||
if not own_id:
|
||||
return None
|
||||
|
||||
title_de = str(fm.get("TitleGer") or fm.get("Title") or path.stem).strip()
|
||||
title_en = str(fm.get("Title") or "").strip()
|
||||
tags_raw = fm.get("tags") or []
|
||||
if isinstance(tags_raw, str):
|
||||
tags_raw = [tags_raw]
|
||||
tags = [str(t).strip() for t in tags_raw if t]
|
||||
|
||||
share_val = fm.get("share")
|
||||
share = bool(share_val) if share_val is not None else None
|
||||
|
||||
return IndexEntry(
|
||||
id=own_id,
|
||||
kind=determine_kind(path),
|
||||
title_de=title_de,
|
||||
title_en=title_en,
|
||||
source_path=str(path.relative_to(SOURCE_ROOT)),
|
||||
referenced_ids=collect_referenced_ids(body, own_id),
|
||||
external_refs=detect_external_refs(body),
|
||||
tags=tags,
|
||||
share=share,
|
||||
)
|
||||
|
||||
|
||||
def get_commit_sha() -> str | None:
|
||||
try:
|
||||
out = subprocess.run(
|
||||
["git", "-C", str(SOURCE_ROOT), "rev-parse", "HEAD"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=True,
|
||||
)
|
||||
return out.stdout.strip()
|
||||
except (subprocess.CalledProcessError, FileNotFoundError):
|
||||
return None
|
||||
|
||||
|
||||
def main() -> int:
|
||||
ap = argparse.ArgumentParser(description=__doc__)
|
||||
ap.add_argument("--check", action="store_true", help="Parse + validate, do not write output")
|
||||
args = ap.parse_args()
|
||||
|
||||
if not MARKDOWN_ROOT.exists():
|
||||
print(f"ERROR: clone not found at {SOURCE_ROOT}", file=sys.stderr)
|
||||
print("Run: git clone --depth=1 https://github.com/BSI-Bund/QUAIDAL.git legal-sources/bsi-quaidal", file=sys.stderr)
|
||||
return 2
|
||||
|
||||
entries: list[IndexEntry] = []
|
||||
skipped: list[Path] = []
|
||||
for path in sorted(MARKDOWN_ROOT.rglob("*.md")):
|
||||
entry = parse_file(path)
|
||||
if entry is None:
|
||||
skipped.append(path)
|
||||
continue
|
||||
entries.append(entry)
|
||||
|
||||
by_kind: dict[str, int] = {}
|
||||
for e in entries:
|
||||
by_kind[e.kind] = by_kind.get(e.kind, 0) + 1
|
||||
|
||||
print(f"Parsed {len(entries)} entries (skipped {len(skipped)} without ID):")
|
||||
for kind, count in sorted(by_kind.items()):
|
||||
print(f" {kind:18s} {count}")
|
||||
|
||||
if args.check:
|
||||
return 0
|
||||
|
||||
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
payload = {
|
||||
"source": "BSI QUAIDAL",
|
||||
"source_url": "https://github.com/BSI-Bund/QUAIDAL",
|
||||
"commit_sha": get_commit_sha(),
|
||||
"license_note": (
|
||||
"BSI-Veroeffentlichung. Repo enthaelt keine SPDX-Lizenzdatei. "
|
||||
"Frontmatter share:true. Veroeffentlichung durch Bundesbehoerde, "
|
||||
"§ 5 UrhG (amtliche Werke) anwendbar. BSI hat 05/2026 die Annahme "
|
||||
"CC-BY-SA-4.0 in unserer Anfrage nicht widersprochen, aber auch "
|
||||
"nicht aktiv bestaetigt. Wir derivieren Clean-Room (eigene "
|
||||
"Formulierungen, nur Referenz auf BSI QUAIDAL Sektion)."
|
||||
),
|
||||
"entries": [asdict(e) for e in entries],
|
||||
}
|
||||
OUTPUT_FILE.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
|
||||
print(f"\nWrote index: {OUTPUT_FILE.relative_to(REPO_ROOT)}")
|
||||
print(f"Commit SHA: {payload['commit_sha']}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,240 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Ingest missing German laws from gesetze-im-internet.de.
|
||||
|
||||
Downloads full HTML, strips to text, uploads with legal chunking strategy.
|
||||
Handles ISO-8859-1 charset typical for gesetze-im-internet.de.
|
||||
|
||||
Usage (on Mac Mini):
|
||||
python3 control-pipeline/scripts/ingest_de_laws.py --dry-run
|
||||
python3 control-pipeline/scripts/ingest_de_laws.py
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
from typing import Optional
|
||||
|
||||
import httpx
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("ingest-laws")
|
||||
|
||||
RAG_URL = "https://localhost:8097"
|
||||
QDRANT_URL = "http://localhost:6333"
|
||||
COLLECTION = "bp_compliance_gesetze"
|
||||
|
||||
# ---- Laws to ingest ----
|
||||
# Format: (slug on gesetze-im-internet.de, regulation_id, display_name)
|
||||
# URL pattern: https://www.gesetze-im-internet.de/{slug}/BJNR*.html (full text)
|
||||
|
||||
LAWS = [
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/arbzg/BJNR117100994.html",
|
||||
"regulation_id": "de_arbzg",
|
||||
"name": "Arbeitszeitgesetz (ArbZG)",
|
||||
"short": "ArbZG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/muschg_2018/BJNR122810017.html",
|
||||
"regulation_id": "de_muschg",
|
||||
"name": "Mutterschutzgesetz (MuSchG)",
|
||||
"short": "MuSchG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/nachwg/BJNR094610995.html",
|
||||
"regulation_id": "de_nachwg",
|
||||
"name": "Nachweisgesetz (NachwG)",
|
||||
"short": "NachwG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/milog/BJNR134810014.html",
|
||||
"regulation_id": "de_milog",
|
||||
"name": "Mindestlohngesetz (MiLoG)",
|
||||
"short": "MiLoG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/gmbhg/BJNR004770892.html",
|
||||
"regulation_id": "de_gmbhg",
|
||||
"name": "GmbH-Gesetz (GmbHG)",
|
||||
"short": "GmbHG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/aktg/BJNR010890965.html",
|
||||
"regulation_id": "de_aktg",
|
||||
"name": "Aktiengesetz (AktG)",
|
||||
"short": "AktG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/inso/BJNR286600994.html",
|
||||
"regulation_id": "de_inso",
|
||||
"name": "Insolvenzordnung (InsO)",
|
||||
"short": "InsO",
|
||||
},
|
||||
# BEG IV ist ein Aenderungsgesetz — kein eigenstaendiger Text auf gesetze-im-internet.de
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/verpflg/BJNR009690974.html",
|
||||
"regulation_id": "de_verpflichtungsgesetz",
|
||||
"name": "Verpflichtungsgesetz",
|
||||
"short": "VerpflG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/burlg/BJNR000020963.html",
|
||||
"regulation_id": "de_burlg",
|
||||
"name": "Bundesurlaubsgesetz (BUrlG)",
|
||||
"short": "BUrlG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/entgfg/BJNR118010994.html",
|
||||
"regulation_id": "de_entgfg",
|
||||
"name": "Entgeltfortzahlungsgesetz (EntgFG)",
|
||||
"short": "EntgFG",
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def download_law(url: str) -> Optional[str]:
|
||||
"""Download law HTML from gesetze-im-internet.de, handle charset."""
|
||||
with httpx.Client(timeout=30.0, follow_redirects=True) as c:
|
||||
resp = c.get(url)
|
||||
if resp.status_code != 200:
|
||||
logger.error(" HTTP %d for %s", resp.status_code, url)
|
||||
return None
|
||||
|
||||
# gesetze-im-internet.de uses ISO-8859-1
|
||||
content_type = resp.headers.get("content-type", "")
|
||||
if "charset" in content_type:
|
||||
# Use declared charset
|
||||
html = resp.text
|
||||
else:
|
||||
# Try UTF-8 first, fall back to ISO-8859-1
|
||||
try:
|
||||
html = resp.content.decode("utf-8")
|
||||
if "\ufffd" in html:
|
||||
raise UnicodeDecodeError("utf-8", b"", 0, 1, "replacement chars")
|
||||
except (UnicodeDecodeError, ValueError):
|
||||
html = resp.content.decode("iso-8859-1")
|
||||
|
||||
return html
|
||||
|
||||
|
||||
def upload_html(
|
||||
html: str,
|
||||
filename: str,
|
||||
regulation_id: str,
|
||||
name: str,
|
||||
short: str,
|
||||
dry_run: bool = False,
|
||||
) -> Optional[dict]:
|
||||
"""Upload HTML to RAG service with legal chunking."""
|
||||
if dry_run:
|
||||
logger.info(" DRY RUN — would upload %d chars", len(html))
|
||||
return {"chunks_count": 0, "document_id": "dry-run"}
|
||||
|
||||
meta = {
|
||||
"regulation_id": regulation_id,
|
||||
"regulation_name_de": name,
|
||||
"regulation_short": short,
|
||||
"source": "gesetze-im-internet.de",
|
||||
"license": "public_domain_de_law",
|
||||
"jurisdiction": "DE",
|
||||
"source_type": "law",
|
||||
}
|
||||
form_data = {
|
||||
"collection": COLLECTION,
|
||||
"data_type": "compliance",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_strategy": "legal",
|
||||
"chunk_size": "1500",
|
||||
"chunk_overlap": "100",
|
||||
"metadata_json": json.dumps(meta, ensure_ascii=False),
|
||||
}
|
||||
with httpx.Client(timeout=600.0, verify=False) as c:
|
||||
resp = c.post(
|
||||
f"{RAG_URL}/api/v1/documents/upload",
|
||||
files={"file": (filename, html.encode("utf-8"), "text/html")},
|
||||
data=form_data,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def count_existing(regulation_id: str) -> int:
|
||||
"""Check if regulation already exists in Qdrant."""
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{COLLECTION}/points/count",
|
||||
json={
|
||||
"filter": {"must": [
|
||||
{"key": "regulation_id", "match": {"value": regulation_id}}
|
||||
]},
|
||||
"exact": True,
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["result"]["count"]
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Ingest DE laws from gesetze-im-internet.de")
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info("Ingest German Laws")
|
||||
logger.info(" Laws: %d", len(LAWS))
|
||||
logger.info(" Collection: %s", COLLECTION)
|
||||
logger.info(" Dry run: %s", args.dry_run)
|
||||
logger.info("=" * 60)
|
||||
|
||||
results = []
|
||||
for i, law in enumerate(LAWS, 1):
|
||||
logger.info("\n[%d/%d] %s (%s)", i, len(LAWS), law["name"], law["regulation_id"])
|
||||
|
||||
# Check if already exists
|
||||
existing = count_existing(law["regulation_id"])
|
||||
if existing > 0:
|
||||
logger.info(" Already exists: %d chunks — SKIPPING", existing)
|
||||
results.append({"law": law["short"], "status": "exists", "chunks": existing})
|
||||
continue
|
||||
|
||||
# Download
|
||||
logger.info(" Downloading: %s", law["url"])
|
||||
html = download_law(law["url"])
|
||||
if not html:
|
||||
results.append({"law": law["short"], "status": "download_failed", "chunks": 0})
|
||||
continue
|
||||
logger.info(" Downloaded: %d chars", len(html))
|
||||
|
||||
# Upload
|
||||
filename = f"{law['regulation_id']}.html"
|
||||
try:
|
||||
result = upload_html(
|
||||
html, filename, law["regulation_id"],
|
||||
law["name"], law["short"], args.dry_run,
|
||||
)
|
||||
chunks = result.get("chunks_count", 0) if result else 0
|
||||
logger.info(" Uploaded: %d chunks", chunks)
|
||||
results.append({"law": law["short"], "status": "ok", "chunks": chunks})
|
||||
except Exception as e:
|
||||
logger.error(" Upload FAILED: %s", e)
|
||||
results.append({"law": law["short"], "status": "error", "chunks": 0})
|
||||
|
||||
if i < len(LAWS):
|
||||
time.sleep(1)
|
||||
|
||||
# Summary
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("RESULTS")
|
||||
logger.info("=" * 60)
|
||||
for r in results:
|
||||
logger.info(" %-10s %s chunks=%d", r["law"], r["status"].upper(), r["chunks"])
|
||||
|
||||
total_new = sum(r["chunks"] for r in results if r["status"] == "ok")
|
||||
logger.info("\nTotal new chunks: %d", total_new)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,414 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Ingest CRA-relevant ENISA documents into the RAG (collection `bp_compliance_ce`).
|
||||
|
||||
Source files live under `legal-sources/enisa/` in this repo. The script extracts
|
||||
PDF text with pdfplumber (HTML for the SRP FAQ), normalizes it, and uploads via
|
||||
the RAG service with `chunk_strategy='legal'` so that section metadata is
|
||||
attached to every chunk.
|
||||
|
||||
Each document carries a `requirement_strength` field so downstream consumers
|
||||
can distinguish normative material from guidance and consultation drafts:
|
||||
- mandatory — binding (none in this batch; CRA itself is the law)
|
||||
- guidance — official ENISA / EUCC guidance, citable
|
||||
- consultation_draft — public-consultation drafts (use with caveat)
|
||||
|
||||
Usage (run on Mac Mini after copying the legal-sources/enisa/ folder, or via SSH
|
||||
with the repo mounted):
|
||||
python3 control-pipeline/scripts/ingest_enisa_cra.py --dry-run
|
||||
python3 control-pipeline/scripts/ingest_enisa_cra.py
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
import unicodedata
|
||||
from html.parser import HTMLParser
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
import pdfplumber
|
||||
|
||||
RAG_URL = "https://localhost:8097"
|
||||
QDRANT_URL = "http://localhost:6333"
|
||||
UPLOAD_TIMEOUT = 1800.0
|
||||
COLLECTION = "bp_compliance_ce"
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||
SOURCE_DIR = REPO_ROOT / "legal-sources" / "enisa"
|
||||
|
||||
DOCS = [
|
||||
{
|
||||
"regulation_id": "enisa_cra_requirements_standards_mapping",
|
||||
"filename": "enisa_cra_requirements_standards_mapping.pdf",
|
||||
"upload_filename": "enisa_cra_requirements_standards_mapping.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_cra_requirements_standards_mapping",
|
||||
"regulation_short": "ENISA CRA Standards Mapping",
|
||||
"guideline_name": "Cyber Resilience Act Requirements Standards Mapping",
|
||||
"doc_type": "standards_mapping",
|
||||
"requirement_strength": "guidance",
|
||||
"publication_year": "2024",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
"attribution": "ENISA, CC BY 4.0",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_cra_implementation_via_eucc",
|
||||
"filename": "enisa_cra_implementation_via_eucc.pdf",
|
||||
"upload_filename": "enisa_cra_implementation_via_eucc.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_cra_implementation_via_eucc",
|
||||
"regulation_short": "ENISA CRA via EUCC",
|
||||
"guideline_name": "CRA Implementation via EUCC and its Applicable Technical Elements",
|
||||
"doc_type": "certification_guidance",
|
||||
"requirement_strength": "guidance",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
"attribution": "ENISA, CC BY 4.0",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_cra_implementation_via_eucc_annex",
|
||||
"filename": "enisa_cra_implementation_via_eucc_annex.pdf",
|
||||
"upload_filename": "enisa_cra_implementation_via_eucc_annex.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_cra_implementation_via_eucc_annex",
|
||||
"regulation_short": "ENISA CRA via EUCC (Annex)",
|
||||
"guideline_name": "Annex — CRA Implementation via EUCC",
|
||||
"doc_type": "certification_guidance_annex",
|
||||
"requirement_strength": "guidance",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
"attribution": "ENISA, CC BY 4.0",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_eucc_vulnerability_management_disclosure",
|
||||
"filename": "enisa_eucc_vulnerability_management_disclosure.pdf",
|
||||
"upload_filename": "enisa_eucc_vulnerability_management_disclosure.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_eucc_vulnerability_management_disclosure",
|
||||
"regulation_short": "EUCC Vuln Management & Disclosure",
|
||||
"guideline_name": "EUCC Guidelines — Vulnerability Management and Disclosure v1.1",
|
||||
"doc_type": "vulnerability_guidance",
|
||||
"requirement_strength": "guidance",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
"attribution": "ENISA, CC BY 4.0",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_eccg_opinion_vulnerability_management",
|
||||
"filename": "enisa_eccg_opinion_vulnerability_management.pdf",
|
||||
"upload_filename": "enisa_eccg_opinion_vulnerability_management.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_eccg_opinion_vulnerability_management",
|
||||
"regulation_short": "ECCG Opinion Vuln Management",
|
||||
"guideline_name": "Final ECCG Opinion — Guidance on Vulnerability Management",
|
||||
"doc_type": "eccg_opinion",
|
||||
"requirement_strength": "guidance",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
"attribution": "ENISA, CC BY 4.0",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_nis2_technical_implementation_guidance",
|
||||
"filename": "enisa_nis2_technical_implementation_guidance.pdf",
|
||||
"upload_filename": "enisa_nis2_technical_implementation_guidance.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_nis2_technical_implementation_guidance",
|
||||
"regulation_short": "ENISA NIS2 TIG v1.0",
|
||||
"guideline_name": "ENISA Technical Implementation Guidance on Cybersecurity Risk Management Measures v1.0",
|
||||
"doc_type": "technical_guidance",
|
||||
"requirement_strength": "guidance",
|
||||
"publication_year": "2025",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
"attribution": "ENISA, CC BY 4.0",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_nis2_security_measures_consultation",
|
||||
"filename": "enisa_nis2_security_measures_implementation_guidance_consultation.pdf",
|
||||
"upload_filename": "enisa_nis2_security_measures_consultation.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_nis2_security_measures_consultation",
|
||||
"regulation_short": "ENISA NIS2 Security Measures (Draft)",
|
||||
"guideline_name": "Implementation Guidance on Security Measures — Public Consultation Draft",
|
||||
"doc_type": "consultation_draft",
|
||||
"requirement_strength": "consultation_draft",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
"attribution": "ENISA, CC BY 4.0",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_cra_single_reporting_platform_faq",
|
||||
"filename": "enisa_cra_single_reporting_platform_faq.html",
|
||||
"upload_filename": "enisa_cra_single_reporting_platform_faq.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_cra_single_reporting_platform_faq",
|
||||
"regulation_short": "ENISA SRP FAQ",
|
||||
"guideline_name": "CRA Single Reporting Platform (SRP) FAQ",
|
||||
"doc_type": "faq",
|
||||
"requirement_strength": "guidance",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
"attribution": "ENISA, CC BY 4.0",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_eucc_evaluation_methodology_product_series",
|
||||
"filename": "enisa_eucc_evaluation_methodology_product_series.pdf",
|
||||
"upload_filename": "enisa_eucc_evaluation_methodology_product_series.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_eucc_evaluation_methodology_product_series",
|
||||
"regulation_short": "EUCC Eval Methodology Product Series",
|
||||
"guideline_name": "EUCC Guidelines — Evaluation Methodology for Product Series v1.0",
|
||||
"doc_type": "evaluation_methodology",
|
||||
"requirement_strength": "guidance",
|
||||
"publication_year": "2025",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
"attribution": "ENISA, CC BY 4.0",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_threat_landscape_2025",
|
||||
"filename": "enisa_threat_landscape_2025.pdf",
|
||||
"upload_filename": "enisa_threat_landscape_2025.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_threat_landscape_2025",
|
||||
"regulation_short": "ENISA Threat Landscape 2025",
|
||||
"guideline_name": "ENISA Threat Landscape 2025 v1.2",
|
||||
"doc_type": "threat_landscape",
|
||||
"requirement_strength": "evidentiary",
|
||||
"publication_year": "2025",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
"attribution": "ENISA, CC BY 4.0",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_cvd_policies_eu_2022",
|
||||
"filename": "enisa_cvd_policies_eu_2022.pdf",
|
||||
"upload_filename": "enisa_cvd_policies_eu_2022.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_cvd_policies_eu_2022",
|
||||
"regulation_short": "ENISA CVD Policies EU 2022",
|
||||
"guideline_name": "Coordinated Vulnerability Disclosure Policies in the EU (2022)",
|
||||
"doc_type": "policy_study",
|
||||
"requirement_strength": "guidance",
|
||||
"publication_year": "2022",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
"attribution": "ENISA, CC BY 4.0",
|
||||
},
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def normalize_text(text: str) -> str:
|
||||
text = unicodedata.normalize("NFKC", text)
|
||||
text = text.replace("", "").replace("", "")
|
||||
prev = None
|
||||
while prev != text:
|
||||
prev = text
|
||||
text = re.sub(r"(\d+)\s+\.\s+(\d+)", r"\1.\2", text)
|
||||
text = re.sub(r"\b([A-Z]{2,4})\s+-\s+(\d+)\b", r"\1-\2", text)
|
||||
text = re.sub(r"\(\s+(\d+)\s+\)", r"(\1)", text)
|
||||
text = re.sub(r"[^\S\n]{2,}", " ", text)
|
||||
return text
|
||||
|
||||
|
||||
class _HTMLToText(HTMLParser):
|
||||
SKIP = {"script", "style", "nav", "header", "footer", "noscript"}
|
||||
BLOCK = {"p", "div", "li", "br", "h1", "h2", "h3", "h4", "h5", "h6", "tr", "section"}
|
||||
|
||||
def __init__(self) -> None:
|
||||
super().__init__()
|
||||
self._buf: list[str] = []
|
||||
self._skip_depth = 0
|
||||
|
||||
def handle_starttag(self, tag, attrs):
|
||||
if tag in self.SKIP:
|
||||
self._skip_depth += 1
|
||||
if tag in self.BLOCK:
|
||||
self._buf.append("\n")
|
||||
|
||||
def handle_endtag(self, tag):
|
||||
if tag in self.SKIP and self._skip_depth > 0:
|
||||
self._skip_depth -= 1
|
||||
if tag in self.BLOCK:
|
||||
self._buf.append("\n")
|
||||
|
||||
def handle_data(self, data):
|
||||
if self._skip_depth == 0:
|
||||
self._buf.append(data)
|
||||
|
||||
def text(self) -> str:
|
||||
raw = "".join(self._buf)
|
||||
raw = re.sub(r"\n{3,}", "\n\n", raw)
|
||||
return raw.strip()
|
||||
|
||||
|
||||
def extract_pdf(path: Path) -> str:
|
||||
print(f" Extracting PDF: {path.name}")
|
||||
parts: list[str] = []
|
||||
with pdfplumber.open(path) as pdf:
|
||||
for i, page in enumerate(pdf.pages):
|
||||
t = page.extract_text(x_tolerance=3, y_tolerance=4)
|
||||
if t:
|
||||
parts.append(t)
|
||||
if (i + 1) % 50 == 0:
|
||||
print(f" {i + 1}/{len(pdf.pages)} pages...")
|
||||
return normalize_text("\n\n".join(parts))
|
||||
|
||||
|
||||
def extract_html(path: Path) -> str:
|
||||
print(f" Extracting HTML: {path.name}")
|
||||
html = path.read_text(encoding="utf-8", errors="replace")
|
||||
parser = _HTMLToText()
|
||||
parser.feed(html)
|
||||
return normalize_text(parser.text())
|
||||
|
||||
|
||||
def get_text(doc) -> str:
|
||||
path = SOURCE_DIR / doc["filename"]
|
||||
if not path.exists():
|
||||
raise FileNotFoundError(path)
|
||||
if path.suffix.lower() == ".pdf":
|
||||
text = extract_pdf(path)
|
||||
elif path.suffix.lower() in {".html", ".htm"}:
|
||||
text = extract_html(path)
|
||||
else:
|
||||
raise ValueError(f"Unsupported file type: {path.suffix}")
|
||||
print(f" Extracted {len(text):,} chars")
|
||||
return text
|
||||
|
||||
|
||||
def upload_text_legal(text: str, filename: str, extra_metadata: dict) -> dict:
|
||||
form_data = {
|
||||
"collection": COLLECTION,
|
||||
"data_type": "compliance",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_strategy": "legal",
|
||||
"chunk_size": "1500",
|
||||
"chunk_overlap": "100",
|
||||
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
|
||||
}
|
||||
with httpx.Client(timeout=UPLOAD_TIMEOUT, verify=False) as c:
|
||||
resp = c.post(
|
||||
f"{RAG_URL}/api/v1/documents/upload",
|
||||
files={"file": (filename, text.encode("utf-8"), "text/plain")},
|
||||
data=form_data,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def count_chunks(regulation_id: str) -> int:
|
||||
with httpx.Client(timeout=30) as c:
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{COLLECTION}/points/count",
|
||||
json={
|
||||
"filter": {
|
||||
"must": [
|
||||
{"key": "regulation_id", "match": {"value": regulation_id}}
|
||||
]
|
||||
},
|
||||
"exact": True,
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["result"]["count"]
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--dry-run", action="store_true",
|
||||
help="Extract text and report sizes, but do not upload.")
|
||||
parser.add_argument("--only", action="append", default=[],
|
||||
help="Limit run to one or more regulation_ids.")
|
||||
args = parser.parse_args()
|
||||
|
||||
if not SOURCE_DIR.exists():
|
||||
print(f"ERROR: source dir not found: {SOURCE_DIR}")
|
||||
return 2
|
||||
|
||||
docs = DOCS
|
||||
if args.only:
|
||||
wanted = set(args.only)
|
||||
docs = [d for d in DOCS if d["regulation_id"] in wanted]
|
||||
missing = wanted - {d["regulation_id"] for d in docs}
|
||||
if missing:
|
||||
print(f"ERROR: unknown regulation_id(s): {sorted(missing)}")
|
||||
return 2
|
||||
|
||||
print("=" * 70)
|
||||
print(f"ENISA CRA ingestion → collection={COLLECTION}")
|
||||
print(f"Source dir: {SOURCE_DIR}")
|
||||
print(f"Documents: {len(docs)} Dry run: {args.dry_run}")
|
||||
print("=" * 70)
|
||||
|
||||
results = []
|
||||
for i, doc in enumerate(docs, 1):
|
||||
reg_id = doc["regulation_id"]
|
||||
print(f"\n[{i}/{len(docs)}] {reg_id}")
|
||||
|
||||
existing = count_chunks(reg_id) if not args.dry_run else "?"
|
||||
print(f" Existing chunks in Qdrant: {existing}")
|
||||
|
||||
try:
|
||||
text = get_text(doc)
|
||||
except Exception as e:
|
||||
print(f" ERROR extracting text: {e}")
|
||||
results.append({"id": reg_id, "chars": 0, "new": 0,
|
||||
"strength": doc["extra_metadata"]["requirement_strength"]})
|
||||
continue
|
||||
|
||||
if args.dry_run:
|
||||
results.append({"id": reg_id, "chars": len(text), "new": "?",
|
||||
"strength": doc["extra_metadata"]["requirement_strength"]})
|
||||
continue
|
||||
|
||||
if existing and existing > 0:
|
||||
print(f" SKIP — {existing} chunks already present. "
|
||||
f"Use Qdrant delete-by-filter before re-ingesting.")
|
||||
results.append({"id": reg_id, "chars": len(text), "new": 0,
|
||||
"strength": doc["extra_metadata"]["requirement_strength"]})
|
||||
continue
|
||||
|
||||
print(" Uploading with chunk_strategy='legal'...")
|
||||
result = upload_text_legal(
|
||||
text, doc["upload_filename"], doc["extra_metadata"]
|
||||
)
|
||||
new_chunks = result.get("chunks_count", 0)
|
||||
new_doc_id = result.get("document_id", "")
|
||||
print(f" -> {new_chunks} chunks (doc_id={new_doc_id})")
|
||||
|
||||
results.append({"id": reg_id, "chars": len(text), "new": new_chunks,
|
||||
"strength": doc["extra_metadata"]["requirement_strength"]})
|
||||
|
||||
if i < len(docs):
|
||||
time.sleep(2)
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("SUMMARY")
|
||||
print("=" * 70)
|
||||
for r in results:
|
||||
print(f" {r['id']:<55} chars={r['chars']:<9} new={r['new']:<5} "
|
||||
f"strength={r['strength']}")
|
||||
total_new = sum(r["new"] for r in results if isinstance(r["new"], int))
|
||||
print(f"\nTotal new chunks: {total_new}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,201 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Ingest missing EU regulations from EUR-Lex (HTML).
|
||||
|
||||
Downloads German HTML from EUR-Lex via CELEX number, uploads with legal chunking.
|
||||
|
||||
Usage (on Mac Mini):
|
||||
python3 control-pipeline/scripts/ingest_eu_regulations.py --dry-run
|
||||
python3 control-pipeline/scripts/ingest_eu_regulations.py
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
|
||||
import httpx
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("ingest-eu")
|
||||
|
||||
RAG_URL = "https://localhost:8097"
|
||||
QDRANT_URL = "http://localhost:6333"
|
||||
COLLECTION = "bp_compliance_ce"
|
||||
|
||||
EURLEX_URL = "https://eur-lex.europa.eu/legal-content/DE/TXT/HTML/?uri=CELEX:{celex}"
|
||||
|
||||
# ---- EU Regulations to ingest ----
|
||||
REGULATIONS = [
|
||||
{
|
||||
"celex": "32022L2464",
|
||||
"regulation_id": "csrd_2022",
|
||||
"name": "Corporate Sustainability Reporting Directive (CSRD)",
|
||||
"short": "CSRD",
|
||||
"category": "sustainability",
|
||||
},
|
||||
{
|
||||
"celex": "32024L1760",
|
||||
"regulation_id": "csddd_2024",
|
||||
"name": "Corporate Sustainability Due Diligence Directive (CSDDD)",
|
||||
"short": "CSDDD",
|
||||
"category": "sustainability",
|
||||
},
|
||||
{
|
||||
"celex": "32020R0852",
|
||||
"regulation_id": "eu_taxonomy_2020",
|
||||
"name": "EU-Taxonomie-Verordnung",
|
||||
"short": "EU Taxonomy",
|
||||
"category": "sustainability",
|
||||
},
|
||||
{
|
||||
"celex": "32024R1183",
|
||||
"regulation_id": "eidas_2_0_2024",
|
||||
"name": "eIDAS 2.0 Verordnung (EU Digital Identity)",
|
||||
"short": "eIDAS 2.0",
|
||||
"category": "digital_identity",
|
||||
},
|
||||
{
|
||||
"celex": "32023L0970",
|
||||
"regulation_id": "pay_transparency_2023",
|
||||
"name": "Entgelttransparenz-Richtlinie",
|
||||
"short": "Pay Transparency",
|
||||
"category": "employment",
|
||||
},
|
||||
{
|
||||
"celex": "32022R2065",
|
||||
"regulation_id": "dsa_2022_updated",
|
||||
"name": "Digital Services Act (DSA) — aktualisiert",
|
||||
"short": "DSA",
|
||||
"category": "digital_services",
|
||||
"skip_if_exists": "dsa_2022", # already exists under different ID
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def download_eurlex(celex: str) -> str:
|
||||
"""Download EU regulation HTML from EUR-Lex."""
|
||||
url = EURLEX_URL.format(celex=celex)
|
||||
with httpx.Client(timeout=30.0, follow_redirects=True) as c:
|
||||
resp = c.get(url)
|
||||
resp.raise_for_status()
|
||||
return resp.text
|
||||
|
||||
|
||||
def upload_html(html: str, filename: str, reg: dict, dry_run: bool = False):
|
||||
"""Upload HTML to RAG service."""
|
||||
if dry_run:
|
||||
logger.info(" DRY RUN — would upload %d chars", len(html))
|
||||
return {"chunks_count": 0}
|
||||
|
||||
meta = {
|
||||
"regulation_id": reg["regulation_id"],
|
||||
"regulation_name_de": reg["name"],
|
||||
"regulation_short": reg["short"],
|
||||
"celex": reg["celex"],
|
||||
"category": reg["category"],
|
||||
"source": "EUR-Lex",
|
||||
"license": "EU_law",
|
||||
"jurisdiction": "EU",
|
||||
"source_type": "law",
|
||||
}
|
||||
form_data = {
|
||||
"collection": COLLECTION,
|
||||
"data_type": "compliance",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_strategy": "legal",
|
||||
"chunk_size": "1500",
|
||||
"chunk_overlap": "100",
|
||||
"metadata_json": json.dumps(meta, ensure_ascii=False),
|
||||
}
|
||||
with httpx.Client(timeout=600.0, verify=False) as c:
|
||||
resp = c.post(
|
||||
f"{RAG_URL}/api/v1/documents/upload",
|
||||
files={"file": (filename, html.encode("utf-8"), "text/html")},
|
||||
data=form_data,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def count_existing(regulation_id: str) -> int:
|
||||
with httpx.Client(timeout=60.0) as c:
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{COLLECTION}/points/count",
|
||||
json={"filter": {"must": [
|
||||
{"key": "regulation_id", "match": {"value": regulation_id}}
|
||||
]}, "exact": True},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["result"]["count"]
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info("Ingest EU Regulations from EUR-Lex")
|
||||
logger.info(" Regulations: %d", len(REGULATIONS))
|
||||
logger.info(" Dry run: %s", args.dry_run)
|
||||
logger.info("=" * 60)
|
||||
|
||||
results = []
|
||||
for i, reg in enumerate(REGULATIONS, 1):
|
||||
logger.info("\n[%d/%d] %s (CELEX: %s)", i, len(REGULATIONS), reg["name"], reg["celex"])
|
||||
|
||||
# Skip if variant already exists
|
||||
skip_id = reg.get("skip_if_exists")
|
||||
if skip_id:
|
||||
existing = count_existing(skip_id)
|
||||
if existing > 0:
|
||||
logger.info(" Already exists as '%s' (%d chunks) — SKIPPING", skip_id, existing)
|
||||
results.append({"reg": reg["short"], "status": "exists", "chunks": existing})
|
||||
continue
|
||||
|
||||
# Check if this exact ID exists
|
||||
existing = count_existing(reg["regulation_id"])
|
||||
if existing > 0:
|
||||
logger.info(" Already exists: %d chunks — SKIPPING", existing)
|
||||
results.append({"reg": reg["short"], "status": "exists", "chunks": existing})
|
||||
continue
|
||||
|
||||
# Download from EUR-Lex
|
||||
logger.info(" Downloading from EUR-Lex...")
|
||||
try:
|
||||
html = download_eurlex(reg["celex"])
|
||||
logger.info(" Downloaded: %d chars", len(html))
|
||||
except Exception as e:
|
||||
logger.error(" Download FAILED: %s", e)
|
||||
results.append({"reg": reg["short"], "status": "download_failed", "chunks": 0})
|
||||
continue
|
||||
|
||||
# Upload
|
||||
filename = f"{reg['regulation_id']}.html"
|
||||
try:
|
||||
result = upload_html(html, filename, reg, args.dry_run)
|
||||
chunks = result.get("chunks_count", 0)
|
||||
logger.info(" Uploaded: %d chunks", chunks)
|
||||
results.append({"reg": reg["short"], "status": "ok", "chunks": chunks})
|
||||
except Exception as e:
|
||||
logger.error(" Upload FAILED: %s", e)
|
||||
results.append({"reg": reg["short"], "status": "error", "chunks": 0})
|
||||
|
||||
if i < len(REGULATIONS):
|
||||
time.sleep(2)
|
||||
|
||||
# Summary
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("RESULTS")
|
||||
logger.info("=" * 60)
|
||||
for r in results:
|
||||
logger.info(" %-20s %s chunks=%d", r["reg"], r["status"].upper(), r["chunks"])
|
||||
|
||||
total_new = sum(r["chunks"] for r in results if r["status"] == "ok")
|
||||
logger.info("\nTotal new chunks: %d", total_new)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,303 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
E2E Quality Report: Verify controls have correct source citations.
|
||||
|
||||
Loads N random controls from PostgreSQL, cross-references with Qdrant chunks,
|
||||
and reports mismatches between source_citation and actual chunk metadata.
|
||||
|
||||
Usage:
|
||||
# Against Mac Mini
|
||||
python3 scripts/quality_report.py --db-host macmini --qdrant-url http://macmini:6333
|
||||
|
||||
# Smaller sample
|
||||
python3 scripts/quality_report.py --db-host macmini --sample 100
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
|
||||
import httpx
|
||||
from sqlalchemy import create_engine, text
|
||||
from sqlalchemy.orm import sessionmaker
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
logger = logging.getLogger("quality-report")
|
||||
|
||||
COLLECTIONS = [
|
||||
"bp_compliance_ce", "bp_compliance_gesetze", "bp_compliance_datenschutz",
|
||||
"bp_dsfa_corpus", "bp_legal_templates",
|
||||
]
|
||||
|
||||
|
||||
def load_controls(db_url: str, sample_size: int) -> list[dict]:
|
||||
"""Load random controls with source_citation from PostgreSQL."""
|
||||
engine = create_engine(db_url)
|
||||
Session = sessionmaker(bind=engine)
|
||||
|
||||
with Session() as db:
|
||||
rows = db.execute(text("""
|
||||
SELECT id::text, control_id, title,
|
||||
source_citation::text, source_original_text,
|
||||
generation_metadata::text, release_state
|
||||
FROM compliance.canonical_controls
|
||||
WHERE source_citation IS NOT NULL
|
||||
AND source_original_text IS NOT NULL
|
||||
AND release_state = 'draft'
|
||||
ORDER BY RANDOM()
|
||||
LIMIT :n
|
||||
"""), {"n": sample_size}).fetchall()
|
||||
|
||||
controls = []
|
||||
for row in rows:
|
||||
citation = json.loads(row[3]) if row[3] else {}
|
||||
metadata = json.loads(row[5]) if row[5] else {}
|
||||
controls.append({
|
||||
"id": row[0],
|
||||
"control_id": row[1],
|
||||
"title": row[2],
|
||||
"citation": citation,
|
||||
"source_text": row[4],
|
||||
"metadata": metadata,
|
||||
"release_state": row[6],
|
||||
})
|
||||
return controls
|
||||
|
||||
|
||||
def build_qdrant_index(qdrant_url: str) -> dict:
|
||||
"""Build regulation_id → list[chunk] index from Qdrant.
|
||||
|
||||
Controls were generated from OLD chunks (512 chars). Qdrant now has
|
||||
NEW chunks (1500 chars). Hash matching won't work — use regulation +
|
||||
section matching instead.
|
||||
"""
|
||||
logger.info("Building Qdrant chunk index by regulation_id...")
|
||||
index = {} # regulation_id → [{"section": ..., "text_snippet": ..., ...}]
|
||||
client = httpx.Client(timeout=60.0)
|
||||
|
||||
for coll in COLLECTIONS:
|
||||
offset = None
|
||||
for _ in range(600):
|
||||
body = {"limit": 250, "with_payload": True, "with_vector": False}
|
||||
if offset:
|
||||
body["offset"] = offset
|
||||
r = client.post(f"{qdrant_url}/collections/{coll}/points/scroll", json=body)
|
||||
if r.status_code != 200:
|
||||
break
|
||||
data = r.json()["result"]
|
||||
for pt in data["points"]:
|
||||
reg_id = pt["payload"].get("regulation_id", "")
|
||||
if not reg_id:
|
||||
continue
|
||||
chunk = {
|
||||
"section": pt["payload"].get("section", ""),
|
||||
"section_title": pt["payload"].get("section_title", ""),
|
||||
"paragraph": pt["payload"].get("paragraph", ""),
|
||||
"text_snippet": pt["payload"].get("chunk_text", "")[:200],
|
||||
"filename": pt["payload"].get("filename", ""),
|
||||
"collection": coll,
|
||||
}
|
||||
index.setdefault(reg_id, []).append(chunk)
|
||||
offset = data.get("next_page_offset")
|
||||
if not offset:
|
||||
break
|
||||
|
||||
client.close()
|
||||
total = sum(len(v) for v in index.values())
|
||||
logger.info("Qdrant index: %d regulations, %d chunks", len(index), total)
|
||||
return index
|
||||
|
||||
|
||||
def check_control(ctrl: dict, qdrant_index: dict) -> dict:
|
||||
"""Check a single control's source_citation against Qdrant chunks.
|
||||
|
||||
Strategy: Find chunks by regulation_id from generation_metadata,
|
||||
then check if any chunk has a matching section/article.
|
||||
"""
|
||||
result = {
|
||||
"control_id": ctrl["control_id"],
|
||||
"title": (ctrl["title"] or "")[:60],
|
||||
"citation_source": ctrl["citation"].get("source", ""),
|
||||
"citation_article": ctrl["citation"].get("article", ""),
|
||||
"citation_paragraph": ctrl["citation"].get("paragraph", ""),
|
||||
"citation_page": ctrl["citation"].get("page"),
|
||||
"issues": [],
|
||||
}
|
||||
|
||||
# Get regulation_id from generation_metadata
|
||||
reg_code = ctrl["metadata"].get("source_regulation", "")
|
||||
citation_article = ctrl["citation"].get("article", "")
|
||||
|
||||
# Check 1: Does the control have a regulation reference?
|
||||
if not reg_code:
|
||||
result["issues"].append("NO_REGULATION_CODE")
|
||||
return result
|
||||
|
||||
# Check 2: Does this regulation exist in Qdrant?
|
||||
chunks = qdrant_index.get(reg_code, [])
|
||||
if not chunks:
|
||||
result["issues"].append(f"REGULATION_NOT_IN_QDRANT: {reg_code}")
|
||||
result["reg_found"] = False
|
||||
return result
|
||||
|
||||
result["reg_found"] = True
|
||||
result["reg_chunks"] = len(chunks)
|
||||
|
||||
# Check 3: Does the control have an article citation?
|
||||
if not citation_article:
|
||||
result["issues"].append("NO_ARTICLE_IN_CITATION")
|
||||
# Still check if chunks have section metadata at all
|
||||
has_section = any(c["section"] for c in chunks)
|
||||
if has_section:
|
||||
result["issues"].append("CHUNKS_HAVE_SECTIONS_BUT_CONTROL_MISSING")
|
||||
return result
|
||||
|
||||
# Check 4: Is the cited article found in any chunk's section?
|
||||
norm_article = citation_article.strip().lower()
|
||||
matching_chunks = [
|
||||
c for c in chunks
|
||||
if c["section"] and (
|
||||
norm_article == c["section"].strip().lower()
|
||||
or norm_article in c["section"].strip().lower()
|
||||
or c["section"].strip().lower() in norm_article
|
||||
)
|
||||
]
|
||||
|
||||
if matching_chunks:
|
||||
result["article_match"] = True
|
||||
result["matched_section"] = matching_chunks[0]["section"]
|
||||
else:
|
||||
# Check if ANY chunk has sections (the article might just not match)
|
||||
sections_in_regulation = sorted(set(c["section"] for c in chunks if c["section"]))
|
||||
if sections_in_regulation:
|
||||
result["issues"].append(
|
||||
f"ARTICLE_NOT_FOUND_IN_CHUNKS: '{citation_article}' not in {sections_in_regulation[:5]}"
|
||||
)
|
||||
else:
|
||||
result["issues"].append("NO_SECTIONS_IN_REGULATION_CHUNKS")
|
||||
|
||||
# Check 5: Does source_original_text contain the cited article?
|
||||
source_text = ctrl["source_text"] or ""
|
||||
if citation_article and source_text:
|
||||
if citation_article.lower() not in source_text.lower():
|
||||
if f"[{citation_article}" not in source_text:
|
||||
result["issues"].append("ARTICLE_NOT_IN_SOURCE_TEXT")
|
||||
|
||||
if not result["issues"]:
|
||||
result["issues"] = ["OK"]
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def generate_report(results: list[dict]):
|
||||
"""Print the quality report."""
|
||||
total = len(results)
|
||||
ok = sum(1 for r in results if r["issues"] == ["OK"])
|
||||
chunk_found = sum(1 for r in results if r.get("chunk_found", False))
|
||||
no_chunk = sum(1 for r in results if "CHUNK_NOT_FOUND" in r["issues"])
|
||||
no_article = sum(1 for r in results if "NO_ARTICLE_IN_CITATION" in r["issues"])
|
||||
no_section = sum(1 for r in results if "NO_SECTION_IN_CHUNK" in r["issues"])
|
||||
mismatch = sum(1 for r in results if any("MISMATCH" in i for i in r["issues"]))
|
||||
not_in_text = sum(1 for r in results if "ARTICLE_NOT_IN_SOURCE_TEXT" in r["issues"])
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("QUALITAETSREPORT: CONTROL SOURCE CITATION VERIFICATION")
|
||||
print("=" * 100)
|
||||
|
||||
print(f"\nStichprobe: {total} Controls")
|
||||
print(f"\n{'Metrik':<45} {'Anzahl':>8} {'Anteil':>8}")
|
||||
print("-" * 65)
|
||||
print(f"{'OK (keine Probleme)':<45} {ok:>8} {ok*100//max(total,1):>7}%")
|
||||
print(f"{'Chunk in Qdrant gefunden':<45} {chunk_found:>8} {chunk_found*100//max(total,1):>7}%")
|
||||
print(f"{'Chunk NICHT gefunden':<45} {no_chunk:>8} {no_chunk*100//max(total,1):>7}%")
|
||||
print(f"{'Kein article in source_citation':<45} {no_article:>8} {no_article*100//max(total,1):>7}%")
|
||||
print(f"{'Kein section im Qdrant-Chunk':<45} {no_section:>8} {no_section*100//max(total,1):>7}%")
|
||||
print(f"{'Article/Section Mismatch':<45} {mismatch:>8} {mismatch*100//max(total,1):>7}%")
|
||||
print(f"{'Article nicht im Source-Text':<45} {not_in_text:>8} {not_in_text*100//max(total,1):>7}%")
|
||||
|
||||
# Show sample mismatches
|
||||
mismatches = [r for r in results if any("MISMATCH" in i for i in r["issues"])]
|
||||
if mismatches:
|
||||
print("\n=== MISMATCHES (erste 10) ===\n")
|
||||
for r in mismatches[:10]:
|
||||
issues = [i for i in r["issues"] if "MISMATCH" in i]
|
||||
print(f" {r['control_id']:20s} {r['title'][:40]:40s}")
|
||||
for i in issues:
|
||||
print(f" → {i}")
|
||||
|
||||
# Show sample NOT_FOUND
|
||||
not_found = [r for r in results if "CHUNK_NOT_FOUND" in r["issues"]]
|
||||
if not_found:
|
||||
print("\n=== CHUNK NOT FOUND (erste 10) ===\n")
|
||||
for r in not_found[:10]:
|
||||
src = r.get("citation_source", "?")
|
||||
art = r.get("citation_article", "?")
|
||||
print(f" {r['control_id']:20s} {src[:25]:25s} {art}")
|
||||
|
||||
# Distribution by source
|
||||
print("\n=== NACH QUELLE ===\n")
|
||||
source_stats = {}
|
||||
for r in results:
|
||||
src = r.get("citation_source", "?")[:30]
|
||||
if src not in source_stats:
|
||||
source_stats[src] = {"total": 0, "ok": 0, "no_chunk": 0, "no_section": 0}
|
||||
source_stats[src]["total"] += 1
|
||||
if r["issues"] == ["OK"]:
|
||||
source_stats[src]["ok"] += 1
|
||||
if "CHUNK_NOT_FOUND" in r["issues"]:
|
||||
source_stats[src]["no_chunk"] += 1
|
||||
if "NO_SECTION_IN_CHUNK" in r["issues"]:
|
||||
source_stats[src]["no_section"] += 1
|
||||
|
||||
print(f" {'Quelle':<32} {'Total':>6} {'OK':>6} {'OK%':>6} {'NoChunk':>8} {'NoSect':>8}")
|
||||
print(f" {'-'*72}")
|
||||
for src in sorted(source_stats.keys(), key=lambda s: -source_stats[s]["total"]):
|
||||
s = source_stats[src]
|
||||
pct = s["ok"] * 100 // max(s["total"], 1)
|
||||
print(f" {src:<32} {s['total']:>6} {s['ok']:>6} {pct:>5}% {s['no_chunk']:>8} {s['no_section']:>8}")
|
||||
|
||||
print(f"\n{'='*100}")
|
||||
verdict = "PASS" if ok * 100 // max(total, 1) >= 50 else "NEEDS IMPROVEMENT"
|
||||
print(f"ERGEBNIS: {verdict} — {ok}/{total} Controls ({ok*100//max(total,1)}%) vollstaendig korrekt")
|
||||
print(f"{'='*100}")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Control Source Citation Quality Report")
|
||||
parser.add_argument("--db-host", default="macmini")
|
||||
parser.add_argument("--db-port", type=int, default=5432)
|
||||
parser.add_argument("--db-name", default="breakpilot_db")
|
||||
parser.add_argument("--db-user", default="breakpilot")
|
||||
parser.add_argument("--db-pass", default="breakpilot123")
|
||||
parser.add_argument("--qdrant-url", default="http://macmini:6333")
|
||||
parser.add_argument("--sample", type=int, default=500)
|
||||
args = parser.parse_args()
|
||||
|
||||
db_url = f"postgresql://{args.db_user}:{args.db_pass}@{args.db_host}:{args.db_port}/{args.db_name}"
|
||||
|
||||
# Load controls
|
||||
logger.info("Loading %d random controls from DB...", args.sample)
|
||||
controls = load_controls(db_url, args.sample)
|
||||
logger.info("Loaded %d controls with source_citation", len(controls))
|
||||
|
||||
if not controls:
|
||||
print("ERROR: No controls found with source_citation")
|
||||
sys.exit(1)
|
||||
|
||||
# Build Qdrant index
|
||||
qdrant_index = build_qdrant_index(args.qdrant_url)
|
||||
|
||||
# Check each control
|
||||
logger.info("Checking %d controls against Qdrant...", len(controls))
|
||||
results = []
|
||||
for ctrl in controls:
|
||||
result = check_control(ctrl, qdrant_index)
|
||||
results.append(result)
|
||||
|
||||
# Report
|
||||
generate_report(results)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,486 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
D5 Re-Ingestion: Re-chunk all ~297 legal sources with structural metadata.
|
||||
|
||||
Usage:
|
||||
# Dry-run: build manifest, no changes
|
||||
python3 scripts/reingest_d5.py --dry-run
|
||||
|
||||
# Re-ingest one collection (test)
|
||||
python3 scripts/reingest_d5.py --collection bp_compliance_gesetze
|
||||
|
||||
# Re-ingest all collections (resume-capable)
|
||||
python3 scripts/reingest_d5.py --resume
|
||||
|
||||
# Custom URLs
|
||||
python3 scripts/reingest_d5.py --rag-url https://macmini:8097 --qdrant-url http://macmini:6333
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import random
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
|
||||
import httpx
|
||||
|
||||
from reingest_d5_config import (
|
||||
CHUNK_OVERLAP,
|
||||
CHUNK_SIZE,
|
||||
CHUNK_STRATEGY,
|
||||
DEFAULT_QDRANT_URL,
|
||||
DEFAULT_RAG_URL,
|
||||
MANIFEST_FILE,
|
||||
TARGET_COLLECTIONS,
|
||||
content_type_from_filename,
|
||||
doc_key,
|
||||
extract_doc_metadata,
|
||||
load_progress,
|
||||
save_progress,
|
||||
)
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
logger = logging.getLogger("d5-reingest")
|
||||
|
||||
UPLOAD_TIMEOUT = httpx.Timeout(timeout=3600.0, connect=30.0)
|
||||
SCROLL_TIMEOUT = httpx.Timeout(timeout=60.0, connect=10.0)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Phase 0: Preflight
|
||||
# ---------------------------------------------------------------------------
|
||||
def preflight_checks(rag_url: str, qdrant_url: str) -> dict:
|
||||
"""Verify services are reachable and record baseline chunk counts."""
|
||||
logger.info("Phase 0: Preflight checks...")
|
||||
|
||||
with httpx.Client(timeout=10.0, verify=False) as c:
|
||||
r = c.get(f"{rag_url}/health")
|
||||
r.raise_for_status()
|
||||
logger.info(" RAG service: OK")
|
||||
|
||||
with httpx.Client(timeout=10.0) as c:
|
||||
r = c.get(f"{qdrant_url}/collections")
|
||||
r.raise_for_status()
|
||||
logger.info(" Qdrant: OK")
|
||||
|
||||
before_counts = {}
|
||||
with httpx.Client(timeout=10.0) as c:
|
||||
for coll in TARGET_COLLECTIONS:
|
||||
try:
|
||||
r = c.post(f"{qdrant_url}/collections/{coll}/points/count",
|
||||
json={"exact": True})
|
||||
r.raise_for_status()
|
||||
count = r.json()["result"]["count"]
|
||||
except Exception:
|
||||
count = 0
|
||||
before_counts[coll] = count
|
||||
logger.info(" %s: %d chunks", coll, count)
|
||||
|
||||
return before_counts
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Phase 1: Build manifest
|
||||
# ---------------------------------------------------------------------------
|
||||
def build_manifest(qdrant_url: str, collections: list[str]) -> list[dict]:
|
||||
"""Scroll Qdrant and build a deduplicated document manifest."""
|
||||
logger.info("Phase 1: Building document manifest...")
|
||||
documents: dict[str, dict] = {} # keyed by doc_key(object_name, collection)
|
||||
|
||||
with httpx.Client(timeout=SCROLL_TIMEOUT) as client:
|
||||
for coll in collections:
|
||||
logger.info(" Scrolling %s...", coll)
|
||||
offset = None
|
||||
points_seen = 0
|
||||
|
||||
while True:
|
||||
body: dict = {
|
||||
"limit": 250,
|
||||
"with_payload": True,
|
||||
"with_vector": False,
|
||||
}
|
||||
if offset:
|
||||
body["offset"] = offset
|
||||
|
||||
resp = client.post(
|
||||
f"{qdrant_url}/collections/{coll}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
points = data["points"]
|
||||
|
||||
for pt in points:
|
||||
payload = pt.get("payload", {})
|
||||
obj_name = payload.get("object_name", "")
|
||||
if not obj_name:
|
||||
continue
|
||||
|
||||
key = doc_key(obj_name, coll)
|
||||
if key not in documents:
|
||||
meta = extract_doc_metadata(payload)
|
||||
documents[key] = {
|
||||
"object_name": obj_name,
|
||||
"collection": coll,
|
||||
"filename": payload.get("filename", obj_name.split("/")[-1]),
|
||||
"form": meta["form"],
|
||||
"extra_metadata": meta["extra"],
|
||||
"old_chunk_count": 0,
|
||||
}
|
||||
documents[key]["old_chunk_count"] += 1
|
||||
|
||||
points_seen += len(points)
|
||||
offset = data.get("next_page_offset")
|
||||
if not offset:
|
||||
break
|
||||
|
||||
logger.info(" %d points → %d unique docs",
|
||||
points_seen,
|
||||
sum(1 for d in documents.values() if d["collection"] == coll))
|
||||
|
||||
manifest = list(documents.values())
|
||||
logger.info(" Total: %d unique documents across %d collections",
|
||||
len(manifest), len(collections))
|
||||
return manifest
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Phase 2: Per-document re-ingestion
|
||||
# ---------------------------------------------------------------------------
|
||||
def download_file(rag_url: str, object_name: str) -> bytes:
|
||||
"""Download file bytes via MinIO presigned URL."""
|
||||
with httpx.Client(timeout=60.0, verify=False) as c:
|
||||
resp = c.get(f"{rag_url}/api/v1/documents/download/{object_name}")
|
||||
resp.raise_for_status()
|
||||
presigned_url = resp.json()["url"]
|
||||
|
||||
with httpx.Client(timeout=120.0, verify=False) as c:
|
||||
resp = c.get(presigned_url)
|
||||
resp.raise_for_status()
|
||||
return resp.content
|
||||
|
||||
|
||||
def delete_old_chunks(qdrant_url: str, collection: str, object_name: str) -> int:
|
||||
"""Delete all chunks for a document from Qdrant. Returns estimated count."""
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
resp = c.post(
|
||||
f"{qdrant_url}/collections/{collection}/points/delete",
|
||||
json={
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "object_name",
|
||||
"match": {"value": object_name},
|
||||
}]
|
||||
}
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return 0 # Qdrant delete doesn't return count
|
||||
|
||||
|
||||
def _delete_old_chunks_safe(
|
||||
qdrant_url: str, collection: str, object_name: str, keep_doc_id: str,
|
||||
) -> None:
|
||||
"""Delete old chunks for a document, keeping chunks with keep_doc_id."""
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
resp = c.post(
|
||||
f"{qdrant_url}/collections/{collection}/points/delete",
|
||||
json={
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "object_name",
|
||||
"match": {"value": object_name},
|
||||
}],
|
||||
"must_not": [{
|
||||
"key": "document_id",
|
||||
"match": {"value": keep_doc_id},
|
||||
}],
|
||||
}
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
|
||||
|
||||
def reupload_document(
|
||||
rag_url: str,
|
||||
file_bytes: bytes,
|
||||
filename: str,
|
||||
collection: str,
|
||||
form_fields: dict,
|
||||
extra_metadata: dict,
|
||||
) -> dict:
|
||||
"""Upload document to RAG service with new chunking parameters."""
|
||||
ct = content_type_from_filename(filename)
|
||||
|
||||
form_data = {
|
||||
"collection": collection,
|
||||
"data_type": form_fields.get("data_type", "compliance"),
|
||||
"bundesland": form_fields.get("bundesland", "bund"),
|
||||
"use_case": form_fields.get("use_case", "compliance"),
|
||||
"year": form_fields.get("year", "2026"),
|
||||
"chunk_strategy": CHUNK_STRATEGY,
|
||||
"chunk_size": str(CHUNK_SIZE),
|
||||
"chunk_overlap": str(CHUNK_OVERLAP),
|
||||
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
|
||||
}
|
||||
|
||||
with httpx.Client(timeout=UPLOAD_TIMEOUT, verify=False) as c:
|
||||
resp = c.post(
|
||||
f"{rag_url}/api/v1/documents/upload",
|
||||
files={"file": (filename, file_bytes, ct)},
|
||||
data=form_data,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def process_document(
|
||||
doc: dict,
|
||||
rag_url: str,
|
||||
qdrant_url: str,
|
||||
progress: dict,
|
||||
max_retries: int = 2,
|
||||
) -> bool:
|
||||
"""Process a single document: download → upload → verify → delete old.
|
||||
|
||||
Safe order: new chunks are created FIRST, old chunks deleted only after
|
||||
successful verification (upload-before-delete pattern).
|
||||
"""
|
||||
key = doc_key(doc["object_name"], doc["collection"])
|
||||
|
||||
# Skip if already done
|
||||
if progress.get("documents", {}).get(key, {}).get("status") == "done":
|
||||
return True
|
||||
|
||||
for attempt in range(max_retries + 1):
|
||||
try:
|
||||
# 1. Download
|
||||
file_bytes = download_file(rag_url, doc["object_name"])
|
||||
if not file_bytes:
|
||||
logger.warning(" Empty file: %s — skipping", doc["object_name"])
|
||||
progress.setdefault("documents", {})[key] = {
|
||||
"status": "skipped", "reason": "empty_file"}
|
||||
return False
|
||||
|
||||
# 2. Upload FIRST (creates new chunks alongside old ones)
|
||||
result = reupload_document(
|
||||
rag_url, file_bytes, doc["filename"],
|
||||
doc["collection"], doc["form"], doc["extra_metadata"],
|
||||
)
|
||||
|
||||
new_chunks = result.get("chunks_count", 0)
|
||||
new_doc_id = result.get("document_id", "")
|
||||
if new_chunks == 0:
|
||||
logger.error(" Upload produced 0 chunks — keeping old data: %s",
|
||||
doc["object_name"])
|
||||
progress.setdefault("documents", {})[key] = {
|
||||
"status": "error", "error": "0 new chunks"}
|
||||
return False
|
||||
|
||||
# 3. Delete OLD chunks only (exclude the new document_id)
|
||||
_delete_old_chunks_safe(
|
||||
qdrant_url, doc["collection"],
|
||||
doc["object_name"], new_doc_id,
|
||||
)
|
||||
|
||||
# 4. Record success
|
||||
progress.setdefault("documents", {})[key] = {
|
||||
"status": "done",
|
||||
"old_chunks": doc["old_chunk_count"],
|
||||
"new_chunks": new_chunks,
|
||||
"new_document_id": result.get("document_id", ""),
|
||||
"completed_at": datetime.now(timezone.utc).isoformat(),
|
||||
}
|
||||
return True
|
||||
|
||||
except httpx.HTTPStatusError as e:
|
||||
if e.response.status_code == 404:
|
||||
logger.warning(" File not in MinIO (404): %s — skipping", doc["object_name"])
|
||||
progress.setdefault("documents", {})[key] = {
|
||||
"status": "skipped", "reason": "not_in_minio"}
|
||||
return False
|
||||
if attempt < max_retries:
|
||||
wait = 5 * (attempt + 1)
|
||||
logger.warning(" HTTP %d on attempt %d, retrying in %ds...",
|
||||
e.response.status_code, attempt + 1, wait)
|
||||
time.sleep(wait)
|
||||
else:
|
||||
logger.error(" FAILED after %d retries: %s", max_retries, e)
|
||||
progress.setdefault("documents", {})[key] = {
|
||||
"status": "error", "error": str(e), "retries": max_retries}
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
if attempt < max_retries:
|
||||
wait = 10 * (attempt + 1)
|
||||
logger.warning(" Error on attempt %d: %s — retrying in %ds",
|
||||
attempt + 1, e, wait)
|
||||
time.sleep(wait)
|
||||
else:
|
||||
logger.error(" FAILED after %d retries: %s", max_retries, e)
|
||||
progress.setdefault("documents", {})[key] = {
|
||||
"status": "error", "error": str(e), "retries": max_retries}
|
||||
return False
|
||||
|
||||
return False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Phase 3: Verification
|
||||
# ---------------------------------------------------------------------------
|
||||
def verify_results(
|
||||
qdrant_url: str,
|
||||
before_counts: dict,
|
||||
collections: list[str],
|
||||
manifest: list[dict],
|
||||
):
|
||||
"""Compare before/after counts and spot-check metadata."""
|
||||
logger.info("Phase 3: Verification...")
|
||||
|
||||
print("\n" + "=" * 65)
|
||||
print("D5 RE-INGESTION VERIFICATION REPORT")
|
||||
print("=" * 65)
|
||||
|
||||
after_counts = {}
|
||||
with httpx.Client(timeout=10.0) as c:
|
||||
for coll in collections:
|
||||
try:
|
||||
r = c.post(f"{qdrant_url}/collections/{coll}/points/count",
|
||||
json={"exact": True})
|
||||
r.raise_for_status()
|
||||
after_counts[coll] = r.json()["result"]["count"]
|
||||
except Exception:
|
||||
after_counts[coll] = -1
|
||||
|
||||
print(f"\n{'Collection':<35} {'Before':>8} {'After':>8} {'Delta':>8}")
|
||||
print("-" * 65)
|
||||
for coll in collections:
|
||||
before = before_counts.get(coll, 0)
|
||||
after = after_counts.get(coll, -1)
|
||||
delta = after - before if after >= 0 else "?"
|
||||
print(f"{coll:<35} {before:>8} {after:>8} {str(delta):>8}")
|
||||
|
||||
# Spot-check: pick 3 random docs and verify metadata
|
||||
print("\nSpot-check (3 random docs):")
|
||||
sample = random.sample(manifest, min(3, len(manifest)))
|
||||
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
for doc in sample:
|
||||
resp = c.post(
|
||||
f"{qdrant_url}/collections/{doc['collection']}/points/scroll",
|
||||
json={
|
||||
"limit": 3,
|
||||
"with_payload": True,
|
||||
"with_vector": False,
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "object_name",
|
||||
"match": {"value": doc["object_name"]},
|
||||
}]
|
||||
},
|
||||
},
|
||||
)
|
||||
if resp.status_code != 200:
|
||||
print(f" {doc['object_name']}: QUERY FAILED")
|
||||
continue
|
||||
|
||||
points = resp.json()["result"]["points"]
|
||||
if not points:
|
||||
print(f" {doc['object_name']}: NO CHUNKS FOUND")
|
||||
continue
|
||||
|
||||
has_section = sum(1 for p in points if p["payload"].get("section"))
|
||||
has_para = sum(1 for p in points if p["payload"].get("paragraph"))
|
||||
print(f" {doc['filename'][:40]:<42} "
|
||||
f"chunks={len(points):>3} "
|
||||
f"with_section={has_section}/{len(points)} "
|
||||
f"with_para={has_para}/{len(points)}")
|
||||
|
||||
print()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main
|
||||
# ---------------------------------------------------------------------------
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="D5 Re-Ingestion Script")
|
||||
parser.add_argument("--rag-url", default=DEFAULT_RAG_URL)
|
||||
parser.add_argument("--qdrant-url", default=DEFAULT_QDRANT_URL)
|
||||
parser.add_argument("--dry-run", action="store_true",
|
||||
help="Build manifest only, no changes")
|
||||
parser.add_argument("--collection", default=None,
|
||||
help="Process only this collection")
|
||||
parser.add_argument("--resume", action="store_true",
|
||||
help="Resume from progress file")
|
||||
args = parser.parse_args()
|
||||
|
||||
collections = [args.collection] if args.collection else TARGET_COLLECTIONS
|
||||
|
||||
# Phase 0
|
||||
before_counts = preflight_checks(args.rag_url, args.qdrant_url)
|
||||
|
||||
# Phase 1
|
||||
manifest = build_manifest(args.qdrant_url, collections)
|
||||
|
||||
# Save manifest for inspection
|
||||
with open(MANIFEST_FILE, "w", encoding="utf-8") as f:
|
||||
json.dump(manifest, f, indent=2, ensure_ascii=False)
|
||||
logger.info("Manifest saved to %s", MANIFEST_FILE)
|
||||
|
||||
if args.dry_run:
|
||||
print(f"\nDRY RUN: {len(manifest)} documents found. See {MANIFEST_FILE}")
|
||||
for doc in manifest[:10]:
|
||||
reg = doc["extra_metadata"].get("regulation_code", "?")
|
||||
print(f" {reg:<30} {doc['collection']:<35} chunks={doc['old_chunk_count']}")
|
||||
if len(manifest) > 10:
|
||||
print(f" ... and {len(manifest) - 10} more")
|
||||
sys.exit(0)
|
||||
|
||||
# Phase 2
|
||||
progress = load_progress() if args.resume else {"documents": {}}
|
||||
progress["started_at"] = datetime.now(timezone.utc).isoformat()
|
||||
progress["before_counts"] = before_counts
|
||||
|
||||
done = 0
|
||||
skipped = 0
|
||||
failed = 0
|
||||
|
||||
for i, doc in enumerate(manifest, 1):
|
||||
key = doc_key(doc["object_name"], doc["collection"])
|
||||
reg = doc["extra_metadata"].get("regulation_code", "?")
|
||||
|
||||
if progress.get("documents", {}).get(key, {}).get("status") == "done":
|
||||
done += 1
|
||||
continue
|
||||
|
||||
logger.info("[%d/%d] %s (%s) — %d old chunks",
|
||||
i, len(manifest), reg, doc["collection"], doc["old_chunk_count"])
|
||||
|
||||
ok = process_document(doc, args.rag_url, args.qdrant_url, progress)
|
||||
if ok:
|
||||
done += 1
|
||||
new_chunks = progress["documents"][key].get("new_chunks", "?")
|
||||
logger.info(" OK: %d old → %s new chunks", doc["old_chunk_count"], new_chunks)
|
||||
elif progress["documents"][key].get("status") == "skipped":
|
||||
skipped += 1
|
||||
else:
|
||||
failed += 1
|
||||
|
||||
save_progress(progress)
|
||||
time.sleep(2)
|
||||
|
||||
logger.info("Phase 2 complete: %d done, %d skipped, %d failed", done, skipped, failed)
|
||||
|
||||
# Phase 3
|
||||
verify_results(args.qdrant_url, before_counts, collections, manifest)
|
||||
|
||||
print(f"Summary: {done} done, {skipped} skipped, {failed} failed")
|
||||
if failed:
|
||||
print(f"Re-run with --resume to retry {failed} failed documents")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,92 @@
|
||||
"""D5 Re-Ingestion: Constants, helpers, progress tracking."""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
|
||||
logger = logging.getLogger("d5-reingest")
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Defaults (overridable via CLI args)
|
||||
# ---------------------------------------------------------------------------
|
||||
DEFAULT_RAG_URL = "https://macmini:8097"
|
||||
DEFAULT_QDRANT_URL = "http://macmini:6333"
|
||||
|
||||
TARGET_COLLECTIONS = [
|
||||
"bp_compliance_ce",
|
||||
"bp_compliance_gesetze",
|
||||
"bp_compliance_datenschutz",
|
||||
"bp_dsfa_corpus",
|
||||
"bp_legal_templates",
|
||||
"bp_compliance_schulrecht",
|
||||
]
|
||||
|
||||
# New chunking parameters (D1-D4 validated)
|
||||
CHUNK_STRATEGY = "recursive"
|
||||
CHUNK_SIZE = 1500
|
||||
CHUNK_OVERLAP = 100
|
||||
|
||||
PROGRESS_FILE = "d5_reingest_progress.json"
|
||||
MANIFEST_FILE = "d5_manifest.json"
|
||||
|
||||
# Per-chunk fields (NOT carried as extra metadata during re-upload)
|
||||
PER_CHUNK_FIELDS = frozenset({
|
||||
"chunk_text", "chunk_index", "document_id", "object_name",
|
||||
"filename", "data_type", "bundesland", "use_case", "year",
|
||||
"section", "section_title", "paragraph", "paragraph_num", "page",
|
||||
})
|
||||
|
||||
# Upload form fields that come from the payload (not metadata_json)
|
||||
FORM_FIELDS = frozenset({"data_type", "bundesland", "use_case", "year"})
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Progress tracking
|
||||
# ---------------------------------------------------------------------------
|
||||
def load_progress(path: str = PROGRESS_FILE) -> dict:
|
||||
if os.path.exists(path):
|
||||
with open(path, encoding="utf-8") as f:
|
||||
return json.load(f)
|
||||
return {"documents": {}}
|
||||
|
||||
|
||||
def save_progress(data: dict, path: str = PROGRESS_FILE):
|
||||
with open(path, "w", encoding="utf-8") as f:
|
||||
json.dump(data, f, indent=2, ensure_ascii=False, default=str)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Metadata extraction
|
||||
# ---------------------------------------------------------------------------
|
||||
def extract_doc_metadata(payload: dict) -> dict:
|
||||
"""Split Qdrant payload into form fields + extra metadata.
|
||||
|
||||
Returns: {"form": {data_type, bundesland, ...}, "extra": {regulation_code, ...}}
|
||||
"""
|
||||
form = {}
|
||||
extra = {}
|
||||
for k, v in payload.items():
|
||||
if k in PER_CHUNK_FIELDS:
|
||||
continue
|
||||
if k in FORM_FIELDS:
|
||||
form[k] = v
|
||||
else:
|
||||
extra[k] = v
|
||||
return {"form": form, "extra": extra}
|
||||
|
||||
|
||||
def doc_key(object_name: str, collection: str) -> str:
|
||||
"""Unique key for a document in the progress file."""
|
||||
return f"{object_name}|{collection}"
|
||||
|
||||
|
||||
def content_type_from_filename(filename: str) -> str:
|
||||
"""Infer MIME type from file extension."""
|
||||
ext = os.path.splitext(filename)[1].lower()
|
||||
return {
|
||||
".pdf": "application/pdf",
|
||||
".html": "text/html",
|
||||
".htm": "text/html",
|
||||
".md": "text/markdown",
|
||||
".txt": "text/plain",
|
||||
}.get(ext, "application/octet-stream")
|
||||
@@ -0,0 +1,485 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Safe re-ingestion of NIST/BSI/ENISA PDFs from MinIO.
|
||||
|
||||
Uses upload-before-delete pattern: new chunks are created FIRST,
|
||||
old chunks are only deleted after successful verification.
|
||||
|
||||
Usage:
|
||||
python3 control-pipeline/scripts/reingest_nist.py [--dry-run]
|
||||
python3 control-pipeline/scripts/reingest_nist.py --only-missing
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
import time
|
||||
|
||||
import httpx
|
||||
|
||||
sys.path.insert(0, "control-pipeline/scripts")
|
||||
from reingest_d5_config import ( # noqa: E402
|
||||
CHUNK_OVERLAP,
|
||||
CHUNK_SIZE,
|
||||
CHUNK_STRATEGY,
|
||||
DEFAULT_QDRANT_URL,
|
||||
DEFAULT_RAG_URL,
|
||||
content_type_from_filename,
|
||||
)
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
)
|
||||
logger = logging.getLogger("reingest-nist")
|
||||
|
||||
UPLOAD_TIMEOUT = 1800.0 # 30 min for large PDFs
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Documents to re-ingest
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
# 4 documents with 0 chunks (deleted by D5, upload failed)
|
||||
MISSING_DOCS = [
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_53r5.pdf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"filename": "NIST_SP_800_53r5.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp800_53r5",
|
||||
"source_id": "nist",
|
||||
"doc_type": "controls_catalog",
|
||||
"guideline_name": "NIST SP 800-53 Rev. 5 Security and Privacy Controls",
|
||||
"license": "public_domain_us_gov",
|
||||
"attribution": "NIST",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nist_sp_800_82r3.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "nist_sp_800_82r3.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_82r3",
|
||||
"regulation_name_de": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
|
||||
"regulation_name_en": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
|
||||
"regulation_short": "NIST SP 800-82",
|
||||
"category": "ot_security",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nist_sp_800_160v1r1.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "nist_sp_800_160v1r1.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_160v1r1",
|
||||
"regulation_name_de": "NIST SP 800-160 Vol. 1 Rev. 1",
|
||||
"regulation_name_en": "NIST SP 800-160 Vol. 1 Rev. 1",
|
||||
"regulation_short": "NIST SP 800-160",
|
||||
"category": "security_engineering",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_207.pdf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"filename": "NIST_SP_800_207.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp800_207",
|
||||
"source_id": "nist",
|
||||
"doc_type": "architecture",
|
||||
"guideline_name": "NIST SP 800-207 Zero Trust Architecture",
|
||||
"license": "public_domain_us_gov",
|
||||
"attribution": "NIST",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
]
|
||||
|
||||
# Additional NIST/BSI/ENISA docs with <10% section rate (re-ingest for quality)
|
||||
LOW_QUALITY_DOCS = [
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nist_csf_2_0.pdf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"filename": "nist_csf_2_0.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_csf_2_0",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nistir_8259a.pdf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"filename": "nistir_8259a.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nistir_8259a",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nist_ai_rmf.pdf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"filename": "nist_ai_rmf.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_ai_rmf",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nist_sp_800_30r1.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "nist_sp_800_30r1.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_30r1",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/enisa_supply_chain_good_practices.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "enisa_supply_chain_good_practices.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_supply_chain_good_practices",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/enisa_ics_scada.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "enisa_ics_scada.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_ics_scada_dependencies",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/enisa_supply_chain_security.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "enisa_supply_chain_security.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_threat_landscape_supply_chain",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/cisa_secure_by_design.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "cisa_secure_by_design.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "cisa_secure_by_design",
|
||||
"license": "public_domain_us",
|
||||
"source": "cisa.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/cvss_v4_0.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "cvss_v4_0.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "cvss_v4_0",
|
||||
"license": "public_domain_us",
|
||||
"source": "first.org",
|
||||
},
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Qdrant helpers
|
||||
# -------------------------------------------------------------------
|
||||
def count_chunks(qdrant_url: str, collection: str, object_name: str) -> int:
|
||||
"""Count existing chunks for a document in Qdrant."""
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
resp = c.post(
|
||||
f"{qdrant_url}/collections/{collection}/points/count",
|
||||
json={
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "object_name",
|
||||
"match": {"value": object_name},
|
||||
}]
|
||||
},
|
||||
"exact": True,
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["result"]["count"]
|
||||
|
||||
|
||||
def get_old_document_ids(
|
||||
qdrant_url: str, collection: str, object_name: str,
|
||||
) -> set:
|
||||
"""Get all document_ids for existing chunks of this document."""
|
||||
doc_ids = set()
|
||||
offset = None
|
||||
with httpx.Client(timeout=60.0) as c:
|
||||
while True:
|
||||
body = {
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "object_name",
|
||||
"match": {"value": object_name},
|
||||
}]
|
||||
},
|
||||
"limit": 100,
|
||||
"with_payload": ["document_id"],
|
||||
}
|
||||
if offset is not None:
|
||||
body["offset"] = offset
|
||||
resp = c.post(
|
||||
f"{qdrant_url}/collections/{collection}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
for pt in data["points"]:
|
||||
did = pt.get("payload", {}).get("document_id")
|
||||
if did:
|
||||
doc_ids.add(did)
|
||||
offset = data.get("next_page_offset")
|
||||
if offset is None:
|
||||
break
|
||||
return doc_ids
|
||||
|
||||
|
||||
def delete_by_document_ids(
|
||||
qdrant_url: str, collection: str, doc_ids: set,
|
||||
) -> None:
|
||||
"""Delete chunks matching specific document_ids."""
|
||||
for did in doc_ids:
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
c.post(
|
||||
f"{qdrant_url}/collections/{collection}/points/delete",
|
||||
json={
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "document_id",
|
||||
"match": {"value": did},
|
||||
}]
|
||||
}
|
||||
},
|
||||
).raise_for_status()
|
||||
|
||||
|
||||
def check_section_rate(
|
||||
qdrant_url: str, collection: str, object_name: str,
|
||||
) -> tuple:
|
||||
"""Check section rate for a document's chunks. Returns (total, with_section)."""
|
||||
total = 0
|
||||
with_section = 0
|
||||
offset = None
|
||||
with httpx.Client(timeout=60.0) as c:
|
||||
while True:
|
||||
body = {
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "object_name",
|
||||
"match": {"value": object_name},
|
||||
}]
|
||||
},
|
||||
"limit": 100,
|
||||
"with_payload": ["section"],
|
||||
}
|
||||
if offset is not None:
|
||||
body["offset"] = offset
|
||||
resp = c.post(
|
||||
f"{qdrant_url}/collections/{collection}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
for pt in data["points"]:
|
||||
total += 1
|
||||
sec = pt.get("payload", {}).get("section", "")
|
||||
if sec and sec.strip():
|
||||
with_section += 1
|
||||
offset = data.get("next_page_offset")
|
||||
if offset is None:
|
||||
break
|
||||
return total, with_section
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Upload
|
||||
# -------------------------------------------------------------------
|
||||
def download_from_minio(rag_url: str, object_name: str) -> bytes:
|
||||
"""Download file from MinIO via RAG service presigned URL."""
|
||||
with httpx.Client(timeout=60.0, verify=False) as c:
|
||||
resp = c.get(f"{rag_url}/api/v1/documents/download/{object_name}")
|
||||
resp.raise_for_status()
|
||||
presigned_url = resp.json()["url"]
|
||||
|
||||
with httpx.Client(timeout=300.0, verify=False) as c:
|
||||
resp = c.get(presigned_url)
|
||||
resp.raise_for_status()
|
||||
return resp.content
|
||||
|
||||
|
||||
def upload_document(
|
||||
rag_url: str,
|
||||
file_bytes: bytes,
|
||||
filename: str,
|
||||
collection: str,
|
||||
extra_metadata: dict,
|
||||
) -> dict:
|
||||
"""Upload document to RAG service."""
|
||||
ct = content_type_from_filename(filename)
|
||||
form_data = {
|
||||
"collection": collection,
|
||||
"data_type": "compliance",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_strategy": CHUNK_STRATEGY,
|
||||
"chunk_size": str(CHUNK_SIZE),
|
||||
"chunk_overlap": str(CHUNK_OVERLAP),
|
||||
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
|
||||
}
|
||||
with httpx.Client(timeout=UPLOAD_TIMEOUT, verify=False) as c:
|
||||
resp = c.post(
|
||||
f"{rag_url}/api/v1/documents/upload",
|
||||
files={"file": (filename, file_bytes, ct)},
|
||||
data=form_data,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Main processing
|
||||
# -------------------------------------------------------------------
|
||||
def process_document(
|
||||
doc: dict,
|
||||
rag_url: str,
|
||||
qdrant_url: str,
|
||||
dry_run: bool = False,
|
||||
) -> dict:
|
||||
"""Safe re-ingest: upload first, then delete old. Returns result dict."""
|
||||
obj = doc["object_name"]
|
||||
coll = doc["collection"]
|
||||
fname = doc["filename"]
|
||||
|
||||
# 1. Check existing state
|
||||
old_count = count_chunks(qdrant_url, coll, obj)
|
||||
old_doc_ids = get_old_document_ids(qdrant_url, coll, obj) if old_count > 0 else set()
|
||||
logger.info(" [%s] existing: %d chunks, %d document_ids",
|
||||
fname, old_count, len(old_doc_ids))
|
||||
|
||||
if dry_run:
|
||||
logger.info(" [%s] DRY RUN — would download + upload + delete old", fname)
|
||||
return {"status": "dry_run", "old_chunks": old_count}
|
||||
|
||||
# 2. Download from MinIO
|
||||
logger.info(" [%s] downloading from MinIO...", fname)
|
||||
file_bytes = download_from_minio(rag_url, obj)
|
||||
size_mb = len(file_bytes) / (1024 * 1024)
|
||||
logger.info(" [%s] downloaded %.1f MB", fname, size_mb)
|
||||
|
||||
# 3. Upload FIRST (creates new chunks)
|
||||
logger.info(" [%s] uploading to RAG service...", fname)
|
||||
result = upload_document(rag_url, file_bytes, fname, coll, doc["extra_metadata"])
|
||||
new_chunks = result.get("chunks_count", 0)
|
||||
new_doc_id = result.get("document_id", "")
|
||||
logger.info(" [%s] uploaded: %d new chunks (doc_id=%s)", fname, new_chunks, new_doc_id)
|
||||
|
||||
# 4. Verify new chunks exist
|
||||
if new_chunks == 0:
|
||||
logger.error(" [%s] UPLOAD PRODUCED 0 CHUNKS — keeping old data!", fname)
|
||||
return {"status": "error", "error": "0 new chunks", "old_chunks": old_count}
|
||||
|
||||
# 5. Delete old chunks (only if there were any)
|
||||
if old_doc_ids:
|
||||
logger.info(" [%s] deleting %d old document_ids...", fname, len(old_doc_ids))
|
||||
delete_by_document_ids(qdrant_url, coll, old_doc_ids)
|
||||
logger.info(" [%s] old chunks deleted", fname)
|
||||
|
||||
# 6. Check section rate
|
||||
total, with_sec = check_section_rate(qdrant_url, coll, obj)
|
||||
pct = (with_sec / total * 100) if total > 0 else 0
|
||||
logger.info(" [%s] section rate: %d/%d (%.0f%%)", fname, with_sec, total, pct)
|
||||
|
||||
return {
|
||||
"status": "ok",
|
||||
"old_chunks": old_count,
|
||||
"new_chunks": new_chunks,
|
||||
"new_document_id": new_doc_id,
|
||||
"section_rate": round(pct, 1),
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Safe NIST/BSI/ENISA re-ingestion")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Show what would happen")
|
||||
parser.add_argument("--only-missing", action="store_true",
|
||||
help="Only re-ingest the 4 missing docs (skip low-quality)")
|
||||
parser.add_argument("--rag-url", default=DEFAULT_RAG_URL)
|
||||
parser.add_argument("--qdrant-url", default=DEFAULT_QDRANT_URL)
|
||||
args = parser.parse_args()
|
||||
|
||||
docs = list(MISSING_DOCS)
|
||||
if not args.only_missing:
|
||||
docs.extend(LOW_QUALITY_DOCS)
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info("NIST/BSI/ENISA Safe Re-Ingestion")
|
||||
logger.info(" Documents: %d (%d missing + %d low-quality)",
|
||||
len(docs), len(MISSING_DOCS),
|
||||
0 if args.only_missing else len(LOW_QUALITY_DOCS))
|
||||
logger.info(" RAG: %s", args.rag_url)
|
||||
logger.info(" Qdrant: %s", args.qdrant_url)
|
||||
logger.info(" Dry run: %s", args.dry_run)
|
||||
logger.info("=" * 60)
|
||||
|
||||
results = {}
|
||||
ok = 0
|
||||
errors = 0
|
||||
|
||||
for i, doc in enumerate(docs, 1):
|
||||
logger.info("[%d/%d] %s → %s", i, len(docs), doc["filename"], doc["collection"])
|
||||
try:
|
||||
r = process_document(doc, args.rag_url, args.qdrant_url, args.dry_run)
|
||||
results[doc["filename"]] = r
|
||||
if r["status"] == "ok":
|
||||
ok += 1
|
||||
elif r["status"] == "error":
|
||||
errors += 1
|
||||
except Exception as e:
|
||||
logger.error(" FAILED: %s", e)
|
||||
results[doc["filename"]] = {"status": "error", "error": str(e)}
|
||||
errors += 1
|
||||
|
||||
if i < len(docs):
|
||||
time.sleep(2)
|
||||
|
||||
# Summary
|
||||
logger.info("")
|
||||
logger.info("=" * 60)
|
||||
logger.info("RESULTS")
|
||||
logger.info("=" * 60)
|
||||
for fname, r in results.items():
|
||||
status = r["status"].upper()
|
||||
old = r.get("old_chunks", "?")
|
||||
new = r.get("new_chunks", "?")
|
||||
sec = r.get("section_rate", "?")
|
||||
logger.info(" %-40s %s old=%s new=%s sect=%.0f%%",
|
||||
fname, status, old, new, sec if isinstance(sec, float) else 0)
|
||||
|
||||
logger.info("")
|
||||
logger.info("OK: %d, Errors: %d, Total: %d", ok, errors, len(docs))
|
||||
|
||||
if errors > 0:
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,213 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Replace EU regulation PDFs with clean HTML from EUR-Lex.
|
||||
|
||||
Downloads HTML versions of EU regulations (using CELEX numbers),
|
||||
deletes old PDF chunks from Qdrant, uploads HTML via RAG service.
|
||||
|
||||
Usage:
|
||||
python3 scripts/replace_eu_pdfs_with_html.py --dry-run
|
||||
python3 scripts/replace_eu_pdfs_with_html.py
|
||||
python3 scripts/replace_eu_pdfs_with_html.py --celex 32016R0679 # single doc
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
|
||||
import httpx
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
logger = logging.getLogger("eurlex-replace")
|
||||
|
||||
DEFAULT_RAG_URL = "https://macmini:8097"
|
||||
DEFAULT_QDRANT_URL = "http://macmini:6333"
|
||||
|
||||
EURLEX_HTML_URL = "https://eur-lex.europa.eu/legal-content/DE/TXT/HTML/?uri=CELEX:{celex}"
|
||||
|
||||
# EU regulations with CELEX numbers and their current collection + metadata
|
||||
EU_REGULATIONS = [
|
||||
{"celex": "32024R1689", "reg_id": "ai_act_2024", "name": "AI Act", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32024R2847", "reg_id": "cra_2024", "name": "Cyber Resilience Act", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32022L2555", "reg_id": "nis2_2022", "name": "NIS2-Richtlinie", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32016R0679", "reg_id": "dsgvo_2016", "name": "DSGVO", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32024R1624", "reg_id": "amlr_2024", "name": "Anti-Geldwaesche-VO", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32017R0745", "reg_id": "eu_mdr_2017", "name": "Medical Device Regulation", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32022R2065", "reg_id": "dsa_2022", "name": "Digital Services Act", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32022R1925", "reg_id": "dma_2022", "name": "Digital Markets Act", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32022R2554", "reg_id": "dora_2022", "name": "DORA", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32022R0868", "reg_id": "dga_2022", "name": "Data Governance Act", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32023R2854", "reg_id": "dataact_2023", "name": "Data Act", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32023R0988", "reg_id": "gpsr_2023", "name": "General Product Safety Regulation", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32023R1230", "reg_id": "machinery_2023", "name": "Maschinenverordnung", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32023R1803", "reg_id": "ifrs_2023", "name": "IFRS Regulation", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32023D1795", "reg_id": "dpf_2023", "name": "Data Privacy Framework", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32019L2161", "reg_id": "omnibus_2019", "name": "Omnibus-Richtlinie", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32019L0790", "reg_id": "dsm_2019", "name": "DSM-Richtlinie", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32019L0770", "reg_id": "digital_content_2019", "name": "Digital Content Directive", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32002L0058", "reg_id": "eprivacy_2002", "name": "ePrivacy-Richtlinie", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32000L0031", "reg_id": "ecommerce_2000", "name": "E-Commerce-Richtlinie", "coll": "bp_compliance_ce"},
|
||||
]
|
||||
|
||||
|
||||
def download_eurlex_html(celex: str) -> bytes:
|
||||
"""Download HTML from EUR-Lex for a given CELEX number."""
|
||||
url = EURLEX_HTML_URL.format(celex=celex)
|
||||
with httpx.Client(timeout=60.0, follow_redirects=True) as c:
|
||||
r = c.get(url)
|
||||
r.raise_for_status()
|
||||
return r.content
|
||||
|
||||
|
||||
def delete_old_chunks(qdrant_url: str, collection: str, reg_id: str):
|
||||
"""Delete chunks matching regulation_id prefix."""
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
# Try multiple field names for regulation_id
|
||||
for field in ["regulation_id"]:
|
||||
r = c.post(f"{qdrant_url}/collections/{collection}/points/delete", json={
|
||||
"filter": {"must": [{"key": field, "match": {"value": reg_id}}]}
|
||||
})
|
||||
if r.status_code == 200:
|
||||
return
|
||||
|
||||
|
||||
def find_old_chunks_by_filename(qdrant_url: str, collection: str, filename_pattern: str) -> int:
|
||||
"""Count existing chunks matching a filename pattern."""
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
r = c.post(f"{qdrant_url}/collections/{collection}/points/count", json={
|
||||
"exact": True,
|
||||
"filter": {"must": [{"key": "regulation_id", "match": {"value": filename_pattern}}]}
|
||||
})
|
||||
if r.status_code == 200:
|
||||
return r.json()["result"]["count"]
|
||||
return 0
|
||||
|
||||
|
||||
def upload_html(rag_url: str, html_bytes: bytes, reg: dict) -> dict:
|
||||
"""Upload HTML to RAG service."""
|
||||
filename = f"{reg['reg_id']}.html"
|
||||
metadata = json.dumps({
|
||||
"regulation_id": reg["reg_id"],
|
||||
"regulation_name_de": reg["name"],
|
||||
"celex": reg["celex"],
|
||||
"source": "EUR-Lex",
|
||||
"license": "EU_law",
|
||||
"source_type": "law",
|
||||
"category": "eu_regulation",
|
||||
}, ensure_ascii=False)
|
||||
|
||||
with httpx.Client(timeout=3600.0, verify=False) as c:
|
||||
r = c.post(f"{rag_url}/api/v1/documents/upload",
|
||||
files={"file": (filename, html_bytes, "text/html")},
|
||||
data={
|
||||
"collection": reg["coll"],
|
||||
"data_type": "compliance",
|
||||
"bundesland": "eu",
|
||||
"use_case": "regulation",
|
||||
"year": reg["celex"][1:5],
|
||||
"chunk_strategy": "recursive",
|
||||
"chunk_size": "1500",
|
||||
"chunk_overlap": "100",
|
||||
"metadata_json": metadata,
|
||||
},
|
||||
)
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
|
||||
|
||||
def check_section_rate(qdrant_url: str, collection: str, reg_id: str) -> tuple:
|
||||
"""Check section rate for a regulation. Returns (total, with_section)."""
|
||||
total = 0
|
||||
with_section = 0
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
r = c.post(f"{qdrant_url}/collections/{collection}/points/scroll", json={
|
||||
"limit": 100, "with_payload": True, "with_vector": False,
|
||||
"filter": {"must": [{"key": "regulation_id", "match": {"value": reg_id}}]}
|
||||
})
|
||||
if r.status_code == 200:
|
||||
pts = r.json()["result"]["points"]
|
||||
total = len(pts)
|
||||
with_section = sum(1 for p in pts if p["payload"].get("section"))
|
||||
return total, with_section
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Replace EU PDFs with EUR-Lex HTML")
|
||||
parser.add_argument("--rag-url", default=DEFAULT_RAG_URL)
|
||||
parser.add_argument("--qdrant-url", default=DEFAULT_QDRANT_URL)
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
parser.add_argument("--celex", default=None, help="Process only this CELEX number")
|
||||
args = parser.parse_args()
|
||||
|
||||
regs = EU_REGULATIONS
|
||||
if args.celex:
|
||||
regs = [r for r in regs if r["celex"] == args.celex]
|
||||
if not regs:
|
||||
print(f"CELEX {args.celex} not found in list")
|
||||
return
|
||||
|
||||
results = []
|
||||
|
||||
for reg in regs:
|
||||
logger.info("[%s] %s (%s)", reg["celex"], reg["name"], reg["reg_id"])
|
||||
|
||||
# Download HTML
|
||||
try:
|
||||
html_bytes = download_eurlex_html(reg["celex"])
|
||||
logger.info(" Downloaded: %d bytes", len(html_bytes))
|
||||
except Exception as e:
|
||||
logger.error(" Download FAILED: %s", e)
|
||||
results.append({"reg": reg, "status": "download_failed", "error": str(e)})
|
||||
continue
|
||||
|
||||
if args.dry_run:
|
||||
results.append({"reg": reg, "status": "dry_run", "html_size": len(html_bytes)})
|
||||
continue
|
||||
|
||||
# Delete old chunks
|
||||
old_count = find_old_chunks_by_filename(args.qdrant_url, reg["coll"], reg["reg_id"])
|
||||
delete_old_chunks(args.qdrant_url, reg["coll"], reg["reg_id"])
|
||||
logger.info(" Deleted %d old chunks", old_count)
|
||||
|
||||
# Upload HTML
|
||||
try:
|
||||
result = upload_html(args.rag_url, html_bytes, reg)
|
||||
new_chunks = result.get("chunks_count", 0)
|
||||
logger.info(" Uploaded: %d new chunks", new_chunks)
|
||||
except Exception as e:
|
||||
logger.error(" Upload FAILED: %s", e)
|
||||
results.append({"reg": reg, "status": "upload_failed", "error": str(e)})
|
||||
time.sleep(2)
|
||||
continue
|
||||
|
||||
# Check quality
|
||||
time.sleep(2)
|
||||
total, with_sec = check_section_rate(args.qdrant_url, reg["coll"], reg["reg_id"])
|
||||
pct = with_sec * 100 // max(total, 1)
|
||||
logger.info(" Section rate: %d/%d = %d%%", with_sec, total, pct)
|
||||
|
||||
results.append({
|
||||
"reg": reg, "status": "ok",
|
||||
"old_chunks": old_count, "new_chunks": new_chunks,
|
||||
"section_rate": pct,
|
||||
})
|
||||
time.sleep(2)
|
||||
|
||||
# Report
|
||||
print("\n" + "=" * 90)
|
||||
print("EUR-LEX REPLACEMENT REPORT")
|
||||
print("=" * 90)
|
||||
print(f"{'CELEX':<15} {'Name':<30} {'Status':<10} {'Old':>5} {'New':>5} {'Sect%':>6}")
|
||||
print("-" * 90)
|
||||
for r in results:
|
||||
reg = r["reg"]
|
||||
status = r["status"]
|
||||
old = r.get("old_chunks", "")
|
||||
new = r.get("new_chunks", r.get("html_size", ""))
|
||||
sect = f"{r.get('section_rate', '')}%" if "section_rate" in r else ""
|
||||
print(f"{reg['celex']:<15} {reg['name'][:30]:<30} {status:<10} {str(old):>5} {str(new):>5} {sect:>6}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,437 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Re-upload NIST/BSI/ENISA docs with chunk_strategy='legal' for section metadata.
|
||||
|
||||
The docs were already uploaded with 'recursive' strategy (no section detection).
|
||||
This script re-uploads with 'legal' strategy, then deletes old recursive chunks.
|
||||
|
||||
Usage (on Mac Mini):
|
||||
python3 control-pipeline/scripts/reupload_legal_strategy.py
|
||||
python3 control-pipeline/scripts/reupload_legal_strategy.py --dry-run
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import io
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
import unicodedata
|
||||
|
||||
import httpx
|
||||
import pdfplumber
|
||||
|
||||
RAG_URL = "https://localhost:8097"
|
||||
QDRANT_URL = "http://localhost:6333"
|
||||
UPLOAD_TIMEOUT = 1800.0
|
||||
|
||||
# ---- Documents to process ----
|
||||
|
||||
DOCS = [
|
||||
# 4 NIST docs already extracted at /tmp/nist_*.txt
|
||||
{
|
||||
"regulation_id": "nist_sp800_53r5",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"upload_filename": "NIST_SP_800_53r5.txt",
|
||||
"local_txt": "/tmp/nist_nist_sp800_53r5.txt",
|
||||
"minio_pdf": None, # already extracted
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp800_53r5",
|
||||
"source_id": "nist",
|
||||
"doc_type": "controls_catalog",
|
||||
"guideline_name": "NIST SP 800-53 Rev. 5",
|
||||
"license": "public_domain_us_gov",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "nist_sp_800_82r3",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "nist_sp_800_82r3.txt",
|
||||
"local_txt": "/tmp/nist_nist_sp_800_82r3.txt",
|
||||
"minio_pdf": None,
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_82r3",
|
||||
"regulation_short": "NIST SP 800-82",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "nist_sp_800_160v1r1",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "nist_sp_800_160v1r1.txt",
|
||||
"local_txt": "/tmp/nist_160.txt",
|
||||
"minio_pdf": None,
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_160v1r1",
|
||||
"regulation_short": "NIST SP 800-160",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "nist_sp800_207",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"upload_filename": "NIST_SP_800_207.txt",
|
||||
"local_txt": None, # needs extraction
|
||||
"minio_pdf": "compliance/bund/compliance/2026/NIST_SP_800_207.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp800_207",
|
||||
"source_id": "nist",
|
||||
"doc_type": "architecture",
|
||||
"guideline_name": "NIST SP 800-207 Zero Trust Architecture",
|
||||
"license": "public_domain_us_gov",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
# Additional low-quality docs (need extraction from MinIO)
|
||||
{
|
||||
"regulation_id": "nist_csf_2_0",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"upload_filename": "nist_csf_2_0.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/nist_csf_2_0.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_csf_2_0",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "nistir_8259a",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"upload_filename": "nistir_8259a.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/nistir_8259a.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nistir_8259a",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "nist_ai_rmf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"upload_filename": "nist_ai_rmf.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/nist_ai_rmf.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_ai_rmf",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "nist_sp_800_30r1",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "nist_sp_800_30r1.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/nist_sp_800_30r1.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_30r1",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "cisa_secure_by_design",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "cisa_secure_by_design.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/cisa_secure_by_design.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "cisa_secure_by_design",
|
||||
"license": "public_domain_us",
|
||||
"source": "cisa.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "cvss_v4_0",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "cvss_v4_0.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/cvss_v4_0.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "cvss_v4_0",
|
||||
"license": "public_domain_us",
|
||||
"source": "first.org",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_ics_scada_dependencies",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "enisa_ics_scada.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/enisa_ics_scada.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_ics_scada_dependencies",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_threat_landscape_supply_chain",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "enisa_supply_chain_security.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/enisa_supply_chain_security.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_threat_landscape_supply_chain",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_supply_chain_good_practices",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "enisa_supply_chain_good_practices.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/enisa_supply_chain_good_practices.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_supply_chain_good_practices",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
},
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def normalize_pdf_text(text):
|
||||
text = unicodedata.normalize('NFKC', text)
|
||||
text = text.replace('\u00ad', '').replace('\u200b', '')
|
||||
prev = None
|
||||
while prev != text:
|
||||
prev = text
|
||||
text = re.sub(r'(\d+)\s+\.\s+(\d+)', r'\1.\2', text)
|
||||
text = re.sub(r'\b([A-Z]{2,4})\s+-\s+(\d+)\b', r'\1-\2', text)
|
||||
text = re.sub(
|
||||
r'\b([A-Z]{2})\s*\.\s*([A-Z]{2})\s*-\s*(\d{2})\b', r'\1.\2-\3', text
|
||||
)
|
||||
text = re.sub(r'\(\s+(\d+)\s+\)', r'(\1)', text)
|
||||
text = re.sub(r'[^\S\n]{2,}', ' ', text)
|
||||
return text
|
||||
|
||||
|
||||
def get_text(doc):
|
||||
"""Get document text: from local file or extract from MinIO PDF."""
|
||||
if doc["local_txt"]:
|
||||
print(f" Reading local: {doc['local_txt']}")
|
||||
with open(doc["local_txt"], encoding="utf-8") as f:
|
||||
return f.read()
|
||||
|
||||
print(f" Downloading from MinIO: {doc['minio_pdf']}")
|
||||
with httpx.Client(timeout=60, verify=False) as c:
|
||||
resp = c.get(f"{RAG_URL}/api/v1/documents/download/{doc['minio_pdf']}")
|
||||
resp.raise_for_status()
|
||||
url = resp.json()["url"]
|
||||
with httpx.Client(timeout=300, verify=False) as c:
|
||||
pdf_bytes = c.get(url).content
|
||||
print(f" Downloaded {len(pdf_bytes) / 1024 / 1024:.1f} MB")
|
||||
|
||||
print(" Extracting with pdfplumber...")
|
||||
parts = []
|
||||
with pdfplumber.open(io.BytesIO(pdf_bytes)) as pdf:
|
||||
for i, page in enumerate(pdf.pages):
|
||||
t = page.extract_text(x_tolerance=3, y_tolerance=4)
|
||||
if t:
|
||||
parts.append(t)
|
||||
if (i + 1) % 50 == 0:
|
||||
print(f" {i + 1}/{len(pdf.pages)} pages...")
|
||||
text = "\n\n".join(parts)
|
||||
text = normalize_pdf_text(text)
|
||||
print(f" Extracted {len(text):,} chars")
|
||||
return text
|
||||
|
||||
|
||||
def get_old_doc_ids(collection, regulation_id):
|
||||
"""Get all document_ids for existing chunks."""
|
||||
doc_ids = set()
|
||||
offset = None
|
||||
with httpx.Client(timeout=60) as c:
|
||||
while True:
|
||||
body = {
|
||||
"filter": {"must": [
|
||||
{"key": "regulation_id", "match": {"value": regulation_id}}
|
||||
]},
|
||||
"limit": 100,
|
||||
"with_payload": ["document_id"],
|
||||
}
|
||||
if offset is not None:
|
||||
body["offset"] = offset
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{collection}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
for pt in data["points"]:
|
||||
did = pt.get("payload", {}).get("document_id")
|
||||
if did:
|
||||
doc_ids.add(did)
|
||||
offset = data.get("next_page_offset")
|
||||
if offset is None:
|
||||
break
|
||||
return doc_ids
|
||||
|
||||
|
||||
def upload_text_legal(text, filename, collection, extra_metadata):
|
||||
"""Upload with chunk_strategy='legal'."""
|
||||
form_data = {
|
||||
"collection": collection,
|
||||
"data_type": "compliance",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_strategy": "legal",
|
||||
"chunk_size": "1500",
|
||||
"chunk_overlap": "100",
|
||||
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
|
||||
}
|
||||
with httpx.Client(timeout=UPLOAD_TIMEOUT, verify=False) as c:
|
||||
resp = c.post(
|
||||
f"{RAG_URL}/api/v1/documents/upload",
|
||||
files={"file": (filename, text.encode("utf-8"), "text/plain")},
|
||||
data=form_data,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def delete_by_doc_ids(collection, doc_ids):
|
||||
"""Delete chunks matching specific document_ids."""
|
||||
with httpx.Client(timeout=30) as c:
|
||||
for did in doc_ids:
|
||||
c.post(
|
||||
f"{QDRANT_URL}/collections/{collection}/points/delete",
|
||||
json={"filter": {"must": [
|
||||
{"key": "document_id", "match": {"value": did}}
|
||||
]}},
|
||||
).raise_for_status()
|
||||
|
||||
|
||||
def count_chunks(collection, regulation_id):
|
||||
with httpx.Client(timeout=30) as c:
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{collection}/points/count",
|
||||
json={"filter": {"must": [
|
||||
{"key": "regulation_id", "match": {"value": regulation_id}}
|
||||
]}, "exact": True},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["result"]["count"]
|
||||
|
||||
|
||||
def check_section_rate(collection, regulation_id):
|
||||
total = 0
|
||||
with_sec = 0
|
||||
offset = None
|
||||
with httpx.Client(timeout=60) as c:
|
||||
while True:
|
||||
body = {
|
||||
"filter": {"must": [
|
||||
{"key": "regulation_id", "match": {"value": regulation_id}}
|
||||
]},
|
||||
"limit": 100,
|
||||
"with_payload": ["section"],
|
||||
}
|
||||
if offset is not None:
|
||||
body["offset"] = offset
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{collection}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
for pt in data["points"]:
|
||||
total += 1
|
||||
s = pt.get("payload", {}).get("section", "")
|
||||
if s and s.strip():
|
||||
with_sec += 1
|
||||
offset = data.get("next_page_offset")
|
||||
if offset is None:
|
||||
break
|
||||
return total, with_sec
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
print("=" * 60)
|
||||
print("Re-upload with chunk_strategy='legal'")
|
||||
print(f"Documents: {len(DOCS)}, Dry run: {args.dry_run}")
|
||||
print("=" * 60)
|
||||
|
||||
results = []
|
||||
for i, doc in enumerate(DOCS, 1):
|
||||
reg_id = doc["regulation_id"]
|
||||
coll = doc["collection"]
|
||||
print(f"\n[{i}/{len(DOCS)}] {doc['upload_filename']} → {coll}")
|
||||
|
||||
# 1. Check existing
|
||||
old_count = count_chunks(coll, reg_id)
|
||||
old_doc_ids = get_old_doc_ids(coll, reg_id) if old_count > 0 else set()
|
||||
print(f" Old: {old_count} chunks, {len(old_doc_ids)} doc_ids")
|
||||
|
||||
if args.dry_run:
|
||||
print(" DRY RUN — skipping")
|
||||
results.append({"file": doc["upload_filename"], "old": old_count,
|
||||
"new": "?", "sect": "?"})
|
||||
continue
|
||||
|
||||
# 2. Get text
|
||||
try:
|
||||
text = get_text(doc)
|
||||
except Exception as e:
|
||||
print(f" ERROR extracting text: {e}")
|
||||
results.append({"file": doc["upload_filename"], "old": old_count,
|
||||
"new": 0, "sect": 0})
|
||||
continue
|
||||
|
||||
# 3. Upload with legal strategy
|
||||
print(" Uploading with strategy='legal'...")
|
||||
result = upload_text_legal(
|
||||
text, doc["upload_filename"], coll, doc["extra_metadata"])
|
||||
new_chunks = result.get("chunks_count", 0)
|
||||
new_doc_id = result.get("document_id", "")
|
||||
print(f" New: {new_chunks} chunks (doc_id={new_doc_id})")
|
||||
|
||||
if new_chunks == 0:
|
||||
print(" ERROR: 0 chunks — keeping old!")
|
||||
results.append({"file": doc["upload_filename"], "old": old_count,
|
||||
"new": 0, "sect": 0})
|
||||
continue
|
||||
|
||||
# 4. Delete old chunks (safe: new ones already exist)
|
||||
if old_doc_ids:
|
||||
# Exclude the new document_id just in case
|
||||
old_doc_ids.discard(new_doc_id)
|
||||
if old_doc_ids:
|
||||
print(f" Deleting {len(old_doc_ids)} old doc_ids...")
|
||||
delete_by_doc_ids(coll, old_doc_ids)
|
||||
|
||||
# 5. Check section rate
|
||||
total, with_sec = check_section_rate(coll, reg_id)
|
||||
pct = (with_sec / total * 100) if total > 0 else 0
|
||||
print(f" Section rate: {with_sec}/{total} ({pct:.0f}%)")
|
||||
|
||||
results.append({"file": doc["upload_filename"], "old": old_count,
|
||||
"new": new_chunks, "sect": round(pct, 1)})
|
||||
|
||||
if i < len(DOCS):
|
||||
time.sleep(2)
|
||||
|
||||
# Summary
|
||||
print("\n" + "=" * 60)
|
||||
print("RESULTS")
|
||||
print("=" * 60)
|
||||
for r in results:
|
||||
print(f" {r['file']:<45} old={r['old']:<6} new={r['new']:<6} sect={r['sect']}%")
|
||||
|
||||
total_new = sum(r["new"] for r in results if isinstance(r["new"], int))
|
||||
print(f"\nTotal new chunks: {total_new}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,268 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
D4 Integration Test: Upload BGB excerpt → verify Qdrant payloads.
|
||||
|
||||
Usage:
|
||||
# Dry-run (local chunking only, no services needed)
|
||||
python3 scripts/test_d4_integration.py --dry-run
|
||||
|
||||
# Against Mac Mini
|
||||
python3 scripts/test_d4_integration.py \
|
||||
--rag-url https://macmini:8097 \
|
||||
--qdrant-url http://macmini:6333
|
||||
|
||||
# Against production
|
||||
python3 scripts/test_d4_integration.py \
|
||||
--rag-url https://rag-prod:8097 \
|
||||
--qdrant-url http://qdrant-prod:6333
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
|
||||
import httpx
|
||||
|
||||
FIXTURE_PATH = os.path.join(
|
||||
os.path.dirname(__file__), "..", "..", "embedding-service",
|
||||
"tests", "fixtures", "bgb_312_excerpt.txt",
|
||||
)
|
||||
COLLECTION = "bp_compliance_gesetze"
|
||||
REG_CODE = "BGB_D4_TEST"
|
||||
|
||||
# Expected sections in the BGB excerpt
|
||||
EXPECTED_SECTIONS = {"§ 312", "§ 312a", "§ 312g", "§ 312k"}
|
||||
|
||||
|
||||
def load_fixture() -> str:
|
||||
with open(FIXTURE_PATH, encoding="utf-8") as f:
|
||||
return f.read()
|
||||
|
||||
|
||||
def upload_document(rag_url: str, text: str) -> dict:
|
||||
"""Upload BGB excerpt to RAG service."""
|
||||
metadata = json.dumps({
|
||||
"regulation_code": REG_CODE,
|
||||
"regulation_name_de": "BGB (D4 Test)",
|
||||
"source_type": "law",
|
||||
})
|
||||
|
||||
with httpx.Client(timeout=60.0, verify=False) as client:
|
||||
resp = client.post(
|
||||
f"{rag_url}/api/v1/documents/upload",
|
||||
files={"file": ("bgb_312_test.txt", text.encode(), "text/plain")},
|
||||
data={
|
||||
"collection": COLLECTION,
|
||||
"data_type": "law",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_strategy": "recursive",
|
||||
"chunk_size": "1500",
|
||||
"chunk_overlap": "100",
|
||||
"metadata_json": metadata,
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def scroll_chunks(qdrant_url: str, document_id: str) -> list[dict]:
|
||||
"""Scroll Qdrant for chunks matching this document_id."""
|
||||
all_points = []
|
||||
offset = None
|
||||
|
||||
with httpx.Client(timeout=30.0) as client:
|
||||
while True:
|
||||
body: dict = {
|
||||
"limit": 100,
|
||||
"with_payload": True,
|
||||
"with_vector": False,
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "document_id",
|
||||
"match": {"value": document_id},
|
||||
}]
|
||||
},
|
||||
}
|
||||
if offset:
|
||||
body["offset"] = offset
|
||||
|
||||
resp = client.post(
|
||||
f"{qdrant_url}/collections/{COLLECTION}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
all_points.extend(data["points"])
|
||||
offset = data.get("next_page_offset")
|
||||
if not offset:
|
||||
break
|
||||
|
||||
return all_points
|
||||
|
||||
|
||||
def delete_test_data(qdrant_url: str, document_id: str):
|
||||
"""Clean up test chunks from Qdrant."""
|
||||
with httpx.Client(timeout=30.0) as client:
|
||||
resp = client.post(
|
||||
f"{qdrant_url}/collections/{COLLECTION}/points/delete",
|
||||
json={
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "document_id",
|
||||
"match": {"value": document_id},
|
||||
}]
|
||||
}
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
|
||||
|
||||
def verify_chunks(points: list[dict]) -> dict:
|
||||
"""Analyze chunks and return a verification report."""
|
||||
report = {
|
||||
"total_chunks": len(points),
|
||||
"sections_found": set(),
|
||||
"chunks_with_section": 0,
|
||||
"chunks_with_paragraph": 0,
|
||||
"chunks_with_page": 0,
|
||||
"section_details": [],
|
||||
"issues": [],
|
||||
}
|
||||
|
||||
for pt in points:
|
||||
payload = pt.get("payload", {})
|
||||
section = payload.get("section", "")
|
||||
section_title = payload.get("section_title", "")
|
||||
paragraph = payload.get("paragraph", "")
|
||||
paragraph_num = payload.get("paragraph_num")
|
||||
page = payload.get("page")
|
||||
chunk_idx = payload.get("chunk_index", "?")
|
||||
|
||||
if section:
|
||||
report["sections_found"].add(section)
|
||||
report["chunks_with_section"] += 1
|
||||
if paragraph:
|
||||
report["chunks_with_paragraph"] += 1
|
||||
if page is not None:
|
||||
report["chunks_with_page"] += 1
|
||||
|
||||
report["section_details"].append({
|
||||
"chunk_index": chunk_idx,
|
||||
"section": section,
|
||||
"section_title": section_title[:40],
|
||||
"paragraph": paragraph,
|
||||
"paragraph_num": paragraph_num,
|
||||
"page": page,
|
||||
"text_preview": payload.get("chunk_text", "")[:60],
|
||||
})
|
||||
|
||||
# Checks
|
||||
missing = EXPECTED_SECTIONS - report["sections_found"]
|
||||
if missing:
|
||||
report["issues"].append(f"Missing sections: {missing}")
|
||||
|
||||
if "§ 312k" not in report["sections_found"]:
|
||||
report["issues"].append("CRITICAL: § 312k not found!")
|
||||
|
||||
section_ratio = report["chunks_with_section"] / max(report["total_chunks"], 1)
|
||||
if section_ratio < 0.9:
|
||||
report["issues"].append(
|
||||
f"Only {section_ratio:.0%} chunks have section metadata (expected >= 90%)"
|
||||
)
|
||||
|
||||
return report
|
||||
|
||||
|
||||
def print_report(report: dict):
|
||||
"""Print verification report."""
|
||||
print("\n" + "=" * 60)
|
||||
print("D4 VALIDATION REPORT")
|
||||
print("=" * 60)
|
||||
print(f"Total chunks: {report['total_chunks']}")
|
||||
print(f"With section: {report['chunks_with_section']}")
|
||||
print(f"With paragraph: {report['chunks_with_paragraph']}")
|
||||
print(f"With page: {report['chunks_with_page']}")
|
||||
print(f"Sections found: {sorted(report['sections_found'])}")
|
||||
|
||||
print("\nChunk details:")
|
||||
for d in sorted(report["section_details"], key=lambda x: x["chunk_index"]):
|
||||
print(
|
||||
f" [{d['chunk_index']:2}] "
|
||||
f"section={d['section']!r:12s} "
|
||||
f"title={d['section_title']!r:30s} "
|
||||
f"para={d['paragraph']!r:8s}"
|
||||
)
|
||||
|
||||
if report["issues"]:
|
||||
print(f"\nISSUES ({len(report['issues'])}):")
|
||||
for issue in report["issues"]:
|
||||
print(f" - {issue}")
|
||||
print("\nRESULT: FAIL")
|
||||
else:
|
||||
print("\nRESULT: PASS — all sections detected, metadata quality OK")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="D4 Integration Test")
|
||||
parser.add_argument("--rag-url", default="https://macmini:8097")
|
||||
parser.add_argument("--qdrant-url", default="http://macmini:6333")
|
||||
parser.add_argument("--dry-run", action="store_true",
|
||||
help="Only test local chunking, no upload")
|
||||
parser.add_argument("--keep", action="store_true",
|
||||
help="Don't delete test data after verification")
|
||||
args = parser.parse_args()
|
||||
|
||||
text = load_fixture()
|
||||
print(f"Loaded BGB excerpt: {len(text)} chars")
|
||||
|
||||
if args.dry_run:
|
||||
# Import chunking directly
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "embedding-service"))
|
||||
from main import chunk_text_legal_structured
|
||||
chunks = chunk_text_legal_structured(text, 1500, 100)
|
||||
# Build fake points for verification
|
||||
points = [{"payload": {
|
||||
"chunk_index": c["index"],
|
||||
"chunk_text": c["text"],
|
||||
"section": c["section"],
|
||||
"section_title": c["section_title"],
|
||||
"paragraph": c["paragraph"],
|
||||
"paragraph_num": c["paragraph_num"],
|
||||
"page": c["page"],
|
||||
}} for c in chunks]
|
||||
report = verify_chunks(points)
|
||||
print_report(report)
|
||||
sys.exit(1 if report["issues"] else 0)
|
||||
|
||||
# Full integration test
|
||||
print(f"Uploading to {args.rag_url} → collection={COLLECTION}...")
|
||||
result = upload_document(args.rag_url, text)
|
||||
doc_id = result["document_id"]
|
||||
print(f" document_id: {doc_id}")
|
||||
print(f" chunks_count: {result['chunks_count']}")
|
||||
print(f" vectors_indexed: {result['vectors_indexed']}")
|
||||
|
||||
print("Waiting 2s for indexing...")
|
||||
time.sleep(2)
|
||||
|
||||
print(f"Scrolling Qdrant at {args.qdrant_url}...")
|
||||
points = scroll_chunks(args.qdrant_url, doc_id)
|
||||
print(f" Found {len(points)} points")
|
||||
|
||||
report = verify_chunks(points)
|
||||
print_report(report)
|
||||
|
||||
if not args.keep:
|
||||
print(f"\nCleaning up test data (document_id={doc_id})...")
|
||||
delete_test_data(args.qdrant_url, doc_id)
|
||||
print(" Deleted.")
|
||||
|
||||
sys.exit(1 if report["issues"] else 0)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -17,9 +17,6 @@ import httpx
|
||||
|
||||
from .control_generator import (
|
||||
GeneratedControl,
|
||||
REGULATION_LICENSE_MAP,
|
||||
_RULE2_PREFIXES,
|
||||
_RULE3_PREFIXES,
|
||||
_classify_regulation,
|
||||
)
|
||||
|
||||
|
||||
@@ -22,6 +22,7 @@ import json
|
||||
import logging
|
||||
import time
|
||||
from collections import defaultdict
|
||||
from datetime import datetime
|
||||
|
||||
from sqlalchemy import text
|
||||
|
||||
@@ -108,34 +109,56 @@ class BatchDedupRunner:
|
||||
self._progress_phase = ""
|
||||
self._progress_count = 0
|
||||
self._progress_total = 0
|
||||
self._since = None # set by run() when scoped run requested
|
||||
|
||||
async def run(
|
||||
self,
|
||||
dry_run: bool = False,
|
||||
hint_filter: str = None,
|
||||
since: datetime = None,
|
||||
) -> dict:
|
||||
"""Run the full batch dedup pipeline.
|
||||
|
||||
Args:
|
||||
dry_run: If True, compute stats but don't modify DB/Qdrant.
|
||||
hint_filter: If set, only process groups matching this hint prefix.
|
||||
since: If set, only process controls with created_at >= since.
|
||||
Useful for incremental dedup after single-document ingestion.
|
||||
|
||||
Returns:
|
||||
Stats dict with counts.
|
||||
"""
|
||||
start = time.monotonic()
|
||||
logger.info("BatchDedup starting (dry_run=%s, hint_filter=%s)",
|
||||
dry_run, hint_filter)
|
||||
logger.info("BatchDedup starting (dry_run=%s, hint_filter=%s, since=%s)",
|
||||
dry_run, hint_filter, since)
|
||||
|
||||
# Scoped runs reset checkpoint to avoid skipping new controls whose
|
||||
# control_id sorts before the stale last_id of a previous full run.
|
||||
self._since = since
|
||||
if since and not dry_run:
|
||||
self.db.execute(text(
|
||||
"DELETE FROM canonical_generation_jobs WHERE status = 'dedup_phase2_checkpoint'"
|
||||
))
|
||||
self.db.commit()
|
||||
|
||||
if not dry_run:
|
||||
await ensure_qdrant_collection(collection=self.collection)
|
||||
|
||||
# Phase 1: Intra-group dedup (same merge_group_hint)
|
||||
# Optimization: skip singleton groups (they're automatically masters)
|
||||
self._progress_phase = "phase1"
|
||||
groups = self._load_merge_groups(hint_filter)
|
||||
groups = self._load_merge_groups(hint_filter, since)
|
||||
self._progress_total = self.stats["total_controls"]
|
||||
|
||||
for hint, controls in groups:
|
||||
multi_groups = [(h, c) for h, c in groups if len(c) > 1]
|
||||
singleton_count = len(groups) - len(multi_groups)
|
||||
self.stats["singleton_groups_skipped"] = singleton_count
|
||||
logger.info(
|
||||
"BatchDedup Phase 1: %d multi-control groups to process, %d singletons skipped",
|
||||
len(multi_groups), singleton_count,
|
||||
)
|
||||
|
||||
for hint, controls in multi_groups:
|
||||
try:
|
||||
await self._process_hint_group(hint, controls, dry_run)
|
||||
self.stats["phase1_groups_processed"] += 1
|
||||
@@ -148,8 +171,8 @@ class BatchDedupRunner:
|
||||
pass
|
||||
|
||||
logger.info(
|
||||
"BatchDedup Phase 1 done: %d masters, %d linked, %d review",
|
||||
self.stats["masters"], self.stats["linked"], self.stats["review"],
|
||||
"BatchDedup Phase 1 done: %d masters, %d linked, %d review (skipped %d singletons)",
|
||||
self.stats["masters"], self.stats["linked"], self.stats["review"], singleton_count,
|
||||
)
|
||||
|
||||
# Phase 2: Cross-group dedup via embeddings
|
||||
@@ -162,7 +185,7 @@ class BatchDedupRunner:
|
||||
logger.info("BatchDedup completed in %.1fs: %s", elapsed, self.stats)
|
||||
return self.stats
|
||||
|
||||
def _load_merge_groups(self, hint_filter: str = None) -> list:
|
||||
def _load_merge_groups(self, hint_filter: str = None, since: datetime = None) -> list:
|
||||
"""Load all Pass 0b controls grouped by merge_group_hint, largest first."""
|
||||
conditions = [
|
||||
"decomposition_method = 'pass0b'",
|
||||
@@ -175,6 +198,10 @@ class BatchDedupRunner:
|
||||
conditions.append("generation_metadata->>'merge_group_hint' LIKE :hf")
|
||||
params["hf"] = f"{hint_filter}%"
|
||||
|
||||
if since:
|
||||
conditions.append("created_at >= :since")
|
||||
params["since"] = since
|
||||
|
||||
where = " AND ".join(conditions)
|
||||
rows = self.db.execute(text(f"""
|
||||
SELECT id::text, control_id, title, objective,
|
||||
@@ -321,114 +348,200 @@ class BatchDedupRunner:
|
||||
async def _run_cross_group_pass(self):
|
||||
"""Phase 2: Find cross-group duplicates among surviving masters.
|
||||
|
||||
After Phase 1, ~52k masters remain. Many have similar semantics
|
||||
despite different merge_group_hints (e.g. different German spellings).
|
||||
This pass embeds all masters and finds near-duplicates via Qdrant.
|
||||
Paginated DB queries + individual error handling per control.
|
||||
Never loads all rows into memory at once.
|
||||
"""
|
||||
logger.info("BatchDedup Phase 2: Cross-group pass starting...")
|
||||
|
||||
rows = self.db.execute(text("""
|
||||
SELECT id::text, control_id, title,
|
||||
generation_metadata->>'merge_group_hint' as merge_group_hint
|
||||
FROM canonical_controls
|
||||
# Count total — respect scoped run if since is set
|
||||
since_clause = " AND created_at >= :since" if self._since else ""
|
||||
params = {"since": self._since} if self._since else {}
|
||||
total_row = self.db.execute(text(f"""
|
||||
SELECT COUNT(*) FROM canonical_controls
|
||||
WHERE decomposition_method = 'pass0b'
|
||||
AND release_state != 'duplicate'
|
||||
AND release_state != 'deprecated'
|
||||
ORDER BY control_id
|
||||
""")).fetchall()
|
||||
AND release_state != 'deprecated'{since_clause}
|
||||
"""), params).fetchone()
|
||||
total = total_row[0] if total_row else 0
|
||||
|
||||
self._progress_total = len(rows)
|
||||
self._progress_total = total
|
||||
self._progress_count = 0
|
||||
logger.info("BatchDedup Cross-group: %d masters to check", len(rows))
|
||||
cross_linked = 0
|
||||
cross_review = 0
|
||||
|
||||
# Process in parallel batches for embedding + Qdrant search
|
||||
PARALLEL_BATCH = 10
|
||||
# Checkpoint: resume from last processed control_id
|
||||
DB_PAGE = 100
|
||||
# Checkpoint: resume from last processed control_id (survives container restart)
|
||||
checkpoint_row = self.db.execute(text("""
|
||||
SELECT config FROM canonical_generation_jobs
|
||||
WHERE status = 'dedup_phase2_checkpoint'
|
||||
LIMIT 1
|
||||
""")).fetchone()
|
||||
last_control_id = checkpoint_row[0] if checkpoint_row else ""
|
||||
|
||||
async def _embed_and_search(r):
|
||||
"""Embed one control and search Qdrant — safe for asyncio.gather."""
|
||||
hint = r[3] or ""
|
||||
parts = hint.split(":", 2)
|
||||
action = parts[0] if len(parts) > 0 else ""
|
||||
obj = parts[1] if len(parts) > 1 else ""
|
||||
canonical = canonicalize_text(action, obj, r[2])
|
||||
embedding = await get_embedding(canonical)
|
||||
if not embedding:
|
||||
return None
|
||||
results = await qdrant_search_cross_regulation(
|
||||
embedding, top_k=5, collection=self.collection,
|
||||
)
|
||||
return (r, results)
|
||||
if last_control_id:
|
||||
skip_params = {"last_id": last_control_id}
|
||||
if self._since:
|
||||
skip_params["since"] = self._since
|
||||
skip_row = self.db.execute(text(f"""
|
||||
SELECT COUNT(*) FROM canonical_controls
|
||||
WHERE decomposition_method = 'pass0b'
|
||||
AND release_state != 'duplicate'
|
||||
AND release_state != 'deprecated'
|
||||
AND control_id <= :last_id{since_clause}
|
||||
"""), skip_params).fetchone()
|
||||
skipped = skip_row[0] if skip_row else 0
|
||||
self._progress_count = skipped
|
||||
logger.info("BatchDedup Cross-group: RESUMING from %s (skipping %d already processed)",
|
||||
last_control_id, skipped)
|
||||
else:
|
||||
self.db.execute(text("""
|
||||
INSERT INTO canonical_generation_jobs (id, status, config)
|
||||
VALUES (gen_random_uuid(), 'dedup_phase2_checkpoint', '')
|
||||
"""))
|
||||
self.db.commit()
|
||||
|
||||
for batch_start in range(0, len(rows), PARALLEL_BATCH):
|
||||
batch = rows[batch_start:batch_start + PARALLEL_BATCH]
|
||||
tasks = [_embed_and_search(r) for r in batch]
|
||||
results_batch = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
logger.info("BatchDedup Cross-group: %d masters to check (starting from %s)",
|
||||
total, last_control_id or "beginning")
|
||||
|
||||
for res in results_batch:
|
||||
if res is None or isinstance(res, Exception):
|
||||
if isinstance(res, Exception):
|
||||
logger.error("BatchDedup embed/search error: %s", res)
|
||||
while True:
|
||||
page_params = {"last_id": last_control_id, "page_size": DB_PAGE}
|
||||
if self._since:
|
||||
page_params["since"] = self._since
|
||||
rows = self.db.execute(text(f"""
|
||||
SELECT id::text, control_id, title,
|
||||
generation_metadata->>'merge_group_hint' as merge_group_hint
|
||||
FROM canonical_controls
|
||||
WHERE decomposition_method = 'pass0b'
|
||||
AND release_state != 'duplicate'
|
||||
AND release_state != 'deprecated'
|
||||
AND control_id > :last_id{since_clause}
|
||||
ORDER BY control_id
|
||||
LIMIT :page_size
|
||||
"""), page_params).fetchall()
|
||||
|
||||
if not rows:
|
||||
break
|
||||
|
||||
last_control_id = rows[-1][1]
|
||||
|
||||
# Process each control individually (no asyncio.gather — more stable)
|
||||
for r in rows:
|
||||
try:
|
||||
hint = r[3] or ""
|
||||
parts = hint.split(":", 2)
|
||||
action = parts[0] if len(parts) > 0 else ""
|
||||
obj = parts[1] if len(parts) > 1 else ""
|
||||
canonical = canonicalize_text(action, obj, r[2])
|
||||
|
||||
# Timeout per embedding call
|
||||
try:
|
||||
embedding = await asyncio.wait_for(
|
||||
get_embedding(canonical), timeout=30.0
|
||||
)
|
||||
except asyncio.TimeoutError:
|
||||
self.stats["errors"] += 1
|
||||
continue
|
||||
|
||||
r, results = res
|
||||
ctrl_uuid = r[0]
|
||||
hint = r[3] or ""
|
||||
|
||||
if not results:
|
||||
continue
|
||||
|
||||
for match in results:
|
||||
match_score = match.get("score", 0.0)
|
||||
match_payload = match.get("payload", {})
|
||||
match_uuid = match_payload.get("control_uuid", "")
|
||||
|
||||
if match_uuid == ctrl_uuid:
|
||||
continue
|
||||
|
||||
if match_score > LINK_THRESHOLD:
|
||||
try:
|
||||
self.db.execute(text("""
|
||||
UPDATE canonical_controls
|
||||
SET release_state = 'duplicate', merged_into_uuid = CAST(:master AS uuid)
|
||||
WHERE id = CAST(:dup AS uuid)
|
||||
AND release_state != 'duplicate'
|
||||
"""), {"master": match_uuid, "dup": ctrl_uuid})
|
||||
if not embedding:
|
||||
continue
|
||||
|
||||
self.db.execute(text("""
|
||||
INSERT INTO control_parent_links
|
||||
(control_uuid, parent_control_uuid, link_type, confidence)
|
||||
VALUES (CAST(:cu AS uuid), CAST(:pu AS uuid), 'cross_regulation', :conf)
|
||||
ON CONFLICT (control_uuid, parent_control_uuid) DO NOTHING
|
||||
"""), {"cu": match_uuid, "pu": ctrl_uuid, "conf": match_score})
|
||||
|
||||
transferred = self._transfer_parent_links(match_uuid, ctrl_uuid)
|
||||
self.stats["parent_links_transferred"] += transferred
|
||||
|
||||
self.db.commit()
|
||||
cross_linked += 1
|
||||
except Exception as e:
|
||||
logger.error("BatchDedup cross-group link error %s→%s: %s",
|
||||
ctrl_uuid, match_uuid, e)
|
||||
self.db.rollback()
|
||||
self.stats["errors"] += 1
|
||||
break
|
||||
elif match_score > REVIEW_THRESHOLD:
|
||||
self._write_review(
|
||||
{"control_id": r[1], "title": r[2], "objective": "",
|
||||
"merge_group_hint": hint, "pattern_id": None},
|
||||
match_payload, match_score,
|
||||
try:
|
||||
results = await asyncio.wait_for(
|
||||
qdrant_search_cross_regulation(
|
||||
embedding, top_k=5, collection=self.collection,
|
||||
), timeout=30.0
|
||||
)
|
||||
cross_review += 1
|
||||
break
|
||||
except asyncio.TimeoutError:
|
||||
self.stats["errors"] += 1
|
||||
continue
|
||||
|
||||
processed = min(batch_start + PARALLEL_BATCH, len(rows))
|
||||
self._progress_count = processed
|
||||
if processed % 500 < PARALLEL_BATCH:
|
||||
logger.info("BatchDedup Cross-group: %d/%d checked, %d linked, %d review",
|
||||
processed, len(rows), cross_linked, cross_review)
|
||||
ctrl_uuid = r[0]
|
||||
|
||||
for match in (results or []):
|
||||
match_score = match.get("score", 0.0)
|
||||
match_payload = match.get("payload", {})
|
||||
match_uuid = match_payload.get("control_uuid", "")
|
||||
|
||||
if match_uuid == ctrl_uuid:
|
||||
continue
|
||||
|
||||
if match_score > LINK_THRESHOLD:
|
||||
try:
|
||||
self.db.execute(text("""
|
||||
UPDATE canonical_controls
|
||||
SET release_state = 'duplicate', merged_into_uuid = CAST(:master AS uuid)
|
||||
WHERE id = CAST(:dup AS uuid)
|
||||
AND release_state != 'duplicate'
|
||||
"""), {"master": match_uuid, "dup": ctrl_uuid})
|
||||
|
||||
self.db.execute(text("""
|
||||
INSERT INTO control_parent_links
|
||||
(control_uuid, parent_control_uuid, link_type, confidence)
|
||||
VALUES (CAST(:cu AS uuid), CAST(:pu AS uuid), 'cross_regulation', :conf)
|
||||
ON CONFLICT (control_uuid, parent_control_uuid) DO NOTHING
|
||||
"""), {"cu": match_uuid, "pu": ctrl_uuid, "conf": match_score})
|
||||
|
||||
transferred = self._transfer_parent_links(match_uuid, ctrl_uuid)
|
||||
self.stats["parent_links_transferred"] += transferred
|
||||
self.db.commit()
|
||||
cross_linked += 1
|
||||
except Exception as e:
|
||||
logger.error("BatchDedup cross-group link error %s→%s: %s",
|
||||
ctrl_uuid, match_uuid, e)
|
||||
try:
|
||||
self.db.rollback()
|
||||
except Exception:
|
||||
pass
|
||||
self.stats["errors"] += 1
|
||||
break
|
||||
elif match_score > REVIEW_THRESHOLD:
|
||||
self._write_review(
|
||||
{"control_id": r[1], "title": r[2], "objective": "",
|
||||
"merge_group_hint": hint, "pattern_id": None},
|
||||
match_payload, match_score,
|
||||
)
|
||||
cross_review += 1
|
||||
break
|
||||
|
||||
except Exception as e:
|
||||
logger.error("BatchDedup cross-group control %s error: %s", r[1], e)
|
||||
self.stats["errors"] += 1
|
||||
try:
|
||||
self.db.rollback()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
self._progress_count += 1
|
||||
|
||||
# Save checkpoint + log progress every page
|
||||
try:
|
||||
self.db.execute(text("""
|
||||
UPDATE canonical_generation_jobs
|
||||
SET config = :cid
|
||||
WHERE status = 'dedup_phase2_checkpoint'
|
||||
"""), {"cid": last_control_id})
|
||||
self.db.commit()
|
||||
except Exception:
|
||||
try:
|
||||
self.db.rollback()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
processed = self._progress_count
|
||||
if processed % 500 < DB_PAGE:
|
||||
logger.info("BatchDedup Cross-group: %d/%d checked, %d linked, %d review (checkpoint: %s)",
|
||||
processed, total, cross_linked, cross_review, last_control_id)
|
||||
|
||||
# Clear checkpoint on completion
|
||||
try:
|
||||
self.db.execute(text("""
|
||||
DELETE FROM canonical_generation_jobs
|
||||
WHERE status = 'dedup_phase2_checkpoint'
|
||||
"""))
|
||||
self.db.commit()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
self.stats["cross_group_linked"] = cross_linked
|
||||
self.stats["cross_group_review"] = cross_review
|
||||
|
||||
@@ -7,7 +7,6 @@ Citation Backfill Service — enrich existing controls with article/paragraph pr
|
||||
Tier 3 — Ollama LLM: ask local LLM to identify article/paragraph from text
|
||||
"""
|
||||
|
||||
import hashlib
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
@@ -28,12 +27,13 @@ OLLAMA_URL = os.getenv("OLLAMA_URL", "http://host.docker.internal:11434")
|
||||
OLLAMA_MODEL = os.getenv("CONTROL_GEN_OLLAMA_MODEL", "qwen3.5:35b-a3b")
|
||||
LLM_TIMEOUT = float(os.getenv("CONTROL_GEN_LLM_TIMEOUT", "180"))
|
||||
|
||||
ALL_COLLECTIONS = [
|
||||
"bp_compliance_ce",
|
||||
# Tier-1 semantic re-link: min cosine for a source_original_text → chunk match.
|
||||
EMBED_THRESHOLD = float(os.getenv("CITATION_EMBED_THRESHOLD", "0.80"))
|
||||
# Collections that carry re-ingested, article_label-bearing chunks.
|
||||
RELINK_COLLECTIONS = [
|
||||
"bp_compliance_gesetze",
|
||||
"bp_compliance_datenschutz",
|
||||
"bp_dsfa_corpus",
|
||||
"bp_legal_templates",
|
||||
"bp_compliance_ce",
|
||||
]
|
||||
|
||||
BACKFILL_SYSTEM_PROMPT = (
|
||||
@@ -51,13 +51,14 @@ _SOURCE_ARTICLE_RE = re.compile(
|
||||
class MatchResult:
|
||||
article: str
|
||||
paragraph: str
|
||||
method: str # "hash", "regex", "llm"
|
||||
method: str # "embed", "regex", "llm"
|
||||
source: str = "" # regulation short/name (embed tier sets the cleaned source)
|
||||
|
||||
|
||||
@dataclass
|
||||
class BackfillResult:
|
||||
total_controls: int = 0
|
||||
matched_hash: int = 0
|
||||
matched_embed: int = 0
|
||||
matched_regex: int = 0
|
||||
matched_llm: int = 0
|
||||
unmatched: int = 0
|
||||
@@ -71,7 +72,6 @@ class CitationBackfill:
|
||||
def __init__(self, db: Session, rag_client: ComplianceRAGClient):
|
||||
self.db = db
|
||||
self.rag = rag_client
|
||||
self._rag_index: dict[str, RAGSearchResult] = {}
|
||||
|
||||
async def run(self, dry_run: bool = True, limit: int = 0) -> BackfillResult:
|
||||
"""Main entry: iterate controls missing article/paragraph, match to RAG, update."""
|
||||
@@ -85,20 +85,10 @@ class CitationBackfill:
|
||||
if not controls:
|
||||
return result
|
||||
|
||||
# Collect hashes we need to find — only build index for controls with source text
|
||||
needed_hashes: set[str] = set()
|
||||
for ctrl in controls:
|
||||
src = ctrl.get("source_original_text")
|
||||
if src:
|
||||
needed_hashes.add(hashlib.sha256(src.encode()).hexdigest())
|
||||
|
||||
if needed_hashes:
|
||||
# Build targeted RAG index — only scroll collections that our controls reference
|
||||
logger.info("Building targeted RAG hash index for %d source texts...", len(needed_hashes))
|
||||
await self._build_rag_index_targeted(controls)
|
||||
logger.info("RAG index built: %d chunks indexed, %d hashes needed", len(self._rag_index), len(needed_hashes))
|
||||
else:
|
||||
logger.info("No source_original_text found — skipping RAG index build")
|
||||
# Tier-1 = per-control semantic search against the re-ingested, labeled chunks.
|
||||
# (The old sha256(chunk.text) hash index died with re-chunking and is gone.)
|
||||
with_source = sum(1 for c in controls if c.get("source_original_text"))
|
||||
logger.info("Embedding-relink candidates (with source_original_text): %d", with_source)
|
||||
|
||||
# Process each control
|
||||
for i, ctrl in enumerate(controls):
|
||||
@@ -108,8 +98,8 @@ class CitationBackfill:
|
||||
try:
|
||||
match = await self._match_control(ctrl)
|
||||
if match:
|
||||
if match.method == "hash":
|
||||
result.matched_hash += 1
|
||||
if match.method == "embed":
|
||||
result.matched_embed += 1
|
||||
elif match.method == "regex":
|
||||
result.matched_regex += 1
|
||||
elif match.method == "llm":
|
||||
@@ -139,8 +129,8 @@ class CitationBackfill:
|
||||
result.errors.append(f"Commit failed: {e}")
|
||||
|
||||
logger.info(
|
||||
"Backfill complete: %d total, hash=%d regex=%d llm=%d unmatched=%d updated=%d",
|
||||
result.total_controls, result.matched_hash, result.matched_regex,
|
||||
"Backfill complete: %d total, embed=%d regex=%d llm=%d unmatched=%d updated=%d",
|
||||
result.total_controls, result.matched_embed, result.matched_regex,
|
||||
result.matched_llm, result.unmatched, result.updated,
|
||||
)
|
||||
return result
|
||||
@@ -178,93 +168,13 @@ class CitationBackfill:
|
||||
controls.append(ctrl)
|
||||
return controls
|
||||
|
||||
async def _build_rag_index_targeted(self, controls: list[dict]):
|
||||
"""Build RAG index by scrolling only collections relevant to our controls.
|
||||
|
||||
Uses regulation codes from generation_metadata to identify which collections
|
||||
to search, falling back to all collections only if needed.
|
||||
"""
|
||||
# Determine which collections are relevant based on regulation codes
|
||||
regulation_to_collection = self._map_regulations_to_collections(controls)
|
||||
collections_to_search = set(regulation_to_collection.values()) or set(ALL_COLLECTIONS)
|
||||
|
||||
logger.info("Targeted index: searching %d collections: %s",
|
||||
len(collections_to_search), ", ".join(collections_to_search))
|
||||
|
||||
for collection in collections_to_search:
|
||||
offset = None
|
||||
page = 0
|
||||
seen_offsets: set[str] = set()
|
||||
while True:
|
||||
chunks, next_offset = await self.rag.scroll(
|
||||
collection=collection, offset=offset, limit=200,
|
||||
)
|
||||
if not chunks:
|
||||
break
|
||||
for chunk in chunks:
|
||||
if chunk.text and len(chunk.text.strip()) >= 50:
|
||||
h = hashlib.sha256(chunk.text.encode()).hexdigest()
|
||||
self._rag_index[h] = chunk
|
||||
page += 1
|
||||
if page % 50 == 0:
|
||||
logger.info("Indexing %s: page %d (%d chunks so far)",
|
||||
collection, page, len(self._rag_index))
|
||||
if not next_offset:
|
||||
break
|
||||
if next_offset in seen_offsets:
|
||||
logger.warning("Scroll loop in %s at page %d — stopping", collection, page)
|
||||
break
|
||||
seen_offsets.add(next_offset)
|
||||
offset = next_offset
|
||||
|
||||
logger.info("Indexed collection %s: %d pages", collection, page)
|
||||
|
||||
def _map_regulations_to_collections(self, controls: list[dict]) -> dict[str, str]:
|
||||
"""Map regulation codes from controls to likely Qdrant collections."""
|
||||
# Heuristic: regulation code prefix → collection
|
||||
collection_map = {
|
||||
"eu_": "bp_compliance_gesetze",
|
||||
"dsgvo": "bp_compliance_datenschutz",
|
||||
"bdsg": "bp_compliance_gesetze",
|
||||
"ttdsg": "bp_compliance_gesetze",
|
||||
"nist_": "bp_compliance_ce",
|
||||
"owasp": "bp_compliance_ce",
|
||||
"bsi_": "bp_compliance_ce",
|
||||
"enisa": "bp_compliance_ce",
|
||||
"at_": "bp_compliance_recht",
|
||||
"fr_": "bp_compliance_recht",
|
||||
"es_": "bp_compliance_recht",
|
||||
}
|
||||
result: dict[str, str] = {}
|
||||
for ctrl in controls:
|
||||
meta = ctrl.get("generation_metadata") or {}
|
||||
reg = meta.get("source_regulation", "")
|
||||
if not reg:
|
||||
continue
|
||||
for prefix, coll in collection_map.items():
|
||||
if reg.startswith(prefix):
|
||||
result[reg] = coll
|
||||
break
|
||||
else:
|
||||
# Unknown regulation — search all
|
||||
for coll in ALL_COLLECTIONS:
|
||||
result[f"_all_{coll}"] = coll
|
||||
return result
|
||||
|
||||
async def _match_control(self, ctrl: dict) -> Optional[MatchResult]:
|
||||
"""3-tier matching: hash → regex → LLM."""
|
||||
|
||||
# Tier 1: Hash match against RAG index
|
||||
source_text = ctrl.get("source_original_text")
|
||||
if source_text:
|
||||
h = hashlib.sha256(source_text.encode()).hexdigest()
|
||||
chunk = self._rag_index.get(h)
|
||||
if chunk and (chunk.article or chunk.paragraph):
|
||||
return MatchResult(
|
||||
article=chunk.article or "",
|
||||
paragraph=chunk.paragraph or "",
|
||||
method="hash",
|
||||
)
|
||||
# Tier 1: Semantic search against the re-ingested, labeled chunks
|
||||
embed = await self._embedding_match(ctrl)
|
||||
if embed:
|
||||
return embed
|
||||
|
||||
# Tier 2: Regex parse concatenated source
|
||||
citation = ctrl.get("source_citation") or {}
|
||||
@@ -278,11 +188,60 @@ class CitationBackfill:
|
||||
)
|
||||
|
||||
# Tier 3: Ollama LLM
|
||||
if source_text:
|
||||
if ctrl.get("source_original_text"):
|
||||
return await self._llm_match(ctrl)
|
||||
|
||||
return None
|
||||
|
||||
async def _embedding_match(self, ctrl: dict) -> Optional[MatchResult]:
|
||||
"""Tier 1: semantic-search source_original_text against the labeled chunks.
|
||||
|
||||
Takes the top hit (cosine >= EMBED_THRESHOLD) that carries a real article
|
||||
and turns its article_label into a precise citation.
|
||||
"""
|
||||
source_text = ctrl.get("source_original_text")
|
||||
if not source_text:
|
||||
return None
|
||||
query = source_text.strip()[:512]
|
||||
best: Optional[RAGSearchResult] = None
|
||||
for collection in self._collections_for(ctrl):
|
||||
try:
|
||||
hits = await self.rag.search(query, collection=collection, top_k=3)
|
||||
except Exception as e:
|
||||
logger.debug("embed search failed (%s): %s", collection, e)
|
||||
hits = []
|
||||
if hits and (best is None or hits[0].score > best.score):
|
||||
best = hits[0]
|
||||
if not best or best.score < EMBED_THRESHOLD:
|
||||
return None
|
||||
article = _article_part(best)
|
||||
if not article:
|
||||
return None
|
||||
return MatchResult(
|
||||
article=article,
|
||||
paragraph=best.paragraph or "",
|
||||
method="embed",
|
||||
source=best.regulation_short or best.regulation_name or "",
|
||||
)
|
||||
|
||||
def _collections_for(self, ctrl: dict) -> list[str]:
|
||||
"""Likely collection(s) for a control's regulation; falls back to all three."""
|
||||
meta = ctrl.get("generation_metadata") or {}
|
||||
reg = (meta.get("source_regulation") or "").lower()
|
||||
prefix_map = {
|
||||
"eu_": "bp_compliance_gesetze", "bdsg": "bp_compliance_gesetze",
|
||||
"de_": "bp_compliance_gesetze", "at_": "bp_compliance_gesetze",
|
||||
"ch_": "bp_compliance_gesetze", "dsgvo": "bp_compliance_gesetze",
|
||||
"trgs": "bp_compliance_ce", "trbs": "bp_compliance_ce", "asr": "bp_compliance_ce",
|
||||
"nist": "bp_compliance_ce", "owasp": "bp_compliance_ce", "enisa": "bp_compliance_ce",
|
||||
"edpb": "bp_compliance_datenschutz", "dsk": "bp_compliance_datenschutz",
|
||||
"bfdi": "bp_compliance_datenschutz",
|
||||
}
|
||||
for prefix, coll in prefix_map.items():
|
||||
if reg.startswith(prefix):
|
||||
return [coll]
|
||||
return list(RELINK_COLLECTIONS)
|
||||
|
||||
async def _llm_match(self, ctrl: dict) -> Optional[MatchResult]:
|
||||
"""Use Ollama to identify article/paragraph from source text."""
|
||||
citation = ctrl.get("source_citation") or {}
|
||||
@@ -331,6 +290,9 @@ Bei deutschen Gesetzen mit § verwende: "§ XX" statt "Art. XX"."""
|
||||
if parsed:
|
||||
citation["source"] = parsed["name"]
|
||||
|
||||
# Embed tier carries the cleaned regulation name → prefer it as source.
|
||||
if match.source:
|
||||
citation["source"] = match.source
|
||||
# Add separate article/paragraph fields
|
||||
citation["article"] = match.article
|
||||
citation["paragraph"] = match.paragraph
|
||||
@@ -359,6 +321,23 @@ Bei deutschen Gesetzen mit § verwende: "§ XX" statt "Art. XX"."""
|
||||
)
|
||||
|
||||
|
||||
def _article_part(chunk: RAGSearchResult) -> str:
|
||||
"""Precise article from a chunk: article_label minus the regulation name.
|
||||
|
||||
'BDSG § 38' -> '§ 38'; 'Art. 39 DSGVO' -> 'Art. 39'; 'NIST SP 800-53r5 SA-12' -> 'SA-12'.
|
||||
Falls back to the bare article field. Returns '' if only a doc-level name is present.
|
||||
"""
|
||||
label = (chunk.article_label or "").strip()
|
||||
reg = (chunk.regulation_short or "").strip()
|
||||
if label:
|
||||
part = label
|
||||
if reg and reg in label:
|
||||
part = label.replace(reg, "").strip(" ,;-")
|
||||
if part and part != reg:
|
||||
return part
|
||||
return (chunk.article or "").strip()
|
||||
|
||||
|
||||
def _parse_concatenated_source(source: str) -> Optional[dict]:
|
||||
"""Parse 'DSGVO Art. 35' → {name: 'DSGVO', article: 'Art. 35'}.
|
||||
|
||||
|
||||
@@ -126,22 +126,29 @@ _ACTION_SYNONYMS: dict[str, str] = {
|
||||
|
||||
|
||||
def normalize_action(action: str) -> str:
|
||||
"""Normalize an action verb to a canonical English form."""
|
||||
"""Normalize an action verb to a canonical English form.
|
||||
|
||||
Delegates to DB-backed OntologyRegistry with dict fallback.
|
||||
"""
|
||||
try:
|
||||
from .ontology_registry import get_ontology_registry
|
||||
return get_ontology_registry().normalize_action(action)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Fallback: original logic
|
||||
if not action:
|
||||
return ""
|
||||
action = action.strip().lower()
|
||||
# Strip German infinitive/conjugation suffixes for lookup
|
||||
action_base = re.sub(r"(en|t|st|e|te|tet|end)$", "", action)
|
||||
# Try exact match first, then base form
|
||||
if action in _ACTION_SYNONYMS:
|
||||
return _ACTION_SYNONYMS[action]
|
||||
if action_base in _ACTION_SYNONYMS:
|
||||
return _ACTION_SYNONYMS[action_base]
|
||||
# Fuzzy: check if action starts with any known verb
|
||||
for verb, canonical in _ACTION_SYNONYMS.items():
|
||||
if action.startswith(verb) or verb.startswith(action):
|
||||
return canonical
|
||||
return action # fallback: return as-is
|
||||
return action
|
||||
|
||||
|
||||
# ── Object Normalization ─────────────────────────────────────────────
|
||||
@@ -237,7 +244,19 @@ _OBJECT_KEYS_SORTED = sorted(_OBJECT_SYNONYMS.keys(), key=len, reverse=True)
|
||||
|
||||
|
||||
def normalize_object(obj: str) -> str:
|
||||
"""Normalize a compliance object to a canonical token."""
|
||||
"""Normalize a compliance object to a canonical token.
|
||||
|
||||
Delegates to DB-backed OntologyRegistry with dict fallback.
|
||||
"""
|
||||
# Try DB-backed registry first
|
||||
try:
|
||||
from .ontology_registry import get_ontology_registry
|
||||
result = get_ontology_registry().normalize_object(obj)
|
||||
if result != obj.strip().lower():
|
||||
return result
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
if not obj:
|
||||
return ""
|
||||
obj_lower = obj.strip().lower()
|
||||
|
||||
@@ -25,8 +25,7 @@ import re
|
||||
import uuid
|
||||
from collections import defaultdict
|
||||
from dataclasses import dataclass, field, asdict
|
||||
from datetime import datetime, timezone
|
||||
from typing import Dict, List, Optional, Set
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
import httpx
|
||||
from pydantic import BaseModel
|
||||
@@ -34,7 +33,8 @@ from sqlalchemy import text
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from .rag_client import ComplianceRAGClient, RAGSearchResult, get_rag_client
|
||||
from .similarity_detector import check_similarity, SimilarityReport
|
||||
from .regulation_registry import get_registry as _get_regulation_registry
|
||||
from .similarity_detector import check_similarity
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@@ -246,28 +246,21 @@ def _classify_regulation(regulation_code: str) -> dict:
|
||||
|
||||
Returns dict with keys: license, rule, name, source_type.
|
||||
source_type is one of: law, guideline, standard, restricted.
|
||||
|
||||
Delegates to DB-backed RegulationRegistry (with 5min cache).
|
||||
Falls back to REGULATION_LICENSE_MAP if DB is unavailable.
|
||||
"""
|
||||
code = regulation_code.lower().strip()
|
||||
registry = _get_regulation_registry()
|
||||
result = registry.classify_regulation(regulation_code)
|
||||
|
||||
# Exact match first
|
||||
if code in REGULATION_LICENSE_MAP:
|
||||
return REGULATION_LICENSE_MAP[code]
|
||||
# If registry returned the unknown fallback AND we have a local match,
|
||||
# prefer the local dict (graceful degradation during migration)
|
||||
if result.get("license") == "UNKNOWN":
|
||||
code = regulation_code.lower().strip()
|
||||
if code in REGULATION_LICENSE_MAP:
|
||||
return REGULATION_LICENSE_MAP[code]
|
||||
|
||||
# Prefix match for Rule 2 (ENISA = standard)
|
||||
for prefix in _RULE2_PREFIXES:
|
||||
if code.startswith(prefix):
|
||||
return {"license": "CC-BY-4.0", "rule": 2, "source_type": "standard",
|
||||
"name": "ENISA", "attribution": "ENISA, CC BY 4.0"}
|
||||
|
||||
# Prefix match for Rule 3 (BSI/ISO/ETSI = restricted)
|
||||
for prefix in _RULE3_PREFIXES:
|
||||
if code.startswith(prefix):
|
||||
return {"license": f"{prefix.rstrip('_').upper()}_RESTRICTED", "rule": 3,
|
||||
"source_type": "restricted", "name": "INTERNAL_ONLY"}
|
||||
|
||||
# Unknown → treat as restricted (safe default)
|
||||
logger.warning("Unknown regulation_code %r — defaulting to Rule 3 (restricted)", code)
|
||||
return {"license": "UNKNOWN", "rule": 3, "source_type": "restricted", "name": "INTERNAL_ONLY"}
|
||||
return result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
@@ -1019,11 +1012,12 @@ class ControlGeneratorPipeline:
|
||||
regulation_name=reg_name,
|
||||
regulation_short=reg_short,
|
||||
category=payload.get("category", "") or payload.get("data_type", ""),
|
||||
article=payload.get("article", "") or payload.get("section_title", "") or payload.get("section", ""),
|
||||
article=payload.get("section", "") or payload.get("article", "") or payload.get("section_title", ""),
|
||||
paragraph=payload.get("paragraph", ""),
|
||||
source_url=payload.get("source_url", "") or payload.get("source", "") or payload.get("url", ""),
|
||||
score=0.0,
|
||||
collection=collection,
|
||||
page=payload.get("page"),
|
||||
)
|
||||
all_results.append(chunk)
|
||||
collection_new += 1
|
||||
@@ -1127,6 +1121,7 @@ Quelle: {chunk.regulation_name} ({chunk.regulation_code}), {chunk.article}"""
|
||||
"source": canonical_source,
|
||||
"article": effective_article,
|
||||
"paragraph": effective_paragraph,
|
||||
"page": chunk.page,
|
||||
"license": license_info.get("license", ""),
|
||||
"source_type": license_info.get("source_type", "law"),
|
||||
"url": chunk.source_url or "",
|
||||
@@ -1141,6 +1136,7 @@ Quelle: {chunk.regulation_name} ({chunk.regulation_code}), {chunk.article}"""
|
||||
"source_regulation": chunk.regulation_code,
|
||||
"source_article": effective_article,
|
||||
"source_paragraph": effective_paragraph,
|
||||
"source_page": chunk.page,
|
||||
}
|
||||
return control
|
||||
|
||||
@@ -1194,6 +1190,7 @@ Quelle: {chunk.regulation_name}, {chunk.article}"""
|
||||
"source": canonical_source,
|
||||
"article": effective_article,
|
||||
"paragraph": effective_paragraph,
|
||||
"page": chunk.page,
|
||||
"license": license_info.get("license", ""),
|
||||
"license_notice": attribution,
|
||||
"source_type": license_info.get("source_type", "standard"),
|
||||
@@ -1209,6 +1206,7 @@ Quelle: {chunk.regulation_name}, {chunk.article}"""
|
||||
"source_regulation": chunk.regulation_code,
|
||||
"source_article": effective_article,
|
||||
"source_paragraph": effective_paragraph,
|
||||
"source_page": chunk.page,
|
||||
}
|
||||
return control
|
||||
|
||||
@@ -1368,6 +1366,7 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Elementen. Fuer Chunks ohne A
|
||||
"source": canonical_source,
|
||||
"article": effective_article,
|
||||
"paragraph": effective_paragraph,
|
||||
"page": chunk.page,
|
||||
"license": lic.get("license", ""),
|
||||
"license_notice": lic.get("attribution", ""),
|
||||
"source_type": lic.get("source_type", "law"),
|
||||
@@ -1384,6 +1383,7 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Elementen. Fuer Chunks ohne A
|
||||
"source_regulation": chunk.regulation_code,
|
||||
"source_article": effective_article,
|
||||
"source_paragraph": effective_paragraph,
|
||||
"source_page": chunk.page,
|
||||
"batch_size": len(chunks),
|
||||
"document_grouped": same_doc,
|
||||
}
|
||||
@@ -1479,14 +1479,14 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Elementen. Fuer Aspekte ohne
|
||||
) -> list[Optional[GeneratedControl]]:
|
||||
"""Process a batch of (chunk, license_info) through stages 3-5."""
|
||||
# Split by license rule: Rule 1+2 → structure, Rule 3 → reform
|
||||
structure_items = [(c, l) for c, l in batch_items if l["rule"] in (1, 2)]
|
||||
reform_items = [(c, l) for c, l in batch_items if l["rule"] == 3]
|
||||
structure_items = [(c, lic) for c, lic in batch_items if lic["rule"] in (1, 2)]
|
||||
reform_items = [(c, lic) for c, lic in batch_items if lic["rule"] == 3]
|
||||
|
||||
all_controls: dict[int, Optional[GeneratedControl]] = {}
|
||||
|
||||
if structure_items:
|
||||
s_chunks = [c for c, _ in structure_items]
|
||||
s_lics = [l for _, l in structure_items]
|
||||
s_lics = [lic for _, lic in structure_items]
|
||||
try:
|
||||
s_controls = await self._structure_batch(s_chunks, s_lics)
|
||||
except Exception as e:
|
||||
|
||||
@@ -223,31 +223,43 @@ _FRAMEWORK_PATTERNS: list[str] = [
|
||||
|
||||
|
||||
def classify_action(text: str) -> str:
|
||||
"""Classify an obligation action text into a canonical action_type."""
|
||||
text_lower = text.lower().strip()
|
||||
"""Classify an obligation action text into a canonical action_type.
|
||||
|
||||
# Check negative patterns first
|
||||
Delegates to DB-backed OntologyRegistry (with 5min cache).
|
||||
Falls back to hardcoded dicts if DB is unavailable.
|
||||
"""
|
||||
try:
|
||||
from .ontology_registry import get_ontology_registry
|
||||
return get_ontology_registry().classify_action(text)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Fallback: original logic
|
||||
text_lower = text.lower().strip()
|
||||
for pattern, action_type in _NEGATIVE_PATTERNS:
|
||||
if pattern in text_lower:
|
||||
return action_type
|
||||
|
||||
# Direct alias match
|
||||
if text_lower in _ALIAS_TO_ACTION:
|
||||
return _ALIAS_TO_ACTION[text_lower]
|
||||
|
||||
# Substring match (longest first)
|
||||
best_match = ""
|
||||
best_action = "implement" # default fallback
|
||||
best_action = "implement"
|
||||
for alias, action_type in sorted(_ALIAS_TO_ACTION.items(), key=lambda x: -len(x[0])):
|
||||
if alias in text_lower and len(alias) > len(best_match):
|
||||
best_match = alias
|
||||
best_action = action_type
|
||||
|
||||
return best_action
|
||||
|
||||
|
||||
def get_phase(action_type: str) -> str:
|
||||
"""Get the control_phase for an action_type."""
|
||||
"""Get the control_phase for an action_type.
|
||||
|
||||
Delegates to DB-backed OntologyRegistry with dict fallback.
|
||||
"""
|
||||
try:
|
||||
from .ontology_registry import get_ontology_registry
|
||||
return get_ontology_registry().get_phase(action_type)
|
||||
except Exception:
|
||||
pass
|
||||
info = ACTION_TYPES.get(action_type, {})
|
||||
return info.get("phase", "implementation")
|
||||
|
||||
|
||||
@@ -24,7 +24,6 @@ import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import uuid
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Optional
|
||||
|
||||
@@ -56,7 +55,7 @@ ANTHROPIC_API_URL = "https://api.anthropic.com/v1"
|
||||
# Patterns are defined in normative_patterns.py and imported here
|
||||
# with local aliases for backward compatibility.
|
||||
|
||||
from .normative_patterns import (
|
||||
from .normative_patterns import ( # noqa: E402
|
||||
PFLICHT_RE as _PFLICHT_RE,
|
||||
EMPFEHLUNG_RE as _EMPFEHLUNG_RE,
|
||||
KANN_RE as _KANN_RE,
|
||||
@@ -461,12 +460,50 @@ WICHTIGE REGELN:
|
||||
|
||||
7. MERGE-KEY: Erzeuge im JSON-Output ein zusaetzliches Feld "merge_key" mit
|
||||
dem Format: "action_type:normalized_object:control_phase"
|
||||
|
||||
WICHTIG: Waehle normalized_object NUR aus dieser Liste kanonischer Tokens:
|
||||
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
|
||||
privileged_access, access_control, encryption, transport_encryption,
|
||||
key_management, certificate_management, network_security, network_segmentation,
|
||||
firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting,
|
||||
compliance_audit, vulnerability, patch_management, backup, disaster_recovery,
|
||||
physical_security, secure_development, api_security, input_validation,
|
||||
container_security, logging_configuration
|
||||
DATA_PROTECTION: personal_data, sensitive_data, health_data, consent,
|
||||
data_subject_rights, data_retention, data_transfer, data_breach_notification,
|
||||
dpia, data_processing_agreement, privacy_by_design, data_processing_register,
|
||||
data_classification, cookie_consent, video_surveillance
|
||||
GOVERNANCE: policy, procedure, process, training, awareness, incident,
|
||||
risk_management, third_party_management, change_management, documentation,
|
||||
records_management, compliance_reporting, asset_management,
|
||||
human_resources_security
|
||||
REGULATORY: supervisory_authority, certification, product_safety, ai_system,
|
||||
financial_reporting, aml, whistleblowing, consumer_protection, ecommerce,
|
||||
telecommunications, medical_device, payment_services, critical_infrastructure,
|
||||
supply_chain_due_diligence, sustainability_reporting
|
||||
|
||||
Wenn KEIN Token passt: "OTHER:kurzbeschreibung" (z.B. "OTHER:battery_recycling")
|
||||
|
||||
ABGRENZUNGEN (haeufige Fehler vermeiden!):
|
||||
- monitoring = NUR kontinuierliche Echtzeit-Ueberwachung von Systemen
|
||||
- audit_logging = Protokollierung, Audit Trail, Nachvollziehbarkeit
|
||||
- compliance_audit = externe Pruefungen, Zertifizierungsaudits
|
||||
- training = Schulungen DURCHFUEHREN (nicht "ueberwachen")
|
||||
- procedure = Verfahren DEFINIEREN (nicht Incident-Behandlung)
|
||||
- incident = Sicherheitsvorfaelle BEHANDELN
|
||||
- alerting = Meldepflichten und Benachrichtigungen
|
||||
- personal_data = DSGVO-Verarbeitungsgrundsaetze (nicht Zertifizierung!)
|
||||
- certification = Zertifizierung/Konformitaet (nicht Datenschutz)
|
||||
|
||||
Beispiele:
|
||||
- "implement:api_rate_limiting:implementation"
|
||||
- "define:access_control_policy:definition"
|
||||
- "monitor:third_party_vulnerabilities:monitoring"
|
||||
- "test:authentication_mechanism:testing"
|
||||
- "implement:multi_factor_auth:implementation"
|
||||
- "define:access_control:definition"
|
||||
- "monitor:network_security:monitoring"
|
||||
- "test:vulnerability:testing"
|
||||
- "report:supervisory_authority:reporting"
|
||||
- "implement:audit_logging:implementation" (NICHT monitoring!)
|
||||
- "define:incident:definition" (Incident-Verfahren, NICHT procedure!)
|
||||
- "train:training:operation" (Schulung, NICHT monitoring!)
|
||||
|
||||
8. APPLICABILITY + SCANNER: Bestimme fuer jedes Control:
|
||||
- applicability: Unter welchen Bedingungen gilt dieses Control?
|
||||
@@ -2473,6 +2510,81 @@ def _ensure_list(val) -> list:
|
||||
return []
|
||||
|
||||
|
||||
# Canonical object tokens from object_ontology (loaded once)
|
||||
_CANONICAL_OBJECTS: set[str] | None = None
|
||||
|
||||
|
||||
def _load_canonical_objects() -> set[str]:
|
||||
"""Load canonical tokens from DB, fallback to hardcoded set."""
|
||||
global _CANONICAL_OBJECTS
|
||||
if _CANONICAL_OBJECTS is not None:
|
||||
return _CANONICAL_OBJECTS
|
||||
try:
|
||||
from db.session import get_engine
|
||||
from sqlalchemy import text
|
||||
engine = get_engine()
|
||||
with engine.connect() as c:
|
||||
rows = c.execute(text(
|
||||
"SELECT canonical_token FROM compliance.object_ontology"
|
||||
)).fetchall()
|
||||
_CANONICAL_OBJECTS = {r[0] for r in rows}
|
||||
except Exception:
|
||||
_CANONICAL_OBJECTS = set()
|
||||
if not _CANONICAL_OBJECTS:
|
||||
_CANONICAL_OBJECTS = {
|
||||
"multi_factor_auth", "password_policy", "credentials",
|
||||
"session_management", "privileged_access", "access_control",
|
||||
"encryption", "transport_encryption", "key_management",
|
||||
"certificate_management", "network_security",
|
||||
"network_segmentation", "firewall", "vpn", "remote_access",
|
||||
"monitoring", "audit_logging", "siem", "alerting",
|
||||
"compliance_audit", "vulnerability", "patch_management",
|
||||
"backup", "disaster_recovery", "personal_data",
|
||||
"sensitive_data", "consent", "data_subject_rights",
|
||||
"data_retention", "data_transfer", "data_breach_notification",
|
||||
"dpia", "data_processing_agreement", "privacy_by_design",
|
||||
"policy", "procedure", "process", "training", "awareness",
|
||||
"incident", "risk_management", "third_party_management",
|
||||
"change_management", "documentation", "supervisory_authority",
|
||||
"certification", "product_safety", "ai_system", "aml",
|
||||
"critical_infrastructure", "medical_device",
|
||||
}
|
||||
return _CANONICAL_OBJECTS
|
||||
|
||||
|
||||
def _validate_merge_key(merge_key: str) -> str:
|
||||
"""Validate merge_key object against canonical ontology.
|
||||
|
||||
Returns the merge_key (possibly corrected). Logs warnings for
|
||||
unknown objects so they can be tracked.
|
||||
"""
|
||||
parts = merge_key.split(":", 2)
|
||||
if len(parts) < 2:
|
||||
return merge_key
|
||||
|
||||
action, obj = parts[0], parts[1]
|
||||
phase = parts[2] if len(parts) > 2 else "implementation"
|
||||
|
||||
# Accept OTHER: prefix (LLM signaling unknown object)
|
||||
if obj.startswith("OTHER:"):
|
||||
return merge_key
|
||||
|
||||
# Check against canonical ontology
|
||||
canonical = _load_canonical_objects()
|
||||
if obj in canonical:
|
||||
return merge_key
|
||||
|
||||
# Try normalize_object() as fallback
|
||||
from services.control_dedup import normalize_object
|
||||
normed = normalize_object(obj)
|
||||
if normed in canonical:
|
||||
return f"{action}:{normed}:{phase}"
|
||||
|
||||
# Unknown object — log and keep as-is (will be clustered by embedding)
|
||||
logger.debug("merge_key unknown object: %s (normed: %s)", obj, normed)
|
||||
return merge_key
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Decomposition Pass
|
||||
# ---------------------------------------------------------------------------
|
||||
@@ -3026,10 +3138,10 @@ class DecompositionPass:
|
||||
evidence_type=parsed.get("evidence_type", ""),
|
||||
provides_context=_ensure_list(parsed.get("provides_context", [])),
|
||||
)
|
||||
# Store merge_key from LLM output in metadata
|
||||
# Store merge_key from LLM output in metadata — with validation
|
||||
llm_merge_key = parsed.get("merge_key", "")
|
||||
if llm_merge_key:
|
||||
atomic.merge_group_hint = llm_merge_key
|
||||
atomic.merge_group_hint = _validate_merge_key(llm_merge_key)
|
||||
|
||||
atomic.parent_control_uuid = obl["parent_uuid"]
|
||||
atomic.obligation_candidate_id = obl["candidate_id"]
|
||||
@@ -3472,7 +3584,7 @@ class DecompositionPass:
|
||||
"category": atomic.category,
|
||||
"parent_uuid": parent_uuid,
|
||||
"gen_meta": json.dumps({
|
||||
"decomposition_source": candidate_id,
|
||||
"decomposition_source_id": candidate_id,
|
||||
"decomposition_method": "pass0b",
|
||||
"engine_version": "v2",
|
||||
"action_object_class": getattr(atomic, "domain", ""),
|
||||
@@ -4104,6 +4216,8 @@ def _format_citation(citation) -> str:
|
||||
parts.append(c["article"])
|
||||
if c.get("paragraph"):
|
||||
parts.append(c["paragraph"])
|
||||
if c.get("page") is not None:
|
||||
parts.append(f"S. {c['page']}")
|
||||
return " ".join(parts) if parts else citation
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return citation
|
||||
|
||||
@@ -0,0 +1,84 @@
|
||||
"""Shared embedding + sub-clustering utilities for the control pipeline."""
|
||||
|
||||
import logging
|
||||
import os
|
||||
from collections import defaultdict
|
||||
|
||||
import httpx
|
||||
import numpy as np
|
||||
from sklearn.cluster import MiniBatchKMeans
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
EMBEDDING_URL = os.getenv(
|
||||
"EMBEDDING_SERVICE_URL", "http://embedding-service:8087"
|
||||
)
|
||||
|
||||
|
||||
def embed_texts(texts: list[str]) -> np.ndarray | None:
|
||||
"""Embed texts via the embedding-service in batches of 64."""
|
||||
try:
|
||||
result = np.zeros((len(texts), 1024), dtype=np.float32)
|
||||
batch_size = 64
|
||||
for i in range(0, len(texts), batch_size):
|
||||
batch = texts[i : i + batch_size]
|
||||
for attempt in range(3):
|
||||
try:
|
||||
with httpx.Client(
|
||||
timeout=httpx.Timeout(60.0, connect=10.0)
|
||||
) as client:
|
||||
resp = client.post(
|
||||
f"{EMBEDDING_URL}/embed", json={"texts": batch}
|
||||
)
|
||||
resp.raise_for_status()
|
||||
embs = resp.json().get("embeddings", [])
|
||||
end = min(i + len(embs), len(texts))
|
||||
result[i:end] = np.array(embs, dtype=np.float32)
|
||||
break
|
||||
except Exception as e:
|
||||
if attempt == 2:
|
||||
logger.error("Embed batch %d failed: %s", i, e)
|
||||
import time
|
||||
time.sleep(2)
|
||||
return result
|
||||
except Exception as e:
|
||||
logger.error("Embedding failed: %s", e)
|
||||
return None
|
||||
|
||||
|
||||
def subcluster_controls(
|
||||
controls: list[dict], target_size: int = 50
|
||||
) -> list[list[dict]]:
|
||||
"""Sub-cluster controls by embedding similarity.
|
||||
|
||||
Returns a list of clusters. Falls back to naive chunking
|
||||
if embedding fails.
|
||||
"""
|
||||
if len(controls) <= target_size:
|
||||
return [controls]
|
||||
|
||||
texts = [c.get("title", "") or c.get("control_id", "") for c in controls]
|
||||
embeddings = embed_texts(texts)
|
||||
if embeddings is None:
|
||||
return [
|
||||
controls[i : i + target_size]
|
||||
for i in range(0, len(controls), target_size)
|
||||
]
|
||||
|
||||
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
|
||||
norms[norms == 0] = 1
|
||||
normalized = embeddings / norms
|
||||
|
||||
k = max(2, min(len(controls) // target_size, 30))
|
||||
kmeans = MiniBatchKMeans(
|
||||
n_clusters=k,
|
||||
batch_size=min(100, len(controls)),
|
||||
max_iter=50,
|
||||
random_state=42,
|
||||
)
|
||||
labels = kmeans.fit_predict(normalized)
|
||||
|
||||
clusters: dict[int, list[dict]] = defaultdict(list)
|
||||
for i, ctrl in enumerate(controls):
|
||||
clusters[int(labels[i])].append(ctrl)
|
||||
return list(clusters.values())
|
||||
@@ -0,0 +1,217 @@
|
||||
"""
|
||||
DB-backed Action & Object Ontology Registry with in-memory cache.
|
||||
|
||||
Replaces hardcoded ACTION_TYPES, _NEGATIVE_PATTERNS, _ACTION_SYNONYMS,
|
||||
and _OBJECT_SYNONYMS with PostgreSQL tables.
|
||||
|
||||
Cache TTL: 5 minutes. Thread-safe via simple timestamp check.
|
||||
Falls back to hardcoded dicts if DB is unavailable.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import re
|
||||
import time
|
||||
from typing import Optional
|
||||
|
||||
from sqlalchemy import text
|
||||
from sqlalchemy.exc import SQLAlchemyError
|
||||
|
||||
from db.session import SessionLocal
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
_CACHE_TTL_SECONDS = 300 # 5 minutes
|
||||
|
||||
|
||||
class OntologyRegistry:
|
||||
"""In-memory cache of action_types, action_synonyms, and object_synonyms."""
|
||||
|
||||
def __init__(self):
|
||||
# Action types: canonical_name → phase
|
||||
self._action_phases: dict[str, str] = {}
|
||||
# Alias → canonical action (for classify_action)
|
||||
self._alias_to_action: dict[str, str] = {}
|
||||
# Negative patterns: [(pattern, action_type)] ordered longest first
|
||||
self._negative_patterns: list[tuple[str, str]] = []
|
||||
# Action synonyms for dedup: synonym → canonical (for normalize_action)
|
||||
self._action_synonyms: dict[str, str] = {}
|
||||
# Object synonyms: synonym → canonical_token (for normalize_object)
|
||||
self._object_synonyms: dict[str, str] = {}
|
||||
# Sorted object keys (longest first) for substring matching
|
||||
self._object_keys_sorted: list[str] = []
|
||||
self._loaded_at: float = 0.0
|
||||
|
||||
def _is_stale(self) -> bool:
|
||||
return (time.monotonic() - self._loaded_at) > _CACHE_TTL_SECONDS
|
||||
|
||||
def _load(self) -> bool:
|
||||
"""Load all ontology data from DB into memory."""
|
||||
try:
|
||||
db = SessionLocal()
|
||||
try:
|
||||
return self._load_from_db(db)
|
||||
finally:
|
||||
db.close()
|
||||
except SQLAlchemyError:
|
||||
logger.warning(
|
||||
"Failed to load ontology from DB — using stale cache",
|
||||
exc_info=True,
|
||||
)
|
||||
return False
|
||||
|
||||
def _load_from_db(self, db) -> bool:
|
||||
"""Load from DB session."""
|
||||
# 1. Action types
|
||||
rows = db.execute(text(
|
||||
"SELECT canonical_name, phase FROM action_types"
|
||||
)).fetchall()
|
||||
action_phases = {r[0]: r[1] for r in rows}
|
||||
|
||||
# 2. Action synonyms (aliases + negative patterns)
|
||||
rows = db.execute(text(
|
||||
"SELECT canonical_action, synonym, pattern_type FROM action_synonyms"
|
||||
)).fetchall()
|
||||
|
||||
alias_to_action: dict[str, str] = {}
|
||||
negative_patterns: list[tuple[str, str]] = []
|
||||
action_synonyms: dict[str, str] = {}
|
||||
|
||||
for canonical, synonym, ptype in rows:
|
||||
if ptype == "negative_pattern":
|
||||
negative_patterns.append((synonym, canonical))
|
||||
else:
|
||||
alias_to_action[synonym] = canonical
|
||||
action_synonyms[synonym] = canonical
|
||||
|
||||
# Sort negative patterns: longest first (for priority matching)
|
||||
negative_patterns.sort(key=lambda x: -len(x[0]))
|
||||
|
||||
# 3. Object synonyms
|
||||
rows = db.execute(text(
|
||||
"SELECT canonical_token, synonym FROM object_synonyms"
|
||||
)).fetchall()
|
||||
object_synonyms = {r[1]: r[0] for r in rows}
|
||||
object_keys_sorted = sorted(object_synonyms.keys(), key=len, reverse=True)
|
||||
|
||||
# Commit to cache
|
||||
self._action_phases = action_phases
|
||||
self._alias_to_action = alias_to_action
|
||||
self._negative_patterns = negative_patterns
|
||||
self._action_synonyms = action_synonyms
|
||||
self._object_synonyms = object_synonyms
|
||||
self._object_keys_sorted = object_keys_sorted
|
||||
self._loaded_at = time.monotonic()
|
||||
|
||||
logger.info(
|
||||
"Ontology loaded: %d action_types, %d aliases, %d neg_patterns, %d object_synonyms",
|
||||
len(action_phases), len(alias_to_action),
|
||||
len(negative_patterns), len(object_synonyms),
|
||||
)
|
||||
return True
|
||||
|
||||
@property
|
||||
def is_loaded(self) -> bool:
|
||||
"""True if the cache has any data."""
|
||||
return len(self._action_phases) > 0
|
||||
|
||||
def _ensure_loaded(self) -> None:
|
||||
if self._is_stale():
|
||||
self._load()
|
||||
if not self.is_loaded:
|
||||
raise RuntimeError("OntologyRegistry has no data")
|
||||
|
||||
# ── Action Classification (replaces control_ontology.classify_action) ──
|
||||
|
||||
def classify_action(self, text_input: str) -> str:
|
||||
"""Classify text into a canonical action_type."""
|
||||
self._ensure_loaded()
|
||||
text_lower = text_input.lower().strip()
|
||||
|
||||
# Check negative patterns first
|
||||
for pattern, action_type in self._negative_patterns:
|
||||
if pattern in text_lower:
|
||||
return action_type
|
||||
|
||||
# Direct alias match
|
||||
if text_lower in self._alias_to_action:
|
||||
return self._alias_to_action[text_lower]
|
||||
|
||||
# Substring match (longest first)
|
||||
best_match = ""
|
||||
best_action = "implement"
|
||||
for alias, action_type in sorted(
|
||||
self._alias_to_action.items(), key=lambda x: -len(x[0])
|
||||
):
|
||||
if alias in text_lower and len(alias) > len(best_match):
|
||||
best_match = alias
|
||||
best_action = action_type
|
||||
|
||||
return best_action
|
||||
|
||||
def get_phase(self, action_type: str) -> str:
|
||||
"""Get the control_phase for an action_type."""
|
||||
self._ensure_loaded()
|
||||
return self._action_phases.get(action_type, "implementation")
|
||||
|
||||
# ── Action Normalization (replaces control_dedup.normalize_action) ──
|
||||
|
||||
def normalize_action(self, action: str) -> str:
|
||||
"""Normalize an action verb to a canonical English form."""
|
||||
self._ensure_loaded()
|
||||
if not action:
|
||||
return ""
|
||||
action = action.strip().lower()
|
||||
action_base = re.sub(r"(en|t|st|e|te|tet|end)$", "", action)
|
||||
|
||||
if action in self._action_synonyms:
|
||||
return self._action_synonyms[action]
|
||||
if action_base in self._action_synonyms:
|
||||
return self._action_synonyms[action_base]
|
||||
|
||||
for verb, canonical in self._action_synonyms.items():
|
||||
if action.startswith(verb) or verb.startswith(action):
|
||||
return canonical
|
||||
|
||||
return action
|
||||
|
||||
# ── Object Normalization (replaces control_dedup.normalize_object) ──
|
||||
|
||||
def normalize_object(self, obj: str) -> str:
|
||||
"""Normalize an object to a canonical token."""
|
||||
self._ensure_loaded()
|
||||
if not obj:
|
||||
return ""
|
||||
obj_lower = obj.strip().lower()
|
||||
|
||||
# Exact match
|
||||
if obj_lower in self._object_synonyms:
|
||||
return self._object_synonyms[obj_lower]
|
||||
|
||||
# Substring match (longest phrase first)
|
||||
for phrase in self._object_keys_sorted:
|
||||
if phrase in obj_lower:
|
||||
return self._object_synonyms[phrase]
|
||||
|
||||
return obj_lower
|
||||
|
||||
def get_action_types(self) -> dict[str, str]:
|
||||
"""Return all action_type → phase mappings."""
|
||||
self._ensure_loaded()
|
||||
return dict(self._action_phases)
|
||||
|
||||
def get_object_synonyms(self) -> dict[str, str]:
|
||||
"""Return all object synonym → canonical mappings."""
|
||||
self._ensure_loaded()
|
||||
return dict(self._object_synonyms)
|
||||
|
||||
|
||||
# Module-level singleton
|
||||
_registry: Optional[OntologyRegistry] = None
|
||||
|
||||
|
||||
def get_ontology_registry() -> OntologyRegistry:
|
||||
"""Get or create the singleton OntologyRegistry instance."""
|
||||
global _registry
|
||||
if _registry is None:
|
||||
_registry = OntologyRegistry()
|
||||
return _registry
|
||||
@@ -33,7 +33,9 @@ class RAGSearchResult:
|
||||
paragraph: str
|
||||
source_url: str
|
||||
score: float
|
||||
article_label: str = ""
|
||||
collection: str = ""
|
||||
page: Optional[int] = None
|
||||
|
||||
|
||||
class ComplianceRAGClient:
|
||||
@@ -89,6 +91,7 @@ class ComplianceRAGClient:
|
||||
regulation_short=r.get("regulation_short", ""),
|
||||
category=r.get("category", ""),
|
||||
article=r.get("article", ""),
|
||||
article_label=r.get("article_label", ""),
|
||||
paragraph=r.get("paragraph", ""),
|
||||
source_url=r.get("source_url", ""),
|
||||
score=r.get("score", 0.0),
|
||||
@@ -170,6 +173,7 @@ class ComplianceRAGClient:
|
||||
regulation_short=r.get("regulation_short", ""),
|
||||
category=r.get("category", ""),
|
||||
article=r.get("article", ""),
|
||||
article_label=r.get("article_label", ""),
|
||||
paragraph=r.get("paragraph", ""),
|
||||
source_url=r.get("source_url", ""),
|
||||
score=0.0,
|
||||
|
||||
@@ -0,0 +1,220 @@
|
||||
"""
|
||||
DB-backed Regulation Registry with in-memory cache.
|
||||
|
||||
Replaces hardcoded REGULATION_LICENSE_MAP and SOURCE_REGULATION_CLASSIFICATION
|
||||
with a single PostgreSQL table (compliance.regulation_registry).
|
||||
|
||||
Cache TTL: 5 minutes. Thread-safe via simple timestamp check.
|
||||
Falls back to hardcoded dicts if DB is unavailable (graceful degradation).
|
||||
"""
|
||||
|
||||
import logging
|
||||
import time
|
||||
from typing import Optional
|
||||
|
||||
from sqlalchemy import text
|
||||
from sqlalchemy.exc import SQLAlchemyError
|
||||
|
||||
from db.session import SessionLocal
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
_CACHE_TTL_SECONDS = 300 # 5 minutes
|
||||
|
||||
# Prefix-based fallback rules (unchanged from original logic)
|
||||
_RULE2_PREFIXES = ("enisa_",)
|
||||
_RULE3_PREFIXES = ("bsi_", "iso_", "etsi_")
|
||||
|
||||
# Fallback for unknown regulations
|
||||
_UNKNOWN_REGULATION = {
|
||||
"license": "UNKNOWN",
|
||||
"rule": 3,
|
||||
"source_type": "restricted",
|
||||
"name": "INTERNAL_ONLY",
|
||||
"attribution": None,
|
||||
}
|
||||
|
||||
|
||||
class RegulationRegistry:
|
||||
"""In-memory cache of the regulation_registry table.
|
||||
|
||||
Provides two lookup modes:
|
||||
1. by_code(regulation_id) — replaces REGULATION_LICENSE_MAP[code]
|
||||
2. source_type_by_name(name) — replaces SOURCE_REGULATION_CLASSIFICATION[name]
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self._by_code: dict[str, dict] = {}
|
||||
self._by_name: dict[str, str] = {}
|
||||
self._loaded_at: float = 0.0
|
||||
|
||||
def _is_stale(self) -> bool:
|
||||
return (time.monotonic() - self._loaded_at) > _CACHE_TTL_SECONDS
|
||||
|
||||
def _load(self) -> bool:
|
||||
"""Load all rows from regulation_registry into memory."""
|
||||
try:
|
||||
db = SessionLocal()
|
||||
try:
|
||||
rows = db.execute(
|
||||
text("""
|
||||
SELECT regulation_id, regulation_name_de, license_rule,
|
||||
license_type, attribution, source_type, jurisdiction,
|
||||
status
|
||||
FROM regulation_registry
|
||||
WHERE status != 'deprecated'
|
||||
""")
|
||||
).fetchall()
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
by_code: dict[str, dict] = {}
|
||||
by_name: dict[str, str] = {}
|
||||
|
||||
for row in rows:
|
||||
entry = {
|
||||
"license": row[3] or "", # license_type
|
||||
"rule": row[2], # license_rule
|
||||
"source_type": row[5] or "law", # source_type
|
||||
"name": row[1] or row[0], # regulation_name_de or regulation_id
|
||||
"attribution": row[4], # attribution
|
||||
"jurisdiction": row[6], # jurisdiction
|
||||
}
|
||||
by_code[row[0].lower()] = entry
|
||||
|
||||
# Also index by name for source_type lookups
|
||||
if row[1]:
|
||||
by_name[row[1]] = row[5] or "law"
|
||||
|
||||
self._by_code = by_code
|
||||
self._by_name = by_name
|
||||
self._loaded_at = time.monotonic()
|
||||
logger.info(
|
||||
"Regulation registry loaded: %d entries by code, %d by name",
|
||||
len(by_code), len(by_name),
|
||||
)
|
||||
return True
|
||||
|
||||
except SQLAlchemyError:
|
||||
logger.warning(
|
||||
"Failed to load regulation_registry from DB — using stale cache",
|
||||
exc_info=True,
|
||||
)
|
||||
return False
|
||||
|
||||
def _ensure_loaded(self) -> None:
|
||||
"""Reload cache if stale."""
|
||||
if self._is_stale():
|
||||
self._load()
|
||||
|
||||
def classify_regulation(self, regulation_code: str) -> dict:
|
||||
"""Look up license info for a regulation_code.
|
||||
|
||||
Returns dict with keys: license, rule, name, source_type, attribution.
|
||||
Equivalent to the old _classify_regulation() function.
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
code = regulation_code.lower().strip()
|
||||
|
||||
# Exact match from DB
|
||||
if code in self._by_code:
|
||||
return self._by_code[code]
|
||||
|
||||
# Prefix match for Rule 2 (ENISA = standard)
|
||||
for prefix in _RULE2_PREFIXES:
|
||||
if code.startswith(prefix):
|
||||
return {
|
||||
"license": "CC-BY-4.0",
|
||||
"rule": 2,
|
||||
"source_type": "standard",
|
||||
"name": "ENISA",
|
||||
"attribution": "ENISA, CC BY 4.0",
|
||||
}
|
||||
|
||||
# Prefix match for Rule 3 (BSI/ISO/ETSI = restricted)
|
||||
for prefix in _RULE3_PREFIXES:
|
||||
if code.startswith(prefix):
|
||||
return {
|
||||
"license": f"{prefix.rstrip('_').upper()}_RESTRICTED",
|
||||
"rule": 3,
|
||||
"source_type": "restricted",
|
||||
"name": "INTERNAL_ONLY",
|
||||
"attribution": None,
|
||||
}
|
||||
|
||||
# Unknown → restricted (safe default)
|
||||
logger.warning(
|
||||
"Unknown regulation_code %r — defaulting to Rule 3 (restricted)", code
|
||||
)
|
||||
return dict(_UNKNOWN_REGULATION)
|
||||
|
||||
def source_type_by_name(self, source_regulation: str) -> str:
|
||||
"""Look up source_type by regulation display name.
|
||||
|
||||
Equivalent to old classify_source_regulation().
|
||||
Falls back to heuristic for unknown names.
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
if not source_regulation:
|
||||
return "framework"
|
||||
|
||||
# Exact match from DB
|
||||
if source_regulation in self._by_name:
|
||||
return self._by_name[source_regulation]
|
||||
|
||||
# Heuristic fallback for unknown sources
|
||||
lower = source_regulation.lower()
|
||||
|
||||
law_indicators = [
|
||||
"verordnung", "richtlinie", "gesetz", "directive", "regulation",
|
||||
"(eu)", "(eg)", "act", "ley", "loi", "törvény", "código",
|
||||
]
|
||||
if any(ind in lower for ind in law_indicators):
|
||||
return "law"
|
||||
|
||||
guideline_indicators = [
|
||||
"edpb", "leitlinie", "guideline", "wp2", "bsi", "empfehlung",
|
||||
]
|
||||
if any(ind in lower for ind in guideline_indicators):
|
||||
return "guideline"
|
||||
|
||||
framework_indicators = [
|
||||
"enisa", "nist", "owasp", "oecd", "cisa", "framework", "iso",
|
||||
]
|
||||
if any(ind in lower for ind in framework_indicators):
|
||||
return "framework"
|
||||
|
||||
return "framework"
|
||||
|
||||
def get_all(self) -> dict[str, dict]:
|
||||
"""Return all cached entries (by regulation_code)."""
|
||||
self._ensure_loaded()
|
||||
return dict(self._by_code)
|
||||
|
||||
def is_open_source(self, regulation_code: str) -> bool:
|
||||
"""Check if regulation is Rule 1 or 2 (safe to reference)."""
|
||||
info = self.classify_regulation(regulation_code)
|
||||
return info["rule"] in (1, 2)
|
||||
|
||||
|
||||
# Module-level singleton
|
||||
_registry: Optional[RegulationRegistry] = None
|
||||
|
||||
|
||||
def get_registry() -> RegulationRegistry:
|
||||
"""Get or create the singleton RegulationRegistry instance."""
|
||||
global _registry
|
||||
if _registry is None:
|
||||
_registry = RegulationRegistry()
|
||||
return _registry
|
||||
|
||||
|
||||
def classify_regulation(regulation_code: str) -> dict:
|
||||
"""Convenience: look up license info for a regulation_code."""
|
||||
return get_registry().classify_regulation(regulation_code)
|
||||
|
||||
|
||||
def classify_source_regulation(source_regulation: str) -> str:
|
||||
"""Convenience: look up source_type by regulation display name."""
|
||||
return get_registry().source_type_by_name(source_regulation)
|
||||
@@ -0,0 +1,318 @@
|
||||
# Adversarial Test Suite — 30 tricky Cases die Controls/Agent herausfordern
|
||||
version: "1.0"
|
||||
purpose: "Testen ob Controls und Agent bei grenzwertigen Formulierungen korrekt entscheiden"
|
||||
|
||||
tests:
|
||||
|
||||
# A. Falsche Rechtsgrundlage (plausibel klingend) — 8 Cases
|
||||
|
||||
- id: ADV-LIT-001
|
||||
category: wrong_legal_basis
|
||||
input: "Wir verarbeiten Ihre Daten fuer Webanalyse auf Grundlage unseres berechtigten Interesses (Art. 6 Abs. 1 lit. f DSGVO)."
|
||||
context: "DSE-Abschnitt ueber Google Analytics"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
|
||||
reason: "Analytics erfordert Einwilligung, nicht berechtigtes Interesse (EuGH C-673/17 Planet49)"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-LIT-002
|
||||
category: wrong_legal_basis
|
||||
input: "Der Versand unseres Newsletters erfolgt auf Grundlage des Vertrages (Art. 6 Abs. 1 lit. b DSGVO)."
|
||||
context: "DSE-Abschnitt ueber Marketing"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
|
||||
reason: "Newsletter ist kein Vertragsbestandteil, erfordert separate Einwilligung"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-LIT-003
|
||||
category: wrong_legal_basis
|
||||
input: "Die Ueberwachung der Arbeitsleistung unserer Mitarbeiter erfolgt auf Grundlage unseres berechtigten Interesses."
|
||||
context: "Interne Datenschutzrichtlinie"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Betriebsvereinbarung + Art. 88 DSGVO i.V.m. § 26 BDSG"
|
||||
reason: "Mitarbeiterueberwachung erfordert Betriebsvereinbarung (BAG Keylogger-Urteil)"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-LIT-004
|
||||
category: wrong_legal_basis
|
||||
input: "Biometrische Zutrittskontrolle auf Basis von Art. 6 Abs. 1 lit. f DSGVO."
|
||||
context: "Sicherheitskonzept"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Art. 9 Abs. 2 DSGVO (ausdrueckliche Einwilligung oder Arbeitsrecht)"
|
||||
reason: "Biometrische Daten = besondere Kategorie nach Art. 9, lit. f reicht nicht"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-LIT-005
|
||||
category: wrong_legal_basis
|
||||
input: "Wir erstellen automatisierte Kreditentscheidungen auf Grundlage berechtigter Interessen."
|
||||
context: "DSE einer Bank"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Art. 22 DSGVO (ausdrueckliche Einwilligung oder gesetzliche Erlaubnis)"
|
||||
reason: "Automatisierte Einzelentscheidungen erfordern Art. 22 Schutz (EuGH SCHUFA C-634/21)"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-LIT-006
|
||||
category: wrong_legal_basis
|
||||
input: "Social Login ueber Google wird als Vertragsdurchfuehrung (lit. b) verarbeitet."
|
||||
context: "DSE mit Social Login"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
|
||||
reason: "Social Login ist keine Vertragspflicht, Nutzer kann sich auch ohne Google anmelden"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-LIT-007
|
||||
category: wrong_legal_basis
|
||||
input: "Personalisierte Werbung basiert auf unserem berechtigten Interesse an Direktmarketing."
|
||||
context: "DSE eines marktbeherrschenden Unternehmens"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
|
||||
reason: "Marktbeherrschende Unternehmen koennen sich nicht auf lit. f fuer Werbung berufen (EuGH Meta C-252/21)"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-LIT-008
|
||||
category: wrong_legal_basis
|
||||
input: "Die Einbindung von Facebook Pixel erfolgt zur Vertragserfuellung (Art. 6 Abs. 1 lit. b DSGVO)."
|
||||
context: "DSE eines Online-Shops"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
|
||||
reason: "Facebook Pixel dient Tracking/Marketing, nicht der Vertragserfuellung"
|
||||
difficulty: easy
|
||||
|
||||
# B. Dark Patterns (subtil) — 6 Cases
|
||||
|
||||
- id: ADV-DP-001
|
||||
category: dark_pattern
|
||||
input:
|
||||
accept_button: {text: "Alle akzeptieren", size: "16px", color: "#ffffff", background: "#0066cc", prominent: true}
|
||||
reject_button: {text: "Ablehnen", size: "10px", color: "#cccccc", background: "transparent", prominent: false}
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: dark_pattern_visual_bias
|
||||
reason: "Ablehnen-Button ist kleiner, weniger sichtbar (OLG Koeln 6 U 58/21)"
|
||||
difficulty: easy
|
||||
|
||||
- id: ADV-DP-002
|
||||
category: dark_pattern
|
||||
input:
|
||||
accept_button: {text: "Alle akzeptieren", clicks_to_complete: 1}
|
||||
reject_option: {text: "Einstellungen verwalten", clicks_to_complete: 3, label: "Einstellungen"}
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: dark_pattern_friction_asymmetry
|
||||
reason: "Ablehnen erfordert 3 Klicks, Akzeptieren nur 1 (CNIL Cookie-Banner)"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-DP-003
|
||||
category: dark_pattern
|
||||
input:
|
||||
type: "cookie_wall"
|
||||
description: "Inhalt erst nach Cookie-Zustimmung sichtbar"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: dark_pattern_cookie_wall
|
||||
reason: "Cookie-Wall = keine freiwillige Einwilligung (EDPB Guidelines 05/2020)"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-DP-004
|
||||
category: dark_pattern
|
||||
input:
|
||||
type: "prechecked_boxes"
|
||||
description: "Checkboxen fuer Marketing und Analytics sind vorausgefuellt"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: dark_pattern_prechecked
|
||||
reason: "Vorausgefuellte Checkboxen sind keine wirksame Einwilligung (BGH Planet49)"
|
||||
difficulty: easy
|
||||
|
||||
- id: ADV-DP-005
|
||||
category: dark_pattern
|
||||
input:
|
||||
type: "confirm_shaming"
|
||||
accept_text: "Ja, ich moechte sicher surfen"
|
||||
reject_text: "Nein, ich verzichte auf Sicherheit"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: dark_pattern_confirm_shaming
|
||||
reason: "Manipulative Formulierung beeinflusst Entscheidung"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-DP-006
|
||||
category: dark_pattern
|
||||
input:
|
||||
type: "hidden_reject"
|
||||
description: "Ablehnen-Link ist 3px gross, Farbe #f0f0f0 auf weissem Hintergrund"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: dark_pattern_hidden_option
|
||||
reason: "Ablehnen-Option praktisch unsichtbar (OLG Koeln)"
|
||||
difficulty: easy
|
||||
|
||||
# C. Fast-vollstaendige Dokumente — 6 Cases
|
||||
|
||||
- id: ADV-DOC-001
|
||||
category: incomplete_document
|
||||
input: "Impressum: Max Mustermann GmbH, Musterstr. 1, 10115 Berlin, info@example.com, HRB 12345"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: missing_field
|
||||
missing: "USt-ID"
|
||||
reason: "§ 5 Abs. 1 Nr. 6 DDG: USt-IdNr. oder Wirtschafts-ID Pflicht"
|
||||
difficulty: easy
|
||||
|
||||
- id: ADV-DOC-002
|
||||
category: incomplete_document
|
||||
input: "Datenschutzerklaerung mit Zwecken, Rechtsgrundlagen, Empfaengern, Betroffenenrechten — aber ohne Speicherdauer"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: missing_field
|
||||
missing: "Speicherdauer"
|
||||
reason: "Art. 13 Abs. 2 lit. a DSGVO: Dauer der Speicherung oder Kriterien"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-DOC-003
|
||||
category: incomplete_document
|
||||
input: "DSE ohne Kontaktdaten des Datenschutzbeauftragten"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: missing_field
|
||||
missing: "DSB-Kontakt"
|
||||
reason: "Art. 13 Abs. 1 lit. b DSGVO: Kontaktdaten des DSB"
|
||||
difficulty: easy
|
||||
|
||||
- id: ADV-DOC-004
|
||||
category: incomplete_document
|
||||
input: "Widerrufsbelehrung mit 14-Tage-Frist, Muster-Formular, aber Fristbeginn fehlt"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: missing_field
|
||||
missing: "Fristbeginn"
|
||||
reason: "Anlage 1 zu Art. 246a § 1 EGBGB: Fristbeginn muss angegeben werden"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-DOC-005
|
||||
category: incomplete_document
|
||||
input: "AGB eines Online-Shops ohne Angabe des Gerichtsstands"
|
||||
expected:
|
||||
finding: false
|
||||
reason: "Gerichtsstand in AGB ist bei B2C nicht erforderlich (sogar oft unzulaessig)"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-DOC-006
|
||||
category: incomplete_document
|
||||
input: "Cookie-Policy listet Google Analytics und Facebook Pixel auf, aber nicht das CMP-Cookie selbst"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: missing_field
|
||||
missing: "CMP-eigene Cookies"
|
||||
reason: "Auch technisch notwendige Cookies muessen in der Cookie-Policy stehen"
|
||||
difficulty: hard
|
||||
|
||||
# D. Semantisch aehnlich aber verschieden — 5 Cases
|
||||
|
||||
- id: ADV-SEM-001
|
||||
category: similar_but_different
|
||||
control_a: "MFA fuer privilegierte Admin-Accounts aktivieren"
|
||||
control_b: "MFA fuer alle Endnutzer-Accounts aktivieren"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Scopes (Admin vs. Endnutzer) = verschiedene Controls"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-SEM-002
|
||||
category: similar_but_different
|
||||
control_a: "Daten nach Vertragsende loeschen"
|
||||
control_b: "Daten nach Ablauf der gesetzlichen Aufbewahrungsfrist loeschen"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Trigger (Vertragsende vs. Aufbewahrungsfrist)"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-SEM-003
|
||||
category: similar_but_different
|
||||
control_a: "Rate Limiting fuer oeffentliche API-Endpunkte"
|
||||
control_b: "Rate Limiting fuer Login-Endpunkte"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Asset-Scopes (API vs. Login)"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-SEM-004
|
||||
category: similar_but_different
|
||||
control_a: "Verschluesselung personenbezogener Daten at rest"
|
||||
control_b: "Verschluesselung personenbezogener Daten in transit"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Phasen (Speicherung vs. Uebertragung)"
|
||||
difficulty: easy
|
||||
|
||||
- id: ADV-SEM-005
|
||||
category: similar_but_different
|
||||
control_a: "Incident Response Plan erstellen"
|
||||
control_b: "Business Continuity Plan erstellen"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "IRP = Sicherheitsvorfaelle, BCP = Geschaeftskontinuitaet (verschiedene Ziele)"
|
||||
difficulty: medium
|
||||
|
||||
# E. Semantisch verschieden aber gleich klingend — 5 Cases
|
||||
|
||||
- id: ADV-HOM-001
|
||||
category: homonym_different
|
||||
control_a: "Einwilligung des Nutzers fuer Datenverarbeitung einholen (DSGVO)"
|
||||
control_b: "Einwilligung des Nutzers fuer Werbeanrufe einholen (UWG)"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Rechtsgrundlagen (DSGVO vs. UWG) und verschiedene Rechtsfolgen"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-HOM-002
|
||||
category: homonym_different
|
||||
control_a: "Risikobewertung fuer Datenschutz-Folgenabschaetzung (DSFA)"
|
||||
control_b: "Risikobewertung fuer finanzielle Risiken (MaRisk)"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Risikokategorien und verschiedene regulatorische Grundlagen"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-HOM-003
|
||||
category: homonym_different
|
||||
control_a: "Audit der Datenschutz-Compliance (Art. 5 Abs. 2 DSGVO)"
|
||||
control_b: "Audit der Jahresabschlusspruefung (HGB)"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Audit-Typen mit verschiedenen Pruefungsstandards"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-HOM-004
|
||||
category: homonym_different
|
||||
control_a: "Zertifizierung nach ISO 27001 (Informationssicherheit)"
|
||||
control_b: "Zertifizierung nach CE-Konformitaet (Produktsicherheit)"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Zertifizierungsrahmen, verschiedene Pruefer, verschiedene Ziele"
|
||||
difficulty: easy
|
||||
|
||||
- id: ADV-HOM-005
|
||||
category: homonym_different
|
||||
control_a: "Verarbeitung personenbezogener Daten dokumentieren (DSGVO VVT)"
|
||||
control_b: "Verarbeitung von Lebensmitteln dokumentieren (HACCP)"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Komplett verschiedene Domaenen trotz gleicher Woerter"
|
||||
difficulty: easy
|
||||
@@ -0,0 +1,36 @@
|
||||
"""Shared test fixtures for the control pipeline test suite."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import pytest
|
||||
|
||||
# Ensure control-pipeline is in path
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def db_session():
|
||||
"""DB session for integration tests — skip if no DATABASE_URL."""
|
||||
url = os.getenv("DATABASE_URL")
|
||||
if not url:
|
||||
pytest.skip("DATABASE_URL not set — skipping DB tests")
|
||||
from db.session import SessionLocal
|
||||
db = SessionLocal()
|
||||
yield db
|
||||
db.close()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_controls(db_session):
|
||||
"""Load 100 random draft controls for regression testing."""
|
||||
from sqlalchemy import text
|
||||
rows = db_session.execute(text("""
|
||||
SELECT control_id, title, category, severity,
|
||||
generation_metadata->>'assertion' as assertion,
|
||||
generation_metadata->>'check_type' as check_type,
|
||||
generation_metadata->>'merge_group_hint' as merge_key
|
||||
FROM compliance.canonical_controls
|
||||
WHERE release_state = 'draft' AND decomposition_method = 'pass0b'
|
||||
ORDER BY random() LIMIT 100
|
||||
""")).fetchall()
|
||||
return [dict(r._mapping) for r in rows]
|
||||
@@ -0,0 +1,94 @@
|
||||
# Golden Dataset for MC Assignment Quality
|
||||
# Manually verified controls with their expected MC topics.
|
||||
# Used for regression testing after pipeline changes.
|
||||
# Created: 2026-05-10, verified by manual review (19/20 correct)
|
||||
|
||||
golden_controls:
|
||||
# ── Data Protection ──
|
||||
- control_id: "DATA-3291-A06"
|
||||
expected_topic_prefix: "data_retention"
|
||||
reason: "Speicherfristen für personenbezogene Daten definieren"
|
||||
|
||||
- control_id: "SEC-7449-A01"
|
||||
expected_topic_prefix: "personal_data"
|
||||
reason: "Fahrzeugnutzungsdaten in Telematikbox (Datenminimierung)"
|
||||
|
||||
- control_id: "DATA-3518-A06"
|
||||
expected_topic_prefix: "data_subject_rights"
|
||||
reason: "Betroffene über Lösch-Ausnahmen informieren"
|
||||
|
||||
- control_id: "GOV-963-A02"
|
||||
expected_topic_prefix: "consent"
|
||||
reason: "Zustimmung des Urhebers vor Veröffentlichung einholen"
|
||||
|
||||
# ── Security ──
|
||||
- control_id: "CRYP-1454-A07"
|
||||
expected_topic_prefix: "encryption"
|
||||
reason: "RSASSA-PSS in TLS 1.3 verifizieren"
|
||||
|
||||
- control_id: "NET-1141-A08"
|
||||
expected_topic_prefix: "monitoring"
|
||||
reason: "Sampling-Strategien konfigurieren"
|
||||
|
||||
- control_id: "SEC-2244-A05"
|
||||
expected_topic_prefix: "asset_management"
|
||||
reason: "Systeminventar kontinuierlich aktualisieren"
|
||||
|
||||
- control_id: "AUTH-3468-A06"
|
||||
expected_topic_prefix: "access_control"
|
||||
reason: "Rollenkonzept mit abgestuften Zugriffsrechten"
|
||||
|
||||
# ── Governance ──
|
||||
- control_id: "AUTH-2364-A09"
|
||||
expected_topic_prefix: "supervisory_authority"
|
||||
reason: "Zusammenarbeit mit Wirtschaftsakteuren dokumentieren"
|
||||
|
||||
- control_id: "SEC-5972-A14"
|
||||
expected_topic_prefix: "third_party_management"
|
||||
reason: "Cybersicherheitsrichtlinien kritischer Lieferanten prüfen"
|
||||
|
||||
- control_id: "SEC-3441-A02"
|
||||
expected_topic_prefix: "human_resources_security"
|
||||
reason: "Mitarbeiter vor Nachteil bei Verweigerung schützen"
|
||||
|
||||
- control_id: "SEC-3502-A06"
|
||||
expected_topic_prefix: "awareness"
|
||||
reason: "Organisationskultur für Sicherheitsverbesserung"
|
||||
|
||||
- control_id: "GOV-1748-A04"
|
||||
expected_topic_prefix: "policy"
|
||||
reason: "Annahme von Geschenken untersagen"
|
||||
|
||||
# ── Regulatory ──
|
||||
- control_id: "AI-1287-A01"
|
||||
expected_topic_prefix: "ai_system"
|
||||
reason: "Akteure des KI-Systems identifizieren"
|
||||
|
||||
- control_id: "AI-1732-A11"
|
||||
expected_topic_prefix: "ai_system"
|
||||
reason: "Menschliche Kontrolle für KI-Entscheidungen"
|
||||
|
||||
- control_id: "COMP-1352-A04"
|
||||
expected_topic_prefix: "certification"
|
||||
reason: "Amateurfunkprüfungszeugnis vorlegen"
|
||||
|
||||
- control_id: "FIN-1212-A02"
|
||||
expected_topic_prefix: "financial_reporting"
|
||||
reason: "Jahresabschluss gemäß EU-Richtlinie aufstellen"
|
||||
|
||||
- control_id: "AUTH-1165-A01"
|
||||
expected_topic_prefix: "data_classification"
|
||||
reason: "Öffentliche IP-Adressen als Stammdaten klassifizieren"
|
||||
|
||||
- control_id: "SEC-7367-A10"
|
||||
expected_topic_prefix: "audit_logging"
|
||||
reason: "Banner-Version Rückverfolgung testen"
|
||||
|
||||
- control_id: "LAB-034-A03"
|
||||
expected_topic_prefix: "third_party_management"
|
||||
reason: "Verträge auf unzulässige Klauseln prüfen"
|
||||
|
||||
quality_thresholds:
|
||||
min_accuracy: 0.90
|
||||
max_controls_per_mc: 300
|
||||
min_master_controls: 10000
|
||||
@@ -0,0 +1,190 @@
|
||||
"""
|
||||
Adversarial Test Suite — 30 tricky cases that challenge the control ontology
|
||||
and dedup engine with edge cases.
|
||||
|
||||
Tests categories:
|
||||
A. Wrong legal basis (plausible but incorrect) — 8 cases
|
||||
B. Dark patterns (subtle UI manipulation) — 6 cases
|
||||
C. Almost-complete documents (missing 1 field) — 6 cases
|
||||
D. Semantically similar but different controls — 5 cases
|
||||
E. Homonyms (different meaning, same words) — 5 cases
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import yaml
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from services.control_ontology import classify_obligation, classify_action
|
||||
|
||||
ADVERSARIAL_PATH = os.path.join(os.path.dirname(__file__), "adversarial_cases.yaml")
|
||||
|
||||
with open(ADVERSARIAL_PATH) as f:
|
||||
_ADV = yaml.safe_load(f)
|
||||
|
||||
TESTS = _ADV["tests"]
|
||||
|
||||
|
||||
def _tests_by_category(cat: str) -> list:
|
||||
return [t for t in TESTS if t["category"] == cat]
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# D. Semantically similar but different — must NOT be deduped
|
||||
# ============================================================================
|
||||
|
||||
class TestSimilarButDifferent:
|
||||
"""Controls that sound alike but are different — dedup must keep both."""
|
||||
|
||||
@pytest.mark.parametrize("case", _tests_by_category("similar_but_different"),
|
||||
ids=lambda c: c["id"])
|
||||
def test_not_duplicate(self, case):
|
||||
assert case["expected"]["is_duplicate"] is False, (
|
||||
f"{case['id']}: These controls MUST NOT be marked as duplicates"
|
||||
)
|
||||
|
||||
def test_admin_vs_user_mfa(self):
|
||||
"""ADV-SEM-001: Admin-MFA and User-MFA are different controls."""
|
||||
case = next(t for t in TESTS if t["id"] == "ADV-SEM-001")
|
||||
a = classify_obligation(case["control_a"], "")
|
||||
b = classify_obligation(case["control_b"], "")
|
||||
# Both should be atomic (not filtered out)
|
||||
assert a["routing"] == "atomic"
|
||||
assert b["routing"] == "atomic"
|
||||
|
||||
def test_encryption_at_rest_vs_in_transit(self):
|
||||
"""ADV-SEM-004: at rest vs in transit are different controls."""
|
||||
a_action = classify_action("Verschluesselung at rest implementieren")
|
||||
b_action = classify_action("Verschluesselung in transit implementieren")
|
||||
# Both should classify as "encrypt" or "implement"
|
||||
assert a_action in ("encrypt", "implement")
|
||||
assert b_action in ("encrypt", "implement")
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# E. Homonyms — same words, different domains
|
||||
# ============================================================================
|
||||
|
||||
class TestHomonymDifferent:
|
||||
"""Controls using same words but from different domains — must NOT merge."""
|
||||
|
||||
@pytest.mark.parametrize("case", _tests_by_category("homonym_different"),
|
||||
ids=lambda c: c["id"])
|
||||
def test_not_duplicate(self, case):
|
||||
assert case["expected"]["is_duplicate"] is False, (
|
||||
f"{case['id']}: Homonyms must NOT be treated as duplicates"
|
||||
)
|
||||
|
||||
def test_dsgvo_audit_vs_hgb_audit(self):
|
||||
"""ADV-HOM-003: Data protection audit vs financial audit."""
|
||||
a = classify_obligation("Audit der Datenschutz-Compliance durchfuehren", "")
|
||||
b = classify_obligation("Audit der Jahresabschlusspruefung durchfuehren", "")
|
||||
assert a["routing"] == "atomic"
|
||||
assert b["routing"] == "atomic"
|
||||
# "durchfuehren" maps to "implement" — key point is both are atomic, not filtered
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# A. Wrong legal basis — structural tests
|
||||
# ============================================================================
|
||||
|
||||
class TestWrongLegalBasis:
|
||||
"""Verify that wrong legal basis cases have correct expected metadata."""
|
||||
|
||||
@pytest.mark.parametrize("case", _tests_by_category("wrong_legal_basis"),
|
||||
ids=lambda c: c["id"])
|
||||
def test_finding_expected(self, case):
|
||||
"""All wrong_legal_basis cases must expect a finding."""
|
||||
assert case["expected"]["finding"] is True
|
||||
|
||||
@pytest.mark.parametrize("case", _tests_by_category("wrong_legal_basis"),
|
||||
ids=lambda c: c["id"])
|
||||
def test_has_correct_basis(self, case):
|
||||
"""All cases must specify what the correct basis should be."""
|
||||
assert "correct_basis" in case["expected"]
|
||||
assert len(case["expected"]["correct_basis"]) > 0
|
||||
|
||||
def test_analytics_requires_consent(self):
|
||||
"""ADV-LIT-001: Analytics on lit. f is always wrong."""
|
||||
case = next(t for t in TESTS if t["id"] == "ADV-LIT-001")
|
||||
assert "lit. a" in case["expected"]["correct_basis"]
|
||||
assert "Planet49" in case["expected"]["reason"]
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# B. Dark Patterns — structural tests
|
||||
# ============================================================================
|
||||
|
||||
class TestDarkPatterns:
|
||||
"""Verify dark pattern test case structure."""
|
||||
|
||||
@pytest.mark.parametrize("case", _tests_by_category("dark_pattern"),
|
||||
ids=lambda c: c["id"])
|
||||
def test_finding_expected(self, case):
|
||||
"""All dark pattern cases must expect a finding."""
|
||||
assert case["expected"]["finding"] is True
|
||||
|
||||
@pytest.mark.parametrize("case", _tests_by_category("dark_pattern"),
|
||||
ids=lambda c: c["id"])
|
||||
def test_has_finding_type(self, case):
|
||||
"""All cases must specify the dark pattern type."""
|
||||
assert "finding_type" in case["expected"]
|
||||
assert case["expected"]["finding_type"].startswith("dark_pattern_")
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# C. Incomplete documents — structural tests
|
||||
# ============================================================================
|
||||
|
||||
class TestIncompleteDocuments:
|
||||
"""Verify incomplete document test case structure."""
|
||||
|
||||
@pytest.mark.parametrize("case", _tests_by_category("incomplete_document"),
|
||||
ids=lambda c: c["id"])
|
||||
def test_has_reason(self, case):
|
||||
"""All cases must have a reason."""
|
||||
assert "reason" in case["expected"]
|
||||
assert len(case["expected"]["reason"]) > 0
|
||||
|
||||
def test_agb_gerichtsstand_no_finding(self):
|
||||
"""ADV-DOC-005: Missing Gerichtsstand in B2C AGB is NOT a finding."""
|
||||
case = next(t for t in TESTS if t["id"] == "ADV-DOC-005")
|
||||
assert case["expected"]["finding"] is False
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Meta tests — validate test suite integrity
|
||||
# ============================================================================
|
||||
|
||||
class TestSuiteIntegrity:
|
||||
"""Verify the adversarial test suite itself is complete and consistent."""
|
||||
|
||||
def test_total_count(self):
|
||||
assert len(TESTS) == 30
|
||||
|
||||
def test_unique_ids(self):
|
||||
ids = [t["id"] for t in TESTS]
|
||||
assert len(ids) == len(set(ids)), "Duplicate test IDs found"
|
||||
|
||||
def test_all_categories_present(self):
|
||||
categories = {t["category"] for t in TESTS}
|
||||
expected = {"wrong_legal_basis", "dark_pattern", "incomplete_document",
|
||||
"similar_but_different", "homonym_different"}
|
||||
assert categories == expected
|
||||
|
||||
def test_category_counts(self):
|
||||
counts = {}
|
||||
for t in TESTS:
|
||||
counts[t["category"]] = counts.get(t["category"], 0) + 1
|
||||
assert counts["wrong_legal_basis"] == 8
|
||||
assert counts["dark_pattern"] == 6
|
||||
assert counts["incomplete_document"] == 6
|
||||
assert counts["similar_but_different"] == 5
|
||||
assert counts["homonym_different"] == 5
|
||||
|
||||
def test_all_have_difficulty(self):
|
||||
for t in TESTS:
|
||||
assert "difficulty" in t, f"{t['id']} missing difficulty"
|
||||
assert t["difficulty"] in ("easy", "medium", "hard")
|
||||
@@ -0,0 +1,166 @@
|
||||
"""Tests for D3: Structural metadata flow (section priority, page in citation)."""
|
||||
|
||||
import json
|
||||
from typing import Optional
|
||||
|
||||
from services.rag_client import RAGSearchResult
|
||||
|
||||
|
||||
def _make_chunk(
|
||||
article: str = "",
|
||||
paragraph: str = "",
|
||||
page: Optional[int] = None,
|
||||
) -> RAGSearchResult:
|
||||
return RAGSearchResult(
|
||||
text="Test chunk text",
|
||||
regulation_code="DSGVO",
|
||||
regulation_name="Datenschutz-Grundverordnung",
|
||||
regulation_short="DSGVO",
|
||||
category="data_protection",
|
||||
article=article,
|
||||
paragraph=paragraph,
|
||||
source_url="https://example.com",
|
||||
score=0.95,
|
||||
collection="bp_compliance_de",
|
||||
page=page,
|
||||
)
|
||||
|
||||
|
||||
class TestRAGSearchResultPage:
|
||||
"""RAGSearchResult now carries a page field."""
|
||||
|
||||
def test_page_default_none(self):
|
||||
chunk = _make_chunk()
|
||||
assert chunk.page is None
|
||||
|
||||
def test_page_set(self):
|
||||
chunk = _make_chunk(page=42)
|
||||
assert chunk.page == 42
|
||||
|
||||
def test_page_zero(self):
|
||||
chunk = _make_chunk(page=0)
|
||||
assert chunk.page == 0
|
||||
|
||||
|
||||
class TestQdrantPayloadPriority:
|
||||
"""section (D2) should take priority over article (legacy)."""
|
||||
|
||||
def test_section_preferred_over_article(self):
|
||||
payload = {"section": "§ 312k", "article": "Art. 312", "section_title": "Kuendigungsbutton"}
|
||||
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
|
||||
assert article == "§ 312k"
|
||||
|
||||
def test_article_fallback_when_no_section(self):
|
||||
payload = {"section": "", "article": "Art. 35", "section_title": ""}
|
||||
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
|
||||
assert article == "Art. 35"
|
||||
|
||||
def test_section_title_last_resort(self):
|
||||
payload = {"section": "", "article": "", "section_title": "Informationspflichten"}
|
||||
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
|
||||
assert article == "Informationspflichten"
|
||||
|
||||
def test_all_empty(self):
|
||||
payload = {"section": "", "article": "", "section_title": ""}
|
||||
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
|
||||
assert article == ""
|
||||
|
||||
def test_page_from_payload(self):
|
||||
payload = {"page": 847}
|
||||
assert payload.get("page") == 847
|
||||
|
||||
def test_page_none_from_payload(self):
|
||||
payload = {}
|
||||
assert payload.get("page") is None
|
||||
|
||||
|
||||
class TestSourceCitationPage:
|
||||
"""source_citation dict should include page when available."""
|
||||
|
||||
def _build_citation(self, chunk: RAGSearchResult) -> dict:
|
||||
"""Mirrors the citation-building logic from control_generator.py."""
|
||||
return {
|
||||
"source": chunk.regulation_name,
|
||||
"article": chunk.article,
|
||||
"paragraph": chunk.paragraph,
|
||||
"page": chunk.page,
|
||||
"license": "free_use",
|
||||
"source_type": "law",
|
||||
"url": chunk.source_url or "",
|
||||
}
|
||||
|
||||
def test_citation_with_page(self):
|
||||
chunk = _make_chunk(article="§ 312k", paragraph="Abs. 1", page=847)
|
||||
citation = self._build_citation(chunk)
|
||||
assert citation["page"] == 847
|
||||
|
||||
def test_citation_without_page(self):
|
||||
chunk = _make_chunk(article="§ 312k", paragraph="Abs. 1")
|
||||
citation = self._build_citation(chunk)
|
||||
assert citation["page"] is None
|
||||
|
||||
def test_citation_serializable(self):
|
||||
chunk = _make_chunk(article="Art. 35", page=12)
|
||||
citation = self._build_citation(chunk)
|
||||
serialized = json.dumps(citation)
|
||||
restored = json.loads(serialized)
|
||||
assert restored["page"] == 12
|
||||
|
||||
|
||||
class TestFormatCitation:
|
||||
"""_format_citation should include page number."""
|
||||
|
||||
def _format_citation(self, citation) -> str:
|
||||
"""Mirrors _format_citation from decomposition_pass.py."""
|
||||
if not citation:
|
||||
return ""
|
||||
if isinstance(citation, str):
|
||||
try:
|
||||
c = json.loads(citation)
|
||||
if isinstance(c, dict):
|
||||
parts = []
|
||||
if c.get("source"):
|
||||
parts.append(c["source"])
|
||||
if c.get("article"):
|
||||
parts.append(c["article"])
|
||||
if c.get("paragraph"):
|
||||
parts.append(c["paragraph"])
|
||||
if c.get("page") is not None:
|
||||
parts.append(f"S. {c['page']}")
|
||||
return " ".join(parts) if parts else citation
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return citation
|
||||
return str(citation)
|
||||
|
||||
def test_format_with_page(self):
|
||||
citation = json.dumps({
|
||||
"source": "DSGVO",
|
||||
"article": "Art. 35",
|
||||
"paragraph": "Abs. 1",
|
||||
"page": 42,
|
||||
})
|
||||
result = self._format_citation(citation)
|
||||
assert result == "DSGVO Art. 35 Abs. 1 S. 42"
|
||||
|
||||
def test_format_without_page(self):
|
||||
citation = json.dumps({
|
||||
"source": "BGB",
|
||||
"article": "§ 312k",
|
||||
"paragraph": "",
|
||||
})
|
||||
result = self._format_citation(citation)
|
||||
assert result == "BGB § 312k"
|
||||
|
||||
def test_format_page_zero(self):
|
||||
citation = json.dumps({
|
||||
"source": "BGB",
|
||||
"article": "§ 1",
|
||||
"paragraph": "",
|
||||
"page": 0,
|
||||
})
|
||||
result = self._format_citation(citation)
|
||||
assert result == "BGB § 1 S. 0"
|
||||
|
||||
def test_format_empty_citation(self):
|
||||
assert self._format_citation("") == ""
|
||||
assert self._format_citation(None) == ""
|
||||
@@ -0,0 +1,122 @@
|
||||
"""F5 Validation: Verify DB-backed lookups match old hardcoded dicts."""
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
class TestRegulationRegistryConsistency:
|
||||
"""Ensure all old REGULATION_LICENSE_MAP entries are in the DB."""
|
||||
|
||||
def test_all_old_entries_in_db(self):
|
||||
from services.control_generator import REGULATION_LICENSE_MAP
|
||||
from scripts.f1_migrate_regulation_registry import build_rows
|
||||
|
||||
db_ids = {r["regulation_id"] for r in build_rows()}
|
||||
for reg_id in REGULATION_LICENSE_MAP:
|
||||
assert reg_id in db_ids, f"Missing from DB: {reg_id}"
|
||||
|
||||
def test_classify_regulation_matches_old(self):
|
||||
"""DB-backed classify_regulation returns same rule as old dict."""
|
||||
from services.control_generator import REGULATION_LICENSE_MAP
|
||||
from services.regulation_registry import RegulationRegistry
|
||||
from unittest.mock import patch, MagicMock
|
||||
|
||||
# Build mock DB with migration data
|
||||
from scripts.f1_migrate_regulation_registry import build_rows
|
||||
rows = build_rows()
|
||||
mock_rows = [
|
||||
(r["regulation_id"], r["regulation_name_de"], r["license_rule"],
|
||||
r["license_type"], r.get("attribution"), r["source_type"],
|
||||
r["jurisdiction"], r["status"])
|
||||
for r in rows
|
||||
]
|
||||
|
||||
reg = RegulationRegistry()
|
||||
with patch("services.regulation_registry.SessionLocal") as mock_cls:
|
||||
mock_session = MagicMock()
|
||||
mock_result = MagicMock()
|
||||
mock_result.fetchall.return_value = mock_rows
|
||||
mock_session.execute.return_value = mock_result
|
||||
mock_cls.return_value = mock_session
|
||||
reg._load()
|
||||
|
||||
# Compare every entry
|
||||
mismatches = []
|
||||
for reg_id, info in REGULATION_LICENSE_MAP.items():
|
||||
db_result = reg.classify_regulation(reg_id)
|
||||
if db_result["rule"] != info["rule"]:
|
||||
mismatches.append(f"{reg_id}: DB rule={db_result['rule']} vs dict rule={info['rule']}")
|
||||
|
||||
assert not mismatches, f"Rule mismatches:\n" + "\n".join(mismatches)
|
||||
|
||||
|
||||
class TestActionOntologyConsistency:
|
||||
"""Ensure all old ACTION_TYPES entries are in the DB."""
|
||||
|
||||
def test_all_action_types_migrated(self):
|
||||
from services.control_ontology import ACTION_TYPES
|
||||
from scripts.f2_migrate_actions import build_action_types
|
||||
|
||||
db_names = {t["canonical_name"] for t in build_action_types()}
|
||||
for action in ACTION_TYPES:
|
||||
assert action in db_names, f"Missing action_type: {action}"
|
||||
|
||||
def test_all_aliases_migrated(self):
|
||||
from services.control_ontology import ACTION_TYPES
|
||||
from scripts.f2_migrate_actions import build_action_synonyms
|
||||
|
||||
db_synonyms = {s["synonym"] for s in build_action_synonyms() if s["pattern_type"] == "alias"}
|
||||
missing = []
|
||||
for action, info in ACTION_TYPES.items():
|
||||
for alias in info.get("aliases", []):
|
||||
if alias.lower() not in db_synonyms:
|
||||
missing.append(f"{action}: {alias}")
|
||||
|
||||
assert not missing, f"Missing aliases:\n" + "\n".join(missing)
|
||||
|
||||
def test_all_negative_patterns_migrated(self):
|
||||
from services.control_ontology import _NEGATIVE_PATTERNS
|
||||
from scripts.f2_migrate_actions import build_action_synonyms
|
||||
|
||||
db_patterns = {s["synonym"] for s in build_action_synonyms() if s["pattern_type"] == "negative_pattern"}
|
||||
for pattern, _ in _NEGATIVE_PATTERNS:
|
||||
assert pattern.lower() in db_patterns, f"Missing negative pattern: {pattern}"
|
||||
|
||||
|
||||
class TestObjectSynonymsConsistency:
|
||||
"""Ensure all old _OBJECT_SYNONYMS are in the DB."""
|
||||
|
||||
def test_all_objects_migrated(self):
|
||||
from services.control_dedup import _OBJECT_SYNONYMS
|
||||
from scripts.f3_migrate_objects import build_rows
|
||||
|
||||
db_synonyms = {r["synonym"] for r in build_rows()}
|
||||
missing = []
|
||||
for syn in _OBJECT_SYNONYMS:
|
||||
if syn.lower() not in db_synonyms:
|
||||
missing.append(syn)
|
||||
|
||||
assert not missing, f"Missing object synonyms:\n" + "\n".join(missing)
|
||||
|
||||
|
||||
class TestLLMEnrichmentQuality:
|
||||
"""Basic quality checks on LLM-generated synonyms."""
|
||||
|
||||
def test_no_empty_synonyms_in_db(self):
|
||||
"""All synonyms should have content."""
|
||||
from scripts.f2_migrate_actions import build_action_synonyms
|
||||
for s in build_action_synonyms():
|
||||
assert len(s["synonym"].strip()) >= 2, f"Too short: {s}"
|
||||
|
||||
def test_no_duplicate_canonical_in_actions(self):
|
||||
"""Each synonym should map to exactly one canonical action."""
|
||||
from scripts.f2_migrate_actions import build_action_synonyms
|
||||
synonyms = build_action_synonyms()
|
||||
seen = {}
|
||||
for s in synonyms:
|
||||
key = (s["synonym"], s["language"], s["pattern_type"])
|
||||
if key in seen:
|
||||
assert seen[key] == s["canonical_action"], (
|
||||
f"Duplicate synonym '{s['synonym']}' maps to both "
|
||||
f"'{seen[key]}' and '{s['canonical_action']}'"
|
||||
)
|
||||
seen[key] = s["canonical_action"]
|
||||
@@ -0,0 +1,166 @@
|
||||
"""
|
||||
Master Control Quality Tests.
|
||||
|
||||
Regression tests to ensure MC assignment quality stays above 90%.
|
||||
Uses golden dataset of manually verified controls.
|
||||
"""
|
||||
|
||||
import os
|
||||
import yaml
|
||||
import pytest
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
DB_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
|
||||
)
|
||||
|
||||
_engine = None
|
||||
|
||||
|
||||
def get_engine():
|
||||
global _engine
|
||||
if _engine is None:
|
||||
_engine = create_engine(
|
||||
DB_URL,
|
||||
connect_args={"options": "-c search_path=compliance,public"},
|
||||
)
|
||||
return _engine
|
||||
|
||||
|
||||
def load_golden():
|
||||
path = os.path.join(os.path.dirname(__file__), "golden_mc_assignments.yaml")
|
||||
with open(path) as f:
|
||||
return yaml.safe_load(f)
|
||||
|
||||
|
||||
# ── Golden Dataset Tests ──
|
||||
|
||||
|
||||
class TestGoldenMCAssignments:
|
||||
"""Each golden control must be in the correct MC."""
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def setup(self):
|
||||
self.golden = load_golden()
|
||||
self.engine = get_engine()
|
||||
|
||||
def test_golden_controls_correctly_assigned(self):
|
||||
"""All golden controls must be in an MC matching their expected topic prefix."""
|
||||
errors = []
|
||||
with self.engine.connect() as c:
|
||||
for gc in self.golden["golden_controls"]:
|
||||
row = c.execute(text("""
|
||||
SELECT mc.canonical_name
|
||||
FROM master_controls mc
|
||||
JOIN master_control_members mcm ON mcm.master_control_uuid = mc.id
|
||||
JOIN canonical_controls cc ON cc.id = mcm.control_uuid
|
||||
WHERE cc.control_id = :cid
|
||||
LIMIT 1
|
||||
"""), {"cid": gc["control_id"]}).fetchone()
|
||||
|
||||
if row is None:
|
||||
errors.append(f"{gc['control_id']}: not found in any MC")
|
||||
elif not row[0].startswith(gc["expected_topic_prefix"]):
|
||||
errors.append(
|
||||
f"{gc['control_id']}: expected {gc['expected_topic_prefix']}*, "
|
||||
f"got {row[0]}"
|
||||
)
|
||||
|
||||
if errors:
|
||||
pytest.fail(
|
||||
f"{len(errors)} golden controls misassigned:\n"
|
||||
+ "\n".join(f" - {e}" for e in errors)
|
||||
)
|
||||
|
||||
|
||||
# ── Structural Quality Tests ──
|
||||
|
||||
|
||||
class TestMCStructuralQuality:
|
||||
"""Structural invariants for Master Controls."""
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def setup(self):
|
||||
self.golden = load_golden()
|
||||
self.thresholds = self.golden["quality_thresholds"]
|
||||
self.engine = get_engine()
|
||||
|
||||
def test_minimum_master_controls(self):
|
||||
"""Must have at least 10K Master Controls."""
|
||||
with self.engine.connect() as c:
|
||||
count = c.execute(
|
||||
text("SELECT count(*) FROM master_controls")
|
||||
).scalar()
|
||||
assert count >= self.thresholds["min_master_controls"], (
|
||||
f"Only {count} MCs, expected >= {self.thresholds['min_master_controls']}"
|
||||
)
|
||||
|
||||
def test_max_controls_per_mc(self):
|
||||
"""No MC should have more than 300 controls."""
|
||||
with self.engine.connect() as c:
|
||||
max_mc = c.execute(
|
||||
text("SELECT max(total_controls) FROM master_controls")
|
||||
).scalar()
|
||||
assert max_mc <= self.thresholds["max_controls_per_mc"], (
|
||||
f"Max MC has {max_mc} controls, limit is {self.thresholds['max_controls_per_mc']}"
|
||||
)
|
||||
|
||||
def test_no_empty_master_controls(self):
|
||||
"""Every MC must have at least 1 member."""
|
||||
with self.engine.connect() as c:
|
||||
empty = c.execute(text("""
|
||||
SELECT count(*) FROM master_controls
|
||||
WHERE total_controls = 0
|
||||
""")).scalar()
|
||||
assert empty == 0, f"{empty} empty MCs found"
|
||||
|
||||
def test_all_members_reference_valid_controls(self):
|
||||
"""Every MC member must reference an existing control."""
|
||||
with self.engine.connect() as c:
|
||||
orphans = c.execute(text("""
|
||||
SELECT count(*) FROM master_control_members mcm
|
||||
LEFT JOIN canonical_controls cc ON cc.id = mcm.control_uuid
|
||||
WHERE cc.id IS NULL
|
||||
""")).scalar()
|
||||
assert orphans == 0, f"{orphans} orphan members found"
|
||||
|
||||
|
||||
# ── Doc Check Controls Tests ──
|
||||
|
||||
|
||||
class TestDocCheckControls:
|
||||
"""Validate doc_check_controls table."""
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def setup(self):
|
||||
self.engine = get_engine()
|
||||
|
||||
def test_doc_check_controls_exist(self):
|
||||
"""Must have doc_check_controls."""
|
||||
with self.engine.connect() as c:
|
||||
count = c.execute(
|
||||
text("SELECT count(*) FROM doc_check_controls")
|
||||
).scalar()
|
||||
assert count > 100, f"Only {count} doc_check_controls"
|
||||
|
||||
def test_all_doc_types_covered(self):
|
||||
"""All 8 document types must have controls."""
|
||||
expected = {"dse", "cookie", "impressum", "widerruf",
|
||||
"agb", "dsfa", "avv", "loeschkonzept"}
|
||||
with self.engine.connect() as c:
|
||||
rows = c.execute(text(
|
||||
"SELECT DISTINCT doc_type FROM doc_check_controls"
|
||||
)).fetchall()
|
||||
actual = {r[0] for r in rows}
|
||||
missing = expected - actual
|
||||
assert not missing, f"Missing doc types: {missing}"
|
||||
|
||||
def test_check_questions_not_empty(self):
|
||||
"""Every doc_check_control must have a check_question."""
|
||||
with self.engine.connect() as c:
|
||||
empty = c.execute(text("""
|
||||
SELECT count(*) FROM doc_check_controls
|
||||
WHERE check_question IS NULL OR check_question = ''
|
||||
""")).scalar()
|
||||
assert empty == 0, f"{empty} controls without check_question"
|
||||
@@ -0,0 +1,226 @@
|
||||
"""Tests for OntologyRegistry — DB-backed action/object normalization."""
|
||||
|
||||
import time
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from services.ontology_registry import OntologyRegistry, _CACHE_TTL_SECONDS
|
||||
|
||||
|
||||
# ── Mock DB data ──────────────────────────────────────────────────────
|
||||
|
||||
_MOCK_ACTION_TYPES = [
|
||||
("implement", "implementation"),
|
||||
("monitor", "monitoring"),
|
||||
("prevent", "implementation"),
|
||||
("exclude", "implementation"),
|
||||
("test", "testing"),
|
||||
("encrypt", "implementation"),
|
||||
("document", "evidence"),
|
||||
("train", "training"),
|
||||
]
|
||||
|
||||
_MOCK_ACTION_SYNONYMS = [
|
||||
# (canonical_action, synonym, pattern_type)
|
||||
("implement", "implementieren", "alias"),
|
||||
("implement", "umsetzen", "alias"),
|
||||
("implement", "einführen", "alias"),
|
||||
("monitor", "überwachen", "alias"),
|
||||
("test", "testen", "alias"),
|
||||
("encrypt", "verschlüsseln", "alias"),
|
||||
("document", "dokumentieren", "alias"),
|
||||
("train", "schulen", "alias"),
|
||||
# Negative patterns
|
||||
("exclude", "dürfen nicht", "negative_pattern"),
|
||||
("exclude", "darf nicht", "negative_pattern"),
|
||||
("prevent", "verhindern", "negative_pattern"),
|
||||
("prevent", "nicht gespeichert", "negative_pattern"),
|
||||
]
|
||||
|
||||
_MOCK_OBJECT_SYNONYMS = [
|
||||
("multi_factor_auth", "mfa"),
|
||||
("multi_factor_auth", "2fa"),
|
||||
("password_policy", "passwort"),
|
||||
("encryption", "verschlüsselung"),
|
||||
("audit_logging", "audit-log"),
|
||||
("firewall", "firewall"),
|
||||
("personal_data", "personenbezogene daten"),
|
||||
]
|
||||
|
||||
|
||||
def _mock_execute(query):
|
||||
"""Route mock queries to correct test data."""
|
||||
q = str(query)
|
||||
mock_result = MagicMock()
|
||||
if "action_types" in q:
|
||||
mock_result.fetchall.return_value = _MOCK_ACTION_TYPES
|
||||
elif "action_synonyms" in q:
|
||||
mock_result.fetchall.return_value = _MOCK_ACTION_SYNONYMS
|
||||
elif "object_synonyms" in q:
|
||||
mock_result.fetchall.return_value = _MOCK_OBJECT_SYNONYMS
|
||||
else:
|
||||
mock_result.fetchall.return_value = []
|
||||
return mock_result
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def registry():
|
||||
"""Create a registry with mocked DB."""
|
||||
reg = OntologyRegistry()
|
||||
with patch("services.ontology_registry.SessionLocal") as mock_cls:
|
||||
mock_session = MagicMock()
|
||||
mock_session.execute = _mock_execute
|
||||
mock_cls.return_value = mock_session
|
||||
reg._load()
|
||||
return reg
|
||||
|
||||
|
||||
# ── classify_action tests ────────────────────────────────────────────
|
||||
|
||||
|
||||
class TestClassifyAction:
|
||||
def test_direct_alias(self, registry):
|
||||
assert registry.classify_action("implementieren") == "implement"
|
||||
assert registry.classify_action("überwachen") == "monitor"
|
||||
assert registry.classify_action("testen") == "test"
|
||||
|
||||
def test_case_insensitive(self, registry):
|
||||
assert registry.classify_action("IMPLEMENTIEREN") == "implement"
|
||||
|
||||
def test_negative_pattern(self, registry):
|
||||
assert registry.classify_action("dürfen nicht verwendet werden") == "exclude"
|
||||
assert registry.classify_action("darf nicht gespeichert werden") == "prevent"
|
||||
|
||||
def test_negative_pattern_priority(self, registry):
|
||||
# "nicht gespeichert" is more specific than "darf nicht"
|
||||
assert registry.classify_action("nicht gespeichert") == "prevent"
|
||||
|
||||
def test_substring_match(self, registry):
|
||||
assert registry.classify_action("Maßnahmen implementieren und dokumentieren") == "implement"
|
||||
|
||||
def test_unknown_defaults_to_implement(self, registry):
|
||||
assert registry.classify_action("fliegen") == "implement"
|
||||
|
||||
|
||||
# ── get_phase tests ──────────────────────────────────────────────────
|
||||
|
||||
|
||||
class TestGetPhase:
|
||||
def test_known_phase(self, registry):
|
||||
assert registry.get_phase("implement") == "implementation"
|
||||
assert registry.get_phase("monitor") == "monitoring"
|
||||
assert registry.get_phase("test") == "testing"
|
||||
|
||||
def test_unknown_defaults_to_implementation(self, registry):
|
||||
assert registry.get_phase("unknown_action") == "implementation"
|
||||
|
||||
|
||||
# ── normalize_action tests ───────────────────────────────────────────
|
||||
|
||||
|
||||
class TestNormalizeAction:
|
||||
def test_exact_match(self, registry):
|
||||
assert registry.normalize_action("implementieren") == "implement"
|
||||
assert registry.normalize_action("testen") == "test"
|
||||
|
||||
def test_empty(self, registry):
|
||||
assert registry.normalize_action("") == ""
|
||||
|
||||
def test_passthrough_unknown(self, registry):
|
||||
assert registry.normalize_action("fliegen") == "fliegen"
|
||||
|
||||
|
||||
# ── normalize_object tests ───────────────────────────────────────────
|
||||
|
||||
|
||||
class TestNormalizeObject:
|
||||
def test_exact_match(self, registry):
|
||||
assert registry.normalize_object("mfa") == "multi_factor_auth"
|
||||
assert registry.normalize_object("2fa") == "multi_factor_auth"
|
||||
assert registry.normalize_object("passwort") == "password_policy"
|
||||
|
||||
def test_case_insensitive(self, registry):
|
||||
assert registry.normalize_object("MFA") == "multi_factor_auth"
|
||||
|
||||
def test_substring_match(self, registry):
|
||||
assert registry.normalize_object("die personenbezogene daten verarbeiten") == "personal_data"
|
||||
|
||||
def test_empty(self, registry):
|
||||
assert registry.normalize_object("") == ""
|
||||
|
||||
def test_unknown_passthrough(self, registry):
|
||||
assert registry.normalize_object("raumschiff") == "raumschiff"
|
||||
|
||||
|
||||
# ── Cache behavior tests ────────────────────────────────────────────
|
||||
|
||||
|
||||
class TestCacheBehavior:
|
||||
def test_fresh_cache_not_stale(self, registry):
|
||||
assert registry._is_stale() is False
|
||||
|
||||
def test_old_cache_is_stale(self, registry):
|
||||
registry._loaded_at = time.monotonic() - _CACHE_TTL_SECONDS - 1
|
||||
assert registry._is_stale() is True
|
||||
|
||||
|
||||
# ── Migration data consistency ───────────────────────────────────────
|
||||
|
||||
|
||||
class TestF2MigrationData:
|
||||
def test_build_action_types(self):
|
||||
from scripts.f2_migrate_actions import build_action_types
|
||||
types = build_action_types()
|
||||
assert len(types) >= 26
|
||||
names = {t["canonical_name"] for t in types}
|
||||
assert "implement" in names
|
||||
assert "monitor" in names
|
||||
assert "encrypt" in names
|
||||
|
||||
def test_build_action_synonyms(self):
|
||||
from scripts.f2_migrate_actions import build_action_synonyms
|
||||
synonyms = build_action_synonyms()
|
||||
assert len(synonyms) > 100
|
||||
|
||||
# Check pattern types
|
||||
aliases = [s for s in synonyms if s["pattern_type"] == "alias"]
|
||||
negatives = [s for s in synonyms if s["pattern_type"] == "negative_pattern"]
|
||||
assert len(aliases) > 80
|
||||
assert len(negatives) > 15
|
||||
|
||||
def test_no_duplicate_synonyms(self):
|
||||
from scripts.f2_migrate_actions import build_action_synonyms
|
||||
synonyms = build_action_synonyms()
|
||||
keys = [(s["synonym"], s["language"], s["pattern_type"]) for s in synonyms]
|
||||
assert len(keys) == len(set(keys))
|
||||
|
||||
def test_all_canonical_actions_exist(self):
|
||||
from scripts.f2_migrate_actions import build_action_types, build_action_synonyms
|
||||
type_names = {t["canonical_name"] for t in build_action_types()}
|
||||
synonyms = build_action_synonyms()
|
||||
for s in synonyms:
|
||||
assert s["canonical_action"] in type_names, (
|
||||
"Synonym '%s' references unknown action '%s'" % (s["synonym"], s["canonical_action"])
|
||||
)
|
||||
|
||||
|
||||
class TestF3MigrationData:
|
||||
def test_build_object_rows(self):
|
||||
from scripts.f3_migrate_objects import build_rows
|
||||
rows = build_rows()
|
||||
assert len(rows) >= 70
|
||||
|
||||
def test_no_duplicate_objects(self):
|
||||
from scripts.f3_migrate_objects import build_rows
|
||||
rows = build_rows()
|
||||
keys = [(r["synonym"], r["language"]) for r in rows]
|
||||
assert len(keys) == len(set(keys))
|
||||
|
||||
def test_known_objects_present(self):
|
||||
from scripts.f3_migrate_objects import build_rows
|
||||
rows = build_rows()
|
||||
synonyms = {r["synonym"] for r in rows}
|
||||
assert "mfa" in synonyms
|
||||
assert "passwort" in synonyms
|
||||
assert "firewall" in synonyms
|
||||
@@ -0,0 +1,196 @@
|
||||
"""
|
||||
Regression Tests — verify pipeline updates don't break existing controls.
|
||||
|
||||
Requires: DATABASE_URL environment variable for DB tests.
|
||||
Tests without DB run always (structural checks).
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Structural tests (no DB needed)
|
||||
# ============================================================================
|
||||
|
||||
class TestOntologyStability:
|
||||
"""Verify ontology constants haven't accidentally changed."""
|
||||
|
||||
def test_action_types_count(self):
|
||||
from services.control_ontology import ACTION_TYPES
|
||||
assert len(ACTION_TYPES) >= 26, f"ACTION_TYPES shrank to {len(ACTION_TYPES)}"
|
||||
|
||||
def test_phase_order_count(self):
|
||||
from services.control_ontology import PHASE_ORDER
|
||||
assert len(PHASE_ORDER) >= 15, f"PHASE_ORDER shrank to {len(PHASE_ORDER)}"
|
||||
|
||||
def test_key_action_types_exist(self):
|
||||
from services.control_ontology import ACTION_TYPES
|
||||
required = ["define", "implement", "monitor", "test", "prevent", "exclude", "train"]
|
||||
for action in required:
|
||||
assert action in ACTION_TYPES, f"Missing action_type: {action}"
|
||||
|
||||
def test_classify_action_deterministic(self):
|
||||
"""Same input must always produce same output."""
|
||||
from services.control_ontology import classify_action
|
||||
for _ in range(10):
|
||||
assert classify_action("implementieren") == "implement"
|
||||
assert classify_action("überwachen") == "monitor"
|
||||
assert classify_action("verhindern") == "prevent"
|
||||
|
||||
|
||||
class TestDependencyEngineStability:
|
||||
"""Verify dependency engine core functions haven't changed behavior."""
|
||||
|
||||
def test_evaluate_condition_empty(self):
|
||||
from services.dependency_engine import evaluate_condition
|
||||
assert evaluate_condition({}, {}) is True
|
||||
|
||||
def test_evaluate_condition_simple(self):
|
||||
from services.dependency_engine import evaluate_condition
|
||||
cond = {"field": "source.status", "op": "==", "value": "pass"}
|
||||
assert evaluate_condition(cond, {"source": {"status": "pass"}}) is True
|
||||
assert evaluate_condition(cond, {"source": {"status": "fail"}}) is False
|
||||
|
||||
def test_apply_effect_not_applicable(self):
|
||||
from services.dependency_engine import apply_effect
|
||||
assert apply_effect({"set_status": "not_applicable"}, "fail") == "not_applicable"
|
||||
|
||||
def test_default_priorities_unchanged(self):
|
||||
from services.dependency_engine import DEFAULT_PRIORITIES
|
||||
assert DEFAULT_PRIORITIES["supersedes"] == 10
|
||||
assert DEFAULT_PRIORITIES["scope_exclusion"] == 20
|
||||
assert DEFAULT_PRIORITIES["prerequisite"] == 50
|
||||
assert DEFAULT_PRIORITIES["compensating_control"] == 80
|
||||
|
||||
|
||||
class TestDocumentComplianceStability:
|
||||
"""Verify document compliance rules haven't changed."""
|
||||
|
||||
def test_basic_website_requires_impressum(self):
|
||||
from services.document_scope_resolver import resolve_required_documents
|
||||
result = resolve_required_documents({"has_website": True})
|
||||
docs = result.get("required_documents", [])
|
||||
doc_types = [d["document_type"] if isinstance(d, dict) else d.document_type for d in docs]
|
||||
assert "impressum" in doc_types
|
||||
assert "privacy_policy" in doc_types
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# DB tests (require DATABASE_URL)
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.skipif(
|
||||
not os.getenv("DATABASE_URL"),
|
||||
reason="DATABASE_URL not set"
|
||||
)
|
||||
class TestControlCountStability:
|
||||
"""Draft count must stay within expected range."""
|
||||
|
||||
def test_draft_count_minimum(self, db_session):
|
||||
from sqlalchemy import text
|
||||
count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
|
||||
)).scalar()
|
||||
assert count > 140000, f"Draft count too low: {count} (expected >140k)"
|
||||
|
||||
def test_draft_count_maximum(self, db_session):
|
||||
from sqlalchemy import text
|
||||
count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
|
||||
)).scalar()
|
||||
assert count < 200000, f"Draft count too high: {count} (expected <200k)"
|
||||
|
||||
def test_no_null_titles(self, db_session):
|
||||
from sqlalchemy import text
|
||||
null_count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
|
||||
"AND (title IS NULL OR title = '')"
|
||||
)).scalar()
|
||||
assert null_count == 0, f"{null_count} controls without title"
|
||||
|
||||
def test_assertion_coverage(self, db_session):
|
||||
from sqlalchemy import text
|
||||
no_assertion = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
|
||||
"AND (generation_metadata->>'assertion' IS NULL "
|
||||
" OR generation_metadata->>'assertion' = '')"
|
||||
)).scalar()
|
||||
total = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
|
||||
)).scalar()
|
||||
coverage = (total - no_assertion) / max(total, 1) * 100
|
||||
assert coverage > 99, f"Assertion coverage only {coverage:.1f}% (expected >99%)"
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
not os.getenv("DATABASE_URL"),
|
||||
reason="DATABASE_URL not set"
|
||||
)
|
||||
class TestDependencyGraphStability:
|
||||
"""Dependency graph must be valid and within expected size."""
|
||||
|
||||
def test_dependency_count_minimum(self, db_session):
|
||||
from sqlalchemy import text
|
||||
count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.control_dependencies WHERE is_active = true"
|
||||
)).scalar()
|
||||
assert count > 10000, f"Too few dependencies: {count} (expected >10k)"
|
||||
|
||||
def test_no_self_dependencies(self, db_session):
|
||||
from sqlalchemy import text
|
||||
self_deps = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.control_dependencies "
|
||||
"WHERE source_control_id = target_control_id AND is_active = true"
|
||||
)).scalar()
|
||||
assert self_deps == 0, f"{self_deps} self-referencing dependencies"
|
||||
|
||||
def test_no_orphan_dependencies(self, db_session):
|
||||
from sqlalchemy import text
|
||||
orphans = db_session.execute(text("""
|
||||
SELECT COUNT(*) FROM compliance.control_dependencies d
|
||||
WHERE d.is_active = true
|
||||
AND NOT EXISTS (
|
||||
SELECT 1 FROM compliance.canonical_controls c
|
||||
WHERE c.id = d.source_control_id AND c.release_state = 'draft'
|
||||
)
|
||||
""")).scalar()
|
||||
# Some orphans OK (pointing to deprecated/duplicate controls)
|
||||
assert orphans < 1000, f"Too many orphan dependencies: {orphans}"
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
not os.getenv("DATABASE_URL"),
|
||||
reason="DATABASE_URL not set"
|
||||
)
|
||||
class TestQualityMetrics:
|
||||
"""Quality metrics must stay within target ranges."""
|
||||
|
||||
def test_duplicate_rate(self, db_session):
|
||||
from sqlalchemy import text
|
||||
total = db_session.execute(text(
|
||||
"SELECT COUNT(DISTINCT generation_metadata->>'merge_group_hint') "
|
||||
"FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
|
||||
"AND generation_metadata->>'merge_group_hint' IS NOT NULL"
|
||||
)).scalar()
|
||||
dups = db_session.execute(text("""
|
||||
SELECT COUNT(*) FROM (
|
||||
SELECT generation_metadata->>'merge_group_hint', COUNT(*)
|
||||
FROM compliance.canonical_controls
|
||||
WHERE release_state = 'draft' AND decomposition_method = 'pass0b'
|
||||
AND generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
GROUP BY generation_metadata->>'merge_group_hint'
|
||||
HAVING COUNT(*) > 1
|
||||
) sub
|
||||
""")).scalar()
|
||||
rate = dups / max(total, 1) * 100
|
||||
assert rate < 5, f"Duplicate merge_key rate {rate:.1f}% exceeds 5% threshold"
|
||||
@@ -0,0 +1,285 @@
|
||||
"""Tests for RegulationRegistry — DB-backed lookup with cache and fallback."""
|
||||
|
||||
import time
|
||||
from unittest.mock import patch, MagicMock
|
||||
|
||||
import pytest
|
||||
|
||||
from services.regulation_registry import (
|
||||
RegulationRegistry,
|
||||
_CACHE_TTL_SECONDS,
|
||||
)
|
||||
|
||||
|
||||
# ── Test data: simulates DB rows ──────────────────────────────────────────
|
||||
|
||||
_MOCK_DB_ROWS = [
|
||||
# (regulation_id, regulation_name_de, license_rule, license_type,
|
||||
# attribution, source_type, jurisdiction, status)
|
||||
("eu_2016_679", "DSGVO (EU) 2016/679", 1, "EU_LAW",
|
||||
None, "law", "EU", "active"),
|
||||
("nist_sp_800_53", "NIST SP 800-53 Rev. 5", 1, "NIST_PUBLIC_DOMAIN",
|
||||
None, "standard", "US", "active"),
|
||||
("owasp_asvs", "OWASP ASVS 4.0", 2, "CC-BY-SA-4.0",
|
||||
"OWASP Foundation, CC BY-SA 4.0", "standard", "INT", "active"),
|
||||
("bdsg", "Bundesdatenschutzgesetz (BDSG)", 1, "DE_LAW",
|
||||
None, "law", "DE", "active"),
|
||||
("at_dsg", "Österreichisches Datenschutzgesetz (DSG)", 1, "AT_LAW",
|
||||
None, "law", "AT", "active"),
|
||||
]
|
||||
|
||||
|
||||
def _mock_db_execute(query):
|
||||
"""Mock that returns our test rows."""
|
||||
mock_result = MagicMock()
|
||||
mock_result.fetchall.return_value = _MOCK_DB_ROWS
|
||||
return mock_result
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def registry():
|
||||
"""Create a registry with mocked DB."""
|
||||
reg = RegulationRegistry()
|
||||
with patch("services.regulation_registry.SessionLocal") as mock_session_cls:
|
||||
mock_session = MagicMock()
|
||||
mock_session.execute = _mock_db_execute
|
||||
mock_session_cls.return_value = mock_session
|
||||
reg._load()
|
||||
return reg
|
||||
|
||||
|
||||
# ── classify_regulation tests ─────────────────────────────────────────────
|
||||
|
||||
|
||||
class TestClassifyRegulation:
|
||||
def test_exact_match_eu_law(self, registry):
|
||||
result = registry.classify_regulation("eu_2016_679")
|
||||
assert result["rule"] == 1
|
||||
assert result["license"] == "EU_LAW"
|
||||
assert result["source_type"] == "law"
|
||||
assert result["name"] == "DSGVO (EU) 2016/679"
|
||||
|
||||
def test_exact_match_case_insensitive(self, registry):
|
||||
result = registry.classify_regulation("EU_2016_679")
|
||||
assert result["rule"] == 1
|
||||
assert result["name"] == "DSGVO (EU) 2016/679"
|
||||
|
||||
def test_exact_match_with_whitespace(self, registry):
|
||||
result = registry.classify_regulation(" eu_2016_679 ")
|
||||
assert result["rule"] == 1
|
||||
|
||||
def test_nist_standard(self, registry):
|
||||
result = registry.classify_regulation("nist_sp_800_53")
|
||||
assert result["rule"] == 1
|
||||
assert result["source_type"] == "standard"
|
||||
|
||||
def test_owasp_rule2(self, registry):
|
||||
result = registry.classify_regulation("owasp_asvs")
|
||||
assert result["rule"] == 2
|
||||
assert result["attribution"] == "OWASP Foundation, CC BY-SA 4.0"
|
||||
|
||||
def test_german_law(self, registry):
|
||||
result = registry.classify_regulation("bdsg")
|
||||
assert result["rule"] == 1
|
||||
assert result["source_type"] == "law"
|
||||
assert result["jurisdiction"] == "DE"
|
||||
|
||||
def test_austrian_law(self, registry):
|
||||
result = registry.classify_regulation("at_dsg")
|
||||
assert result["rule"] == 1
|
||||
assert result["jurisdiction"] == "AT"
|
||||
|
||||
def test_prefix_enisa_rule2(self, registry):
|
||||
result = registry.classify_regulation("enisa_supply_chain_2024")
|
||||
assert result["rule"] == 2
|
||||
assert result["source_type"] == "standard"
|
||||
assert "ENISA" in result["attribution"]
|
||||
|
||||
def test_prefix_bsi_rule3(self, registry):
|
||||
result = registry.classify_regulation("bsi_tr_03161")
|
||||
assert result["rule"] == 3
|
||||
assert result["source_type"] == "restricted"
|
||||
assert result["name"] == "INTERNAL_ONLY"
|
||||
|
||||
def test_prefix_iso_rule3(self, registry):
|
||||
result = registry.classify_regulation("iso_27001")
|
||||
assert result["rule"] == 3
|
||||
assert result["source_type"] == "restricted"
|
||||
|
||||
def test_prefix_etsi_rule3(self, registry):
|
||||
result = registry.classify_regulation("etsi_en_303_645")
|
||||
assert result["rule"] == 3
|
||||
|
||||
def test_unknown_defaults_to_restricted(self, registry):
|
||||
result = registry.classify_regulation("some_unknown_regulation")
|
||||
assert result["rule"] == 3
|
||||
assert result["source_type"] == "restricted"
|
||||
assert result["license"] == "UNKNOWN"
|
||||
|
||||
|
||||
# ── source_type_by_name tests ────────────────────────────────────────────
|
||||
|
||||
|
||||
class TestSourceTypeByName:
|
||||
def test_exact_match_law(self, registry):
|
||||
result = registry.source_type_by_name("DSGVO (EU) 2016/679")
|
||||
assert result == "law"
|
||||
|
||||
def test_exact_match_standard(self, registry):
|
||||
result = registry.source_type_by_name("NIST SP 800-53 Rev. 5")
|
||||
assert result == "standard"
|
||||
|
||||
def test_empty_returns_framework(self, registry):
|
||||
assert registry.source_type_by_name("") == "framework"
|
||||
assert registry.source_type_by_name(None) == "framework"
|
||||
|
||||
def test_heuristic_law(self, registry):
|
||||
assert registry.source_type_by_name("Verordnung XYZ") == "law"
|
||||
assert registry.source_type_by_name("Some EU Directive") == "law"
|
||||
|
||||
def test_heuristic_guideline(self, registry):
|
||||
assert registry.source_type_by_name("EDPB Leitlinie 99/2025") == "guideline"
|
||||
assert registry.source_type_by_name("BSI Standard 200-1") == "guideline"
|
||||
|
||||
def test_heuristic_framework(self, registry):
|
||||
# "ENISA Cloud Guidelines" matches "guideline" before "enisa" in heuristic order
|
||||
assert registry.source_type_by_name("ENISA Cloud Report") == "framework"
|
||||
assert registry.source_type_by_name("OWASP Testing Guide") == "framework"
|
||||
|
||||
def test_unknown_returns_framework(self, registry):
|
||||
assert registry.source_type_by_name("Completely Unknown Document") == "framework"
|
||||
|
||||
|
||||
# ── is_open_source tests ─────────────��───────────────────────────────────
|
||||
|
||||
|
||||
class TestIsOpenSource:
|
||||
def test_rule1_is_open(self, registry):
|
||||
assert registry.is_open_source("eu_2016_679") is True
|
||||
|
||||
def test_rule2_is_open(self, registry):
|
||||
assert registry.is_open_source("owasp_asvs") is True
|
||||
|
||||
def test_rule3_is_not_open(self, registry):
|
||||
assert registry.is_open_source("bsi_tr_03161") is False
|
||||
|
||||
def test_unknown_is_not_open(self, registry):
|
||||
assert registry.is_open_source("unknown_thing") is False
|
||||
|
||||
|
||||
# ── Cache behavior tests ──────��──────────────────────────────────────────
|
||||
|
||||
|
||||
class TestCacheBehavior:
|
||||
def test_fresh_cache_not_stale(self, registry):
|
||||
assert registry._is_stale() is False
|
||||
|
||||
def test_old_cache_is_stale(self, registry):
|
||||
registry._loaded_at = time.monotonic() - _CACHE_TTL_SECONDS - 1
|
||||
assert registry._is_stale() is True
|
||||
|
||||
def test_ensure_loaded_reloads_when_stale(self):
|
||||
reg = RegulationRegistry()
|
||||
reg._loaded_at = time.monotonic() - _CACHE_TTL_SECONDS - 100 # force stale
|
||||
|
||||
load_called = False
|
||||
original_load = reg._load
|
||||
|
||||
def tracking_load():
|
||||
nonlocal load_called
|
||||
load_called = True
|
||||
|
||||
reg._load = tracking_load
|
||||
reg._ensure_loaded()
|
||||
assert load_called, "_load should have been called when cache is stale"
|
||||
|
||||
def test_ensure_loaded_skips_when_fresh(self, registry):
|
||||
with patch.object(registry, "_load") as mock_load:
|
||||
registry._ensure_loaded()
|
||||
mock_load.assert_not_called()
|
||||
|
||||
|
||||
# ── Graceful degradation tests ──────��────────────────────────────────────
|
||||
|
||||
|
||||
class TestGracefulDegradation:
|
||||
def test_db_failure_uses_stale_cache(self):
|
||||
"""If DB fails, stale cache entries are still usable."""
|
||||
reg = RegulationRegistry()
|
||||
|
||||
# First load succeeds
|
||||
with patch("services.regulation_registry.SessionLocal") as mock_cls:
|
||||
mock_session = MagicMock()
|
||||
mock_session.execute = _mock_db_execute
|
||||
mock_cls.return_value = mock_session
|
||||
reg._load()
|
||||
|
||||
# Force stale
|
||||
reg._loaded_at = time.monotonic() - _CACHE_TTL_SECONDS - 1
|
||||
|
||||
# Second load fails — DB error
|
||||
from sqlalchemy.exc import OperationalError
|
||||
with patch("services.regulation_registry.SessionLocal") as mock_cls:
|
||||
mock_cls.side_effect = OperationalError("connection refused", None, None)
|
||||
reg._ensure_loaded()
|
||||
|
||||
# Should still have cached data
|
||||
result = reg.classify_regulation("eu_2016_679")
|
||||
assert result["rule"] == 1
|
||||
|
||||
def test_empty_registry_returns_unknown(self):
|
||||
"""Unloaded registry returns safe defaults."""
|
||||
reg = RegulationRegistry()
|
||||
reg._loaded_at = time.monotonic() # pretend fresh but empty
|
||||
|
||||
result = reg.classify_regulation("eu_2016_679")
|
||||
assert result["rule"] == 3 # safe default
|
||||
assert result["license"] == "UNKNOWN"
|
||||
|
||||
|
||||
# ── Migration data consistency tests ───────��─────────────────────────────
|
||||
|
||||
|
||||
class TestMigrationDataConsistency:
|
||||
"""Verify that the migration script produces valid data."""
|
||||
|
||||
def test_build_rows_produces_data(self):
|
||||
from scripts.f1_migrate_regulation_registry import build_rows
|
||||
rows = build_rows()
|
||||
assert len(rows) > 100 # at least 100 entries
|
||||
|
||||
def test_all_rows_have_required_fields(self):
|
||||
from scripts.f1_migrate_regulation_registry import build_rows
|
||||
rows = build_rows()
|
||||
for row in rows:
|
||||
assert row["regulation_id"], f"Missing regulation_id: {row}"
|
||||
assert row["regulation_name_de"], f"Missing name: {row}"
|
||||
assert row["license_rule"] in (1, 2, 3), f"Bad rule: {row}"
|
||||
assert row["source_type"] in (
|
||||
"law", "guideline", "standard", "framework", "restricted"
|
||||
), f"Bad source_type: {row}"
|
||||
assert row["jurisdiction"], f"Missing jurisdiction: {row}"
|
||||
assert row["status"] in ("active", "needs_review", "deprecated")
|
||||
|
||||
def test_no_duplicate_regulation_ids(self):
|
||||
from scripts.f1_migrate_regulation_registry import build_rows
|
||||
rows = build_rows()
|
||||
ids = [r["regulation_id"] for r in rows]
|
||||
assert len(ids) == len(set(ids)), f"Duplicates: {[x for x in ids if ids.count(x) > 1]}"
|
||||
|
||||
def test_known_regulations_present(self):
|
||||
from scripts.f1_migrate_regulation_registry import build_rows
|
||||
rows = build_rows()
|
||||
ids = {r["regulation_id"] for r in rows}
|
||||
assert "eu_2016_679" in ids # DSGVO
|
||||
assert "bdsg" in ids # BDSG
|
||||
assert "nist_sp_800_53" in ids # NIST
|
||||
assert "owasp_asvs" in ids # OWASP
|
||||
|
||||
def test_owasp_has_attribution(self):
|
||||
from scripts.f1_migrate_regulation_registry import build_rows
|
||||
rows = build_rows()
|
||||
owasp = [r for r in rows if r["regulation_id"] == "owasp_asvs"][0]
|
||||
assert owasp["attribution"] is not None
|
||||
assert "OWASP" in owasp["attribution"]
|
||||
assert owasp["license_rule"] == 2
|
||||
@@ -162,8 +162,6 @@ services:
|
||||
profiles: ["disabled"]
|
||||
gitea-runner:
|
||||
profiles: ["disabled"]
|
||||
night-scheduler:
|
||||
profiles: ["disabled"]
|
||||
admin-core:
|
||||
profiles: ["disabled"]
|
||||
pitch-deck:
|
||||
|
||||
+23
-34
@@ -414,10 +414,10 @@ services:
|
||||
condition: service_healthy
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://127.0.0.1:8098/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 10s
|
||||
interval: 60s
|
||||
timeout: 30s
|
||||
retries: 10
|
||||
start_period: 30s
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- breakpilot-network
|
||||
@@ -434,7 +434,7 @@ services:
|
||||
EMBEDDING_BACKEND: ${EMBEDDING_BACKEND:-local}
|
||||
LOCAL_EMBEDDING_MODEL: ${LOCAL_EMBEDDING_MODEL:-BAAI/bge-m3}
|
||||
LOCAL_RERANKER_MODEL: ${LOCAL_RERANKER_MODEL:-cross-encoder/ms-marco-MiniLM-L-6-v2}
|
||||
PDF_EXTRACTION_BACKEND: ${PDF_EXTRACTION_BACKEND:-pymupdf}
|
||||
PDF_EXTRACTION_BACKEND: ${PDF_EXTRACTION_BACKEND:-auto}
|
||||
OPENAI_API_KEY: ${OPENAI_API_KEY:-}
|
||||
COHERE_API_KEY: ${COHERE_API_KEY:-}
|
||||
LOG_LEVEL: ${LOG_LEVEL:-INFO}
|
||||
@@ -490,9 +490,8 @@ services:
|
||||
volumes:
|
||||
- gitea_data:/var/lib/gitea
|
||||
- gitea_config:/etc/gitea
|
||||
- /etc/timezone:/etc/timezone:ro
|
||||
- /etc/localtime:/etc/localtime:ro
|
||||
environment:
|
||||
TZ: "Europe/Berlin"
|
||||
USER_UID: "1000"
|
||||
USER_GID: "1000"
|
||||
GITEA__database__DB_TYPE: postgres
|
||||
@@ -583,33 +582,6 @@ services:
|
||||
networks:
|
||||
- breakpilot-network
|
||||
|
||||
# =========================================================
|
||||
# NIGHT SCHEDULER
|
||||
# =========================================================
|
||||
night-scheduler:
|
||||
build:
|
||||
context: ./night-scheduler
|
||||
dockerfile: Dockerfile
|
||||
container_name: bp-core-night-scheduler
|
||||
ports:
|
||||
- "8096:8096"
|
||||
volumes:
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
- ./night-scheduler/config:/config
|
||||
environment:
|
||||
COMPOSE_PROJECT_NAME: breakpilot-core
|
||||
CONTAINER_PATTERN: "bp-*"
|
||||
EXCLUDED_CONTAINERS: "bp-core-night-scheduler,bp-core-nginx,bp-core-postgres,bp-core-valkey"
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://127.0.0.1:8096/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
start_period: 10s
|
||||
retries: 3
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- breakpilot-network
|
||||
|
||||
# =========================================================
|
||||
# ADMIN CORE
|
||||
# =========================================================
|
||||
@@ -910,3 +882,20 @@ services:
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- breakpilot-network
|
||||
|
||||
# =========================================================
|
||||
# MARKETING WEBSITE - BreakPilot Produktwebsite
|
||||
# =========================================================
|
||||
marketing-website:
|
||||
build:
|
||||
context: ./marketing-website
|
||||
dockerfile: Dockerfile
|
||||
container_name: bp-core-marketing-website
|
||||
platform: linux/arm64
|
||||
ports:
|
||||
- "3014:3000"
|
||||
environment:
|
||||
NODE_ENV: production
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- breakpilot-network
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user