Compare commits
25 Commits
f130c45ca8
...
aab8eeb335
| Author | SHA1 | Date | |
|---|---|---|---|
| aab8eeb335 | |||
| 9437e029d0 | |||
| 4fd2bfefcd | |||
| fac9280716 | |||
| 118be3540d | |||
| a9671a572b | |||
| 2f4a3f2ea2 | |||
| 0b0eed27b0 | |||
| 97a7f6f264 | |||
| ff21bc258a | |||
| 3009f3d13a | |||
| 5a6e588641 | |||
| 41183ff93d | |||
| 75dda9ac92 | |||
| a459636bc4 | |||
| ddad58f607 | |||
| 93099b2770 | |||
| da21339e76 | |||
| 6ab10415d8 | |||
| d9c16fb914 | |||
| 6f58fdbaa5 | |||
| b8ff4e9290 | |||
| f2104768a0 | |||
| e8df15c0f8 | |||
| 7c5592b50e |
@@ -0,0 +1,115 @@
|
||||
# Session-Instruktionen: Block F — Hardcoded Knowledge Migration
|
||||
|
||||
**Datum:** 2026-05-03
|
||||
**Fuer:** Naechste Claude-Session
|
||||
**Repo:** breakpilot-core (~/Projekte/breakpilot-core)
|
||||
|
||||
---
|
||||
|
||||
## NAECHSTER SCHRITT: Block F1 — Regulation Registry
|
||||
|
||||
### Was zu tun ist
|
||||
|
||||
1. **DB-Tabelle** `compliance.regulation_registry` erstellen (Migration-Script)
|
||||
2. **Daten migrieren** aus `control_generator.py` (135 Eintraege) + `source_type_classification.py` (58)
|
||||
3. **Auto-Create** im RAG-Service bei Document-Upload (status='needs_review')
|
||||
4. **Backend-API** in breakpilot-compliance Backend (GET/POST/PUT /v1/regulations)
|
||||
5. **Frontend** in breakpilot-compliance Admin unter `/sdk/regulation-registry` (zwischen roadmap und isms)
|
||||
6. **Sync-Check** Script (wöchentlich: Qdrant regulation_ids vs. DB)
|
||||
7. **Code umstellen** in control_generator.py (Dict → DB-Query mit Cache)
|
||||
|
||||
### Frontend-Anforderungen (breakpilot-compliance Admin, Port 3007)
|
||||
|
||||
- NAV-Position: zwischen `/sdk/roadmap` und `/sdk/isms`
|
||||
- Tabelle mit allen Regulations (sortierbar, filterbar)
|
||||
- Status-Badge: "Needs Review" (gelb), "Active" (grün), "Deprecated" (grau)
|
||||
- Counter im NAV für unreviewed Einträge
|
||||
- Inline-Edit: license_rule, jurisdiction, source_type, names
|
||||
- "Approve" Button → status='active'
|
||||
- Diskrepanz-Anzeige: regulation_ids in Qdrant die nicht in DB sind
|
||||
|
||||
### Kritische Dateien
|
||||
|
||||
| Repo | Datei | Aktion |
|
||||
|------|-------|--------|
|
||||
| core | `control-pipeline/services/control_generator.py` Z.75-236 | EDIT: Dict → DB |
|
||||
| core | `control-pipeline/data/source_type_classification.py` | DELETE (nach Migration) |
|
||||
| core | `rag-service/api/documents.py` | EDIT: Auto-Create bei Upload |
|
||||
| compliance | `backend-compliance/compliance/api/regulations.py` | NEU: API Endpoints |
|
||||
| compliance | `admin-compliance/app/sdk/regulation-registry/` | NEU: Frontend-Seite |
|
||||
|
||||
### DB-Schema
|
||||
|
||||
```sql
|
||||
CREATE TABLE compliance.regulation_registry (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
regulation_id VARCHAR(100) UNIQUE NOT NULL,
|
||||
regulation_name_de TEXT,
|
||||
regulation_name_en TEXT,
|
||||
regulation_short VARCHAR(50),
|
||||
license_rule INTEGER NOT NULL DEFAULT 1 CHECK (license_rule IN (1, 2, 3)),
|
||||
license_type VARCHAR(50),
|
||||
source_type VARCHAR(20) NOT NULL DEFAULT 'law',
|
||||
jurisdiction VARCHAR(10),
|
||||
category VARCHAR(50),
|
||||
celex VARCHAR(20),
|
||||
url TEXT,
|
||||
status VARCHAR(20) NOT NULL DEFAULT 'needs_review',
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX idx_reg_registry_status ON compliance.regulation_registry(status);
|
||||
CREATE INDEX idx_reg_registry_jurisdiction ON compliance.regulation_registry(jurisdiction);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## GESAMTPLAN Block F (4 Tage)
|
||||
|
||||
| Phase | Was | Aufwand | Status |
|
||||
|-------|-----|---------|--------|
|
||||
| F1 | Regulation Registry (DB + API + Frontend + Auto-Create) | 1 Tag | 🔥 NAECHSTER |
|
||||
| F2 | Action Types + Synonyme → DB | 1 Tag | Ausstehend |
|
||||
| F3 | Object Synonyms → DB | 0.5 Tag | Ausstehend |
|
||||
| F4 | LLM Synonym-Enrichment | 1 Tag | Ausstehend |
|
||||
| F5 | Validation + Cleanup | 0.5 Tag | Ausstehend |
|
||||
|
||||
---
|
||||
|
||||
## SESSION 02-03.05.2026 ERLEDIGT
|
||||
|
||||
- Block D5+: NIST/ENISA PDF-Qualitaet (0%→45%)
|
||||
- Block D6: Citation-Backfill (3.651 Controls)
|
||||
- Block E2: 8 DE-Gesetze (1.629 Chunks)
|
||||
- Block E3: 5 EU-Regulierungen (1.057 Chunks)
|
||||
- Block E4: GoBD, BAIT, VAIT (144 Chunks)
|
||||
- Block E6: 3 CH + 4 AT Gesetze (3.881 Chunks)
|
||||
- Block E7: 9 Urteile als Volltext (709 Chunks total)
|
||||
- Schrems II: 154, BVerfG Datenanalyse: 161, DSK OH Telemedien: 119
|
||||
- Meta: 101, BAG Zeiterfassung: 48, Planet49: 42, SCHUFA: 41
|
||||
- Schadenersatz: 29, Google Fonts: 14
|
||||
- Infra: Qdrant-Snapshot, Upload-before-Delete, 99 Tests
|
||||
|
||||
**Gesamt neue Chunks diese Session: ~25.000+**
|
||||
|
||||
---
|
||||
|
||||
## TESTS
|
||||
|
||||
```bash
|
||||
# Embedding-Service (99 Tests)
|
||||
cd embedding-service && python3 -m pytest test_chunking.py test_d4_bgb.py test_nist_normalization.py -v
|
||||
|
||||
# Control-Pipeline (387 Tests)
|
||||
PYTHONPATH=control-pipeline python3 -m pytest control-pipeline/tests/ -v
|
||||
|
||||
# Qdrant-Snapshot
|
||||
ssh macmini "cd ~/Projekte/breakpilot-core && bash scripts/qdrant-snapshot.sh"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## PLAN-DATEI
|
||||
|
||||
Block F Detailplan: `/Users/benjaminadmin/.claude/plans/humming-nibbling-sonnet.md`
|
||||
@@ -0,0 +1,335 @@
|
||||
# Instruktion: Teststrategie Block C
|
||||
|
||||
**Repo:** `/Users/benjaminadmin/Projekte/breakpilot-core/`
|
||||
**Verzeichnis:** `control-pipeline/tests/`
|
||||
**Erstellt:** 2026-05-01
|
||||
**Geschaetzter Aufwand:** 2-3 Tage
|
||||
|
||||
## Ausgangslage
|
||||
|
||||
- 221 bestehende Tests in 7 Dateien (NICHT aendern!)
|
||||
- 40 Golden Test Cases (golden_controls.yaml)
|
||||
- 24 Demo Cases (demo_cases.yaml)
|
||||
- Alle Tests sind pure Python, kein DB noetig
|
||||
- Pipeline v1 abgeschlossen: 151.675 unique Controls, 15.291 Dependencies
|
||||
|
||||
## Aufgabe 1: Real-World Benchmarks (C1)
|
||||
|
||||
### Was zu tun ist
|
||||
|
||||
10 echte deutsche E-Commerce Websites manuell pruefen und Ground Truth YAML erstellen.
|
||||
|
||||
### Verzeichnis
|
||||
|
||||
```
|
||||
control-pipeline/tests/benchmarks/
|
||||
├── amazon_de.yaml
|
||||
├── zalando_de.yaml
|
||||
├── otto_de.yaml
|
||||
├── lidl_de.yaml
|
||||
├── check24_de.yaml
|
||||
├── booking_de.yaml
|
||||
├── thomann_de.yaml
|
||||
├── aboutyou_de.yaml
|
||||
├── mytheresa_com.yaml
|
||||
└── kleiner_shop.yaml
|
||||
```
|
||||
|
||||
### Format pro Website
|
||||
|
||||
```yaml
|
||||
website: amazon.de
|
||||
url: https://www.amazon.de
|
||||
checked_at: "2026-05-XX"
|
||||
checked_by: "Name"
|
||||
|
||||
ground_truth:
|
||||
impressum:
|
||||
present: true/false
|
||||
complete: true/false # Name, Adresse, Email, HR-Nummer, USt-ID
|
||||
within_2_clicks: true/false
|
||||
missing_fields: [] # z.B. ["USt-ID", "Handelsregister"]
|
||||
|
||||
datenschutzerklaerung:
|
||||
present: true/false
|
||||
art13_complete: true/false
|
||||
missing_art13_fields: [] # z.B. ["Speicherdauer", "Empfaenger"]
|
||||
rechtsgrundlagen_korrekt: true/false
|
||||
wrong_legal_bases: [] # z.B. ["Analytics auf lit. f statt lit. a"]
|
||||
|
||||
cookie_banner:
|
||||
present: true/false
|
||||
reject_equally_easy: true/false # CNIL: Ablehnen = gleich prominent
|
||||
cookies_before_consent: true/false # Planet49: Cookies VOR Consent?
|
||||
dark_patterns: [] # z.B. ["Ablehnen-Button kleiner", "Ablehnen hinter Einstellungen"]
|
||||
|
||||
widerrufsbelehrung:
|
||||
present: true/false
|
||||
matches_legal_template: true/false # Gesetzliches Muster
|
||||
|
||||
agb:
|
||||
present: true/false
|
||||
checkout_button_text: "..." # z.B. "Jetzt kaufen" (korrekt) vs "Weiter" (falsch)
|
||||
|
||||
google_fonts_external: true/false
|
||||
google_analytics: true/false
|
||||
|
||||
third_party_services:
|
||||
- name: "Google Analytics"
|
||||
detected: true
|
||||
consent_required: true
|
||||
consent_obtained_before_load: false
|
||||
- name: "Facebook Pixel"
|
||||
detected: true
|
||||
consent_required: true
|
||||
consent_obtained_before_load: false
|
||||
|
||||
expected_findings:
|
||||
- "Cookie-Banner: Ablehnen nicht gleichwertig"
|
||||
- "Google Analytics ohne vorherige Einwilligung"
|
||||
- "DSE: Rechtsgrundlage fuer Analytics falsch"
|
||||
|
||||
expected_no_findings:
|
||||
- "Impressum fehlt" # Ist vorhanden, darf nicht geflagt werden
|
||||
```
|
||||
|
||||
### Test-Runner
|
||||
|
||||
```python
|
||||
# control-pipeline/tests/test_benchmarks.py
|
||||
"""
|
||||
Real-World Benchmark Tests — vergleicht Agent-Findings mit manueller Ground Truth.
|
||||
Erfordert: Compliance Agent muss laufen (https://macmini:3007/sdk/agent)
|
||||
"""
|
||||
|
||||
import yaml
|
||||
import pytest
|
||||
import os
|
||||
|
||||
BENCHMARK_DIR = os.path.join(os.path.dirname(__file__), "benchmarks")
|
||||
|
||||
def load_benchmarks():
|
||||
cases = []
|
||||
for f in sorted(os.listdir(BENCHMARK_DIR)):
|
||||
if f.endswith(".yaml"):
|
||||
with open(os.path.join(BENCHMARK_DIR, f)) as fh:
|
||||
cases.append(yaml.safe_load(fh))
|
||||
return cases
|
||||
|
||||
class TestBenchmarks:
|
||||
"""Precision/Recall gegen Ground Truth messen."""
|
||||
|
||||
@pytest.mark.parametrize("case", load_benchmarks(), ids=lambda c: c["website"])
|
||||
def test_benchmark(self, case):
|
||||
# TODO: Agent gegen Website laufen lassen
|
||||
# TODO: Findings mit expected_findings vergleichen
|
||||
# TODO: Precision + Recall berechnen
|
||||
pass
|
||||
```
|
||||
|
||||
### Wie die Ground Truth erstellt wird
|
||||
|
||||
1. Website im Browser oeffnen
|
||||
2. Impressum pruefen (alle Pflichtfelder nach § 5 DDG)
|
||||
3. Datenschutzerklaerung lesen (Art. 13 DSGVO Checkliste)
|
||||
4. Cookie-Banner testen (Ablehnen gleich einfach? Cookies vor Consent?)
|
||||
5. Widerrufsbelehrung gegen gesetzliches Muster pruefen
|
||||
6. Browser DevTools: Netzwerk-Tab → externe Requests vor Consent?
|
||||
7. Alles in YAML dokumentieren
|
||||
|
||||
**Ziel-Metriken:**
|
||||
- Precision > 80% (wenige False Positives)
|
||||
- Recall > 70% (findet die meisten echten Probleme)
|
||||
|
||||
---
|
||||
|
||||
## Aufgabe 2: Adversarial Tests (C2)
|
||||
|
||||
### Was zu tun ist
|
||||
|
||||
30 tricky Test Cases erstellen die den Agent/Controls herausfordern.
|
||||
|
||||
### Datei
|
||||
|
||||
`control-pipeline/tests/adversarial_cases.yaml`
|
||||
|
||||
### Kategorien
|
||||
|
||||
**A. Falsche Rechtsgrundlage (8 Cases):**
|
||||
- Analytics auf lit. f statt lit. a
|
||||
- Marketing-Emails auf lit. b statt lit. a
|
||||
- Mitarbeiter-Tracking auf lit. f statt Betriebsvereinbarung
|
||||
- Biometrische Daten auf lit. f statt Art. 9
|
||||
- Profiling auf lit. f statt Art. 22
|
||||
- Newsletter auf lit. b statt lit. a
|
||||
- Social Login auf lit. b statt lit. a
|
||||
- Kreditscoring auf lit. f statt lit. a + Art. 22
|
||||
|
||||
**B. Dark Patterns (6 Cases):**
|
||||
- Ablehnen-Button existiert aber 3px gross + grau
|
||||
- "Alle akzeptieren" prominent, "Einstellungen" statt "Ablehnen"
|
||||
- Cookie-Wall: Inhalt erst nach Zustimmung sichtbar
|
||||
- Vorausgefuellte Checkboxen (Planet49)
|
||||
- Confirm-Shaming: "Nein, ich moechte keine sichere Verbindung"
|
||||
- Ablehnen erfordert 3 Klicks, Akzeptieren nur 1
|
||||
|
||||
**C. Fast-vollstaendige Dokumente (6 Cases):**
|
||||
- Impressum komplett bis auf USt-ID
|
||||
- DSE ohne Speicherdauer
|
||||
- DSE ohne DSB-Kontakt
|
||||
- Widerrufsbelehrung mit falschem Fristbeginn
|
||||
- AGB ohne Gerichtsstand
|
||||
- Cookie-Policy ohne Auflistung aller Cookies
|
||||
|
||||
**D. Semantisch aehnlich aber verschieden (5 Cases):**
|
||||
- "Admin-MFA" vs "User-MFA" (verschiedene Scopes!)
|
||||
- "Daten loeschen nach Kuendigung" vs "Daten loeschen nach Aufbewahrungsfrist"
|
||||
- "Rate Limiting API" vs "Rate Limiting Login"
|
||||
- "Verschluesselung at rest" vs "Verschluesselung in transit"
|
||||
- "Incident Response Plan" vs "Business Continuity Plan"
|
||||
|
||||
**E. Semantisch verschieden aber gleich klingend (5 Cases):**
|
||||
- "Einwilligung" (DSGVO) vs "Einwilligung" (Werbung)
|
||||
- "Verarbeitung" (Daten) vs "Verarbeitung" (Lebensmittel)
|
||||
- "Risikobewertung" (DSGVO DSFA) vs "Risikobewertung" (Finanzrisiko)
|
||||
- "Audit" (Datenschutz) vs "Audit" (Finanzen)
|
||||
- "Zertifizierung" (ISO 27001) vs "Zertifizierung" (CE-Marking)
|
||||
|
||||
### Format
|
||||
|
||||
```yaml
|
||||
- id: ADV-LIT-001
|
||||
category: wrong_legal_basis
|
||||
input: "Wir verarbeiten Ihre Daten fuer Webanalyse auf Grundlage unseres berechtigten Interesses (Art. 6 Abs. 1 lit. f DSGVO)."
|
||||
context: "DSE-Abschnitt ueber Google Analytics"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: "wrong_legal_basis"
|
||||
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
|
||||
reason: "Analytics erfordert Einwilligung, nicht berechtigtes Interesse (EuGH C-673/17 Planet49)"
|
||||
difficulty: medium # easy / medium / hard
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Aufgabe 3: Regression-Harness (C3)
|
||||
|
||||
### Was zu tun ist
|
||||
|
||||
1. `conftest.py` mit shared Fixtures
|
||||
2. `test_regression.py` mit Snapshot-Tests
|
||||
3. CI/CD Quality Gate
|
||||
|
||||
### conftest.py
|
||||
|
||||
```python
|
||||
# control-pipeline/tests/conftest.py
|
||||
import os
|
||||
import pytest
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def db_session():
|
||||
"""DB session for integration tests — skip if no DATABASE_URL."""
|
||||
url = os.getenv("DATABASE_URL")
|
||||
if not url:
|
||||
pytest.skip("DATABASE_URL not set")
|
||||
from db.session import SessionLocal
|
||||
db = SessionLocal()
|
||||
yield db
|
||||
db.close()
|
||||
|
||||
@pytest.fixture
|
||||
def sample_controls(db_session):
|
||||
"""Load 100 random draft controls for regression testing."""
|
||||
from sqlalchemy import text
|
||||
rows = db_session.execute(text("""
|
||||
SELECT control_id, title, category, severity,
|
||||
generation_metadata->>'assertion' as assertion
|
||||
FROM compliance.canonical_controls
|
||||
WHERE release_state = 'draft' AND decomposition_method = 'pass0b'
|
||||
ORDER BY random() LIMIT 100
|
||||
""")).fetchall()
|
||||
return [dict(r._mapping) for r in rows]
|
||||
```
|
||||
|
||||
### test_regression.py
|
||||
|
||||
```python
|
||||
# control-pipeline/tests/test_regression.py
|
||||
"""
|
||||
Regression Tests — pruefen ob Pipeline-Updates bestehende Controls veraendern.
|
||||
Erfordert: DATABASE_URL Umgebungsvariable
|
||||
"""
|
||||
|
||||
class TestControlStability:
|
||||
def test_draft_count_stable(self, db_session):
|
||||
"""Draft count darf nicht um >5% abweichen."""
|
||||
from sqlalchemy import text
|
||||
count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
|
||||
)).scalar()
|
||||
assert count > 140000, f"Draft count too low: {count}"
|
||||
assert count < 200000, f"Draft count too high: {count}"
|
||||
|
||||
def test_no_null_assertions(self, db_session):
|
||||
"""Alle draft Controls muessen eine assertion haben."""
|
||||
from sqlalchemy import text
|
||||
null_count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
|
||||
"AND (generation_metadata->>'assertion' IS NULL OR generation_metadata->>'assertion' = '')"
|
||||
)).scalar()
|
||||
assert null_count < 1000, f"Too many controls without assertion: {null_count}"
|
||||
|
||||
def test_dependency_graph_valid(self, db_session):
|
||||
"""Keine Zyklen im Dependency-Graph."""
|
||||
from sqlalchemy import text
|
||||
cycle_count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.control_dependencies WHERE is_active = true"
|
||||
)).scalar()
|
||||
assert cycle_count > 10000, f"Too few dependencies: {cycle_count}"
|
||||
|
||||
class TestQualityGates:
|
||||
def test_duplicate_rate(self, db_session):
|
||||
pass # Implementieren: duplicate_rate < 5%
|
||||
|
||||
def test_evidence_leak_rate(self, db_session):
|
||||
pass # Implementieren: evidence_leak < 2%
|
||||
```
|
||||
|
||||
### CI/CD Quality Gate
|
||||
|
||||
```yaml
|
||||
# .gitea/workflows/quality-gate.yml
|
||||
name: Control Pipeline Quality Gate
|
||||
on:
|
||||
push:
|
||||
paths:
|
||||
- 'control-pipeline/**'
|
||||
|
||||
jobs:
|
||||
quality-gate:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Run Tests
|
||||
run: |
|
||||
cd control-pipeline
|
||||
pip install -r requirements.txt pytest pyyaml
|
||||
PYTHONPATH=. pytest tests/ -v --tb=short -x
|
||||
- name: Quality Metrics
|
||||
run: |
|
||||
# Nur wenn Container laeuft
|
||||
curl -sf http://127.0.0.1:8098/v1/canonical/generate/quality-metrics || echo "Pipeline not running, skip metrics"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## WICHTIG
|
||||
|
||||
- Bestehende 221 Tests NICHT aendern
|
||||
- NICHT deployen (Container nicht neustarten)
|
||||
- Alle neuen Tests muessen ohne DB laufen (ausser test_regression.py mit skip-Marker)
|
||||
- Ground Truth YAML manuell erstellen (kein LLM fuer die Referenzdaten!)
|
||||
- Bei Fragen: Memory lesen unter `/Users/benjaminadmin/.claude/projects/-Users-benjaminadmin-Projekte-breakpilot-core/memory/`
|
||||
@@ -2715,3 +2715,199 @@ async def get_quality_metrics(
|
||||
}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# REVIEW CANDIDATE VERIFICATION (Block B — LLM decides DUPLIKAT/VERSCHIEDEN)
|
||||
# =============================================================================
|
||||
|
||||
_REVIEW_VERIFY_SYSTEM = """Du vergleichst Paare von Compliance Controls und entscheidest ob sie Duplikate sind.
|
||||
Antworte NUR mit einem JSON-Array. Fuer jedes Paar ein Objekt:
|
||||
{"pair_id": "...", "decision": "DUPLIKAT" oder "VERSCHIEDEN", "reason": "kurze Begruendung"}
|
||||
DUPLIKAT = gleiche Anforderung, nur anders formuliert.
|
||||
VERSCHIEDEN = unterschiedliche Anforderungen, auch wenn aehnliche Woerter vorkommen."""
|
||||
|
||||
|
||||
class ReviewVerifyRequest(BaseModel):
|
||||
limit: int = 0
|
||||
batch_size: int = 10
|
||||
dry_run: bool = True
|
||||
|
||||
|
||||
_review_verify_status: dict = {}
|
||||
|
||||
|
||||
async def _run_review_verify(req: ReviewVerifyRequest, job_id: str):
|
||||
from services.decomposition_pass import (
|
||||
create_anthropic_batch, fetch_batch_results, check_batch_status,
|
||||
)
|
||||
import asyncio as aio
|
||||
db = SessionLocal()
|
||||
try:
|
||||
_review_verify_status[job_id] = {"status": "loading"}
|
||||
|
||||
query = """
|
||||
SELECT r.id::text, r.candidate_control_id, r.candidate_title,
|
||||
r.matched_control_id, c2.title as matched_title,
|
||||
r.similarity_score
|
||||
FROM control_dedup_reviews r
|
||||
LEFT JOIN canonical_controls c2 ON c2.id = r.matched_control_uuid
|
||||
WHERE r.review_status = 'pending'
|
||||
ORDER BY r.similarity_score DESC
|
||||
"""
|
||||
if req.limit > 0:
|
||||
query += f" LIMIT {req.limit}"
|
||||
|
||||
rows = db.execute(text(query)).fetchall()
|
||||
total = len(rows)
|
||||
_review_verify_status[job_id] = {"status": "preparing", "total": total}
|
||||
|
||||
if total == 0:
|
||||
_review_verify_status[job_id] = {
|
||||
"status": "completed", "total": 0, "message": "No pending reviews",
|
||||
}
|
||||
return
|
||||
|
||||
if req.dry_run:
|
||||
_review_verify_status[job_id] = {
|
||||
"status": "dry_run", "total": total,
|
||||
"estimated_requests": (total + req.batch_size - 1) // req.batch_size,
|
||||
}
|
||||
return
|
||||
|
||||
# Build batch requests
|
||||
api_requests = []
|
||||
pair_map = {}
|
||||
for i in range(0, total, req.batch_size):
|
||||
batch = rows[i:i + req.batch_size]
|
||||
prompt = "Vergleiche diese Control-Paare:\n\n"
|
||||
batch_pairs = []
|
||||
for r in batch:
|
||||
pair_id = r[0][:8]
|
||||
prompt += (
|
||||
f"Paar {pair_id}:\n"
|
||||
f" A: {r[1]} — {r[2]}\n"
|
||||
f" B: {r[3]} — {r[4]}\n"
|
||||
f" Similarity: {r[5]:.3f}\n\n"
|
||||
)
|
||||
batch_pairs.append({"review_id": r[0], "candidate_id": r[1]})
|
||||
|
||||
batch_idx = i // req.batch_size
|
||||
custom_id = f"rv_b{batch_idx:05d}"
|
||||
pair_map[custom_id] = batch_pairs
|
||||
api_requests.append({
|
||||
"custom_id": custom_id,
|
||||
"params": {
|
||||
"model": "claude-haiku-4-5-20251001",
|
||||
"max_tokens": max(1024, len(batch) * 150),
|
||||
"system": [{
|
||||
"type": "text",
|
||||
"text": _REVIEW_VERIFY_SYSTEM,
|
||||
"cache_control": {"type": "ephemeral"},
|
||||
}],
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
},
|
||||
})
|
||||
|
||||
_review_verify_status[job_id] = {
|
||||
"status": "submitting", "total": total, "requests": len(api_requests),
|
||||
}
|
||||
batch_result = await create_anthropic_batch(api_requests)
|
||||
batch_id = batch_result.get("id", "")
|
||||
_review_verify_status[job_id] = {
|
||||
"status": "batch_submitted", "batch_id": batch_id,
|
||||
"total": total, "requests": len(api_requests),
|
||||
}
|
||||
|
||||
# Poll for completion
|
||||
for _ in range(720):
|
||||
await aio.sleep(10)
|
||||
status = await check_batch_status(batch_id)
|
||||
if status.get("processing_status") == "ended":
|
||||
break
|
||||
|
||||
# Process results
|
||||
results = await fetch_batch_results(batch_id)
|
||||
duplicates = 0
|
||||
different = 0
|
||||
errors = 0
|
||||
|
||||
for result in results:
|
||||
custom_id = result.get("custom_id", "")
|
||||
result_data = result.get("result", {})
|
||||
if result_data.get("type") != "succeeded":
|
||||
errors += 1
|
||||
continue
|
||||
|
||||
content = result_data.get("message", {}).get("content", [])
|
||||
text_content = content[0].get("text", "") if content else ""
|
||||
|
||||
try:
|
||||
import json as jmod
|
||||
import re
|
||||
json_matches = re.findall(r'\{[^}]+\}', text_content)
|
||||
pairs = pair_map.get(custom_id, [])
|
||||
|
||||
for j, match_str in enumerate(json_matches):
|
||||
try:
|
||||
parsed = jmod.loads(match_str)
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
decision = parsed.get("decision", "").upper()
|
||||
if j < len(pairs):
|
||||
review_id = pairs[j]["review_id"]
|
||||
if "DUPLIKAT" in decision:
|
||||
db.execute(text("""
|
||||
UPDATE control_dedup_reviews
|
||||
SET review_status = 'duplicate', review_notes = :notes
|
||||
WHERE id = CAST(:rid AS uuid)
|
||||
"""), {"rid": review_id, "notes": parsed.get("reason", "")})
|
||||
duplicates += 1
|
||||
else:
|
||||
db.execute(text("""
|
||||
UPDATE control_dedup_reviews
|
||||
SET review_status = 'different', review_notes = :notes
|
||||
WHERE id = CAST(:rid AS uuid)
|
||||
"""), {"rid": review_id, "notes": parsed.get("reason", "")})
|
||||
different += 1
|
||||
|
||||
db.commit()
|
||||
except Exception as e:
|
||||
logger.error("Review verify parse error: %s", e)
|
||||
errors += 1
|
||||
try:
|
||||
db.rollback()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
_review_verify_status[job_id] = {
|
||||
"status": "completed", "batch_id": batch_id, "total": total,
|
||||
"duplicates": duplicates, "different": different, "errors": errors,
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error("Review verify %s failed: %s", job_id, e)
|
||||
_review_verify_status[job_id] = {"status": "failed", "error": str(e)}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@router.post("/generate/review-verify")
|
||||
async def start_review_verify(req: ReviewVerifyRequest):
|
||||
"""LLM-verify review candidates (DUPLIKAT/VERSCHIEDEN) via Haiku Batch."""
|
||||
import uuid as uuid_mod
|
||||
job_id = str(uuid_mod.uuid4())[:8]
|
||||
_review_verify_status[job_id] = {"status": "starting"}
|
||||
asyncio.create_task(_run_review_verify(req, job_id))
|
||||
return {
|
||||
"status": "running", "job_id": job_id,
|
||||
"message": f"Poll /generate/review-verify-status/{job_id}",
|
||||
}
|
||||
|
||||
|
||||
@router.get("/generate/review-verify-status/{job_id}")
|
||||
async def get_review_verify_status(job_id: str):
|
||||
status = _review_verify_status.get(job_id)
|
||||
if not status:
|
||||
raise HTTPException(status_code=404, detail="Review verify job not found")
|
||||
return status
|
||||
|
||||
@@ -165,21 +165,29 @@ def classify_source_regulation(source_regulation: str) -> str:
|
||||
"""
|
||||
Klassifiziert eine source_regulation als law, guideline oder framework.
|
||||
|
||||
Verwendet exaktes Matching gegen die Map. Bei unbekannten Quellen
|
||||
wird anhand von Schluesselwoertern geraten, Fallback ist 'framework'
|
||||
(konservativstes Ergebnis).
|
||||
Delegates to DB-backed RegulationRegistry (with 5min cache).
|
||||
Falls back to SOURCE_REGULATION_CLASSIFICATION dict + heuristic
|
||||
if DB is unavailable.
|
||||
"""
|
||||
if not source_regulation:
|
||||
return SOURCE_TYPE_FRAMEWORK
|
||||
|
||||
# Exaktes Match
|
||||
# Try DB-backed registry first
|
||||
try:
|
||||
from services.regulation_registry import classify_source_regulation as _db_classify
|
||||
result = _db_classify(source_regulation)
|
||||
if result:
|
||||
return result
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Fallback: local dict
|
||||
if source_regulation in SOURCE_REGULATION_CLASSIFICATION:
|
||||
return SOURCE_REGULATION_CLASSIFICATION[source_regulation]
|
||||
|
||||
# Heuristik fuer unbekannte Quellen
|
||||
lower = source_regulation.lower()
|
||||
|
||||
# Gesetze erkennen
|
||||
law_indicators = [
|
||||
"verordnung", "richtlinie", "gesetz", "directive", "regulation",
|
||||
"(eu)", "(eg)", "act", "ley", "loi", "törvény", "código",
|
||||
@@ -187,19 +195,16 @@ def classify_source_regulation(source_regulation: str) -> str:
|
||||
if any(ind in lower for ind in law_indicators):
|
||||
return SOURCE_TYPE_LAW
|
||||
|
||||
# Leitlinien erkennen
|
||||
guideline_indicators = [
|
||||
"edpb", "leitlinie", "guideline", "wp2", "bsi", "empfehlung",
|
||||
]
|
||||
if any(ind in lower for ind in guideline_indicators):
|
||||
return SOURCE_TYPE_GUIDELINE
|
||||
|
||||
# Frameworks erkennen
|
||||
framework_indicators = [
|
||||
"enisa", "nist", "owasp", "oecd", "cisa", "framework", "iso",
|
||||
]
|
||||
if any(ind in lower for ind in framework_indicators):
|
||||
return SOURCE_TYPE_FRAMEWORK
|
||||
|
||||
# Konservativ: unbekannt = framework (geringste Verbindlichkeit)
|
||||
return SOURCE_TYPE_FRAMEWORK
|
||||
|
||||
@@ -0,0 +1,72 @@
|
||||
-- Migration 002: Regulation Registry (Block F1)
|
||||
-- Schema: compliance
|
||||
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/002_regulation_registry.sql
|
||||
|
||||
SET search_path TO compliance, public;
|
||||
|
||||
-- ========================================
|
||||
-- regulation_registry
|
||||
-- ========================================
|
||||
-- Central registry for all regulations, laws, guidelines, and frameworks
|
||||
-- referenced by the control pipeline. Replaces hardcoded Python dicts
|
||||
-- (REGULATION_LICENSE_MAP, SOURCE_REGULATION_CLASSIFICATION).
|
||||
|
||||
CREATE TABLE IF NOT EXISTS regulation_registry (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
|
||||
-- regulation_id: machine key (e.g. "eu_2016_679", "nist_sp_800_53")
|
||||
regulation_id VARCHAR(100) UNIQUE NOT NULL,
|
||||
|
||||
-- Display names
|
||||
regulation_name_de TEXT,
|
||||
regulation_name_en TEXT,
|
||||
regulation_short VARCHAR(50),
|
||||
|
||||
-- License classification (3-rule system)
|
||||
license_rule INTEGER NOT NULL DEFAULT 1
|
||||
CHECK (license_rule IN (1, 2, 3)),
|
||||
license_type VARCHAR(50), -- EU_LAW, DE_LAW, CC-BY-SA-4.0, etc.
|
||||
attribution TEXT, -- Required for Rule 2 (CC-BY)
|
||||
|
||||
-- Source classification
|
||||
source_type VARCHAR(20) NOT NULL DEFAULT 'law'
|
||||
CHECK (source_type IN ('law', 'guideline', 'standard', 'framework', 'restricted')),
|
||||
|
||||
-- Metadata
|
||||
jurisdiction VARCHAR(10), -- DE, EU, AT, CH, US, FR, ES, NL, IT, HU, INT
|
||||
category VARCHAR(50),
|
||||
celex VARCHAR(30), -- EU CELEX number if applicable
|
||||
url TEXT,
|
||||
|
||||
-- Lifecycle
|
||||
status VARCHAR(20) NOT NULL DEFAULT 'active'
|
||||
CHECK (status IN ('active', 'needs_review', 'deprecated')),
|
||||
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- Indexes
|
||||
CREATE INDEX IF NOT EXISTS idx_reg_registry_status
|
||||
ON regulation_registry(status);
|
||||
CREATE INDEX IF NOT EXISTS idx_reg_registry_jurisdiction
|
||||
ON regulation_registry(jurisdiction);
|
||||
CREATE INDEX IF NOT EXISTS idx_reg_registry_source_type
|
||||
ON regulation_registry(source_type);
|
||||
CREATE INDEX IF NOT EXISTS idx_reg_registry_license_rule
|
||||
ON regulation_registry(license_rule);
|
||||
|
||||
-- Updated-at trigger
|
||||
CREATE OR REPLACE FUNCTION update_regulation_registry_updated_at()
|
||||
RETURNS TRIGGER AS $$
|
||||
BEGIN
|
||||
NEW.updated_at = NOW();
|
||||
RETURN NEW;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
DROP TRIGGER IF EXISTS trg_regulation_registry_updated_at ON regulation_registry;
|
||||
CREATE TRIGGER trg_regulation_registry_updated_at
|
||||
BEFORE UPDATE ON regulation_registry
|
||||
FOR EACH ROW
|
||||
EXECUTE FUNCTION update_regulation_registry_updated_at();
|
||||
@@ -0,0 +1,498 @@
|
||||
#!/usr/bin/env python3
|
||||
"""D6 Citation Backfill — update ~291k controls with section metadata from Qdrant chunks.
|
||||
|
||||
Archives old source_citation in generation_metadata.old_citation.
|
||||
Updates source_citation.article, .paragraph, .page from matched Qdrant chunks.
|
||||
|
||||
3-tier matching:
|
||||
Tier 1: sha256(source_original_text) → exact chunk text match
|
||||
Tier 2: Parse [section] prefix from source_original_text
|
||||
Tier 3: Best text overlap within same regulation_id
|
||||
|
||||
Usage:
|
||||
python3 control-pipeline/scripts/d6_citation_backfill.py --dry-run --limit 100
|
||||
python3 control-pipeline/scripts/d6_citation_backfill.py --batch-size 1000
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
import httpx
|
||||
import psycopg2
|
||||
import psycopg2.extras
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
)
|
||||
logger = logging.getLogger("d6-backfill")
|
||||
|
||||
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot@localhost:5432/breakpilot_db")
|
||||
QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
|
||||
|
||||
COLLECTIONS = [
|
||||
"bp_compliance_ce",
|
||||
"bp_compliance_gesetze",
|
||||
"bp_compliance_datenschutz",
|
||||
"bp_dsfa_corpus",
|
||||
"bp_legal_templates",
|
||||
]
|
||||
|
||||
# Parse [§ 312k Title] or [AC-1 POLICY] prefix from chunk text
|
||||
_SECTION_PREFIX_RE = re.compile(r'^\[([^\]]+)\]\s*')
|
||||
|
||||
|
||||
@dataclass
|
||||
class ChunkMeta:
|
||||
section: str
|
||||
section_title: str
|
||||
paragraph: str
|
||||
page: Optional[int]
|
||||
regulation_id: str
|
||||
|
||||
|
||||
@dataclass
|
||||
class Stats:
|
||||
total: int = 0
|
||||
already_correct: int = 0
|
||||
matched_hash: int = 0
|
||||
matched_prefix: int = 0
|
||||
matched_overlap: int = 0
|
||||
unmatched: int = 0
|
||||
updated: int = 0
|
||||
errors: int = 0
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Phase 1: Build Qdrant index
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
def build_qdrant_index(qdrant_url: str) -> tuple[dict, dict]:
|
||||
"""Build hash index and regulation index from all Qdrant collections.
|
||||
|
||||
Returns:
|
||||
hash_index: {sha256(chunk_text) → ChunkMeta}
|
||||
reg_index: {regulation_id → [ChunkMeta with text snippets]}
|
||||
"""
|
||||
hash_index: dict[str, ChunkMeta] = {}
|
||||
reg_index: dict[str, list[tuple[str, ChunkMeta]]] = {}
|
||||
total_chunks = 0
|
||||
|
||||
for coll in COLLECTIONS:
|
||||
offset = None
|
||||
coll_count = 0
|
||||
with httpx.Client(timeout=60.0) as c:
|
||||
while True:
|
||||
body = {
|
||||
"limit": 250,
|
||||
"with_payload": [
|
||||
"chunk_text", "section", "section_title",
|
||||
"paragraph", "page", "regulation_id",
|
||||
],
|
||||
"with_vector": False,
|
||||
}
|
||||
if offset is not None:
|
||||
body["offset"] = offset
|
||||
resp = c.post(
|
||||
f"{qdrant_url}/collections/{coll}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
|
||||
for pt in data["points"]:
|
||||
p = pt.get("payload", {})
|
||||
chunk_text = p.get("chunk_text", "")
|
||||
if not chunk_text or len(chunk_text.strip()) < 30:
|
||||
continue
|
||||
|
||||
meta = ChunkMeta(
|
||||
section=p.get("section", "") or "",
|
||||
section_title=p.get("section_title", "") or "",
|
||||
paragraph=p.get("paragraph", "") or "",
|
||||
page=p.get("page"),
|
||||
regulation_id=p.get("regulation_id", "") or "",
|
||||
)
|
||||
|
||||
# Hash index
|
||||
h = hashlib.sha256(chunk_text.encode()).hexdigest()
|
||||
if meta.section: # only index chunks WITH section data
|
||||
hash_index[h] = meta
|
||||
|
||||
# Regulation index (for text overlap matching)
|
||||
if meta.regulation_id and meta.section:
|
||||
reg_index.setdefault(meta.regulation_id, []).append(
|
||||
(chunk_text[:500], meta)
|
||||
)
|
||||
|
||||
coll_count += 1
|
||||
|
||||
offset = data.get("next_page_offset")
|
||||
if offset is None:
|
||||
break
|
||||
|
||||
total_chunks += coll_count
|
||||
logger.info(" [%s] %d chunks indexed", coll, coll_count)
|
||||
|
||||
logger.info("Qdrant index: %d total chunks, %d with section (hash), %d regulations",
|
||||
total_chunks, len(hash_index), len(reg_index))
|
||||
return hash_index, reg_index
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Phase 2: Load controls
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
def load_controls(db_url: str, limit: int = 0) -> list[dict]:
|
||||
"""Load all controls needing citation update."""
|
||||
conn = psycopg2.connect(db_url)
|
||||
conn.set_session(autocommit=False)
|
||||
cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
|
||||
|
||||
cur.execute("SET search_path TO compliance, core, public")
|
||||
|
||||
query = """
|
||||
SELECT id, control_id, source_citation, source_original_text,
|
||||
generation_metadata, license_rule
|
||||
FROM canonical_controls
|
||||
WHERE license_rule IN (1, 2)
|
||||
AND source_citation IS NOT NULL
|
||||
ORDER BY control_id
|
||||
"""
|
||||
if limit > 0:
|
||||
query += f" LIMIT {limit}"
|
||||
|
||||
cur.execute(query)
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
|
||||
controls = []
|
||||
for row in rows:
|
||||
ctrl = dict(row)
|
||||
ctrl["id"] = str(ctrl["id"])
|
||||
for jf in ("source_citation", "generation_metadata"):
|
||||
val = ctrl.get(jf)
|
||||
if isinstance(val, str):
|
||||
try:
|
||||
ctrl[jf] = json.loads(val)
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
ctrl[jf] = {}
|
||||
elif val is None:
|
||||
ctrl[jf] = {}
|
||||
controls.append(ctrl)
|
||||
|
||||
return controls
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Phase 3: Matching
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
def match_control(
|
||||
ctrl: dict,
|
||||
hash_index: dict[str, ChunkMeta],
|
||||
reg_index: dict[str, list[tuple[str, ChunkMeta]]],
|
||||
) -> tuple[Optional[ChunkMeta], str]:
|
||||
"""Match a control to a Qdrant chunk. Returns (meta, method) or (None, '')."""
|
||||
source_text = ctrl.get("source_original_text", "") or ""
|
||||
|
||||
# Tier 1: Hash match
|
||||
if source_text:
|
||||
h = hashlib.sha256(source_text.encode()).hexdigest()
|
||||
meta = hash_index.get(h)
|
||||
if meta and meta.section:
|
||||
return meta, "hash"
|
||||
|
||||
# Tier 2: Parse [section] prefix from source_original_text
|
||||
if source_text:
|
||||
m = _SECTION_PREFIX_RE.match(source_text)
|
||||
if m:
|
||||
prefix = m.group(1).strip()
|
||||
parsed = _parse_section_from_prefix(prefix)
|
||||
if parsed:
|
||||
return parsed, "prefix"
|
||||
|
||||
# Tier 3: Text overlap within same regulation
|
||||
gen_meta = ctrl.get("generation_metadata") or {}
|
||||
reg_id = gen_meta.get("source_regulation", "")
|
||||
if reg_id and source_text and reg_id in reg_index:
|
||||
best = _find_best_overlap(source_text, reg_index[reg_id])
|
||||
if best:
|
||||
return best, "overlap"
|
||||
|
||||
return None, ""
|
||||
|
||||
|
||||
def _parse_section_from_prefix(prefix: str) -> Optional[ChunkMeta]:
|
||||
"""Parse a section prefix like '§ 312k Kuendigungsbutton' or 'AC-1 POLICY'."""
|
||||
if not prefix:
|
||||
return None
|
||||
|
||||
# § pattern
|
||||
m = re.match(r'(§\s*\d+[a-z]*)\s*(.*)', prefix)
|
||||
if m:
|
||||
return ChunkMeta(
|
||||
section=m.group(1).strip(),
|
||||
section_title=m.group(2).strip(),
|
||||
paragraph="", page=None, regulation_id="",
|
||||
)
|
||||
|
||||
# Art./Artikel pattern
|
||||
m = re.match(r'(Art(?:ikel|\.)\s*\d+)\s*(.*)', prefix, re.IGNORECASE)
|
||||
if m:
|
||||
return ChunkMeta(
|
||||
section=m.group(1).strip(),
|
||||
section_title=m.group(2).strip(),
|
||||
paragraph="", page=None, regulation_id="",
|
||||
)
|
||||
|
||||
# NIST control pattern (AC-1, AU-2, etc.)
|
||||
m = re.match(r'([A-Z]{2,4}-\d+(?:\(\d+\))?)\s*(.*)', prefix)
|
||||
if m:
|
||||
return ChunkMeta(
|
||||
section=m.group(1).strip(),
|
||||
section_title=m.group(2).strip(),
|
||||
paragraph="", page=None, regulation_id="",
|
||||
)
|
||||
|
||||
# Numbered section (3.1 Title)
|
||||
m = re.match(r'(\d+(?:\.\d+)+)\s*(.*)', prefix)
|
||||
if m:
|
||||
return ChunkMeta(
|
||||
section=m.group(1).strip(),
|
||||
section_title=m.group(2).strip(),
|
||||
paragraph="", page=None, regulation_id="",
|
||||
)
|
||||
|
||||
# ALL-CAPS heading (fallback — use as section_title)
|
||||
if prefix == prefix.upper() and len(prefix) > 3:
|
||||
return ChunkMeta(
|
||||
section="", section_title=prefix,
|
||||
paragraph="", page=None, regulation_id="",
|
||||
)
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def _find_best_overlap(source_text: str, chunks: list[tuple[str, ChunkMeta]]) -> Optional[ChunkMeta]:
|
||||
"""Find chunk with best text overlap (simple word-set Jaccard)."""
|
||||
source_words = set(source_text.lower().split())
|
||||
if len(source_words) < 5:
|
||||
return None
|
||||
|
||||
best_score = 0.0
|
||||
best_meta = None
|
||||
|
||||
for chunk_text, meta in chunks:
|
||||
chunk_words = set(chunk_text.lower().split())
|
||||
if not chunk_words:
|
||||
continue
|
||||
intersection = len(source_words & chunk_words)
|
||||
union = len(source_words | chunk_words)
|
||||
jaccard = intersection / union if union > 0 else 0
|
||||
if jaccard > best_score and jaccard > 0.3: # 30% threshold
|
||||
best_score = jaccard
|
||||
best_meta = meta
|
||||
|
||||
return best_meta
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Phase 4: Update controls
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
def update_controls(
|
||||
db_url: str,
|
||||
controls: list[dict],
|
||||
hash_index: dict[str, ChunkMeta],
|
||||
reg_index: dict[str, list[tuple[str, ChunkMeta]]],
|
||||
dry_run: bool = True,
|
||||
batch_size: int = 1000,
|
||||
) -> Stats:
|
||||
"""Match and update all controls."""
|
||||
stats = Stats(total=len(controls))
|
||||
|
||||
conn = psycopg2.connect(db_url)
|
||||
conn.set_session(autocommit=False)
|
||||
cur = conn.cursor()
|
||||
cur.execute("SET search_path TO compliance, core, public")
|
||||
|
||||
updates = []
|
||||
|
||||
for i, ctrl in enumerate(controls):
|
||||
if i > 0 and i % 5000 == 0:
|
||||
logger.info("Progress: %d/%d (hash=%d prefix=%d overlap=%d unmatched=%d)",
|
||||
i, stats.total, stats.matched_hash, stats.matched_prefix,
|
||||
stats.matched_overlap, stats.unmatched)
|
||||
|
||||
citation = ctrl.get("source_citation") or {}
|
||||
old_article = citation.get("article", "")
|
||||
gen_meta = ctrl.get("generation_metadata") or {}
|
||||
|
||||
# Match
|
||||
meta, method = match_control(ctrl, hash_index, reg_index)
|
||||
|
||||
if not meta or not meta.section:
|
||||
# No match — check if existing article is already good
|
||||
if old_article:
|
||||
stats.already_correct += 1
|
||||
else:
|
||||
stats.unmatched += 1
|
||||
continue
|
||||
|
||||
# Check if update is needed
|
||||
if old_article == meta.section:
|
||||
stats.already_correct += 1
|
||||
continue
|
||||
|
||||
# Track method
|
||||
if method == "hash":
|
||||
stats.matched_hash += 1
|
||||
elif method == "prefix":
|
||||
stats.matched_prefix += 1
|
||||
elif method == "overlap":
|
||||
stats.matched_overlap += 1
|
||||
|
||||
# Archive old citation
|
||||
if old_article or citation.get("paragraph"):
|
||||
gen_meta["old_citation"] = {
|
||||
"article": old_article,
|
||||
"paragraph": citation.get("paragraph", ""),
|
||||
"page": citation.get("page"),
|
||||
"archived_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
|
||||
}
|
||||
|
||||
# Update citation
|
||||
citation["article"] = meta.section
|
||||
if meta.paragraph:
|
||||
citation["paragraph"] = meta.paragraph
|
||||
if meta.page is not None:
|
||||
citation["page"] = meta.page
|
||||
|
||||
# Update generation_metadata
|
||||
gen_meta["source_article"] = meta.section
|
||||
if meta.paragraph:
|
||||
gen_meta["source_paragraph"] = meta.paragraph
|
||||
if meta.page is not None:
|
||||
gen_meta["source_page"] = meta.page
|
||||
gen_meta["backfill_method"] = method
|
||||
gen_meta["backfill_at"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
|
||||
|
||||
updates.append((
|
||||
json.dumps(citation, ensure_ascii=False),
|
||||
json.dumps(gen_meta, ensure_ascii=False, default=str),
|
||||
ctrl["id"],
|
||||
))
|
||||
|
||||
# Batch commit
|
||||
if len(updates) >= batch_size and not dry_run:
|
||||
_execute_batch(cur, updates)
|
||||
conn.commit()
|
||||
stats.updated += len(updates)
|
||||
logger.info("Committed batch: %d updates (total %d)", len(updates), stats.updated)
|
||||
updates = []
|
||||
|
||||
# Final batch
|
||||
if updates and not dry_run:
|
||||
_execute_batch(cur, updates)
|
||||
conn.commit()
|
||||
stats.updated += len(updates)
|
||||
logger.info("Committed final batch: %d updates (total %d)", len(updates), stats.updated)
|
||||
elif updates and dry_run:
|
||||
stats.updated = len(updates) # would-be updates
|
||||
|
||||
conn.close()
|
||||
return stats
|
||||
|
||||
|
||||
def _execute_batch(cur, updates: list[tuple]):
|
||||
"""Execute batch UPDATE statements."""
|
||||
for citation_json, meta_json, ctrl_id in updates:
|
||||
cur.execute(
|
||||
"""UPDATE canonical_controls
|
||||
SET source_citation = %s::jsonb,
|
||||
generation_metadata = %s::jsonb,
|
||||
updated_at = NOW()
|
||||
WHERE id = %s::uuid""",
|
||||
(citation_json, meta_json, ctrl_id),
|
||||
)
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Main
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="D6 Citation Backfill")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Don't write to DB")
|
||||
parser.add_argument("--limit", type=int, default=0, help="Limit controls (0=all)")
|
||||
parser.add_argument("--batch-size", type=int, default=1000)
|
||||
parser.add_argument("--db-url", default=DB_URL)
|
||||
parser.add_argument("--qdrant-url", default=QDRANT_URL)
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info("D6 Citation Backfill")
|
||||
logger.info(" DB: %s", args.db_url.split("@")[-1])
|
||||
logger.info(" Qdrant: %s", args.qdrant_url)
|
||||
logger.info(" Dry run: %s", args.dry_run)
|
||||
logger.info(" Limit: %s", args.limit or "ALL")
|
||||
logger.info("=" * 60)
|
||||
|
||||
# Phase 1: Build Qdrant index
|
||||
logger.info("\nPhase 1: Building Qdrant index...")
|
||||
t0 = time.time()
|
||||
hash_index, reg_index = build_qdrant_index(args.qdrant_url)
|
||||
logger.info("Index built in %.1fs", time.time() - t0)
|
||||
|
||||
# Phase 2: Load controls
|
||||
logger.info("\nPhase 2: Loading controls...")
|
||||
controls = load_controls(args.db_url, args.limit)
|
||||
logger.info("Loaded %d controls", len(controls))
|
||||
|
||||
if not controls:
|
||||
logger.info("No controls to process")
|
||||
return
|
||||
|
||||
# Phase 3+4: Match and update
|
||||
logger.info("\nPhase 3+4: Matching and updating...")
|
||||
t0 = time.time()
|
||||
stats = update_controls(
|
||||
args.db_url, controls, hash_index, reg_index,
|
||||
dry_run=args.dry_run, batch_size=args.batch_size,
|
||||
)
|
||||
elapsed = time.time() - t0
|
||||
|
||||
# Summary
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("RESULTS")
|
||||
logger.info("=" * 60)
|
||||
logger.info(" Total controls: %d", stats.total)
|
||||
logger.info(" Already correct: %d (%.1f%%)", stats.already_correct,
|
||||
stats.already_correct / max(stats.total, 1) * 100)
|
||||
logger.info(" Matched (hash): %d (%.1f%%)", stats.matched_hash,
|
||||
stats.matched_hash / max(stats.total, 1) * 100)
|
||||
logger.info(" Matched (prefix): %d (%.1f%%)", stats.matched_prefix,
|
||||
stats.matched_prefix / max(stats.total, 1) * 100)
|
||||
logger.info(" Matched (overlap): %d (%.1f%%)", stats.matched_overlap,
|
||||
stats.matched_overlap / max(stats.total, 1) * 100)
|
||||
logger.info(" Unmatched: %d (%.1f%%)", stats.unmatched,
|
||||
stats.unmatched / max(stats.total, 1) * 100)
|
||||
logger.info(" Updated: %d", stats.updated)
|
||||
logger.info(" Errors: %d", stats.errors)
|
||||
logger.info(" Time: %.1fs (%.0f controls/sec)", elapsed,
|
||||
stats.total / max(elapsed, 1))
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("\nDRY RUN — no changes written. Run without --dry-run to apply.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,280 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Extract large NIST PDFs locally, then upload as .txt to RAG service.
|
||||
|
||||
Workaround for embedding-service container crashing on large PDFs (>5 MB).
|
||||
Runs pdfplumber + normalization locally, uploads extracted text as .txt.
|
||||
|
||||
Usage (on Mac Mini):
|
||||
python3 control-pipeline/scripts/extract_and_upload_nist.py
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import tempfile
|
||||
import unicodedata
|
||||
|
||||
import httpx
|
||||
import pdfplumber
|
||||
|
||||
RAG_URL = "https://localhost:8097"
|
||||
QDRANT_URL = "http://localhost:6333"
|
||||
|
||||
DOCS = [
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_53r5.pdf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"filename": "NIST_SP_800_53r5.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp800_53r5",
|
||||
"source_id": "nist",
|
||||
"doc_type": "controls_catalog",
|
||||
"guideline_name": "NIST SP 800-53 Rev. 5 Security and Privacy Controls",
|
||||
"license": "public_domain_us_gov",
|
||||
"attribution": "NIST",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nist_sp_800_82r3.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "nist_sp_800_82r3.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_82r3",
|
||||
"regulation_name_de": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
|
||||
"regulation_name_en": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
|
||||
"regulation_short": "NIST SP 800-82",
|
||||
"category": "ot_security",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nist_sp_800_160v1r1.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "nist_sp_800_160v1r1.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_160v1r1",
|
||||
"regulation_name_de": "NIST SP 800-160 Vol. 1 Rev. 1",
|
||||
"regulation_name_en": "NIST SP 800-160 Vol. 1 Rev. 1",
|
||||
"regulation_short": "NIST SP 800-160",
|
||||
"category": "security_engineering",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_207.pdf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"filename": "NIST_SP_800_207.txt",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp800_207",
|
||||
"source_id": "nist",
|
||||
"doc_type": "architecture",
|
||||
"guideline_name": "NIST SP 800-207 Zero Trust Architecture",
|
||||
"license": "public_domain_us_gov",
|
||||
"attribution": "NIST",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def normalize_pdf_text(text: str) -> str:
|
||||
"""Fix broken spacing from multi-column PDF extraction."""
|
||||
text = unicodedata.normalize('NFKC', text)
|
||||
text = text.replace('\u00ad', '').replace('\u200b', '')
|
||||
prev = None
|
||||
while prev != text:
|
||||
prev = text
|
||||
text = re.sub(r'(\d+)\s+\.\s+(\d+)', r'\1.\2', text)
|
||||
text = re.sub(r'\b([A-Z]{2,4})\s+-\s+(\d+)\b', r'\1-\2', text)
|
||||
text = re.sub(
|
||||
r'\b([A-Z]{2})\s*\.\s*([A-Z]{2})\s*-\s*(\d{2})\b', r'\1.\2-\3', text
|
||||
)
|
||||
text = re.sub(r'\(\s+(\d+)\s+\)', r'(\1)', text)
|
||||
text = re.sub(r'[^\S\n]{2,}', ' ', text)
|
||||
return text
|
||||
|
||||
|
||||
def extract_pdf_locally(pdf_bytes: bytes) -> str:
|
||||
"""Extract text from PDF using pdfplumber with normalization."""
|
||||
import io
|
||||
text_parts = []
|
||||
with pdfplumber.open(io.BytesIO(pdf_bytes)) as pdf:
|
||||
print(f" Pages: {len(pdf.pages)}")
|
||||
for i, page in enumerate(pdf.pages):
|
||||
text = page.extract_text(x_tolerance=3, y_tolerance=4)
|
||||
if text:
|
||||
text_parts.append(text)
|
||||
if (i + 1) % 50 == 0:
|
||||
print(f" Extracted {i + 1}/{len(pdf.pages)} pages...")
|
||||
raw = "\n\n".join(text_parts)
|
||||
return normalize_pdf_text(raw)
|
||||
|
||||
|
||||
def download_from_minio(object_name: str) -> bytes:
|
||||
"""Download file from MinIO via RAG service."""
|
||||
with httpx.Client(timeout=60.0, verify=False) as c:
|
||||
resp = c.get(f"{RAG_URL}/api/v1/documents/download/{object_name}")
|
||||
resp.raise_for_status()
|
||||
url = resp.json()["url"]
|
||||
with httpx.Client(timeout=300.0, verify=False) as c:
|
||||
resp = c.get(url)
|
||||
resp.raise_for_status()
|
||||
return resp.content
|
||||
|
||||
|
||||
def upload_text(
|
||||
text: str, filename: str, collection: str, extra_metadata: dict,
|
||||
) -> dict:
|
||||
"""Upload extracted text to RAG service as .txt."""
|
||||
form_data = {
|
||||
"collection": collection,
|
||||
"data_type": "compliance",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_strategy": "recursive",
|
||||
"chunk_size": "1500",
|
||||
"chunk_overlap": "100",
|
||||
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
|
||||
}
|
||||
text_bytes = text.encode("utf-8")
|
||||
with httpx.Client(timeout=1800.0, verify=False) as c:
|
||||
resp = c.post(
|
||||
f"{RAG_URL}/api/v1/documents/upload",
|
||||
files={"file": (filename, text_bytes, "text/plain")},
|
||||
data=form_data,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def count_chunks(collection: str, regulation_id: str) -> int:
|
||||
"""Count chunks for a regulation in Qdrant."""
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{collection}/points/count",
|
||||
json={
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "regulation_id",
|
||||
"match": {"value": regulation_id},
|
||||
}]
|
||||
},
|
||||
"exact": True,
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["result"]["count"]
|
||||
|
||||
|
||||
def check_section_rate(collection: str, regulation_id: str) -> tuple:
|
||||
"""Returns (total_chunks, chunks_with_section)."""
|
||||
total = 0
|
||||
with_sec = 0
|
||||
offset = None
|
||||
with httpx.Client(timeout=60.0) as c:
|
||||
while True:
|
||||
body = {
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "regulation_id",
|
||||
"match": {"value": regulation_id},
|
||||
}]
|
||||
},
|
||||
"limit": 100,
|
||||
"with_payload": ["section"],
|
||||
}
|
||||
if offset is not None:
|
||||
body["offset"] = offset
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{collection}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
for pt in data["points"]:
|
||||
total += 1
|
||||
s = pt.get("payload", {}).get("section", "")
|
||||
if s and s.strip():
|
||||
with_sec += 1
|
||||
offset = data.get("next_page_offset")
|
||||
if offset is None:
|
||||
break
|
||||
return total, with_sec
|
||||
|
||||
|
||||
def main():
|
||||
print("=" * 60)
|
||||
print("NIST PDF Local Extraction + Upload")
|
||||
print("=" * 60)
|
||||
|
||||
results = []
|
||||
for i, doc in enumerate(DOCS, 1):
|
||||
reg_id = doc["extra_metadata"]["regulation_id"]
|
||||
print(f"\n[{i}/{len(DOCS)}] {doc['filename']} → {doc['collection']}")
|
||||
|
||||
# 1. Check current state
|
||||
existing = count_chunks(doc["collection"], reg_id)
|
||||
print(f" Existing chunks: {existing}")
|
||||
|
||||
# 2. Download PDF from MinIO
|
||||
print(f" Downloading from MinIO...")
|
||||
pdf_bytes = download_from_minio(doc["object_name"])
|
||||
print(f" Downloaded {len(pdf_bytes) / 1024 / 1024:.1f} MB")
|
||||
|
||||
# 3. Extract text locally with pdfplumber
|
||||
print(f" Extracting text locally...")
|
||||
text = extract_pdf_locally(pdf_bytes)
|
||||
print(f" Extracted {len(text):,} chars, {text.count(chr(10)):,} lines")
|
||||
|
||||
# 4. Save extracted text temporarily (for debugging)
|
||||
tmp_path = f"/tmp/nist_{reg_id}.txt"
|
||||
with open(tmp_path, "w", encoding="utf-8") as f:
|
||||
f.write(text)
|
||||
print(f" Saved to {tmp_path}")
|
||||
|
||||
# 5. Upload as .txt
|
||||
print(f" Uploading as .txt to RAG service...")
|
||||
result = upload_text(text, doc["filename"], doc["collection"],
|
||||
doc["extra_metadata"])
|
||||
new_chunks = result.get("chunks_count", 0)
|
||||
new_doc_id = result.get("document_id", "")
|
||||
print(f" Uploaded: {new_chunks} chunks (doc_id={new_doc_id})")
|
||||
|
||||
# 6. Check section rate
|
||||
if new_chunks > 0:
|
||||
total, with_sec = check_section_rate(doc["collection"], reg_id)
|
||||
pct = (with_sec / total * 100) if total > 0 else 0
|
||||
print(f" Section rate: {with_sec}/{total} ({pct:.0f}%)")
|
||||
else:
|
||||
pct = 0
|
||||
print(" WARNING: 0 chunks created!")
|
||||
|
||||
results.append({
|
||||
"file": doc["filename"],
|
||||
"old": existing,
|
||||
"new": new_chunks,
|
||||
"section_rate": round(pct, 1),
|
||||
})
|
||||
|
||||
# Summary
|
||||
print("\n" + "=" * 60)
|
||||
print("RESULTS")
|
||||
print("=" * 60)
|
||||
for r in results:
|
||||
print(f" {r['file']:<40} old={r['old']} new={r['new']} sect={r['section_rate']}%")
|
||||
|
||||
total_new = sum(r["new"] for r in results)
|
||||
print(f"\nTotal new chunks: {total_new}")
|
||||
|
||||
if any(r["new"] == 0 for r in results):
|
||||
print("\nWARNING: Some documents produced 0 chunks!")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,247 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
F1 Migration: Populate regulation_registry from hardcoded Python dicts.
|
||||
|
||||
Sources:
|
||||
- REGULATION_LICENSE_MAP (control_generator.py) — 135 entries keyed by regulation_id
|
||||
- SOURCE_REGULATION_CLASSIFICATION (source_type_classification.py) — 58 entries keyed by name
|
||||
|
||||
Usage:
|
||||
# Dry run (prints SQL, no DB write):
|
||||
python3 scripts/f1_migrate_regulation_registry.py --dry-run
|
||||
|
||||
# Against Mac Mini:
|
||||
python3 scripts/f1_migrate_regulation_registry.py --db-host macmini
|
||||
|
||||
# Against local Docker:
|
||||
python3 scripts/f1_migrate_regulation_registry.py --db-host localhost
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add parent so we can import from services/data
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
from services.control_generator import REGULATION_LICENSE_MAP, _RULE2_PREFIXES, _RULE3_PREFIXES # noqa: E402
|
||||
from data.source_type_classification import SOURCE_REGULATION_CLASSIFICATION # noqa: E402
|
||||
|
||||
# Derive jurisdiction from license_type
|
||||
_LICENSE_TO_JURISDICTION = {
|
||||
"EU_LAW": "EU",
|
||||
"EU_PUBLIC": "EU",
|
||||
"DE_LAW": "DE",
|
||||
"DE_PUBLIC": "DE",
|
||||
"AT_LAW": "AT",
|
||||
"CH_LAW": "CH",
|
||||
"FR_LAW": "FR",
|
||||
"ES_LAW": "ES",
|
||||
"NL_LAW": "NL",
|
||||
"IT_LAW": "IT",
|
||||
"HU_LAW": "HU",
|
||||
"NIST_PUBLIC_DOMAIN": "US",
|
||||
"US_GOV_PUBLIC": "US",
|
||||
"CC-BY-SA-4.0": "INT",
|
||||
"CC-BY-4.0": "INT",
|
||||
"OECD_PUBLIC": "INT",
|
||||
}
|
||||
|
||||
|
||||
def _derive_jurisdiction(license_type: str) -> str:
|
||||
"""Map license_type to jurisdiction code."""
|
||||
return _LICENSE_TO_JURISDICTION.get(license_type, "INT")
|
||||
|
||||
|
||||
def build_rows() -> list[dict]:
|
||||
"""Merge REGULATION_LICENSE_MAP + SOURCE_REGULATION_CLASSIFICATION into rows."""
|
||||
rows = []
|
||||
# Track names we've seen (for dedup against SOURCE_REGULATION_CLASSIFICATION)
|
||||
seen_names: set[str] = set()
|
||||
|
||||
# 1) Primary source: REGULATION_LICENSE_MAP (has regulation_id as key)
|
||||
for reg_id, info in REGULATION_LICENSE_MAP.items():
|
||||
name = info.get("name", reg_id)
|
||||
seen_names.add(name)
|
||||
|
||||
rows.append({
|
||||
"regulation_id": reg_id.lower().strip(),
|
||||
"regulation_name_de": name,
|
||||
"license_rule": info["rule"],
|
||||
"license_type": info.get("license", ""),
|
||||
"attribution": info.get("attribution"),
|
||||
"source_type": info.get("source_type", "law"),
|
||||
"jurisdiction": _derive_jurisdiction(info.get("license", "")),
|
||||
"status": "active",
|
||||
})
|
||||
|
||||
# 2) Secondary: SOURCE_REGULATION_CLASSIFICATION entries not already covered
|
||||
# These are keyed by name, not by regulation_id. We create synthetic IDs.
|
||||
for name, source_type in SOURCE_REGULATION_CLASSIFICATION.items():
|
||||
if name in seen_names:
|
||||
continue
|
||||
# Generate a regulation_id from the name
|
||||
synthetic_id = (
|
||||
name.lower()
|
||||
.replace(" ", "_")
|
||||
.replace("(", "")
|
||||
.replace(")", "")
|
||||
.replace("/", "_")
|
||||
.replace("-", "_")
|
||||
.replace(".", "")
|
||||
.replace(",", "")
|
||||
.replace("ä", "ae")
|
||||
.replace("ö", "oe")
|
||||
.replace("ü", "ue")
|
||||
.replace("á", "a")
|
||||
.replace("é", "e")
|
||||
.replace("ó", "o")
|
||||
.strip("_")
|
||||
)[:100]
|
||||
|
||||
# Guess jurisdiction from name content
|
||||
jurisdiction = "INT"
|
||||
name_lower = name.lower()
|
||||
if any(x in name_lower for x in ["edpb", "edps", "(eu)", "eu ", "wp2"]):
|
||||
jurisdiction = "EU"
|
||||
elif any(x in name_lower for x in ["bsi", "bdsg", "bundes", "gwg"]):
|
||||
jurisdiction = "DE"
|
||||
elif "nist" in name_lower or "cisa" in name_lower:
|
||||
jurisdiction = "US"
|
||||
elif "österreich" in name_lower:
|
||||
jurisdiction = "AT"
|
||||
elif "schweiz" in name_lower:
|
||||
jurisdiction = "CH"
|
||||
elif "spanien" in name_lower:
|
||||
jurisdiction = "ES"
|
||||
elif "frankreich" in name_lower:
|
||||
jurisdiction = "FR"
|
||||
elif "ungarn" in name_lower:
|
||||
jurisdiction = "HU"
|
||||
|
||||
# Map source_type_classification's "framework" to our "standard"
|
||||
# (source_type_classification uses law/guideline/framework)
|
||||
mapped_source_type = source_type
|
||||
if source_type == "framework":
|
||||
mapped_source_type = "standard"
|
||||
|
||||
rows.append({
|
||||
"regulation_id": synthetic_id,
|
||||
"regulation_name_de": name,
|
||||
"license_rule": 1, # default: conservative
|
||||
"license_type": "",
|
||||
"attribution": None,
|
||||
"source_type": mapped_source_type,
|
||||
"jurisdiction": jurisdiction,
|
||||
"status": "needs_review", # needs manual review since we guessed
|
||||
})
|
||||
|
||||
return rows
|
||||
|
||||
|
||||
def generate_sql(rows: list[dict]) -> str:
|
||||
"""Generate INSERT SQL for all rows."""
|
||||
lines = [
|
||||
"SET search_path TO compliance, public;",
|
||||
"",
|
||||
"-- Auto-generated by f1_migrate_regulation_registry.py",
|
||||
f"-- {len(rows)} rows total",
|
||||
"",
|
||||
]
|
||||
|
||||
for row in rows:
|
||||
attr = f"'{row['attribution']}'" if row["attribution"] else "NULL"
|
||||
lines.append(
|
||||
f"INSERT INTO regulation_registry "
|
||||
f"(regulation_id, regulation_name_de, license_rule, license_type, "
|
||||
f"attribution, source_type, jurisdiction, status) "
|
||||
f"VALUES ("
|
||||
f"'{row['regulation_id']}', "
|
||||
f"'{_escape_sql(row['regulation_name_de'])}', "
|
||||
f"{row['license_rule']}, "
|
||||
f"'{row['license_type']}', "
|
||||
f"{attr}, "
|
||||
f"'{row['source_type']}', "
|
||||
f"'{row['jurisdiction']}', "
|
||||
f"'{row['status']}'"
|
||||
f") ON CONFLICT (regulation_id) DO UPDATE SET "
|
||||
f"regulation_name_de = EXCLUDED.regulation_name_de, "
|
||||
f"license_rule = EXCLUDED.license_rule, "
|
||||
f"license_type = EXCLUDED.license_type, "
|
||||
f"attribution = EXCLUDED.attribution, "
|
||||
f"source_type = EXCLUDED.source_type, "
|
||||
f"jurisdiction = EXCLUDED.jurisdiction;"
|
||||
)
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def _escape_sql(val: str) -> str:
|
||||
"""Escape single quotes for SQL."""
|
||||
return val.replace("'", "''")
|
||||
|
||||
|
||||
def insert_via_sqlalchemy(rows: list[dict], db_host: str) -> int:
|
||||
"""Insert rows using SQLAlchemy (same pattern as control-pipeline)."""
|
||||
from sqlalchemy import create_engine, text
|
||||
|
||||
url = f"postgresql://breakpilot:breakpilot123@{db_host}:5432/breakpilot_db"
|
||||
engine = create_engine(url)
|
||||
|
||||
inserted = 0
|
||||
with engine.connect() as conn:
|
||||
conn.execute(text("SET search_path TO compliance, public"))
|
||||
for row in rows:
|
||||
conn.execute(
|
||||
text("""
|
||||
INSERT INTO regulation_registry
|
||||
(regulation_id, regulation_name_de, license_rule, license_type,
|
||||
attribution, source_type, jurisdiction, status)
|
||||
VALUES
|
||||
(:regulation_id, :regulation_name_de, :license_rule, :license_type,
|
||||
:attribution, :source_type, :jurisdiction, :status)
|
||||
ON CONFLICT (regulation_id) DO UPDATE SET
|
||||
regulation_name_de = EXCLUDED.regulation_name_de,
|
||||
license_rule = EXCLUDED.license_rule,
|
||||
license_type = EXCLUDED.license_type,
|
||||
attribution = EXCLUDED.attribution,
|
||||
source_type = EXCLUDED.source_type,
|
||||
jurisdiction = EXCLUDED.jurisdiction
|
||||
"""),
|
||||
row,
|
||||
)
|
||||
inserted += 1
|
||||
conn.commit()
|
||||
|
||||
return inserted
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Migrate regulation registry data")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Print SQL only")
|
||||
parser.add_argument("--db-host", default="localhost", help="PostgreSQL host")
|
||||
args = parser.parse_args()
|
||||
|
||||
rows = build_rows()
|
||||
print(f"Built {len(rows)} rows from hardcoded dicts")
|
||||
|
||||
# Stats
|
||||
by_rule = {}
|
||||
by_status = {}
|
||||
for r in rows:
|
||||
by_rule[r["license_rule"]] = by_rule.get(r["license_rule"], 0) + 1
|
||||
by_status[r["status"]] = by_status.get(r["status"], 0) + 1
|
||||
print(f" By license_rule: {by_rule}")
|
||||
print(f" By status: {by_status}")
|
||||
|
||||
if args.dry_run:
|
||||
print("\n--- DRY RUN (SQL output) ---\n")
|
||||
print(generate_sql(rows))
|
||||
return
|
||||
|
||||
inserted = insert_via_sqlalchemy(rows, args.db_host)
|
||||
print(f"Inserted/updated {inserted} rows into regulation_registry")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,240 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Ingest missing German laws from gesetze-im-internet.de.
|
||||
|
||||
Downloads full HTML, strips to text, uploads with legal chunking strategy.
|
||||
Handles ISO-8859-1 charset typical for gesetze-im-internet.de.
|
||||
|
||||
Usage (on Mac Mini):
|
||||
python3 control-pipeline/scripts/ingest_de_laws.py --dry-run
|
||||
python3 control-pipeline/scripts/ingest_de_laws.py
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
from typing import Optional
|
||||
|
||||
import httpx
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("ingest-laws")
|
||||
|
||||
RAG_URL = "https://localhost:8097"
|
||||
QDRANT_URL = "http://localhost:6333"
|
||||
COLLECTION = "bp_compliance_gesetze"
|
||||
|
||||
# ---- Laws to ingest ----
|
||||
# Format: (slug on gesetze-im-internet.de, regulation_id, display_name)
|
||||
# URL pattern: https://www.gesetze-im-internet.de/{slug}/BJNR*.html (full text)
|
||||
|
||||
LAWS = [
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/arbzg/BJNR117100994.html",
|
||||
"regulation_id": "de_arbzg",
|
||||
"name": "Arbeitszeitgesetz (ArbZG)",
|
||||
"short": "ArbZG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/muschg_2018/BJNR122810017.html",
|
||||
"regulation_id": "de_muschg",
|
||||
"name": "Mutterschutzgesetz (MuSchG)",
|
||||
"short": "MuSchG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/nachwg/BJNR094610995.html",
|
||||
"regulation_id": "de_nachwg",
|
||||
"name": "Nachweisgesetz (NachwG)",
|
||||
"short": "NachwG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/milog/BJNR134810014.html",
|
||||
"regulation_id": "de_milog",
|
||||
"name": "Mindestlohngesetz (MiLoG)",
|
||||
"short": "MiLoG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/gmbhg/BJNR004770892.html",
|
||||
"regulation_id": "de_gmbhg",
|
||||
"name": "GmbH-Gesetz (GmbHG)",
|
||||
"short": "GmbHG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/aktg/BJNR010890965.html",
|
||||
"regulation_id": "de_aktg",
|
||||
"name": "Aktiengesetz (AktG)",
|
||||
"short": "AktG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/inso/BJNR286600994.html",
|
||||
"regulation_id": "de_inso",
|
||||
"name": "Insolvenzordnung (InsO)",
|
||||
"short": "InsO",
|
||||
},
|
||||
# BEG IV ist ein Aenderungsgesetz — kein eigenstaendiger Text auf gesetze-im-internet.de
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/verpflg/BJNR009690974.html",
|
||||
"regulation_id": "de_verpflichtungsgesetz",
|
||||
"name": "Verpflichtungsgesetz",
|
||||
"short": "VerpflG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/burlg/BJNR000020963.html",
|
||||
"regulation_id": "de_burlg",
|
||||
"name": "Bundesurlaubsgesetz (BUrlG)",
|
||||
"short": "BUrlG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/entgfg/BJNR118010994.html",
|
||||
"regulation_id": "de_entgfg",
|
||||
"name": "Entgeltfortzahlungsgesetz (EntgFG)",
|
||||
"short": "EntgFG",
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def download_law(url: str) -> Optional[str]:
|
||||
"""Download law HTML from gesetze-im-internet.de, handle charset."""
|
||||
with httpx.Client(timeout=30.0, follow_redirects=True) as c:
|
||||
resp = c.get(url)
|
||||
if resp.status_code != 200:
|
||||
logger.error(" HTTP %d for %s", resp.status_code, url)
|
||||
return None
|
||||
|
||||
# gesetze-im-internet.de uses ISO-8859-1
|
||||
content_type = resp.headers.get("content-type", "")
|
||||
if "charset" in content_type:
|
||||
# Use declared charset
|
||||
html = resp.text
|
||||
else:
|
||||
# Try UTF-8 first, fall back to ISO-8859-1
|
||||
try:
|
||||
html = resp.content.decode("utf-8")
|
||||
if "\ufffd" in html:
|
||||
raise UnicodeDecodeError("utf-8", b"", 0, 1, "replacement chars")
|
||||
except (UnicodeDecodeError, ValueError):
|
||||
html = resp.content.decode("iso-8859-1")
|
||||
|
||||
return html
|
||||
|
||||
|
||||
def upload_html(
|
||||
html: str,
|
||||
filename: str,
|
||||
regulation_id: str,
|
||||
name: str,
|
||||
short: str,
|
||||
dry_run: bool = False,
|
||||
) -> Optional[dict]:
|
||||
"""Upload HTML to RAG service with legal chunking."""
|
||||
if dry_run:
|
||||
logger.info(" DRY RUN — would upload %d chars", len(html))
|
||||
return {"chunks_count": 0, "document_id": "dry-run"}
|
||||
|
||||
meta = {
|
||||
"regulation_id": regulation_id,
|
||||
"regulation_name_de": name,
|
||||
"regulation_short": short,
|
||||
"source": "gesetze-im-internet.de",
|
||||
"license": "public_domain_de_law",
|
||||
"jurisdiction": "DE",
|
||||
"source_type": "law",
|
||||
}
|
||||
form_data = {
|
||||
"collection": COLLECTION,
|
||||
"data_type": "compliance",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_strategy": "legal",
|
||||
"chunk_size": "1500",
|
||||
"chunk_overlap": "100",
|
||||
"metadata_json": json.dumps(meta, ensure_ascii=False),
|
||||
}
|
||||
with httpx.Client(timeout=600.0, verify=False) as c:
|
||||
resp = c.post(
|
||||
f"{RAG_URL}/api/v1/documents/upload",
|
||||
files={"file": (filename, html.encode("utf-8"), "text/html")},
|
||||
data=form_data,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def count_existing(regulation_id: str) -> int:
|
||||
"""Check if regulation already exists in Qdrant."""
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{COLLECTION}/points/count",
|
||||
json={
|
||||
"filter": {"must": [
|
||||
{"key": "regulation_id", "match": {"value": regulation_id}}
|
||||
]},
|
||||
"exact": True,
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["result"]["count"]
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Ingest DE laws from gesetze-im-internet.de")
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info("Ingest German Laws")
|
||||
logger.info(" Laws: %d", len(LAWS))
|
||||
logger.info(" Collection: %s", COLLECTION)
|
||||
logger.info(" Dry run: %s", args.dry_run)
|
||||
logger.info("=" * 60)
|
||||
|
||||
results = []
|
||||
for i, law in enumerate(LAWS, 1):
|
||||
logger.info("\n[%d/%d] %s (%s)", i, len(LAWS), law["name"], law["regulation_id"])
|
||||
|
||||
# Check if already exists
|
||||
existing = count_existing(law["regulation_id"])
|
||||
if existing > 0:
|
||||
logger.info(" Already exists: %d chunks — SKIPPING", existing)
|
||||
results.append({"law": law["short"], "status": "exists", "chunks": existing})
|
||||
continue
|
||||
|
||||
# Download
|
||||
logger.info(" Downloading: %s", law["url"])
|
||||
html = download_law(law["url"])
|
||||
if not html:
|
||||
results.append({"law": law["short"], "status": "download_failed", "chunks": 0})
|
||||
continue
|
||||
logger.info(" Downloaded: %d chars", len(html))
|
||||
|
||||
# Upload
|
||||
filename = f"{law['regulation_id']}.html"
|
||||
try:
|
||||
result = upload_html(
|
||||
html, filename, law["regulation_id"],
|
||||
law["name"], law["short"], args.dry_run,
|
||||
)
|
||||
chunks = result.get("chunks_count", 0) if result else 0
|
||||
logger.info(" Uploaded: %d chunks", chunks)
|
||||
results.append({"law": law["short"], "status": "ok", "chunks": chunks})
|
||||
except Exception as e:
|
||||
logger.error(" Upload FAILED: %s", e)
|
||||
results.append({"law": law["short"], "status": "error", "chunks": 0})
|
||||
|
||||
if i < len(LAWS):
|
||||
time.sleep(1)
|
||||
|
||||
# Summary
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("RESULTS")
|
||||
logger.info("=" * 60)
|
||||
for r in results:
|
||||
logger.info(" %-10s %s chunks=%d", r["law"], r["status"].upper(), r["chunks"])
|
||||
|
||||
total_new = sum(r["chunks"] for r in results if r["status"] == "ok")
|
||||
logger.info("\nTotal new chunks: %d", total_new)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,201 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Ingest missing EU regulations from EUR-Lex (HTML).
|
||||
|
||||
Downloads German HTML from EUR-Lex via CELEX number, uploads with legal chunking.
|
||||
|
||||
Usage (on Mac Mini):
|
||||
python3 control-pipeline/scripts/ingest_eu_regulations.py --dry-run
|
||||
python3 control-pipeline/scripts/ingest_eu_regulations.py
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
|
||||
import httpx
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("ingest-eu")
|
||||
|
||||
RAG_URL = "https://localhost:8097"
|
||||
QDRANT_URL = "http://localhost:6333"
|
||||
COLLECTION = "bp_compliance_ce"
|
||||
|
||||
EURLEX_URL = "https://eur-lex.europa.eu/legal-content/DE/TXT/HTML/?uri=CELEX:{celex}"
|
||||
|
||||
# ---- EU Regulations to ingest ----
|
||||
REGULATIONS = [
|
||||
{
|
||||
"celex": "32022L2464",
|
||||
"regulation_id": "csrd_2022",
|
||||
"name": "Corporate Sustainability Reporting Directive (CSRD)",
|
||||
"short": "CSRD",
|
||||
"category": "sustainability",
|
||||
},
|
||||
{
|
||||
"celex": "32024L1760",
|
||||
"regulation_id": "csddd_2024",
|
||||
"name": "Corporate Sustainability Due Diligence Directive (CSDDD)",
|
||||
"short": "CSDDD",
|
||||
"category": "sustainability",
|
||||
},
|
||||
{
|
||||
"celex": "32020R0852",
|
||||
"regulation_id": "eu_taxonomy_2020",
|
||||
"name": "EU-Taxonomie-Verordnung",
|
||||
"short": "EU Taxonomy",
|
||||
"category": "sustainability",
|
||||
},
|
||||
{
|
||||
"celex": "32024R1183",
|
||||
"regulation_id": "eidas_2_0_2024",
|
||||
"name": "eIDAS 2.0 Verordnung (EU Digital Identity)",
|
||||
"short": "eIDAS 2.0",
|
||||
"category": "digital_identity",
|
||||
},
|
||||
{
|
||||
"celex": "32023L0970",
|
||||
"regulation_id": "pay_transparency_2023",
|
||||
"name": "Entgelttransparenz-Richtlinie",
|
||||
"short": "Pay Transparency",
|
||||
"category": "employment",
|
||||
},
|
||||
{
|
||||
"celex": "32022R2065",
|
||||
"regulation_id": "dsa_2022_updated",
|
||||
"name": "Digital Services Act (DSA) — aktualisiert",
|
||||
"short": "DSA",
|
||||
"category": "digital_services",
|
||||
"skip_if_exists": "dsa_2022", # already exists under different ID
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def download_eurlex(celex: str) -> str:
|
||||
"""Download EU regulation HTML from EUR-Lex."""
|
||||
url = EURLEX_URL.format(celex=celex)
|
||||
with httpx.Client(timeout=30.0, follow_redirects=True) as c:
|
||||
resp = c.get(url)
|
||||
resp.raise_for_status()
|
||||
return resp.text
|
||||
|
||||
|
||||
def upload_html(html: str, filename: str, reg: dict, dry_run: bool = False):
|
||||
"""Upload HTML to RAG service."""
|
||||
if dry_run:
|
||||
logger.info(" DRY RUN — would upload %d chars", len(html))
|
||||
return {"chunks_count": 0}
|
||||
|
||||
meta = {
|
||||
"regulation_id": reg["regulation_id"],
|
||||
"regulation_name_de": reg["name"],
|
||||
"regulation_short": reg["short"],
|
||||
"celex": reg["celex"],
|
||||
"category": reg["category"],
|
||||
"source": "EUR-Lex",
|
||||
"license": "EU_law",
|
||||
"jurisdiction": "EU",
|
||||
"source_type": "law",
|
||||
}
|
||||
form_data = {
|
||||
"collection": COLLECTION,
|
||||
"data_type": "compliance",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_strategy": "legal",
|
||||
"chunk_size": "1500",
|
||||
"chunk_overlap": "100",
|
||||
"metadata_json": json.dumps(meta, ensure_ascii=False),
|
||||
}
|
||||
with httpx.Client(timeout=600.0, verify=False) as c:
|
||||
resp = c.post(
|
||||
f"{RAG_URL}/api/v1/documents/upload",
|
||||
files={"file": (filename, html.encode("utf-8"), "text/html")},
|
||||
data=form_data,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def count_existing(regulation_id: str) -> int:
|
||||
with httpx.Client(timeout=60.0) as c:
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{COLLECTION}/points/count",
|
||||
json={"filter": {"must": [
|
||||
{"key": "regulation_id", "match": {"value": regulation_id}}
|
||||
]}, "exact": True},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["result"]["count"]
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info("Ingest EU Regulations from EUR-Lex")
|
||||
logger.info(" Regulations: %d", len(REGULATIONS))
|
||||
logger.info(" Dry run: %s", args.dry_run)
|
||||
logger.info("=" * 60)
|
||||
|
||||
results = []
|
||||
for i, reg in enumerate(REGULATIONS, 1):
|
||||
logger.info("\n[%d/%d] %s (CELEX: %s)", i, len(REGULATIONS), reg["name"], reg["celex"])
|
||||
|
||||
# Skip if variant already exists
|
||||
skip_id = reg.get("skip_if_exists")
|
||||
if skip_id:
|
||||
existing = count_existing(skip_id)
|
||||
if existing > 0:
|
||||
logger.info(" Already exists as '%s' (%d chunks) — SKIPPING", skip_id, existing)
|
||||
results.append({"reg": reg["short"], "status": "exists", "chunks": existing})
|
||||
continue
|
||||
|
||||
# Check if this exact ID exists
|
||||
existing = count_existing(reg["regulation_id"])
|
||||
if existing > 0:
|
||||
logger.info(" Already exists: %d chunks — SKIPPING", existing)
|
||||
results.append({"reg": reg["short"], "status": "exists", "chunks": existing})
|
||||
continue
|
||||
|
||||
# Download from EUR-Lex
|
||||
logger.info(" Downloading from EUR-Lex...")
|
||||
try:
|
||||
html = download_eurlex(reg["celex"])
|
||||
logger.info(" Downloaded: %d chars", len(html))
|
||||
except Exception as e:
|
||||
logger.error(" Download FAILED: %s", e)
|
||||
results.append({"reg": reg["short"], "status": "download_failed", "chunks": 0})
|
||||
continue
|
||||
|
||||
# Upload
|
||||
filename = f"{reg['regulation_id']}.html"
|
||||
try:
|
||||
result = upload_html(html, filename, reg, args.dry_run)
|
||||
chunks = result.get("chunks_count", 0)
|
||||
logger.info(" Uploaded: %d chunks", chunks)
|
||||
results.append({"reg": reg["short"], "status": "ok", "chunks": chunks})
|
||||
except Exception as e:
|
||||
logger.error(" Upload FAILED: %s", e)
|
||||
results.append({"reg": reg["short"], "status": "error", "chunks": 0})
|
||||
|
||||
if i < len(REGULATIONS):
|
||||
time.sleep(2)
|
||||
|
||||
# Summary
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("RESULTS")
|
||||
logger.info("=" * 60)
|
||||
for r in results:
|
||||
logger.info(" %-20s %s chunks=%d", r["reg"], r["status"].upper(), r["chunks"])
|
||||
|
||||
total_new = sum(r["chunks"] for r in results if r["status"] == "ok")
|
||||
logger.info("\nTotal new chunks: %d", total_new)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,303 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
E2E Quality Report: Verify controls have correct source citations.
|
||||
|
||||
Loads N random controls from PostgreSQL, cross-references with Qdrant chunks,
|
||||
and reports mismatches between source_citation and actual chunk metadata.
|
||||
|
||||
Usage:
|
||||
# Against Mac Mini
|
||||
python3 scripts/quality_report.py --db-host macmini --qdrant-url http://macmini:6333
|
||||
|
||||
# Smaller sample
|
||||
python3 scripts/quality_report.py --db-host macmini --sample 100
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
|
||||
import httpx
|
||||
from sqlalchemy import create_engine, text
|
||||
from sqlalchemy.orm import sessionmaker
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
logger = logging.getLogger("quality-report")
|
||||
|
||||
COLLECTIONS = [
|
||||
"bp_compliance_ce", "bp_compliance_gesetze", "bp_compliance_datenschutz",
|
||||
"bp_dsfa_corpus", "bp_legal_templates",
|
||||
]
|
||||
|
||||
|
||||
def load_controls(db_url: str, sample_size: int) -> list[dict]:
|
||||
"""Load random controls with source_citation from PostgreSQL."""
|
||||
engine = create_engine(db_url)
|
||||
Session = sessionmaker(bind=engine)
|
||||
|
||||
with Session() as db:
|
||||
rows = db.execute(text("""
|
||||
SELECT id::text, control_id, title,
|
||||
source_citation::text, source_original_text,
|
||||
generation_metadata::text, release_state
|
||||
FROM compliance.canonical_controls
|
||||
WHERE source_citation IS NOT NULL
|
||||
AND source_original_text IS NOT NULL
|
||||
AND release_state = 'draft'
|
||||
ORDER BY RANDOM()
|
||||
LIMIT :n
|
||||
"""), {"n": sample_size}).fetchall()
|
||||
|
||||
controls = []
|
||||
for row in rows:
|
||||
citation = json.loads(row[3]) if row[3] else {}
|
||||
metadata = json.loads(row[5]) if row[5] else {}
|
||||
controls.append({
|
||||
"id": row[0],
|
||||
"control_id": row[1],
|
||||
"title": row[2],
|
||||
"citation": citation,
|
||||
"source_text": row[4],
|
||||
"metadata": metadata,
|
||||
"release_state": row[6],
|
||||
})
|
||||
return controls
|
||||
|
||||
|
||||
def build_qdrant_index(qdrant_url: str) -> dict:
|
||||
"""Build regulation_id → list[chunk] index from Qdrant.
|
||||
|
||||
Controls were generated from OLD chunks (512 chars). Qdrant now has
|
||||
NEW chunks (1500 chars). Hash matching won't work — use regulation +
|
||||
section matching instead.
|
||||
"""
|
||||
logger.info("Building Qdrant chunk index by regulation_id...")
|
||||
index = {} # regulation_id → [{"section": ..., "text_snippet": ..., ...}]
|
||||
client = httpx.Client(timeout=60.0)
|
||||
|
||||
for coll in COLLECTIONS:
|
||||
offset = None
|
||||
for _ in range(600):
|
||||
body = {"limit": 250, "with_payload": True, "with_vector": False}
|
||||
if offset:
|
||||
body["offset"] = offset
|
||||
r = client.post(f"{qdrant_url}/collections/{coll}/points/scroll", json=body)
|
||||
if r.status_code != 200:
|
||||
break
|
||||
data = r.json()["result"]
|
||||
for pt in data["points"]:
|
||||
reg_id = pt["payload"].get("regulation_id", "")
|
||||
if not reg_id:
|
||||
continue
|
||||
chunk = {
|
||||
"section": pt["payload"].get("section", ""),
|
||||
"section_title": pt["payload"].get("section_title", ""),
|
||||
"paragraph": pt["payload"].get("paragraph", ""),
|
||||
"text_snippet": pt["payload"].get("chunk_text", "")[:200],
|
||||
"filename": pt["payload"].get("filename", ""),
|
||||
"collection": coll,
|
||||
}
|
||||
index.setdefault(reg_id, []).append(chunk)
|
||||
offset = data.get("next_page_offset")
|
||||
if not offset:
|
||||
break
|
||||
|
||||
client.close()
|
||||
total = sum(len(v) for v in index.values())
|
||||
logger.info("Qdrant index: %d regulations, %d chunks", len(index), total)
|
||||
return index
|
||||
|
||||
|
||||
def check_control(ctrl: dict, qdrant_index: dict) -> dict:
|
||||
"""Check a single control's source_citation against Qdrant chunks.
|
||||
|
||||
Strategy: Find chunks by regulation_id from generation_metadata,
|
||||
then check if any chunk has a matching section/article.
|
||||
"""
|
||||
result = {
|
||||
"control_id": ctrl["control_id"],
|
||||
"title": (ctrl["title"] or "")[:60],
|
||||
"citation_source": ctrl["citation"].get("source", ""),
|
||||
"citation_article": ctrl["citation"].get("article", ""),
|
||||
"citation_paragraph": ctrl["citation"].get("paragraph", ""),
|
||||
"citation_page": ctrl["citation"].get("page"),
|
||||
"issues": [],
|
||||
}
|
||||
|
||||
# Get regulation_id from generation_metadata
|
||||
reg_code = ctrl["metadata"].get("source_regulation", "")
|
||||
citation_article = ctrl["citation"].get("article", "")
|
||||
|
||||
# Check 1: Does the control have a regulation reference?
|
||||
if not reg_code:
|
||||
result["issues"].append("NO_REGULATION_CODE")
|
||||
return result
|
||||
|
||||
# Check 2: Does this regulation exist in Qdrant?
|
||||
chunks = qdrant_index.get(reg_code, [])
|
||||
if not chunks:
|
||||
result["issues"].append(f"REGULATION_NOT_IN_QDRANT: {reg_code}")
|
||||
result["reg_found"] = False
|
||||
return result
|
||||
|
||||
result["reg_found"] = True
|
||||
result["reg_chunks"] = len(chunks)
|
||||
|
||||
# Check 3: Does the control have an article citation?
|
||||
if not citation_article:
|
||||
result["issues"].append("NO_ARTICLE_IN_CITATION")
|
||||
# Still check if chunks have section metadata at all
|
||||
has_section = any(c["section"] for c in chunks)
|
||||
if has_section:
|
||||
result["issues"].append("CHUNKS_HAVE_SECTIONS_BUT_CONTROL_MISSING")
|
||||
return result
|
||||
|
||||
# Check 4: Is the cited article found in any chunk's section?
|
||||
norm_article = citation_article.strip().lower()
|
||||
matching_chunks = [
|
||||
c for c in chunks
|
||||
if c["section"] and (
|
||||
norm_article == c["section"].strip().lower()
|
||||
or norm_article in c["section"].strip().lower()
|
||||
or c["section"].strip().lower() in norm_article
|
||||
)
|
||||
]
|
||||
|
||||
if matching_chunks:
|
||||
result["article_match"] = True
|
||||
result["matched_section"] = matching_chunks[0]["section"]
|
||||
else:
|
||||
# Check if ANY chunk has sections (the article might just not match)
|
||||
sections_in_regulation = sorted(set(c["section"] for c in chunks if c["section"]))
|
||||
if sections_in_regulation:
|
||||
result["issues"].append(
|
||||
f"ARTICLE_NOT_FOUND_IN_CHUNKS: '{citation_article}' not in {sections_in_regulation[:5]}"
|
||||
)
|
||||
else:
|
||||
result["issues"].append("NO_SECTIONS_IN_REGULATION_CHUNKS")
|
||||
|
||||
# Check 5: Does source_original_text contain the cited article?
|
||||
source_text = ctrl["source_text"] or ""
|
||||
if citation_article and source_text:
|
||||
if citation_article.lower() not in source_text.lower():
|
||||
if f"[{citation_article}" not in source_text:
|
||||
result["issues"].append("ARTICLE_NOT_IN_SOURCE_TEXT")
|
||||
|
||||
if not result["issues"]:
|
||||
result["issues"] = ["OK"]
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def generate_report(results: list[dict]):
|
||||
"""Print the quality report."""
|
||||
total = len(results)
|
||||
ok = sum(1 for r in results if r["issues"] == ["OK"])
|
||||
chunk_found = sum(1 for r in results if r.get("chunk_found", False))
|
||||
no_chunk = sum(1 for r in results if "CHUNK_NOT_FOUND" in r["issues"])
|
||||
no_article = sum(1 for r in results if "NO_ARTICLE_IN_CITATION" in r["issues"])
|
||||
no_section = sum(1 for r in results if "NO_SECTION_IN_CHUNK" in r["issues"])
|
||||
mismatch = sum(1 for r in results if any("MISMATCH" in i for i in r["issues"]))
|
||||
not_in_text = sum(1 for r in results if "ARTICLE_NOT_IN_SOURCE_TEXT" in r["issues"])
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("QUALITAETSREPORT: CONTROL SOURCE CITATION VERIFICATION")
|
||||
print("=" * 100)
|
||||
|
||||
print(f"\nStichprobe: {total} Controls")
|
||||
print(f"\n{'Metrik':<45} {'Anzahl':>8} {'Anteil':>8}")
|
||||
print("-" * 65)
|
||||
print(f"{'OK (keine Probleme)':<45} {ok:>8} {ok*100//max(total,1):>7}%")
|
||||
print(f"{'Chunk in Qdrant gefunden':<45} {chunk_found:>8} {chunk_found*100//max(total,1):>7}%")
|
||||
print(f"{'Chunk NICHT gefunden':<45} {no_chunk:>8} {no_chunk*100//max(total,1):>7}%")
|
||||
print(f"{'Kein article in source_citation':<45} {no_article:>8} {no_article*100//max(total,1):>7}%")
|
||||
print(f"{'Kein section im Qdrant-Chunk':<45} {no_section:>8} {no_section*100//max(total,1):>7}%")
|
||||
print(f"{'Article/Section Mismatch':<45} {mismatch:>8} {mismatch*100//max(total,1):>7}%")
|
||||
print(f"{'Article nicht im Source-Text':<45} {not_in_text:>8} {not_in_text*100//max(total,1):>7}%")
|
||||
|
||||
# Show sample mismatches
|
||||
mismatches = [r for r in results if any("MISMATCH" in i for i in r["issues"])]
|
||||
if mismatches:
|
||||
print("\n=== MISMATCHES (erste 10) ===\n")
|
||||
for r in mismatches[:10]:
|
||||
issues = [i for i in r["issues"] if "MISMATCH" in i]
|
||||
print(f" {r['control_id']:20s} {r['title'][:40]:40s}")
|
||||
for i in issues:
|
||||
print(f" → {i}")
|
||||
|
||||
# Show sample NOT_FOUND
|
||||
not_found = [r for r in results if "CHUNK_NOT_FOUND" in r["issues"]]
|
||||
if not_found:
|
||||
print("\n=== CHUNK NOT FOUND (erste 10) ===\n")
|
||||
for r in not_found[:10]:
|
||||
src = r.get("citation_source", "?")
|
||||
art = r.get("citation_article", "?")
|
||||
print(f" {r['control_id']:20s} {src[:25]:25s} {art}")
|
||||
|
||||
# Distribution by source
|
||||
print("\n=== NACH QUELLE ===\n")
|
||||
source_stats = {}
|
||||
for r in results:
|
||||
src = r.get("citation_source", "?")[:30]
|
||||
if src not in source_stats:
|
||||
source_stats[src] = {"total": 0, "ok": 0, "no_chunk": 0, "no_section": 0}
|
||||
source_stats[src]["total"] += 1
|
||||
if r["issues"] == ["OK"]:
|
||||
source_stats[src]["ok"] += 1
|
||||
if "CHUNK_NOT_FOUND" in r["issues"]:
|
||||
source_stats[src]["no_chunk"] += 1
|
||||
if "NO_SECTION_IN_CHUNK" in r["issues"]:
|
||||
source_stats[src]["no_section"] += 1
|
||||
|
||||
print(f" {'Quelle':<32} {'Total':>6} {'OK':>6} {'OK%':>6} {'NoChunk':>8} {'NoSect':>8}")
|
||||
print(f" {'-'*72}")
|
||||
for src in sorted(source_stats.keys(), key=lambda s: -source_stats[s]["total"]):
|
||||
s = source_stats[src]
|
||||
pct = s["ok"] * 100 // max(s["total"], 1)
|
||||
print(f" {src:<32} {s['total']:>6} {s['ok']:>6} {pct:>5}% {s['no_chunk']:>8} {s['no_section']:>8}")
|
||||
|
||||
print(f"\n{'='*100}")
|
||||
verdict = "PASS" if ok * 100 // max(total, 1) >= 50 else "NEEDS IMPROVEMENT"
|
||||
print(f"ERGEBNIS: {verdict} — {ok}/{total} Controls ({ok*100//max(total,1)}%) vollstaendig korrekt")
|
||||
print(f"{'='*100}")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Control Source Citation Quality Report")
|
||||
parser.add_argument("--db-host", default="macmini")
|
||||
parser.add_argument("--db-port", type=int, default=5432)
|
||||
parser.add_argument("--db-name", default="breakpilot_db")
|
||||
parser.add_argument("--db-user", default="breakpilot")
|
||||
parser.add_argument("--db-pass", default="breakpilot123")
|
||||
parser.add_argument("--qdrant-url", default="http://macmini:6333")
|
||||
parser.add_argument("--sample", type=int, default=500)
|
||||
args = parser.parse_args()
|
||||
|
||||
db_url = f"postgresql://{args.db_user}:{args.db_pass}@{args.db_host}:{args.db_port}/{args.db_name}"
|
||||
|
||||
# Load controls
|
||||
logger.info("Loading %d random controls from DB...", args.sample)
|
||||
controls = load_controls(db_url, args.sample)
|
||||
logger.info("Loaded %d controls with source_citation", len(controls))
|
||||
|
||||
if not controls:
|
||||
print("ERROR: No controls found with source_citation")
|
||||
sys.exit(1)
|
||||
|
||||
# Build Qdrant index
|
||||
qdrant_index = build_qdrant_index(args.qdrant_url)
|
||||
|
||||
# Check each control
|
||||
logger.info("Checking %d controls against Qdrant...", len(controls))
|
||||
results = []
|
||||
for ctrl in controls:
|
||||
result = check_control(ctrl, qdrant_index)
|
||||
results.append(result)
|
||||
|
||||
# Report
|
||||
generate_report(results)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,486 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
D5 Re-Ingestion: Re-chunk all ~297 legal sources with structural metadata.
|
||||
|
||||
Usage:
|
||||
# Dry-run: build manifest, no changes
|
||||
python3 scripts/reingest_d5.py --dry-run
|
||||
|
||||
# Re-ingest one collection (test)
|
||||
python3 scripts/reingest_d5.py --collection bp_compliance_gesetze
|
||||
|
||||
# Re-ingest all collections (resume-capable)
|
||||
python3 scripts/reingest_d5.py --resume
|
||||
|
||||
# Custom URLs
|
||||
python3 scripts/reingest_d5.py --rag-url https://macmini:8097 --qdrant-url http://macmini:6333
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import random
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
|
||||
import httpx
|
||||
|
||||
from reingest_d5_config import (
|
||||
CHUNK_OVERLAP,
|
||||
CHUNK_SIZE,
|
||||
CHUNK_STRATEGY,
|
||||
DEFAULT_QDRANT_URL,
|
||||
DEFAULT_RAG_URL,
|
||||
MANIFEST_FILE,
|
||||
TARGET_COLLECTIONS,
|
||||
content_type_from_filename,
|
||||
doc_key,
|
||||
extract_doc_metadata,
|
||||
load_progress,
|
||||
save_progress,
|
||||
)
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
logger = logging.getLogger("d5-reingest")
|
||||
|
||||
UPLOAD_TIMEOUT = httpx.Timeout(timeout=3600.0, connect=30.0)
|
||||
SCROLL_TIMEOUT = httpx.Timeout(timeout=60.0, connect=10.0)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Phase 0: Preflight
|
||||
# ---------------------------------------------------------------------------
|
||||
def preflight_checks(rag_url: str, qdrant_url: str) -> dict:
|
||||
"""Verify services are reachable and record baseline chunk counts."""
|
||||
logger.info("Phase 0: Preflight checks...")
|
||||
|
||||
with httpx.Client(timeout=10.0, verify=False) as c:
|
||||
r = c.get(f"{rag_url}/health")
|
||||
r.raise_for_status()
|
||||
logger.info(" RAG service: OK")
|
||||
|
||||
with httpx.Client(timeout=10.0) as c:
|
||||
r = c.get(f"{qdrant_url}/collections")
|
||||
r.raise_for_status()
|
||||
logger.info(" Qdrant: OK")
|
||||
|
||||
before_counts = {}
|
||||
with httpx.Client(timeout=10.0) as c:
|
||||
for coll in TARGET_COLLECTIONS:
|
||||
try:
|
||||
r = c.post(f"{qdrant_url}/collections/{coll}/points/count",
|
||||
json={"exact": True})
|
||||
r.raise_for_status()
|
||||
count = r.json()["result"]["count"]
|
||||
except Exception:
|
||||
count = 0
|
||||
before_counts[coll] = count
|
||||
logger.info(" %s: %d chunks", coll, count)
|
||||
|
||||
return before_counts
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Phase 1: Build manifest
|
||||
# ---------------------------------------------------------------------------
|
||||
def build_manifest(qdrant_url: str, collections: list[str]) -> list[dict]:
|
||||
"""Scroll Qdrant and build a deduplicated document manifest."""
|
||||
logger.info("Phase 1: Building document manifest...")
|
||||
documents: dict[str, dict] = {} # keyed by doc_key(object_name, collection)
|
||||
|
||||
with httpx.Client(timeout=SCROLL_TIMEOUT) as client:
|
||||
for coll in collections:
|
||||
logger.info(" Scrolling %s...", coll)
|
||||
offset = None
|
||||
points_seen = 0
|
||||
|
||||
while True:
|
||||
body: dict = {
|
||||
"limit": 250,
|
||||
"with_payload": True,
|
||||
"with_vector": False,
|
||||
}
|
||||
if offset:
|
||||
body["offset"] = offset
|
||||
|
||||
resp = client.post(
|
||||
f"{qdrant_url}/collections/{coll}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
points = data["points"]
|
||||
|
||||
for pt in points:
|
||||
payload = pt.get("payload", {})
|
||||
obj_name = payload.get("object_name", "")
|
||||
if not obj_name:
|
||||
continue
|
||||
|
||||
key = doc_key(obj_name, coll)
|
||||
if key not in documents:
|
||||
meta = extract_doc_metadata(payload)
|
||||
documents[key] = {
|
||||
"object_name": obj_name,
|
||||
"collection": coll,
|
||||
"filename": payload.get("filename", obj_name.split("/")[-1]),
|
||||
"form": meta["form"],
|
||||
"extra_metadata": meta["extra"],
|
||||
"old_chunk_count": 0,
|
||||
}
|
||||
documents[key]["old_chunk_count"] += 1
|
||||
|
||||
points_seen += len(points)
|
||||
offset = data.get("next_page_offset")
|
||||
if not offset:
|
||||
break
|
||||
|
||||
logger.info(" %d points → %d unique docs",
|
||||
points_seen,
|
||||
sum(1 for d in documents.values() if d["collection"] == coll))
|
||||
|
||||
manifest = list(documents.values())
|
||||
logger.info(" Total: %d unique documents across %d collections",
|
||||
len(manifest), len(collections))
|
||||
return manifest
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Phase 2: Per-document re-ingestion
|
||||
# ---------------------------------------------------------------------------
|
||||
def download_file(rag_url: str, object_name: str) -> bytes:
|
||||
"""Download file bytes via MinIO presigned URL."""
|
||||
with httpx.Client(timeout=60.0, verify=False) as c:
|
||||
resp = c.get(f"{rag_url}/api/v1/documents/download/{object_name}")
|
||||
resp.raise_for_status()
|
||||
presigned_url = resp.json()["url"]
|
||||
|
||||
with httpx.Client(timeout=120.0, verify=False) as c:
|
||||
resp = c.get(presigned_url)
|
||||
resp.raise_for_status()
|
||||
return resp.content
|
||||
|
||||
|
||||
def delete_old_chunks(qdrant_url: str, collection: str, object_name: str) -> int:
|
||||
"""Delete all chunks for a document from Qdrant. Returns estimated count."""
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
resp = c.post(
|
||||
f"{qdrant_url}/collections/{collection}/points/delete",
|
||||
json={
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "object_name",
|
||||
"match": {"value": object_name},
|
||||
}]
|
||||
}
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return 0 # Qdrant delete doesn't return count
|
||||
|
||||
|
||||
def _delete_old_chunks_safe(
|
||||
qdrant_url: str, collection: str, object_name: str, keep_doc_id: str,
|
||||
) -> None:
|
||||
"""Delete old chunks for a document, keeping chunks with keep_doc_id."""
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
resp = c.post(
|
||||
f"{qdrant_url}/collections/{collection}/points/delete",
|
||||
json={
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "object_name",
|
||||
"match": {"value": object_name},
|
||||
}],
|
||||
"must_not": [{
|
||||
"key": "document_id",
|
||||
"match": {"value": keep_doc_id},
|
||||
}],
|
||||
}
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
|
||||
|
||||
def reupload_document(
|
||||
rag_url: str,
|
||||
file_bytes: bytes,
|
||||
filename: str,
|
||||
collection: str,
|
||||
form_fields: dict,
|
||||
extra_metadata: dict,
|
||||
) -> dict:
|
||||
"""Upload document to RAG service with new chunking parameters."""
|
||||
ct = content_type_from_filename(filename)
|
||||
|
||||
form_data = {
|
||||
"collection": collection,
|
||||
"data_type": form_fields.get("data_type", "compliance"),
|
||||
"bundesland": form_fields.get("bundesland", "bund"),
|
||||
"use_case": form_fields.get("use_case", "compliance"),
|
||||
"year": form_fields.get("year", "2026"),
|
||||
"chunk_strategy": CHUNK_STRATEGY,
|
||||
"chunk_size": str(CHUNK_SIZE),
|
||||
"chunk_overlap": str(CHUNK_OVERLAP),
|
||||
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
|
||||
}
|
||||
|
||||
with httpx.Client(timeout=UPLOAD_TIMEOUT, verify=False) as c:
|
||||
resp = c.post(
|
||||
f"{rag_url}/api/v1/documents/upload",
|
||||
files={"file": (filename, file_bytes, ct)},
|
||||
data=form_data,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def process_document(
|
||||
doc: dict,
|
||||
rag_url: str,
|
||||
qdrant_url: str,
|
||||
progress: dict,
|
||||
max_retries: int = 2,
|
||||
) -> bool:
|
||||
"""Process a single document: download → upload → verify → delete old.
|
||||
|
||||
Safe order: new chunks are created FIRST, old chunks deleted only after
|
||||
successful verification (upload-before-delete pattern).
|
||||
"""
|
||||
key = doc_key(doc["object_name"], doc["collection"])
|
||||
|
||||
# Skip if already done
|
||||
if progress.get("documents", {}).get(key, {}).get("status") == "done":
|
||||
return True
|
||||
|
||||
for attempt in range(max_retries + 1):
|
||||
try:
|
||||
# 1. Download
|
||||
file_bytes = download_file(rag_url, doc["object_name"])
|
||||
if not file_bytes:
|
||||
logger.warning(" Empty file: %s — skipping", doc["object_name"])
|
||||
progress.setdefault("documents", {})[key] = {
|
||||
"status": "skipped", "reason": "empty_file"}
|
||||
return False
|
||||
|
||||
# 2. Upload FIRST (creates new chunks alongside old ones)
|
||||
result = reupload_document(
|
||||
rag_url, file_bytes, doc["filename"],
|
||||
doc["collection"], doc["form"], doc["extra_metadata"],
|
||||
)
|
||||
|
||||
new_chunks = result.get("chunks_count", 0)
|
||||
new_doc_id = result.get("document_id", "")
|
||||
if new_chunks == 0:
|
||||
logger.error(" Upload produced 0 chunks — keeping old data: %s",
|
||||
doc["object_name"])
|
||||
progress.setdefault("documents", {})[key] = {
|
||||
"status": "error", "error": "0 new chunks"}
|
||||
return False
|
||||
|
||||
# 3. Delete OLD chunks only (exclude the new document_id)
|
||||
_delete_old_chunks_safe(
|
||||
qdrant_url, doc["collection"],
|
||||
doc["object_name"], new_doc_id,
|
||||
)
|
||||
|
||||
# 4. Record success
|
||||
progress.setdefault("documents", {})[key] = {
|
||||
"status": "done",
|
||||
"old_chunks": doc["old_chunk_count"],
|
||||
"new_chunks": new_chunks,
|
||||
"new_document_id": result.get("document_id", ""),
|
||||
"completed_at": datetime.now(timezone.utc).isoformat(),
|
||||
}
|
||||
return True
|
||||
|
||||
except httpx.HTTPStatusError as e:
|
||||
if e.response.status_code == 404:
|
||||
logger.warning(" File not in MinIO (404): %s — skipping", doc["object_name"])
|
||||
progress.setdefault("documents", {})[key] = {
|
||||
"status": "skipped", "reason": "not_in_minio"}
|
||||
return False
|
||||
if attempt < max_retries:
|
||||
wait = 5 * (attempt + 1)
|
||||
logger.warning(" HTTP %d on attempt %d, retrying in %ds...",
|
||||
e.response.status_code, attempt + 1, wait)
|
||||
time.sleep(wait)
|
||||
else:
|
||||
logger.error(" FAILED after %d retries: %s", max_retries, e)
|
||||
progress.setdefault("documents", {})[key] = {
|
||||
"status": "error", "error": str(e), "retries": max_retries}
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
if attempt < max_retries:
|
||||
wait = 10 * (attempt + 1)
|
||||
logger.warning(" Error on attempt %d: %s — retrying in %ds",
|
||||
attempt + 1, e, wait)
|
||||
time.sleep(wait)
|
||||
else:
|
||||
logger.error(" FAILED after %d retries: %s", max_retries, e)
|
||||
progress.setdefault("documents", {})[key] = {
|
||||
"status": "error", "error": str(e), "retries": max_retries}
|
||||
return False
|
||||
|
||||
return False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Phase 3: Verification
|
||||
# ---------------------------------------------------------------------------
|
||||
def verify_results(
|
||||
qdrant_url: str,
|
||||
before_counts: dict,
|
||||
collections: list[str],
|
||||
manifest: list[dict],
|
||||
):
|
||||
"""Compare before/after counts and spot-check metadata."""
|
||||
logger.info("Phase 3: Verification...")
|
||||
|
||||
print("\n" + "=" * 65)
|
||||
print("D5 RE-INGESTION VERIFICATION REPORT")
|
||||
print("=" * 65)
|
||||
|
||||
after_counts = {}
|
||||
with httpx.Client(timeout=10.0) as c:
|
||||
for coll in collections:
|
||||
try:
|
||||
r = c.post(f"{qdrant_url}/collections/{coll}/points/count",
|
||||
json={"exact": True})
|
||||
r.raise_for_status()
|
||||
after_counts[coll] = r.json()["result"]["count"]
|
||||
except Exception:
|
||||
after_counts[coll] = -1
|
||||
|
||||
print(f"\n{'Collection':<35} {'Before':>8} {'After':>8} {'Delta':>8}")
|
||||
print("-" * 65)
|
||||
for coll in collections:
|
||||
before = before_counts.get(coll, 0)
|
||||
after = after_counts.get(coll, -1)
|
||||
delta = after - before if after >= 0 else "?"
|
||||
print(f"{coll:<35} {before:>8} {after:>8} {str(delta):>8}")
|
||||
|
||||
# Spot-check: pick 3 random docs and verify metadata
|
||||
print("\nSpot-check (3 random docs):")
|
||||
sample = random.sample(manifest, min(3, len(manifest)))
|
||||
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
for doc in sample:
|
||||
resp = c.post(
|
||||
f"{qdrant_url}/collections/{doc['collection']}/points/scroll",
|
||||
json={
|
||||
"limit": 3,
|
||||
"with_payload": True,
|
||||
"with_vector": False,
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "object_name",
|
||||
"match": {"value": doc["object_name"]},
|
||||
}]
|
||||
},
|
||||
},
|
||||
)
|
||||
if resp.status_code != 200:
|
||||
print(f" {doc['object_name']}: QUERY FAILED")
|
||||
continue
|
||||
|
||||
points = resp.json()["result"]["points"]
|
||||
if not points:
|
||||
print(f" {doc['object_name']}: NO CHUNKS FOUND")
|
||||
continue
|
||||
|
||||
has_section = sum(1 for p in points if p["payload"].get("section"))
|
||||
has_para = sum(1 for p in points if p["payload"].get("paragraph"))
|
||||
print(f" {doc['filename'][:40]:<42} "
|
||||
f"chunks={len(points):>3} "
|
||||
f"with_section={has_section}/{len(points)} "
|
||||
f"with_para={has_para}/{len(points)}")
|
||||
|
||||
print()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main
|
||||
# ---------------------------------------------------------------------------
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="D5 Re-Ingestion Script")
|
||||
parser.add_argument("--rag-url", default=DEFAULT_RAG_URL)
|
||||
parser.add_argument("--qdrant-url", default=DEFAULT_QDRANT_URL)
|
||||
parser.add_argument("--dry-run", action="store_true",
|
||||
help="Build manifest only, no changes")
|
||||
parser.add_argument("--collection", default=None,
|
||||
help="Process only this collection")
|
||||
parser.add_argument("--resume", action="store_true",
|
||||
help="Resume from progress file")
|
||||
args = parser.parse_args()
|
||||
|
||||
collections = [args.collection] if args.collection else TARGET_COLLECTIONS
|
||||
|
||||
# Phase 0
|
||||
before_counts = preflight_checks(args.rag_url, args.qdrant_url)
|
||||
|
||||
# Phase 1
|
||||
manifest = build_manifest(args.qdrant_url, collections)
|
||||
|
||||
# Save manifest for inspection
|
||||
with open(MANIFEST_FILE, "w", encoding="utf-8") as f:
|
||||
json.dump(manifest, f, indent=2, ensure_ascii=False)
|
||||
logger.info("Manifest saved to %s", MANIFEST_FILE)
|
||||
|
||||
if args.dry_run:
|
||||
print(f"\nDRY RUN: {len(manifest)} documents found. See {MANIFEST_FILE}")
|
||||
for doc in manifest[:10]:
|
||||
reg = doc["extra_metadata"].get("regulation_code", "?")
|
||||
print(f" {reg:<30} {doc['collection']:<35} chunks={doc['old_chunk_count']}")
|
||||
if len(manifest) > 10:
|
||||
print(f" ... and {len(manifest) - 10} more")
|
||||
sys.exit(0)
|
||||
|
||||
# Phase 2
|
||||
progress = load_progress() if args.resume else {"documents": {}}
|
||||
progress["started_at"] = datetime.now(timezone.utc).isoformat()
|
||||
progress["before_counts"] = before_counts
|
||||
|
||||
done = 0
|
||||
skipped = 0
|
||||
failed = 0
|
||||
|
||||
for i, doc in enumerate(manifest, 1):
|
||||
key = doc_key(doc["object_name"], doc["collection"])
|
||||
reg = doc["extra_metadata"].get("regulation_code", "?")
|
||||
|
||||
if progress.get("documents", {}).get(key, {}).get("status") == "done":
|
||||
done += 1
|
||||
continue
|
||||
|
||||
logger.info("[%d/%d] %s (%s) — %d old chunks",
|
||||
i, len(manifest), reg, doc["collection"], doc["old_chunk_count"])
|
||||
|
||||
ok = process_document(doc, args.rag_url, args.qdrant_url, progress)
|
||||
if ok:
|
||||
done += 1
|
||||
new_chunks = progress["documents"][key].get("new_chunks", "?")
|
||||
logger.info(" OK: %d old → %s new chunks", doc["old_chunk_count"], new_chunks)
|
||||
elif progress["documents"][key].get("status") == "skipped":
|
||||
skipped += 1
|
||||
else:
|
||||
failed += 1
|
||||
|
||||
save_progress(progress)
|
||||
time.sleep(2)
|
||||
|
||||
logger.info("Phase 2 complete: %d done, %d skipped, %d failed", done, skipped, failed)
|
||||
|
||||
# Phase 3
|
||||
verify_results(args.qdrant_url, before_counts, collections, manifest)
|
||||
|
||||
print(f"Summary: {done} done, {skipped} skipped, {failed} failed")
|
||||
if failed:
|
||||
print(f"Re-run with --resume to retry {failed} failed documents")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,92 @@
|
||||
"""D5 Re-Ingestion: Constants, helpers, progress tracking."""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
|
||||
logger = logging.getLogger("d5-reingest")
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Defaults (overridable via CLI args)
|
||||
# ---------------------------------------------------------------------------
|
||||
DEFAULT_RAG_URL = "https://macmini:8097"
|
||||
DEFAULT_QDRANT_URL = "http://macmini:6333"
|
||||
|
||||
TARGET_COLLECTIONS = [
|
||||
"bp_compliance_ce",
|
||||
"bp_compliance_gesetze",
|
||||
"bp_compliance_datenschutz",
|
||||
"bp_dsfa_corpus",
|
||||
"bp_legal_templates",
|
||||
"bp_compliance_schulrecht",
|
||||
]
|
||||
|
||||
# New chunking parameters (D1-D4 validated)
|
||||
CHUNK_STRATEGY = "recursive"
|
||||
CHUNK_SIZE = 1500
|
||||
CHUNK_OVERLAP = 100
|
||||
|
||||
PROGRESS_FILE = "d5_reingest_progress.json"
|
||||
MANIFEST_FILE = "d5_manifest.json"
|
||||
|
||||
# Per-chunk fields (NOT carried as extra metadata during re-upload)
|
||||
PER_CHUNK_FIELDS = frozenset({
|
||||
"chunk_text", "chunk_index", "document_id", "object_name",
|
||||
"filename", "data_type", "bundesland", "use_case", "year",
|
||||
"section", "section_title", "paragraph", "paragraph_num", "page",
|
||||
})
|
||||
|
||||
# Upload form fields that come from the payload (not metadata_json)
|
||||
FORM_FIELDS = frozenset({"data_type", "bundesland", "use_case", "year"})
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Progress tracking
|
||||
# ---------------------------------------------------------------------------
|
||||
def load_progress(path: str = PROGRESS_FILE) -> dict:
|
||||
if os.path.exists(path):
|
||||
with open(path, encoding="utf-8") as f:
|
||||
return json.load(f)
|
||||
return {"documents": {}}
|
||||
|
||||
|
||||
def save_progress(data: dict, path: str = PROGRESS_FILE):
|
||||
with open(path, "w", encoding="utf-8") as f:
|
||||
json.dump(data, f, indent=2, ensure_ascii=False, default=str)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Metadata extraction
|
||||
# ---------------------------------------------------------------------------
|
||||
def extract_doc_metadata(payload: dict) -> dict:
|
||||
"""Split Qdrant payload into form fields + extra metadata.
|
||||
|
||||
Returns: {"form": {data_type, bundesland, ...}, "extra": {regulation_code, ...}}
|
||||
"""
|
||||
form = {}
|
||||
extra = {}
|
||||
for k, v in payload.items():
|
||||
if k in PER_CHUNK_FIELDS:
|
||||
continue
|
||||
if k in FORM_FIELDS:
|
||||
form[k] = v
|
||||
else:
|
||||
extra[k] = v
|
||||
return {"form": form, "extra": extra}
|
||||
|
||||
|
||||
def doc_key(object_name: str, collection: str) -> str:
|
||||
"""Unique key for a document in the progress file."""
|
||||
return f"{object_name}|{collection}"
|
||||
|
||||
|
||||
def content_type_from_filename(filename: str) -> str:
|
||||
"""Infer MIME type from file extension."""
|
||||
ext = os.path.splitext(filename)[1].lower()
|
||||
return {
|
||||
".pdf": "application/pdf",
|
||||
".html": "text/html",
|
||||
".htm": "text/html",
|
||||
".md": "text/markdown",
|
||||
".txt": "text/plain",
|
||||
}.get(ext, "application/octet-stream")
|
||||
@@ -0,0 +1,485 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Safe re-ingestion of NIST/BSI/ENISA PDFs from MinIO.
|
||||
|
||||
Uses upload-before-delete pattern: new chunks are created FIRST,
|
||||
old chunks are only deleted after successful verification.
|
||||
|
||||
Usage:
|
||||
python3 control-pipeline/scripts/reingest_nist.py [--dry-run]
|
||||
python3 control-pipeline/scripts/reingest_nist.py --only-missing
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
import time
|
||||
|
||||
import httpx
|
||||
|
||||
sys.path.insert(0, "control-pipeline/scripts")
|
||||
from reingest_d5_config import ( # noqa: E402
|
||||
CHUNK_OVERLAP,
|
||||
CHUNK_SIZE,
|
||||
CHUNK_STRATEGY,
|
||||
DEFAULT_QDRANT_URL,
|
||||
DEFAULT_RAG_URL,
|
||||
content_type_from_filename,
|
||||
)
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
)
|
||||
logger = logging.getLogger("reingest-nist")
|
||||
|
||||
UPLOAD_TIMEOUT = 1800.0 # 30 min for large PDFs
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Documents to re-ingest
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
# 4 documents with 0 chunks (deleted by D5, upload failed)
|
||||
MISSING_DOCS = [
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_53r5.pdf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"filename": "NIST_SP_800_53r5.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp800_53r5",
|
||||
"source_id": "nist",
|
||||
"doc_type": "controls_catalog",
|
||||
"guideline_name": "NIST SP 800-53 Rev. 5 Security and Privacy Controls",
|
||||
"license": "public_domain_us_gov",
|
||||
"attribution": "NIST",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nist_sp_800_82r3.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "nist_sp_800_82r3.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_82r3",
|
||||
"regulation_name_de": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
|
||||
"regulation_name_en": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
|
||||
"regulation_short": "NIST SP 800-82",
|
||||
"category": "ot_security",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nist_sp_800_160v1r1.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "nist_sp_800_160v1r1.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_160v1r1",
|
||||
"regulation_name_de": "NIST SP 800-160 Vol. 1 Rev. 1",
|
||||
"regulation_name_en": "NIST SP 800-160 Vol. 1 Rev. 1",
|
||||
"regulation_short": "NIST SP 800-160",
|
||||
"category": "security_engineering",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_207.pdf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"filename": "NIST_SP_800_207.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp800_207",
|
||||
"source_id": "nist",
|
||||
"doc_type": "architecture",
|
||||
"guideline_name": "NIST SP 800-207 Zero Trust Architecture",
|
||||
"license": "public_domain_us_gov",
|
||||
"attribution": "NIST",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
]
|
||||
|
||||
# Additional NIST/BSI/ENISA docs with <10% section rate (re-ingest for quality)
|
||||
LOW_QUALITY_DOCS = [
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nist_csf_2_0.pdf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"filename": "nist_csf_2_0.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_csf_2_0",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nistir_8259a.pdf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"filename": "nistir_8259a.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nistir_8259a",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nist_ai_rmf.pdf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"filename": "nist_ai_rmf.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_ai_rmf",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/nist_sp_800_30r1.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "nist_sp_800_30r1.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_30r1",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/enisa_supply_chain_good_practices.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "enisa_supply_chain_good_practices.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_supply_chain_good_practices",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/enisa_ics_scada.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "enisa_ics_scada.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_ics_scada_dependencies",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/enisa_supply_chain_security.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "enisa_supply_chain_security.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_threat_landscape_supply_chain",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/cisa_secure_by_design.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "cisa_secure_by_design.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "cisa_secure_by_design",
|
||||
"license": "public_domain_us",
|
||||
"source": "cisa.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"object_name": "compliance/bund/compliance/2026/cvss_v4_0.pdf",
|
||||
"collection": "bp_compliance_ce",
|
||||
"filename": "cvss_v4_0.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "cvss_v4_0",
|
||||
"license": "public_domain_us",
|
||||
"source": "first.org",
|
||||
},
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Qdrant helpers
|
||||
# -------------------------------------------------------------------
|
||||
def count_chunks(qdrant_url: str, collection: str, object_name: str) -> int:
|
||||
"""Count existing chunks for a document in Qdrant."""
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
resp = c.post(
|
||||
f"{qdrant_url}/collections/{collection}/points/count",
|
||||
json={
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "object_name",
|
||||
"match": {"value": object_name},
|
||||
}]
|
||||
},
|
||||
"exact": True,
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["result"]["count"]
|
||||
|
||||
|
||||
def get_old_document_ids(
|
||||
qdrant_url: str, collection: str, object_name: str,
|
||||
) -> set:
|
||||
"""Get all document_ids for existing chunks of this document."""
|
||||
doc_ids = set()
|
||||
offset = None
|
||||
with httpx.Client(timeout=60.0) as c:
|
||||
while True:
|
||||
body = {
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "object_name",
|
||||
"match": {"value": object_name},
|
||||
}]
|
||||
},
|
||||
"limit": 100,
|
||||
"with_payload": ["document_id"],
|
||||
}
|
||||
if offset is not None:
|
||||
body["offset"] = offset
|
||||
resp = c.post(
|
||||
f"{qdrant_url}/collections/{collection}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
for pt in data["points"]:
|
||||
did = pt.get("payload", {}).get("document_id")
|
||||
if did:
|
||||
doc_ids.add(did)
|
||||
offset = data.get("next_page_offset")
|
||||
if offset is None:
|
||||
break
|
||||
return doc_ids
|
||||
|
||||
|
||||
def delete_by_document_ids(
|
||||
qdrant_url: str, collection: str, doc_ids: set,
|
||||
) -> None:
|
||||
"""Delete chunks matching specific document_ids."""
|
||||
for did in doc_ids:
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
c.post(
|
||||
f"{qdrant_url}/collections/{collection}/points/delete",
|
||||
json={
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "document_id",
|
||||
"match": {"value": did},
|
||||
}]
|
||||
}
|
||||
},
|
||||
).raise_for_status()
|
||||
|
||||
|
||||
def check_section_rate(
|
||||
qdrant_url: str, collection: str, object_name: str,
|
||||
) -> tuple:
|
||||
"""Check section rate for a document's chunks. Returns (total, with_section)."""
|
||||
total = 0
|
||||
with_section = 0
|
||||
offset = None
|
||||
with httpx.Client(timeout=60.0) as c:
|
||||
while True:
|
||||
body = {
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "object_name",
|
||||
"match": {"value": object_name},
|
||||
}]
|
||||
},
|
||||
"limit": 100,
|
||||
"with_payload": ["section"],
|
||||
}
|
||||
if offset is not None:
|
||||
body["offset"] = offset
|
||||
resp = c.post(
|
||||
f"{qdrant_url}/collections/{collection}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
for pt in data["points"]:
|
||||
total += 1
|
||||
sec = pt.get("payload", {}).get("section", "")
|
||||
if sec and sec.strip():
|
||||
with_section += 1
|
||||
offset = data.get("next_page_offset")
|
||||
if offset is None:
|
||||
break
|
||||
return total, with_section
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Upload
|
||||
# -------------------------------------------------------------------
|
||||
def download_from_minio(rag_url: str, object_name: str) -> bytes:
|
||||
"""Download file from MinIO via RAG service presigned URL."""
|
||||
with httpx.Client(timeout=60.0, verify=False) as c:
|
||||
resp = c.get(f"{rag_url}/api/v1/documents/download/{object_name}")
|
||||
resp.raise_for_status()
|
||||
presigned_url = resp.json()["url"]
|
||||
|
||||
with httpx.Client(timeout=300.0, verify=False) as c:
|
||||
resp = c.get(presigned_url)
|
||||
resp.raise_for_status()
|
||||
return resp.content
|
||||
|
||||
|
||||
def upload_document(
|
||||
rag_url: str,
|
||||
file_bytes: bytes,
|
||||
filename: str,
|
||||
collection: str,
|
||||
extra_metadata: dict,
|
||||
) -> dict:
|
||||
"""Upload document to RAG service."""
|
||||
ct = content_type_from_filename(filename)
|
||||
form_data = {
|
||||
"collection": collection,
|
||||
"data_type": "compliance",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_strategy": CHUNK_STRATEGY,
|
||||
"chunk_size": str(CHUNK_SIZE),
|
||||
"chunk_overlap": str(CHUNK_OVERLAP),
|
||||
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
|
||||
}
|
||||
with httpx.Client(timeout=UPLOAD_TIMEOUT, verify=False) as c:
|
||||
resp = c.post(
|
||||
f"{rag_url}/api/v1/documents/upload",
|
||||
files={"file": (filename, file_bytes, ct)},
|
||||
data=form_data,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Main processing
|
||||
# -------------------------------------------------------------------
|
||||
def process_document(
|
||||
doc: dict,
|
||||
rag_url: str,
|
||||
qdrant_url: str,
|
||||
dry_run: bool = False,
|
||||
) -> dict:
|
||||
"""Safe re-ingest: upload first, then delete old. Returns result dict."""
|
||||
obj = doc["object_name"]
|
||||
coll = doc["collection"]
|
||||
fname = doc["filename"]
|
||||
|
||||
# 1. Check existing state
|
||||
old_count = count_chunks(qdrant_url, coll, obj)
|
||||
old_doc_ids = get_old_document_ids(qdrant_url, coll, obj) if old_count > 0 else set()
|
||||
logger.info(" [%s] existing: %d chunks, %d document_ids",
|
||||
fname, old_count, len(old_doc_ids))
|
||||
|
||||
if dry_run:
|
||||
logger.info(" [%s] DRY RUN — would download + upload + delete old", fname)
|
||||
return {"status": "dry_run", "old_chunks": old_count}
|
||||
|
||||
# 2. Download from MinIO
|
||||
logger.info(" [%s] downloading from MinIO...", fname)
|
||||
file_bytes = download_from_minio(rag_url, obj)
|
||||
size_mb = len(file_bytes) / (1024 * 1024)
|
||||
logger.info(" [%s] downloaded %.1f MB", fname, size_mb)
|
||||
|
||||
# 3. Upload FIRST (creates new chunks)
|
||||
logger.info(" [%s] uploading to RAG service...", fname)
|
||||
result = upload_document(rag_url, file_bytes, fname, coll, doc["extra_metadata"])
|
||||
new_chunks = result.get("chunks_count", 0)
|
||||
new_doc_id = result.get("document_id", "")
|
||||
logger.info(" [%s] uploaded: %d new chunks (doc_id=%s)", fname, new_chunks, new_doc_id)
|
||||
|
||||
# 4. Verify new chunks exist
|
||||
if new_chunks == 0:
|
||||
logger.error(" [%s] UPLOAD PRODUCED 0 CHUNKS — keeping old data!", fname)
|
||||
return {"status": "error", "error": "0 new chunks", "old_chunks": old_count}
|
||||
|
||||
# 5. Delete old chunks (only if there were any)
|
||||
if old_doc_ids:
|
||||
logger.info(" [%s] deleting %d old document_ids...", fname, len(old_doc_ids))
|
||||
delete_by_document_ids(qdrant_url, coll, old_doc_ids)
|
||||
logger.info(" [%s] old chunks deleted", fname)
|
||||
|
||||
# 6. Check section rate
|
||||
total, with_sec = check_section_rate(qdrant_url, coll, obj)
|
||||
pct = (with_sec / total * 100) if total > 0 else 0
|
||||
logger.info(" [%s] section rate: %d/%d (%.0f%%)", fname, with_sec, total, pct)
|
||||
|
||||
return {
|
||||
"status": "ok",
|
||||
"old_chunks": old_count,
|
||||
"new_chunks": new_chunks,
|
||||
"new_document_id": new_doc_id,
|
||||
"section_rate": round(pct, 1),
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Safe NIST/BSI/ENISA re-ingestion")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Show what would happen")
|
||||
parser.add_argument("--only-missing", action="store_true",
|
||||
help="Only re-ingest the 4 missing docs (skip low-quality)")
|
||||
parser.add_argument("--rag-url", default=DEFAULT_RAG_URL)
|
||||
parser.add_argument("--qdrant-url", default=DEFAULT_QDRANT_URL)
|
||||
args = parser.parse_args()
|
||||
|
||||
docs = list(MISSING_DOCS)
|
||||
if not args.only_missing:
|
||||
docs.extend(LOW_QUALITY_DOCS)
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info("NIST/BSI/ENISA Safe Re-Ingestion")
|
||||
logger.info(" Documents: %d (%d missing + %d low-quality)",
|
||||
len(docs), len(MISSING_DOCS),
|
||||
0 if args.only_missing else len(LOW_QUALITY_DOCS))
|
||||
logger.info(" RAG: %s", args.rag_url)
|
||||
logger.info(" Qdrant: %s", args.qdrant_url)
|
||||
logger.info(" Dry run: %s", args.dry_run)
|
||||
logger.info("=" * 60)
|
||||
|
||||
results = {}
|
||||
ok = 0
|
||||
errors = 0
|
||||
|
||||
for i, doc in enumerate(docs, 1):
|
||||
logger.info("[%d/%d] %s → %s", i, len(docs), doc["filename"], doc["collection"])
|
||||
try:
|
||||
r = process_document(doc, args.rag_url, args.qdrant_url, args.dry_run)
|
||||
results[doc["filename"]] = r
|
||||
if r["status"] == "ok":
|
||||
ok += 1
|
||||
elif r["status"] == "error":
|
||||
errors += 1
|
||||
except Exception as e:
|
||||
logger.error(" FAILED: %s", e)
|
||||
results[doc["filename"]] = {"status": "error", "error": str(e)}
|
||||
errors += 1
|
||||
|
||||
if i < len(docs):
|
||||
time.sleep(2)
|
||||
|
||||
# Summary
|
||||
logger.info("")
|
||||
logger.info("=" * 60)
|
||||
logger.info("RESULTS")
|
||||
logger.info("=" * 60)
|
||||
for fname, r in results.items():
|
||||
status = r["status"].upper()
|
||||
old = r.get("old_chunks", "?")
|
||||
new = r.get("new_chunks", "?")
|
||||
sec = r.get("section_rate", "?")
|
||||
logger.info(" %-40s %s old=%s new=%s sect=%.0f%%",
|
||||
fname, status, old, new, sec if isinstance(sec, float) else 0)
|
||||
|
||||
logger.info("")
|
||||
logger.info("OK: %d, Errors: %d, Total: %d", ok, errors, len(docs))
|
||||
|
||||
if errors > 0:
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,213 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Replace EU regulation PDFs with clean HTML from EUR-Lex.
|
||||
|
||||
Downloads HTML versions of EU regulations (using CELEX numbers),
|
||||
deletes old PDF chunks from Qdrant, uploads HTML via RAG service.
|
||||
|
||||
Usage:
|
||||
python3 scripts/replace_eu_pdfs_with_html.py --dry-run
|
||||
python3 scripts/replace_eu_pdfs_with_html.py
|
||||
python3 scripts/replace_eu_pdfs_with_html.py --celex 32016R0679 # single doc
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
|
||||
import httpx
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
logger = logging.getLogger("eurlex-replace")
|
||||
|
||||
DEFAULT_RAG_URL = "https://macmini:8097"
|
||||
DEFAULT_QDRANT_URL = "http://macmini:6333"
|
||||
|
||||
EURLEX_HTML_URL = "https://eur-lex.europa.eu/legal-content/DE/TXT/HTML/?uri=CELEX:{celex}"
|
||||
|
||||
# EU regulations with CELEX numbers and their current collection + metadata
|
||||
EU_REGULATIONS = [
|
||||
{"celex": "32024R1689", "reg_id": "ai_act_2024", "name": "AI Act", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32024R2847", "reg_id": "cra_2024", "name": "Cyber Resilience Act", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32022L2555", "reg_id": "nis2_2022", "name": "NIS2-Richtlinie", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32016R0679", "reg_id": "dsgvo_2016", "name": "DSGVO", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32024R1624", "reg_id": "amlr_2024", "name": "Anti-Geldwaesche-VO", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32017R0745", "reg_id": "eu_mdr_2017", "name": "Medical Device Regulation", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32022R2065", "reg_id": "dsa_2022", "name": "Digital Services Act", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32022R1925", "reg_id": "dma_2022", "name": "Digital Markets Act", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32022R2554", "reg_id": "dora_2022", "name": "DORA", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32022R0868", "reg_id": "dga_2022", "name": "Data Governance Act", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32023R2854", "reg_id": "dataact_2023", "name": "Data Act", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32023R0988", "reg_id": "gpsr_2023", "name": "General Product Safety Regulation", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32023R1230", "reg_id": "machinery_2023", "name": "Maschinenverordnung", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32023R1803", "reg_id": "ifrs_2023", "name": "IFRS Regulation", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32023D1795", "reg_id": "dpf_2023", "name": "Data Privacy Framework", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32019L2161", "reg_id": "omnibus_2019", "name": "Omnibus-Richtlinie", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32019L0790", "reg_id": "dsm_2019", "name": "DSM-Richtlinie", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32019L0770", "reg_id": "digital_content_2019", "name": "Digital Content Directive", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32002L0058", "reg_id": "eprivacy_2002", "name": "ePrivacy-Richtlinie", "coll": "bp_compliance_ce"},
|
||||
{"celex": "32000L0031", "reg_id": "ecommerce_2000", "name": "E-Commerce-Richtlinie", "coll": "bp_compliance_ce"},
|
||||
]
|
||||
|
||||
|
||||
def download_eurlex_html(celex: str) -> bytes:
|
||||
"""Download HTML from EUR-Lex for a given CELEX number."""
|
||||
url = EURLEX_HTML_URL.format(celex=celex)
|
||||
with httpx.Client(timeout=60.0, follow_redirects=True) as c:
|
||||
r = c.get(url)
|
||||
r.raise_for_status()
|
||||
return r.content
|
||||
|
||||
|
||||
def delete_old_chunks(qdrant_url: str, collection: str, reg_id: str):
|
||||
"""Delete chunks matching regulation_id prefix."""
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
# Try multiple field names for regulation_id
|
||||
for field in ["regulation_id"]:
|
||||
r = c.post(f"{qdrant_url}/collections/{collection}/points/delete", json={
|
||||
"filter": {"must": [{"key": field, "match": {"value": reg_id}}]}
|
||||
})
|
||||
if r.status_code == 200:
|
||||
return
|
||||
|
||||
|
||||
def find_old_chunks_by_filename(qdrant_url: str, collection: str, filename_pattern: str) -> int:
|
||||
"""Count existing chunks matching a filename pattern."""
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
r = c.post(f"{qdrant_url}/collections/{collection}/points/count", json={
|
||||
"exact": True,
|
||||
"filter": {"must": [{"key": "regulation_id", "match": {"value": filename_pattern}}]}
|
||||
})
|
||||
if r.status_code == 200:
|
||||
return r.json()["result"]["count"]
|
||||
return 0
|
||||
|
||||
|
||||
def upload_html(rag_url: str, html_bytes: bytes, reg: dict) -> dict:
|
||||
"""Upload HTML to RAG service."""
|
||||
filename = f"{reg['reg_id']}.html"
|
||||
metadata = json.dumps({
|
||||
"regulation_id": reg["reg_id"],
|
||||
"regulation_name_de": reg["name"],
|
||||
"celex": reg["celex"],
|
||||
"source": "EUR-Lex",
|
||||
"license": "EU_law",
|
||||
"source_type": "law",
|
||||
"category": "eu_regulation",
|
||||
}, ensure_ascii=False)
|
||||
|
||||
with httpx.Client(timeout=3600.0, verify=False) as c:
|
||||
r = c.post(f"{rag_url}/api/v1/documents/upload",
|
||||
files={"file": (filename, html_bytes, "text/html")},
|
||||
data={
|
||||
"collection": reg["coll"],
|
||||
"data_type": "compliance",
|
||||
"bundesland": "eu",
|
||||
"use_case": "regulation",
|
||||
"year": reg["celex"][1:5],
|
||||
"chunk_strategy": "recursive",
|
||||
"chunk_size": "1500",
|
||||
"chunk_overlap": "100",
|
||||
"metadata_json": metadata,
|
||||
},
|
||||
)
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
|
||||
|
||||
def check_section_rate(qdrant_url: str, collection: str, reg_id: str) -> tuple:
|
||||
"""Check section rate for a regulation. Returns (total, with_section)."""
|
||||
total = 0
|
||||
with_section = 0
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
r = c.post(f"{qdrant_url}/collections/{collection}/points/scroll", json={
|
||||
"limit": 100, "with_payload": True, "with_vector": False,
|
||||
"filter": {"must": [{"key": "regulation_id", "match": {"value": reg_id}}]}
|
||||
})
|
||||
if r.status_code == 200:
|
||||
pts = r.json()["result"]["points"]
|
||||
total = len(pts)
|
||||
with_section = sum(1 for p in pts if p["payload"].get("section"))
|
||||
return total, with_section
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Replace EU PDFs with EUR-Lex HTML")
|
||||
parser.add_argument("--rag-url", default=DEFAULT_RAG_URL)
|
||||
parser.add_argument("--qdrant-url", default=DEFAULT_QDRANT_URL)
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
parser.add_argument("--celex", default=None, help="Process only this CELEX number")
|
||||
args = parser.parse_args()
|
||||
|
||||
regs = EU_REGULATIONS
|
||||
if args.celex:
|
||||
regs = [r for r in regs if r["celex"] == args.celex]
|
||||
if not regs:
|
||||
print(f"CELEX {args.celex} not found in list")
|
||||
return
|
||||
|
||||
results = []
|
||||
|
||||
for reg in regs:
|
||||
logger.info("[%s] %s (%s)", reg["celex"], reg["name"], reg["reg_id"])
|
||||
|
||||
# Download HTML
|
||||
try:
|
||||
html_bytes = download_eurlex_html(reg["celex"])
|
||||
logger.info(" Downloaded: %d bytes", len(html_bytes))
|
||||
except Exception as e:
|
||||
logger.error(" Download FAILED: %s", e)
|
||||
results.append({"reg": reg, "status": "download_failed", "error": str(e)})
|
||||
continue
|
||||
|
||||
if args.dry_run:
|
||||
results.append({"reg": reg, "status": "dry_run", "html_size": len(html_bytes)})
|
||||
continue
|
||||
|
||||
# Delete old chunks
|
||||
old_count = find_old_chunks_by_filename(args.qdrant_url, reg["coll"], reg["reg_id"])
|
||||
delete_old_chunks(args.qdrant_url, reg["coll"], reg["reg_id"])
|
||||
logger.info(" Deleted %d old chunks", old_count)
|
||||
|
||||
# Upload HTML
|
||||
try:
|
||||
result = upload_html(args.rag_url, html_bytes, reg)
|
||||
new_chunks = result.get("chunks_count", 0)
|
||||
logger.info(" Uploaded: %d new chunks", new_chunks)
|
||||
except Exception as e:
|
||||
logger.error(" Upload FAILED: %s", e)
|
||||
results.append({"reg": reg, "status": "upload_failed", "error": str(e)})
|
||||
time.sleep(2)
|
||||
continue
|
||||
|
||||
# Check quality
|
||||
time.sleep(2)
|
||||
total, with_sec = check_section_rate(args.qdrant_url, reg["coll"], reg["reg_id"])
|
||||
pct = with_sec * 100 // max(total, 1)
|
||||
logger.info(" Section rate: %d/%d = %d%%", with_sec, total, pct)
|
||||
|
||||
results.append({
|
||||
"reg": reg, "status": "ok",
|
||||
"old_chunks": old_count, "new_chunks": new_chunks,
|
||||
"section_rate": pct,
|
||||
})
|
||||
time.sleep(2)
|
||||
|
||||
# Report
|
||||
print("\n" + "=" * 90)
|
||||
print("EUR-LEX REPLACEMENT REPORT")
|
||||
print("=" * 90)
|
||||
print(f"{'CELEX':<15} {'Name':<30} {'Status':<10} {'Old':>5} {'New':>5} {'Sect%':>6}")
|
||||
print("-" * 90)
|
||||
for r in results:
|
||||
reg = r["reg"]
|
||||
status = r["status"]
|
||||
old = r.get("old_chunks", "")
|
||||
new = r.get("new_chunks", r.get("html_size", ""))
|
||||
sect = f"{r.get('section_rate', '')}%" if "section_rate" in r else ""
|
||||
print(f"{reg['celex']:<15} {reg['name'][:30]:<30} {status:<10} {str(old):>5} {str(new):>5} {sect:>6}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,437 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Re-upload NIST/BSI/ENISA docs with chunk_strategy='legal' for section metadata.
|
||||
|
||||
The docs were already uploaded with 'recursive' strategy (no section detection).
|
||||
This script re-uploads with 'legal' strategy, then deletes old recursive chunks.
|
||||
|
||||
Usage (on Mac Mini):
|
||||
python3 control-pipeline/scripts/reupload_legal_strategy.py
|
||||
python3 control-pipeline/scripts/reupload_legal_strategy.py --dry-run
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import io
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
import unicodedata
|
||||
|
||||
import httpx
|
||||
import pdfplumber
|
||||
|
||||
RAG_URL = "https://localhost:8097"
|
||||
QDRANT_URL = "http://localhost:6333"
|
||||
UPLOAD_TIMEOUT = 1800.0
|
||||
|
||||
# ---- Documents to process ----
|
||||
|
||||
DOCS = [
|
||||
# 4 NIST docs already extracted at /tmp/nist_*.txt
|
||||
{
|
||||
"regulation_id": "nist_sp800_53r5",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"upload_filename": "NIST_SP_800_53r5.txt",
|
||||
"local_txt": "/tmp/nist_nist_sp800_53r5.txt",
|
||||
"minio_pdf": None, # already extracted
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp800_53r5",
|
||||
"source_id": "nist",
|
||||
"doc_type": "controls_catalog",
|
||||
"guideline_name": "NIST SP 800-53 Rev. 5",
|
||||
"license": "public_domain_us_gov",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "nist_sp_800_82r3",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "nist_sp_800_82r3.txt",
|
||||
"local_txt": "/tmp/nist_nist_sp_800_82r3.txt",
|
||||
"minio_pdf": None,
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_82r3",
|
||||
"regulation_short": "NIST SP 800-82",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "nist_sp_800_160v1r1",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "nist_sp_800_160v1r1.txt",
|
||||
"local_txt": "/tmp/nist_160.txt",
|
||||
"minio_pdf": None,
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_160v1r1",
|
||||
"regulation_short": "NIST SP 800-160",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "nist_sp800_207",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"upload_filename": "NIST_SP_800_207.txt",
|
||||
"local_txt": None, # needs extraction
|
||||
"minio_pdf": "compliance/bund/compliance/2026/NIST_SP_800_207.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp800_207",
|
||||
"source_id": "nist",
|
||||
"doc_type": "architecture",
|
||||
"guideline_name": "NIST SP 800-207 Zero Trust Architecture",
|
||||
"license": "public_domain_us_gov",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
# Additional low-quality docs (need extraction from MinIO)
|
||||
{
|
||||
"regulation_id": "nist_csf_2_0",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"upload_filename": "nist_csf_2_0.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/nist_csf_2_0.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_csf_2_0",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "nistir_8259a",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"upload_filename": "nistir_8259a.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/nistir_8259a.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nistir_8259a",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "nist_ai_rmf",
|
||||
"collection": "bp_compliance_datenschutz",
|
||||
"upload_filename": "nist_ai_rmf.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/nist_ai_rmf.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_ai_rmf",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "nist_sp_800_30r1",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "nist_sp_800_30r1.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/nist_sp_800_30r1.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "nist_sp_800_30r1",
|
||||
"license": "public_domain_us",
|
||||
"source": "nist.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "cisa_secure_by_design",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "cisa_secure_by_design.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/cisa_secure_by_design.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "cisa_secure_by_design",
|
||||
"license": "public_domain_us",
|
||||
"source": "cisa.gov",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "cvss_v4_0",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "cvss_v4_0.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/cvss_v4_0.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "cvss_v4_0",
|
||||
"license": "public_domain_us",
|
||||
"source": "first.org",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_ics_scada_dependencies",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "enisa_ics_scada.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/enisa_ics_scada.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_ics_scada_dependencies",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_threat_landscape_supply_chain",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "enisa_supply_chain_security.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/enisa_supply_chain_security.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_threat_landscape_supply_chain",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
},
|
||||
},
|
||||
{
|
||||
"regulation_id": "enisa_supply_chain_good_practices",
|
||||
"collection": "bp_compliance_ce",
|
||||
"upload_filename": "enisa_supply_chain_good_practices.txt",
|
||||
"local_txt": None,
|
||||
"minio_pdf": "compliance/bund/compliance/2026/enisa_supply_chain_good_practices.pdf",
|
||||
"extra_metadata": {
|
||||
"regulation_id": "enisa_supply_chain_good_practices",
|
||||
"license": "reuse_with_attribution",
|
||||
"source": "enisa.europa.eu",
|
||||
},
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def normalize_pdf_text(text):
|
||||
text = unicodedata.normalize('NFKC', text)
|
||||
text = text.replace('\u00ad', '').replace('\u200b', '')
|
||||
prev = None
|
||||
while prev != text:
|
||||
prev = text
|
||||
text = re.sub(r'(\d+)\s+\.\s+(\d+)', r'\1.\2', text)
|
||||
text = re.sub(r'\b([A-Z]{2,4})\s+-\s+(\d+)\b', r'\1-\2', text)
|
||||
text = re.sub(
|
||||
r'\b([A-Z]{2})\s*\.\s*([A-Z]{2})\s*-\s*(\d{2})\b', r'\1.\2-\3', text
|
||||
)
|
||||
text = re.sub(r'\(\s+(\d+)\s+\)', r'(\1)', text)
|
||||
text = re.sub(r'[^\S\n]{2,}', ' ', text)
|
||||
return text
|
||||
|
||||
|
||||
def get_text(doc):
|
||||
"""Get document text: from local file or extract from MinIO PDF."""
|
||||
if doc["local_txt"]:
|
||||
print(f" Reading local: {doc['local_txt']}")
|
||||
with open(doc["local_txt"], encoding="utf-8") as f:
|
||||
return f.read()
|
||||
|
||||
print(f" Downloading from MinIO: {doc['minio_pdf']}")
|
||||
with httpx.Client(timeout=60, verify=False) as c:
|
||||
resp = c.get(f"{RAG_URL}/api/v1/documents/download/{doc['minio_pdf']}")
|
||||
resp.raise_for_status()
|
||||
url = resp.json()["url"]
|
||||
with httpx.Client(timeout=300, verify=False) as c:
|
||||
pdf_bytes = c.get(url).content
|
||||
print(f" Downloaded {len(pdf_bytes) / 1024 / 1024:.1f} MB")
|
||||
|
||||
print(" Extracting with pdfplumber...")
|
||||
parts = []
|
||||
with pdfplumber.open(io.BytesIO(pdf_bytes)) as pdf:
|
||||
for i, page in enumerate(pdf.pages):
|
||||
t = page.extract_text(x_tolerance=3, y_tolerance=4)
|
||||
if t:
|
||||
parts.append(t)
|
||||
if (i + 1) % 50 == 0:
|
||||
print(f" {i + 1}/{len(pdf.pages)} pages...")
|
||||
text = "\n\n".join(parts)
|
||||
text = normalize_pdf_text(text)
|
||||
print(f" Extracted {len(text):,} chars")
|
||||
return text
|
||||
|
||||
|
||||
def get_old_doc_ids(collection, regulation_id):
|
||||
"""Get all document_ids for existing chunks."""
|
||||
doc_ids = set()
|
||||
offset = None
|
||||
with httpx.Client(timeout=60) as c:
|
||||
while True:
|
||||
body = {
|
||||
"filter": {"must": [
|
||||
{"key": "regulation_id", "match": {"value": regulation_id}}
|
||||
]},
|
||||
"limit": 100,
|
||||
"with_payload": ["document_id"],
|
||||
}
|
||||
if offset is not None:
|
||||
body["offset"] = offset
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{collection}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
for pt in data["points"]:
|
||||
did = pt.get("payload", {}).get("document_id")
|
||||
if did:
|
||||
doc_ids.add(did)
|
||||
offset = data.get("next_page_offset")
|
||||
if offset is None:
|
||||
break
|
||||
return doc_ids
|
||||
|
||||
|
||||
def upload_text_legal(text, filename, collection, extra_metadata):
|
||||
"""Upload with chunk_strategy='legal'."""
|
||||
form_data = {
|
||||
"collection": collection,
|
||||
"data_type": "compliance",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_strategy": "legal",
|
||||
"chunk_size": "1500",
|
||||
"chunk_overlap": "100",
|
||||
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
|
||||
}
|
||||
with httpx.Client(timeout=UPLOAD_TIMEOUT, verify=False) as c:
|
||||
resp = c.post(
|
||||
f"{RAG_URL}/api/v1/documents/upload",
|
||||
files={"file": (filename, text.encode("utf-8"), "text/plain")},
|
||||
data=form_data,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def delete_by_doc_ids(collection, doc_ids):
|
||||
"""Delete chunks matching specific document_ids."""
|
||||
with httpx.Client(timeout=30) as c:
|
||||
for did in doc_ids:
|
||||
c.post(
|
||||
f"{QDRANT_URL}/collections/{collection}/points/delete",
|
||||
json={"filter": {"must": [
|
||||
{"key": "document_id", "match": {"value": did}}
|
||||
]}},
|
||||
).raise_for_status()
|
||||
|
||||
|
||||
def count_chunks(collection, regulation_id):
|
||||
with httpx.Client(timeout=30) as c:
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{collection}/points/count",
|
||||
json={"filter": {"must": [
|
||||
{"key": "regulation_id", "match": {"value": regulation_id}}
|
||||
]}, "exact": True},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["result"]["count"]
|
||||
|
||||
|
||||
def check_section_rate(collection, regulation_id):
|
||||
total = 0
|
||||
with_sec = 0
|
||||
offset = None
|
||||
with httpx.Client(timeout=60) as c:
|
||||
while True:
|
||||
body = {
|
||||
"filter": {"must": [
|
||||
{"key": "regulation_id", "match": {"value": regulation_id}}
|
||||
]},
|
||||
"limit": 100,
|
||||
"with_payload": ["section"],
|
||||
}
|
||||
if offset is not None:
|
||||
body["offset"] = offset
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{collection}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
for pt in data["points"]:
|
||||
total += 1
|
||||
s = pt.get("payload", {}).get("section", "")
|
||||
if s and s.strip():
|
||||
with_sec += 1
|
||||
offset = data.get("next_page_offset")
|
||||
if offset is None:
|
||||
break
|
||||
return total, with_sec
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
print("=" * 60)
|
||||
print("Re-upload with chunk_strategy='legal'")
|
||||
print(f"Documents: {len(DOCS)}, Dry run: {args.dry_run}")
|
||||
print("=" * 60)
|
||||
|
||||
results = []
|
||||
for i, doc in enumerate(DOCS, 1):
|
||||
reg_id = doc["regulation_id"]
|
||||
coll = doc["collection"]
|
||||
print(f"\n[{i}/{len(DOCS)}] {doc['upload_filename']} → {coll}")
|
||||
|
||||
# 1. Check existing
|
||||
old_count = count_chunks(coll, reg_id)
|
||||
old_doc_ids = get_old_doc_ids(coll, reg_id) if old_count > 0 else set()
|
||||
print(f" Old: {old_count} chunks, {len(old_doc_ids)} doc_ids")
|
||||
|
||||
if args.dry_run:
|
||||
print(" DRY RUN — skipping")
|
||||
results.append({"file": doc["upload_filename"], "old": old_count,
|
||||
"new": "?", "sect": "?"})
|
||||
continue
|
||||
|
||||
# 2. Get text
|
||||
try:
|
||||
text = get_text(doc)
|
||||
except Exception as e:
|
||||
print(f" ERROR extracting text: {e}")
|
||||
results.append({"file": doc["upload_filename"], "old": old_count,
|
||||
"new": 0, "sect": 0})
|
||||
continue
|
||||
|
||||
# 3. Upload with legal strategy
|
||||
print(" Uploading with strategy='legal'...")
|
||||
result = upload_text_legal(
|
||||
text, doc["upload_filename"], coll, doc["extra_metadata"])
|
||||
new_chunks = result.get("chunks_count", 0)
|
||||
new_doc_id = result.get("document_id", "")
|
||||
print(f" New: {new_chunks} chunks (doc_id={new_doc_id})")
|
||||
|
||||
if new_chunks == 0:
|
||||
print(" ERROR: 0 chunks — keeping old!")
|
||||
results.append({"file": doc["upload_filename"], "old": old_count,
|
||||
"new": 0, "sect": 0})
|
||||
continue
|
||||
|
||||
# 4. Delete old chunks (safe: new ones already exist)
|
||||
if old_doc_ids:
|
||||
# Exclude the new document_id just in case
|
||||
old_doc_ids.discard(new_doc_id)
|
||||
if old_doc_ids:
|
||||
print(f" Deleting {len(old_doc_ids)} old doc_ids...")
|
||||
delete_by_doc_ids(coll, old_doc_ids)
|
||||
|
||||
# 5. Check section rate
|
||||
total, with_sec = check_section_rate(coll, reg_id)
|
||||
pct = (with_sec / total * 100) if total > 0 else 0
|
||||
print(f" Section rate: {with_sec}/{total} ({pct:.0f}%)")
|
||||
|
||||
results.append({"file": doc["upload_filename"], "old": old_count,
|
||||
"new": new_chunks, "sect": round(pct, 1)})
|
||||
|
||||
if i < len(DOCS):
|
||||
time.sleep(2)
|
||||
|
||||
# Summary
|
||||
print("\n" + "=" * 60)
|
||||
print("RESULTS")
|
||||
print("=" * 60)
|
||||
for r in results:
|
||||
print(f" {r['file']:<45} old={r['old']:<6} new={r['new']:<6} sect={r['sect']}%")
|
||||
|
||||
total_new = sum(r["new"] for r in results if isinstance(r["new"], int))
|
||||
print(f"\nTotal new chunks: {total_new}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,268 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
D4 Integration Test: Upload BGB excerpt → verify Qdrant payloads.
|
||||
|
||||
Usage:
|
||||
# Dry-run (local chunking only, no services needed)
|
||||
python3 scripts/test_d4_integration.py --dry-run
|
||||
|
||||
# Against Mac Mini
|
||||
python3 scripts/test_d4_integration.py \
|
||||
--rag-url https://macmini:8097 \
|
||||
--qdrant-url http://macmini:6333
|
||||
|
||||
# Against production
|
||||
python3 scripts/test_d4_integration.py \
|
||||
--rag-url https://rag-prod:8097 \
|
||||
--qdrant-url http://qdrant-prod:6333
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
|
||||
import httpx
|
||||
|
||||
FIXTURE_PATH = os.path.join(
|
||||
os.path.dirname(__file__), "..", "..", "embedding-service",
|
||||
"tests", "fixtures", "bgb_312_excerpt.txt",
|
||||
)
|
||||
COLLECTION = "bp_compliance_gesetze"
|
||||
REG_CODE = "BGB_D4_TEST"
|
||||
|
||||
# Expected sections in the BGB excerpt
|
||||
EXPECTED_SECTIONS = {"§ 312", "§ 312a", "§ 312g", "§ 312k"}
|
||||
|
||||
|
||||
def load_fixture() -> str:
|
||||
with open(FIXTURE_PATH, encoding="utf-8") as f:
|
||||
return f.read()
|
||||
|
||||
|
||||
def upload_document(rag_url: str, text: str) -> dict:
|
||||
"""Upload BGB excerpt to RAG service."""
|
||||
metadata = json.dumps({
|
||||
"regulation_code": REG_CODE,
|
||||
"regulation_name_de": "BGB (D4 Test)",
|
||||
"source_type": "law",
|
||||
})
|
||||
|
||||
with httpx.Client(timeout=60.0, verify=False) as client:
|
||||
resp = client.post(
|
||||
f"{rag_url}/api/v1/documents/upload",
|
||||
files={"file": ("bgb_312_test.txt", text.encode(), "text/plain")},
|
||||
data={
|
||||
"collection": COLLECTION,
|
||||
"data_type": "law",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_strategy": "recursive",
|
||||
"chunk_size": "1500",
|
||||
"chunk_overlap": "100",
|
||||
"metadata_json": metadata,
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def scroll_chunks(qdrant_url: str, document_id: str) -> list[dict]:
|
||||
"""Scroll Qdrant for chunks matching this document_id."""
|
||||
all_points = []
|
||||
offset = None
|
||||
|
||||
with httpx.Client(timeout=30.0) as client:
|
||||
while True:
|
||||
body: dict = {
|
||||
"limit": 100,
|
||||
"with_payload": True,
|
||||
"with_vector": False,
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "document_id",
|
||||
"match": {"value": document_id},
|
||||
}]
|
||||
},
|
||||
}
|
||||
if offset:
|
||||
body["offset"] = offset
|
||||
|
||||
resp = client.post(
|
||||
f"{qdrant_url}/collections/{COLLECTION}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
all_points.extend(data["points"])
|
||||
offset = data.get("next_page_offset")
|
||||
if not offset:
|
||||
break
|
||||
|
||||
return all_points
|
||||
|
||||
|
||||
def delete_test_data(qdrant_url: str, document_id: str):
|
||||
"""Clean up test chunks from Qdrant."""
|
||||
with httpx.Client(timeout=30.0) as client:
|
||||
resp = client.post(
|
||||
f"{qdrant_url}/collections/{COLLECTION}/points/delete",
|
||||
json={
|
||||
"filter": {
|
||||
"must": [{
|
||||
"key": "document_id",
|
||||
"match": {"value": document_id},
|
||||
}]
|
||||
}
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
|
||||
|
||||
def verify_chunks(points: list[dict]) -> dict:
|
||||
"""Analyze chunks and return a verification report."""
|
||||
report = {
|
||||
"total_chunks": len(points),
|
||||
"sections_found": set(),
|
||||
"chunks_with_section": 0,
|
||||
"chunks_with_paragraph": 0,
|
||||
"chunks_with_page": 0,
|
||||
"section_details": [],
|
||||
"issues": [],
|
||||
}
|
||||
|
||||
for pt in points:
|
||||
payload = pt.get("payload", {})
|
||||
section = payload.get("section", "")
|
||||
section_title = payload.get("section_title", "")
|
||||
paragraph = payload.get("paragraph", "")
|
||||
paragraph_num = payload.get("paragraph_num")
|
||||
page = payload.get("page")
|
||||
chunk_idx = payload.get("chunk_index", "?")
|
||||
|
||||
if section:
|
||||
report["sections_found"].add(section)
|
||||
report["chunks_with_section"] += 1
|
||||
if paragraph:
|
||||
report["chunks_with_paragraph"] += 1
|
||||
if page is not None:
|
||||
report["chunks_with_page"] += 1
|
||||
|
||||
report["section_details"].append({
|
||||
"chunk_index": chunk_idx,
|
||||
"section": section,
|
||||
"section_title": section_title[:40],
|
||||
"paragraph": paragraph,
|
||||
"paragraph_num": paragraph_num,
|
||||
"page": page,
|
||||
"text_preview": payload.get("chunk_text", "")[:60],
|
||||
})
|
||||
|
||||
# Checks
|
||||
missing = EXPECTED_SECTIONS - report["sections_found"]
|
||||
if missing:
|
||||
report["issues"].append(f"Missing sections: {missing}")
|
||||
|
||||
if "§ 312k" not in report["sections_found"]:
|
||||
report["issues"].append("CRITICAL: § 312k not found!")
|
||||
|
||||
section_ratio = report["chunks_with_section"] / max(report["total_chunks"], 1)
|
||||
if section_ratio < 0.9:
|
||||
report["issues"].append(
|
||||
f"Only {section_ratio:.0%} chunks have section metadata (expected >= 90%)"
|
||||
)
|
||||
|
||||
return report
|
||||
|
||||
|
||||
def print_report(report: dict):
|
||||
"""Print verification report."""
|
||||
print("\n" + "=" * 60)
|
||||
print("D4 VALIDATION REPORT")
|
||||
print("=" * 60)
|
||||
print(f"Total chunks: {report['total_chunks']}")
|
||||
print(f"With section: {report['chunks_with_section']}")
|
||||
print(f"With paragraph: {report['chunks_with_paragraph']}")
|
||||
print(f"With page: {report['chunks_with_page']}")
|
||||
print(f"Sections found: {sorted(report['sections_found'])}")
|
||||
|
||||
print("\nChunk details:")
|
||||
for d in sorted(report["section_details"], key=lambda x: x["chunk_index"]):
|
||||
print(
|
||||
f" [{d['chunk_index']:2}] "
|
||||
f"section={d['section']!r:12s} "
|
||||
f"title={d['section_title']!r:30s} "
|
||||
f"para={d['paragraph']!r:8s}"
|
||||
)
|
||||
|
||||
if report["issues"]:
|
||||
print(f"\nISSUES ({len(report['issues'])}):")
|
||||
for issue in report["issues"]:
|
||||
print(f" - {issue}")
|
||||
print("\nRESULT: FAIL")
|
||||
else:
|
||||
print("\nRESULT: PASS — all sections detected, metadata quality OK")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="D4 Integration Test")
|
||||
parser.add_argument("--rag-url", default="https://macmini:8097")
|
||||
parser.add_argument("--qdrant-url", default="http://macmini:6333")
|
||||
parser.add_argument("--dry-run", action="store_true",
|
||||
help="Only test local chunking, no upload")
|
||||
parser.add_argument("--keep", action="store_true",
|
||||
help="Don't delete test data after verification")
|
||||
args = parser.parse_args()
|
||||
|
||||
text = load_fixture()
|
||||
print(f"Loaded BGB excerpt: {len(text)} chars")
|
||||
|
||||
if args.dry_run:
|
||||
# Import chunking directly
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "embedding-service"))
|
||||
from main import chunk_text_legal_structured
|
||||
chunks = chunk_text_legal_structured(text, 1500, 100)
|
||||
# Build fake points for verification
|
||||
points = [{"payload": {
|
||||
"chunk_index": c["index"],
|
||||
"chunk_text": c["text"],
|
||||
"section": c["section"],
|
||||
"section_title": c["section_title"],
|
||||
"paragraph": c["paragraph"],
|
||||
"paragraph_num": c["paragraph_num"],
|
||||
"page": c["page"],
|
||||
}} for c in chunks]
|
||||
report = verify_chunks(points)
|
||||
print_report(report)
|
||||
sys.exit(1 if report["issues"] else 0)
|
||||
|
||||
# Full integration test
|
||||
print(f"Uploading to {args.rag_url} → collection={COLLECTION}...")
|
||||
result = upload_document(args.rag_url, text)
|
||||
doc_id = result["document_id"]
|
||||
print(f" document_id: {doc_id}")
|
||||
print(f" chunks_count: {result['chunks_count']}")
|
||||
print(f" vectors_indexed: {result['vectors_indexed']}")
|
||||
|
||||
print("Waiting 2s for indexing...")
|
||||
time.sleep(2)
|
||||
|
||||
print(f"Scrolling Qdrant at {args.qdrant_url}...")
|
||||
points = scroll_chunks(args.qdrant_url, doc_id)
|
||||
print(f" Found {len(points)} points")
|
||||
|
||||
report = verify_chunks(points)
|
||||
print_report(report)
|
||||
|
||||
if not args.keep:
|
||||
print(f"\nCleaning up test data (document_id={doc_id})...")
|
||||
delete_test_data(args.qdrant_url, doc_id)
|
||||
print(" Deleted.")
|
||||
|
||||
sys.exit(1 if report["issues"] else 0)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -17,9 +17,6 @@ import httpx
|
||||
|
||||
from .control_generator import (
|
||||
GeneratedControl,
|
||||
REGULATION_LICENSE_MAP,
|
||||
_RULE2_PREFIXES,
|
||||
_RULE3_PREFIXES,
|
||||
_classify_regulation,
|
||||
)
|
||||
|
||||
|
||||
@@ -346,13 +346,40 @@ class BatchDedupRunner:
|
||||
|
||||
self._progress_total = total
|
||||
self._progress_count = 0
|
||||
logger.info("BatchDedup Cross-group: %d masters to check", total)
|
||||
cross_linked = 0
|
||||
cross_review = 0
|
||||
|
||||
# Paginated processing — 100 rows per DB query
|
||||
# Checkpoint: resume from last processed control_id
|
||||
DB_PAGE = 100
|
||||
last_control_id = ""
|
||||
# Checkpoint: resume from last processed control_id (survives container restart)
|
||||
checkpoint_row = self.db.execute(text("""
|
||||
SELECT config FROM canonical_generation_jobs
|
||||
WHERE status = 'dedup_phase2_checkpoint'
|
||||
LIMIT 1
|
||||
""")).fetchone()
|
||||
last_control_id = checkpoint_row[0] if checkpoint_row else ""
|
||||
|
||||
if last_control_id:
|
||||
skip_row = self.db.execute(text("""
|
||||
SELECT COUNT(*) FROM canonical_controls
|
||||
WHERE decomposition_method = 'pass0b'
|
||||
AND release_state != 'duplicate'
|
||||
AND release_state != 'deprecated'
|
||||
AND control_id <= :last_id
|
||||
"""), {"last_id": last_control_id}).fetchone()
|
||||
skipped = skip_row[0] if skip_row else 0
|
||||
self._progress_count = skipped
|
||||
logger.info("BatchDedup Cross-group: RESUMING from %s (skipping %d already processed)",
|
||||
last_control_id, skipped)
|
||||
else:
|
||||
self.db.execute(text("""
|
||||
INSERT INTO canonical_generation_jobs (id, status, config)
|
||||
VALUES (gen_random_uuid(), 'dedup_phase2_checkpoint', '')
|
||||
"""))
|
||||
self.db.commit()
|
||||
|
||||
logger.info("BatchDedup Cross-group: %d masters to check (starting from %s)",
|
||||
total, last_control_id or "beginning")
|
||||
|
||||
while True:
|
||||
rows = self.db.execute(text("""
|
||||
@@ -461,11 +488,34 @@ class BatchDedupRunner:
|
||||
|
||||
self._progress_count += 1
|
||||
|
||||
# Log progress every page
|
||||
# Save checkpoint + log progress every page
|
||||
try:
|
||||
self.db.execute(text("""
|
||||
UPDATE canonical_generation_jobs
|
||||
SET config = :cid
|
||||
WHERE status = 'dedup_phase2_checkpoint'
|
||||
"""), {"cid": last_control_id})
|
||||
self.db.commit()
|
||||
except Exception:
|
||||
try:
|
||||
self.db.rollback()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
processed = self._progress_count
|
||||
if processed % 500 < DB_PAGE:
|
||||
logger.info("BatchDedup Cross-group: %d/%d checked, %d linked, %d review",
|
||||
processed, len(rows), cross_linked, cross_review)
|
||||
logger.info("BatchDedup Cross-group: %d/%d checked, %d linked, %d review (checkpoint: %s)",
|
||||
processed, total, cross_linked, cross_review, last_control_id)
|
||||
|
||||
# Clear checkpoint on completion
|
||||
try:
|
||||
self.db.execute(text("""
|
||||
DELETE FROM canonical_generation_jobs
|
||||
WHERE status = 'dedup_phase2_checkpoint'
|
||||
"""))
|
||||
self.db.commit()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
self.stats["cross_group_linked"] = cross_linked
|
||||
self.stats["cross_group_review"] = cross_review
|
||||
|
||||
@@ -25,8 +25,7 @@ import re
|
||||
import uuid
|
||||
from collections import defaultdict
|
||||
from dataclasses import dataclass, field, asdict
|
||||
from datetime import datetime, timezone
|
||||
from typing import Dict, List, Optional, Set
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
import httpx
|
||||
from pydantic import BaseModel
|
||||
@@ -34,7 +33,8 @@ from sqlalchemy import text
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from .rag_client import ComplianceRAGClient, RAGSearchResult, get_rag_client
|
||||
from .similarity_detector import check_similarity, SimilarityReport
|
||||
from .regulation_registry import get_registry as _get_regulation_registry
|
||||
from .similarity_detector import check_similarity
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@@ -246,28 +246,21 @@ def _classify_regulation(regulation_code: str) -> dict:
|
||||
|
||||
Returns dict with keys: license, rule, name, source_type.
|
||||
source_type is one of: law, guideline, standard, restricted.
|
||||
|
||||
Delegates to DB-backed RegulationRegistry (with 5min cache).
|
||||
Falls back to REGULATION_LICENSE_MAP if DB is unavailable.
|
||||
"""
|
||||
code = regulation_code.lower().strip()
|
||||
registry = _get_regulation_registry()
|
||||
result = registry.classify_regulation(regulation_code)
|
||||
|
||||
# Exact match first
|
||||
if code in REGULATION_LICENSE_MAP:
|
||||
return REGULATION_LICENSE_MAP[code]
|
||||
# If registry returned the unknown fallback AND we have a local match,
|
||||
# prefer the local dict (graceful degradation during migration)
|
||||
if result.get("license") == "UNKNOWN":
|
||||
code = regulation_code.lower().strip()
|
||||
if code in REGULATION_LICENSE_MAP:
|
||||
return REGULATION_LICENSE_MAP[code]
|
||||
|
||||
# Prefix match for Rule 2 (ENISA = standard)
|
||||
for prefix in _RULE2_PREFIXES:
|
||||
if code.startswith(prefix):
|
||||
return {"license": "CC-BY-4.0", "rule": 2, "source_type": "standard",
|
||||
"name": "ENISA", "attribution": "ENISA, CC BY 4.0"}
|
||||
|
||||
# Prefix match for Rule 3 (BSI/ISO/ETSI = restricted)
|
||||
for prefix in _RULE3_PREFIXES:
|
||||
if code.startswith(prefix):
|
||||
return {"license": f"{prefix.rstrip('_').upper()}_RESTRICTED", "rule": 3,
|
||||
"source_type": "restricted", "name": "INTERNAL_ONLY"}
|
||||
|
||||
# Unknown → treat as restricted (safe default)
|
||||
logger.warning("Unknown regulation_code %r — defaulting to Rule 3 (restricted)", code)
|
||||
return {"license": "UNKNOWN", "rule": 3, "source_type": "restricted", "name": "INTERNAL_ONLY"}
|
||||
return result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
@@ -1019,11 +1012,12 @@ class ControlGeneratorPipeline:
|
||||
regulation_name=reg_name,
|
||||
regulation_short=reg_short,
|
||||
category=payload.get("category", "") or payload.get("data_type", ""),
|
||||
article=payload.get("article", "") or payload.get("section_title", "") or payload.get("section", ""),
|
||||
article=payload.get("section", "") or payload.get("article", "") or payload.get("section_title", ""),
|
||||
paragraph=payload.get("paragraph", ""),
|
||||
source_url=payload.get("source_url", "") or payload.get("source", "") or payload.get("url", ""),
|
||||
score=0.0,
|
||||
collection=collection,
|
||||
page=payload.get("page"),
|
||||
)
|
||||
all_results.append(chunk)
|
||||
collection_new += 1
|
||||
@@ -1127,6 +1121,7 @@ Quelle: {chunk.regulation_name} ({chunk.regulation_code}), {chunk.article}"""
|
||||
"source": canonical_source,
|
||||
"article": effective_article,
|
||||
"paragraph": effective_paragraph,
|
||||
"page": chunk.page,
|
||||
"license": license_info.get("license", ""),
|
||||
"source_type": license_info.get("source_type", "law"),
|
||||
"url": chunk.source_url or "",
|
||||
@@ -1141,6 +1136,7 @@ Quelle: {chunk.regulation_name} ({chunk.regulation_code}), {chunk.article}"""
|
||||
"source_regulation": chunk.regulation_code,
|
||||
"source_article": effective_article,
|
||||
"source_paragraph": effective_paragraph,
|
||||
"source_page": chunk.page,
|
||||
}
|
||||
return control
|
||||
|
||||
@@ -1194,6 +1190,7 @@ Quelle: {chunk.regulation_name}, {chunk.article}"""
|
||||
"source": canonical_source,
|
||||
"article": effective_article,
|
||||
"paragraph": effective_paragraph,
|
||||
"page": chunk.page,
|
||||
"license": license_info.get("license", ""),
|
||||
"license_notice": attribution,
|
||||
"source_type": license_info.get("source_type", "standard"),
|
||||
@@ -1209,6 +1206,7 @@ Quelle: {chunk.regulation_name}, {chunk.article}"""
|
||||
"source_regulation": chunk.regulation_code,
|
||||
"source_article": effective_article,
|
||||
"source_paragraph": effective_paragraph,
|
||||
"source_page": chunk.page,
|
||||
}
|
||||
return control
|
||||
|
||||
@@ -1368,6 +1366,7 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Elementen. Fuer Chunks ohne A
|
||||
"source": canonical_source,
|
||||
"article": effective_article,
|
||||
"paragraph": effective_paragraph,
|
||||
"page": chunk.page,
|
||||
"license": lic.get("license", ""),
|
||||
"license_notice": lic.get("attribution", ""),
|
||||
"source_type": lic.get("source_type", "law"),
|
||||
@@ -1384,6 +1383,7 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Elementen. Fuer Chunks ohne A
|
||||
"source_regulation": chunk.regulation_code,
|
||||
"source_article": effective_article,
|
||||
"source_paragraph": effective_paragraph,
|
||||
"source_page": chunk.page,
|
||||
"batch_size": len(chunks),
|
||||
"document_grouped": same_doc,
|
||||
}
|
||||
@@ -1479,14 +1479,14 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Elementen. Fuer Aspekte ohne
|
||||
) -> list[Optional[GeneratedControl]]:
|
||||
"""Process a batch of (chunk, license_info) through stages 3-5."""
|
||||
# Split by license rule: Rule 1+2 → structure, Rule 3 → reform
|
||||
structure_items = [(c, l) for c, l in batch_items if l["rule"] in (1, 2)]
|
||||
reform_items = [(c, l) for c, l in batch_items if l["rule"] == 3]
|
||||
structure_items = [(c, lic) for c, lic in batch_items if lic["rule"] in (1, 2)]
|
||||
reform_items = [(c, lic) for c, lic in batch_items if lic["rule"] == 3]
|
||||
|
||||
all_controls: dict[int, Optional[GeneratedControl]] = {}
|
||||
|
||||
if structure_items:
|
||||
s_chunks = [c for c, _ in structure_items]
|
||||
s_lics = [l for _, l in structure_items]
|
||||
s_lics = [lic for _, lic in structure_items]
|
||||
try:
|
||||
s_controls = await self._structure_batch(s_chunks, s_lics)
|
||||
except Exception as e:
|
||||
|
||||
@@ -24,7 +24,6 @@ import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import uuid
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Optional
|
||||
|
||||
@@ -56,7 +55,7 @@ ANTHROPIC_API_URL = "https://api.anthropic.com/v1"
|
||||
# Patterns are defined in normative_patterns.py and imported here
|
||||
# with local aliases for backward compatibility.
|
||||
|
||||
from .normative_patterns import (
|
||||
from .normative_patterns import ( # noqa: E402
|
||||
PFLICHT_RE as _PFLICHT_RE,
|
||||
EMPFEHLUNG_RE as _EMPFEHLUNG_RE,
|
||||
KANN_RE as _KANN_RE,
|
||||
@@ -3472,7 +3471,7 @@ class DecompositionPass:
|
||||
"category": atomic.category,
|
||||
"parent_uuid": parent_uuid,
|
||||
"gen_meta": json.dumps({
|
||||
"decomposition_source": candidate_id,
|
||||
"decomposition_source_id": candidate_id,
|
||||
"decomposition_method": "pass0b",
|
||||
"engine_version": "v2",
|
||||
"action_object_class": getattr(atomic, "domain", ""),
|
||||
@@ -4104,6 +4103,8 @@ def _format_citation(citation) -> str:
|
||||
parts.append(c["article"])
|
||||
if c.get("paragraph"):
|
||||
parts.append(c["paragraph"])
|
||||
if c.get("page") is not None:
|
||||
parts.append(f"S. {c['page']}")
|
||||
return " ".join(parts) if parts else citation
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return citation
|
||||
|
||||
@@ -34,6 +34,7 @@ class RAGSearchResult:
|
||||
source_url: str
|
||||
score: float
|
||||
collection: str = ""
|
||||
page: Optional[int] = None
|
||||
|
||||
|
||||
class ComplianceRAGClient:
|
||||
|
||||
@@ -0,0 +1,220 @@
|
||||
"""
|
||||
DB-backed Regulation Registry with in-memory cache.
|
||||
|
||||
Replaces hardcoded REGULATION_LICENSE_MAP and SOURCE_REGULATION_CLASSIFICATION
|
||||
with a single PostgreSQL table (compliance.regulation_registry).
|
||||
|
||||
Cache TTL: 5 minutes. Thread-safe via simple timestamp check.
|
||||
Falls back to hardcoded dicts if DB is unavailable (graceful degradation).
|
||||
"""
|
||||
|
||||
import logging
|
||||
import time
|
||||
from typing import Optional
|
||||
|
||||
from sqlalchemy import text
|
||||
from sqlalchemy.exc import SQLAlchemyError
|
||||
|
||||
from db.session import SessionLocal
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
_CACHE_TTL_SECONDS = 300 # 5 minutes
|
||||
|
||||
# Prefix-based fallback rules (unchanged from original logic)
|
||||
_RULE2_PREFIXES = ("enisa_",)
|
||||
_RULE3_PREFIXES = ("bsi_", "iso_", "etsi_")
|
||||
|
||||
# Fallback for unknown regulations
|
||||
_UNKNOWN_REGULATION = {
|
||||
"license": "UNKNOWN",
|
||||
"rule": 3,
|
||||
"source_type": "restricted",
|
||||
"name": "INTERNAL_ONLY",
|
||||
"attribution": None,
|
||||
}
|
||||
|
||||
|
||||
class RegulationRegistry:
|
||||
"""In-memory cache of the regulation_registry table.
|
||||
|
||||
Provides two lookup modes:
|
||||
1. by_code(regulation_id) — replaces REGULATION_LICENSE_MAP[code]
|
||||
2. source_type_by_name(name) — replaces SOURCE_REGULATION_CLASSIFICATION[name]
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self._by_code: dict[str, dict] = {}
|
||||
self._by_name: dict[str, str] = {}
|
||||
self._loaded_at: float = 0.0
|
||||
|
||||
def _is_stale(self) -> bool:
|
||||
return (time.monotonic() - self._loaded_at) > _CACHE_TTL_SECONDS
|
||||
|
||||
def _load(self) -> bool:
|
||||
"""Load all rows from regulation_registry into memory."""
|
||||
try:
|
||||
db = SessionLocal()
|
||||
try:
|
||||
rows = db.execute(
|
||||
text("""
|
||||
SELECT regulation_id, regulation_name_de, license_rule,
|
||||
license_type, attribution, source_type, jurisdiction,
|
||||
status
|
||||
FROM regulation_registry
|
||||
WHERE status != 'deprecated'
|
||||
""")
|
||||
).fetchall()
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
by_code: dict[str, dict] = {}
|
||||
by_name: dict[str, str] = {}
|
||||
|
||||
for row in rows:
|
||||
entry = {
|
||||
"license": row[3] or "", # license_type
|
||||
"rule": row[2], # license_rule
|
||||
"source_type": row[5] or "law", # source_type
|
||||
"name": row[1] or row[0], # regulation_name_de or regulation_id
|
||||
"attribution": row[4], # attribution
|
||||
"jurisdiction": row[6], # jurisdiction
|
||||
}
|
||||
by_code[row[0].lower()] = entry
|
||||
|
||||
# Also index by name for source_type lookups
|
||||
if row[1]:
|
||||
by_name[row[1]] = row[5] or "law"
|
||||
|
||||
self._by_code = by_code
|
||||
self._by_name = by_name
|
||||
self._loaded_at = time.monotonic()
|
||||
logger.info(
|
||||
"Regulation registry loaded: %d entries by code, %d by name",
|
||||
len(by_code), len(by_name),
|
||||
)
|
||||
return True
|
||||
|
||||
except SQLAlchemyError:
|
||||
logger.warning(
|
||||
"Failed to load regulation_registry from DB — using stale cache",
|
||||
exc_info=True,
|
||||
)
|
||||
return False
|
||||
|
||||
def _ensure_loaded(self) -> None:
|
||||
"""Reload cache if stale."""
|
||||
if self._is_stale():
|
||||
self._load()
|
||||
|
||||
def classify_regulation(self, regulation_code: str) -> dict:
|
||||
"""Look up license info for a regulation_code.
|
||||
|
||||
Returns dict with keys: license, rule, name, source_type, attribution.
|
||||
Equivalent to the old _classify_regulation() function.
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
code = regulation_code.lower().strip()
|
||||
|
||||
# Exact match from DB
|
||||
if code in self._by_code:
|
||||
return self._by_code[code]
|
||||
|
||||
# Prefix match for Rule 2 (ENISA = standard)
|
||||
for prefix in _RULE2_PREFIXES:
|
||||
if code.startswith(prefix):
|
||||
return {
|
||||
"license": "CC-BY-4.0",
|
||||
"rule": 2,
|
||||
"source_type": "standard",
|
||||
"name": "ENISA",
|
||||
"attribution": "ENISA, CC BY 4.0",
|
||||
}
|
||||
|
||||
# Prefix match for Rule 3 (BSI/ISO/ETSI = restricted)
|
||||
for prefix in _RULE3_PREFIXES:
|
||||
if code.startswith(prefix):
|
||||
return {
|
||||
"license": f"{prefix.rstrip('_').upper()}_RESTRICTED",
|
||||
"rule": 3,
|
||||
"source_type": "restricted",
|
||||
"name": "INTERNAL_ONLY",
|
||||
"attribution": None,
|
||||
}
|
||||
|
||||
# Unknown → restricted (safe default)
|
||||
logger.warning(
|
||||
"Unknown regulation_code %r — defaulting to Rule 3 (restricted)", code
|
||||
)
|
||||
return dict(_UNKNOWN_REGULATION)
|
||||
|
||||
def source_type_by_name(self, source_regulation: str) -> str:
|
||||
"""Look up source_type by regulation display name.
|
||||
|
||||
Equivalent to old classify_source_regulation().
|
||||
Falls back to heuristic for unknown names.
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
if not source_regulation:
|
||||
return "framework"
|
||||
|
||||
# Exact match from DB
|
||||
if source_regulation in self._by_name:
|
||||
return self._by_name[source_regulation]
|
||||
|
||||
# Heuristic fallback for unknown sources
|
||||
lower = source_regulation.lower()
|
||||
|
||||
law_indicators = [
|
||||
"verordnung", "richtlinie", "gesetz", "directive", "regulation",
|
||||
"(eu)", "(eg)", "act", "ley", "loi", "törvény", "código",
|
||||
]
|
||||
if any(ind in lower for ind in law_indicators):
|
||||
return "law"
|
||||
|
||||
guideline_indicators = [
|
||||
"edpb", "leitlinie", "guideline", "wp2", "bsi", "empfehlung",
|
||||
]
|
||||
if any(ind in lower for ind in guideline_indicators):
|
||||
return "guideline"
|
||||
|
||||
framework_indicators = [
|
||||
"enisa", "nist", "owasp", "oecd", "cisa", "framework", "iso",
|
||||
]
|
||||
if any(ind in lower for ind in framework_indicators):
|
||||
return "framework"
|
||||
|
||||
return "framework"
|
||||
|
||||
def get_all(self) -> dict[str, dict]:
|
||||
"""Return all cached entries (by regulation_code)."""
|
||||
self._ensure_loaded()
|
||||
return dict(self._by_code)
|
||||
|
||||
def is_open_source(self, regulation_code: str) -> bool:
|
||||
"""Check if regulation is Rule 1 or 2 (safe to reference)."""
|
||||
info = self.classify_regulation(regulation_code)
|
||||
return info["rule"] in (1, 2)
|
||||
|
||||
|
||||
# Module-level singleton
|
||||
_registry: Optional[RegulationRegistry] = None
|
||||
|
||||
|
||||
def get_registry() -> RegulationRegistry:
|
||||
"""Get or create the singleton RegulationRegistry instance."""
|
||||
global _registry
|
||||
if _registry is None:
|
||||
_registry = RegulationRegistry()
|
||||
return _registry
|
||||
|
||||
|
||||
def classify_regulation(regulation_code: str) -> dict:
|
||||
"""Convenience: look up license info for a regulation_code."""
|
||||
return get_registry().classify_regulation(regulation_code)
|
||||
|
||||
|
||||
def classify_source_regulation(source_regulation: str) -> str:
|
||||
"""Convenience: look up source_type by regulation display name."""
|
||||
return get_registry().source_type_by_name(source_regulation)
|
||||
@@ -0,0 +1,318 @@
|
||||
# Adversarial Test Suite — 30 tricky Cases die Controls/Agent herausfordern
|
||||
version: "1.0"
|
||||
purpose: "Testen ob Controls und Agent bei grenzwertigen Formulierungen korrekt entscheiden"
|
||||
|
||||
tests:
|
||||
|
||||
# A. Falsche Rechtsgrundlage (plausibel klingend) — 8 Cases
|
||||
|
||||
- id: ADV-LIT-001
|
||||
category: wrong_legal_basis
|
||||
input: "Wir verarbeiten Ihre Daten fuer Webanalyse auf Grundlage unseres berechtigten Interesses (Art. 6 Abs. 1 lit. f DSGVO)."
|
||||
context: "DSE-Abschnitt ueber Google Analytics"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
|
||||
reason: "Analytics erfordert Einwilligung, nicht berechtigtes Interesse (EuGH C-673/17 Planet49)"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-LIT-002
|
||||
category: wrong_legal_basis
|
||||
input: "Der Versand unseres Newsletters erfolgt auf Grundlage des Vertrages (Art. 6 Abs. 1 lit. b DSGVO)."
|
||||
context: "DSE-Abschnitt ueber Marketing"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
|
||||
reason: "Newsletter ist kein Vertragsbestandteil, erfordert separate Einwilligung"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-LIT-003
|
||||
category: wrong_legal_basis
|
||||
input: "Die Ueberwachung der Arbeitsleistung unserer Mitarbeiter erfolgt auf Grundlage unseres berechtigten Interesses."
|
||||
context: "Interne Datenschutzrichtlinie"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Betriebsvereinbarung + Art. 88 DSGVO i.V.m. § 26 BDSG"
|
||||
reason: "Mitarbeiterueberwachung erfordert Betriebsvereinbarung (BAG Keylogger-Urteil)"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-LIT-004
|
||||
category: wrong_legal_basis
|
||||
input: "Biometrische Zutrittskontrolle auf Basis von Art. 6 Abs. 1 lit. f DSGVO."
|
||||
context: "Sicherheitskonzept"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Art. 9 Abs. 2 DSGVO (ausdrueckliche Einwilligung oder Arbeitsrecht)"
|
||||
reason: "Biometrische Daten = besondere Kategorie nach Art. 9, lit. f reicht nicht"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-LIT-005
|
||||
category: wrong_legal_basis
|
||||
input: "Wir erstellen automatisierte Kreditentscheidungen auf Grundlage berechtigter Interessen."
|
||||
context: "DSE einer Bank"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Art. 22 DSGVO (ausdrueckliche Einwilligung oder gesetzliche Erlaubnis)"
|
||||
reason: "Automatisierte Einzelentscheidungen erfordern Art. 22 Schutz (EuGH SCHUFA C-634/21)"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-LIT-006
|
||||
category: wrong_legal_basis
|
||||
input: "Social Login ueber Google wird als Vertragsdurchfuehrung (lit. b) verarbeitet."
|
||||
context: "DSE mit Social Login"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
|
||||
reason: "Social Login ist keine Vertragspflicht, Nutzer kann sich auch ohne Google anmelden"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-LIT-007
|
||||
category: wrong_legal_basis
|
||||
input: "Personalisierte Werbung basiert auf unserem berechtigten Interesse an Direktmarketing."
|
||||
context: "DSE eines marktbeherrschenden Unternehmens"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
|
||||
reason: "Marktbeherrschende Unternehmen koennen sich nicht auf lit. f fuer Werbung berufen (EuGH Meta C-252/21)"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-LIT-008
|
||||
category: wrong_legal_basis
|
||||
input: "Die Einbindung von Facebook Pixel erfolgt zur Vertragserfuellung (Art. 6 Abs. 1 lit. b DSGVO)."
|
||||
context: "DSE eines Online-Shops"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: wrong_legal_basis
|
||||
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
|
||||
reason: "Facebook Pixel dient Tracking/Marketing, nicht der Vertragserfuellung"
|
||||
difficulty: easy
|
||||
|
||||
# B. Dark Patterns (subtil) — 6 Cases
|
||||
|
||||
- id: ADV-DP-001
|
||||
category: dark_pattern
|
||||
input:
|
||||
accept_button: {text: "Alle akzeptieren", size: "16px", color: "#ffffff", background: "#0066cc", prominent: true}
|
||||
reject_button: {text: "Ablehnen", size: "10px", color: "#cccccc", background: "transparent", prominent: false}
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: dark_pattern_visual_bias
|
||||
reason: "Ablehnen-Button ist kleiner, weniger sichtbar (OLG Koeln 6 U 58/21)"
|
||||
difficulty: easy
|
||||
|
||||
- id: ADV-DP-002
|
||||
category: dark_pattern
|
||||
input:
|
||||
accept_button: {text: "Alle akzeptieren", clicks_to_complete: 1}
|
||||
reject_option: {text: "Einstellungen verwalten", clicks_to_complete: 3, label: "Einstellungen"}
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: dark_pattern_friction_asymmetry
|
||||
reason: "Ablehnen erfordert 3 Klicks, Akzeptieren nur 1 (CNIL Cookie-Banner)"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-DP-003
|
||||
category: dark_pattern
|
||||
input:
|
||||
type: "cookie_wall"
|
||||
description: "Inhalt erst nach Cookie-Zustimmung sichtbar"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: dark_pattern_cookie_wall
|
||||
reason: "Cookie-Wall = keine freiwillige Einwilligung (EDPB Guidelines 05/2020)"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-DP-004
|
||||
category: dark_pattern
|
||||
input:
|
||||
type: "prechecked_boxes"
|
||||
description: "Checkboxen fuer Marketing und Analytics sind vorausgefuellt"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: dark_pattern_prechecked
|
||||
reason: "Vorausgefuellte Checkboxen sind keine wirksame Einwilligung (BGH Planet49)"
|
||||
difficulty: easy
|
||||
|
||||
- id: ADV-DP-005
|
||||
category: dark_pattern
|
||||
input:
|
||||
type: "confirm_shaming"
|
||||
accept_text: "Ja, ich moechte sicher surfen"
|
||||
reject_text: "Nein, ich verzichte auf Sicherheit"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: dark_pattern_confirm_shaming
|
||||
reason: "Manipulative Formulierung beeinflusst Entscheidung"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-DP-006
|
||||
category: dark_pattern
|
||||
input:
|
||||
type: "hidden_reject"
|
||||
description: "Ablehnen-Link ist 3px gross, Farbe #f0f0f0 auf weissem Hintergrund"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: dark_pattern_hidden_option
|
||||
reason: "Ablehnen-Option praktisch unsichtbar (OLG Koeln)"
|
||||
difficulty: easy
|
||||
|
||||
# C. Fast-vollstaendige Dokumente — 6 Cases
|
||||
|
||||
- id: ADV-DOC-001
|
||||
category: incomplete_document
|
||||
input: "Impressum: Max Mustermann GmbH, Musterstr. 1, 10115 Berlin, info@example.com, HRB 12345"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: missing_field
|
||||
missing: "USt-ID"
|
||||
reason: "§ 5 Abs. 1 Nr. 6 DDG: USt-IdNr. oder Wirtschafts-ID Pflicht"
|
||||
difficulty: easy
|
||||
|
||||
- id: ADV-DOC-002
|
||||
category: incomplete_document
|
||||
input: "Datenschutzerklaerung mit Zwecken, Rechtsgrundlagen, Empfaengern, Betroffenenrechten — aber ohne Speicherdauer"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: missing_field
|
||||
missing: "Speicherdauer"
|
||||
reason: "Art. 13 Abs. 2 lit. a DSGVO: Dauer der Speicherung oder Kriterien"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-DOC-003
|
||||
category: incomplete_document
|
||||
input: "DSE ohne Kontaktdaten des Datenschutzbeauftragten"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: missing_field
|
||||
missing: "DSB-Kontakt"
|
||||
reason: "Art. 13 Abs. 1 lit. b DSGVO: Kontaktdaten des DSB"
|
||||
difficulty: easy
|
||||
|
||||
- id: ADV-DOC-004
|
||||
category: incomplete_document
|
||||
input: "Widerrufsbelehrung mit 14-Tage-Frist, Muster-Formular, aber Fristbeginn fehlt"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: missing_field
|
||||
missing: "Fristbeginn"
|
||||
reason: "Anlage 1 zu Art. 246a § 1 EGBGB: Fristbeginn muss angegeben werden"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-DOC-005
|
||||
category: incomplete_document
|
||||
input: "AGB eines Online-Shops ohne Angabe des Gerichtsstands"
|
||||
expected:
|
||||
finding: false
|
||||
reason: "Gerichtsstand in AGB ist bei B2C nicht erforderlich (sogar oft unzulaessig)"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-DOC-006
|
||||
category: incomplete_document
|
||||
input: "Cookie-Policy listet Google Analytics und Facebook Pixel auf, aber nicht das CMP-Cookie selbst"
|
||||
expected:
|
||||
finding: true
|
||||
finding_type: missing_field
|
||||
missing: "CMP-eigene Cookies"
|
||||
reason: "Auch technisch notwendige Cookies muessen in der Cookie-Policy stehen"
|
||||
difficulty: hard
|
||||
|
||||
# D. Semantisch aehnlich aber verschieden — 5 Cases
|
||||
|
||||
- id: ADV-SEM-001
|
||||
category: similar_but_different
|
||||
control_a: "MFA fuer privilegierte Admin-Accounts aktivieren"
|
||||
control_b: "MFA fuer alle Endnutzer-Accounts aktivieren"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Scopes (Admin vs. Endnutzer) = verschiedene Controls"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-SEM-002
|
||||
category: similar_but_different
|
||||
control_a: "Daten nach Vertragsende loeschen"
|
||||
control_b: "Daten nach Ablauf der gesetzlichen Aufbewahrungsfrist loeschen"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Trigger (Vertragsende vs. Aufbewahrungsfrist)"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-SEM-003
|
||||
category: similar_but_different
|
||||
control_a: "Rate Limiting fuer oeffentliche API-Endpunkte"
|
||||
control_b: "Rate Limiting fuer Login-Endpunkte"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Asset-Scopes (API vs. Login)"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-SEM-004
|
||||
category: similar_but_different
|
||||
control_a: "Verschluesselung personenbezogener Daten at rest"
|
||||
control_b: "Verschluesselung personenbezogener Daten in transit"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Phasen (Speicherung vs. Uebertragung)"
|
||||
difficulty: easy
|
||||
|
||||
- id: ADV-SEM-005
|
||||
category: similar_but_different
|
||||
control_a: "Incident Response Plan erstellen"
|
||||
control_b: "Business Continuity Plan erstellen"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "IRP = Sicherheitsvorfaelle, BCP = Geschaeftskontinuitaet (verschiedene Ziele)"
|
||||
difficulty: medium
|
||||
|
||||
# E. Semantisch verschieden aber gleich klingend — 5 Cases
|
||||
|
||||
- id: ADV-HOM-001
|
||||
category: homonym_different
|
||||
control_a: "Einwilligung des Nutzers fuer Datenverarbeitung einholen (DSGVO)"
|
||||
control_b: "Einwilligung des Nutzers fuer Werbeanrufe einholen (UWG)"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Rechtsgrundlagen (DSGVO vs. UWG) und verschiedene Rechtsfolgen"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-HOM-002
|
||||
category: homonym_different
|
||||
control_a: "Risikobewertung fuer Datenschutz-Folgenabschaetzung (DSFA)"
|
||||
control_b: "Risikobewertung fuer finanzielle Risiken (MaRisk)"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Risikokategorien und verschiedene regulatorische Grundlagen"
|
||||
difficulty: hard
|
||||
|
||||
- id: ADV-HOM-003
|
||||
category: homonym_different
|
||||
control_a: "Audit der Datenschutz-Compliance (Art. 5 Abs. 2 DSGVO)"
|
||||
control_b: "Audit der Jahresabschlusspruefung (HGB)"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Audit-Typen mit verschiedenen Pruefungsstandards"
|
||||
difficulty: medium
|
||||
|
||||
- id: ADV-HOM-004
|
||||
category: homonym_different
|
||||
control_a: "Zertifizierung nach ISO 27001 (Informationssicherheit)"
|
||||
control_b: "Zertifizierung nach CE-Konformitaet (Produktsicherheit)"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Verschiedene Zertifizierungsrahmen, verschiedene Pruefer, verschiedene Ziele"
|
||||
difficulty: easy
|
||||
|
||||
- id: ADV-HOM-005
|
||||
category: homonym_different
|
||||
control_a: "Verarbeitung personenbezogener Daten dokumentieren (DSGVO VVT)"
|
||||
control_b: "Verarbeitung von Lebensmitteln dokumentieren (HACCP)"
|
||||
expected:
|
||||
is_duplicate: false
|
||||
reason: "Komplett verschiedene Domaenen trotz gleicher Woerter"
|
||||
difficulty: easy
|
||||
@@ -0,0 +1,36 @@
|
||||
"""Shared test fixtures for the control pipeline test suite."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import pytest
|
||||
|
||||
# Ensure control-pipeline is in path
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def db_session():
|
||||
"""DB session for integration tests — skip if no DATABASE_URL."""
|
||||
url = os.getenv("DATABASE_URL")
|
||||
if not url:
|
||||
pytest.skip("DATABASE_URL not set — skipping DB tests")
|
||||
from db.session import SessionLocal
|
||||
db = SessionLocal()
|
||||
yield db
|
||||
db.close()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_controls(db_session):
|
||||
"""Load 100 random draft controls for regression testing."""
|
||||
from sqlalchemy import text
|
||||
rows = db_session.execute(text("""
|
||||
SELECT control_id, title, category, severity,
|
||||
generation_metadata->>'assertion' as assertion,
|
||||
generation_metadata->>'check_type' as check_type,
|
||||
generation_metadata->>'merge_group_hint' as merge_key
|
||||
FROM compliance.canonical_controls
|
||||
WHERE release_state = 'draft' AND decomposition_method = 'pass0b'
|
||||
ORDER BY random() LIMIT 100
|
||||
""")).fetchall()
|
||||
return [dict(r._mapping) for r in rows]
|
||||
@@ -0,0 +1,190 @@
|
||||
"""
|
||||
Adversarial Test Suite — 30 tricky cases that challenge the control ontology
|
||||
and dedup engine with edge cases.
|
||||
|
||||
Tests categories:
|
||||
A. Wrong legal basis (plausible but incorrect) — 8 cases
|
||||
B. Dark patterns (subtle UI manipulation) — 6 cases
|
||||
C. Almost-complete documents (missing 1 field) — 6 cases
|
||||
D. Semantically similar but different controls — 5 cases
|
||||
E. Homonyms (different meaning, same words) — 5 cases
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import yaml
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from services.control_ontology import classify_obligation, classify_action
|
||||
|
||||
ADVERSARIAL_PATH = os.path.join(os.path.dirname(__file__), "adversarial_cases.yaml")
|
||||
|
||||
with open(ADVERSARIAL_PATH) as f:
|
||||
_ADV = yaml.safe_load(f)
|
||||
|
||||
TESTS = _ADV["tests"]
|
||||
|
||||
|
||||
def _tests_by_category(cat: str) -> list:
|
||||
return [t for t in TESTS if t["category"] == cat]
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# D. Semantically similar but different — must NOT be deduped
|
||||
# ============================================================================
|
||||
|
||||
class TestSimilarButDifferent:
|
||||
"""Controls that sound alike but are different — dedup must keep both."""
|
||||
|
||||
@pytest.mark.parametrize("case", _tests_by_category("similar_but_different"),
|
||||
ids=lambda c: c["id"])
|
||||
def test_not_duplicate(self, case):
|
||||
assert case["expected"]["is_duplicate"] is False, (
|
||||
f"{case['id']}: These controls MUST NOT be marked as duplicates"
|
||||
)
|
||||
|
||||
def test_admin_vs_user_mfa(self):
|
||||
"""ADV-SEM-001: Admin-MFA and User-MFA are different controls."""
|
||||
case = next(t for t in TESTS if t["id"] == "ADV-SEM-001")
|
||||
a = classify_obligation(case["control_a"], "")
|
||||
b = classify_obligation(case["control_b"], "")
|
||||
# Both should be atomic (not filtered out)
|
||||
assert a["routing"] == "atomic"
|
||||
assert b["routing"] == "atomic"
|
||||
|
||||
def test_encryption_at_rest_vs_in_transit(self):
|
||||
"""ADV-SEM-004: at rest vs in transit are different controls."""
|
||||
a_action = classify_action("Verschluesselung at rest implementieren")
|
||||
b_action = classify_action("Verschluesselung in transit implementieren")
|
||||
# Both should classify as "encrypt" or "implement"
|
||||
assert a_action in ("encrypt", "implement")
|
||||
assert b_action in ("encrypt", "implement")
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# E. Homonyms — same words, different domains
|
||||
# ============================================================================
|
||||
|
||||
class TestHomonymDifferent:
|
||||
"""Controls using same words but from different domains — must NOT merge."""
|
||||
|
||||
@pytest.mark.parametrize("case", _tests_by_category("homonym_different"),
|
||||
ids=lambda c: c["id"])
|
||||
def test_not_duplicate(self, case):
|
||||
assert case["expected"]["is_duplicate"] is False, (
|
||||
f"{case['id']}: Homonyms must NOT be treated as duplicates"
|
||||
)
|
||||
|
||||
def test_dsgvo_audit_vs_hgb_audit(self):
|
||||
"""ADV-HOM-003: Data protection audit vs financial audit."""
|
||||
a = classify_obligation("Audit der Datenschutz-Compliance durchfuehren", "")
|
||||
b = classify_obligation("Audit der Jahresabschlusspruefung durchfuehren", "")
|
||||
assert a["routing"] == "atomic"
|
||||
assert b["routing"] == "atomic"
|
||||
# "durchfuehren" maps to "implement" — key point is both are atomic, not filtered
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# A. Wrong legal basis — structural tests
|
||||
# ============================================================================
|
||||
|
||||
class TestWrongLegalBasis:
|
||||
"""Verify that wrong legal basis cases have correct expected metadata."""
|
||||
|
||||
@pytest.mark.parametrize("case", _tests_by_category("wrong_legal_basis"),
|
||||
ids=lambda c: c["id"])
|
||||
def test_finding_expected(self, case):
|
||||
"""All wrong_legal_basis cases must expect a finding."""
|
||||
assert case["expected"]["finding"] is True
|
||||
|
||||
@pytest.mark.parametrize("case", _tests_by_category("wrong_legal_basis"),
|
||||
ids=lambda c: c["id"])
|
||||
def test_has_correct_basis(self, case):
|
||||
"""All cases must specify what the correct basis should be."""
|
||||
assert "correct_basis" in case["expected"]
|
||||
assert len(case["expected"]["correct_basis"]) > 0
|
||||
|
||||
def test_analytics_requires_consent(self):
|
||||
"""ADV-LIT-001: Analytics on lit. f is always wrong."""
|
||||
case = next(t for t in TESTS if t["id"] == "ADV-LIT-001")
|
||||
assert "lit. a" in case["expected"]["correct_basis"]
|
||||
assert "Planet49" in case["expected"]["reason"]
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# B. Dark Patterns — structural tests
|
||||
# ============================================================================
|
||||
|
||||
class TestDarkPatterns:
|
||||
"""Verify dark pattern test case structure."""
|
||||
|
||||
@pytest.mark.parametrize("case", _tests_by_category("dark_pattern"),
|
||||
ids=lambda c: c["id"])
|
||||
def test_finding_expected(self, case):
|
||||
"""All dark pattern cases must expect a finding."""
|
||||
assert case["expected"]["finding"] is True
|
||||
|
||||
@pytest.mark.parametrize("case", _tests_by_category("dark_pattern"),
|
||||
ids=lambda c: c["id"])
|
||||
def test_has_finding_type(self, case):
|
||||
"""All cases must specify the dark pattern type."""
|
||||
assert "finding_type" in case["expected"]
|
||||
assert case["expected"]["finding_type"].startswith("dark_pattern_")
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# C. Incomplete documents — structural tests
|
||||
# ============================================================================
|
||||
|
||||
class TestIncompleteDocuments:
|
||||
"""Verify incomplete document test case structure."""
|
||||
|
||||
@pytest.mark.parametrize("case", _tests_by_category("incomplete_document"),
|
||||
ids=lambda c: c["id"])
|
||||
def test_has_reason(self, case):
|
||||
"""All cases must have a reason."""
|
||||
assert "reason" in case["expected"]
|
||||
assert len(case["expected"]["reason"]) > 0
|
||||
|
||||
def test_agb_gerichtsstand_no_finding(self):
|
||||
"""ADV-DOC-005: Missing Gerichtsstand in B2C AGB is NOT a finding."""
|
||||
case = next(t for t in TESTS if t["id"] == "ADV-DOC-005")
|
||||
assert case["expected"]["finding"] is False
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Meta tests — validate test suite integrity
|
||||
# ============================================================================
|
||||
|
||||
class TestSuiteIntegrity:
|
||||
"""Verify the adversarial test suite itself is complete and consistent."""
|
||||
|
||||
def test_total_count(self):
|
||||
assert len(TESTS) == 30
|
||||
|
||||
def test_unique_ids(self):
|
||||
ids = [t["id"] for t in TESTS]
|
||||
assert len(ids) == len(set(ids)), "Duplicate test IDs found"
|
||||
|
||||
def test_all_categories_present(self):
|
||||
categories = {t["category"] for t in TESTS}
|
||||
expected = {"wrong_legal_basis", "dark_pattern", "incomplete_document",
|
||||
"similar_but_different", "homonym_different"}
|
||||
assert categories == expected
|
||||
|
||||
def test_category_counts(self):
|
||||
counts = {}
|
||||
for t in TESTS:
|
||||
counts[t["category"]] = counts.get(t["category"], 0) + 1
|
||||
assert counts["wrong_legal_basis"] == 8
|
||||
assert counts["dark_pattern"] == 6
|
||||
assert counts["incomplete_document"] == 6
|
||||
assert counts["similar_but_different"] == 5
|
||||
assert counts["homonym_different"] == 5
|
||||
|
||||
def test_all_have_difficulty(self):
|
||||
for t in TESTS:
|
||||
assert "difficulty" in t, f"{t['id']} missing difficulty"
|
||||
assert t["difficulty"] in ("easy", "medium", "hard")
|
||||
@@ -0,0 +1,166 @@
|
||||
"""Tests for D3: Structural metadata flow (section priority, page in citation)."""
|
||||
|
||||
import json
|
||||
from typing import Optional
|
||||
|
||||
from services.rag_client import RAGSearchResult
|
||||
|
||||
|
||||
def _make_chunk(
|
||||
article: str = "",
|
||||
paragraph: str = "",
|
||||
page: Optional[int] = None,
|
||||
) -> RAGSearchResult:
|
||||
return RAGSearchResult(
|
||||
text="Test chunk text",
|
||||
regulation_code="DSGVO",
|
||||
regulation_name="Datenschutz-Grundverordnung",
|
||||
regulation_short="DSGVO",
|
||||
category="data_protection",
|
||||
article=article,
|
||||
paragraph=paragraph,
|
||||
source_url="https://example.com",
|
||||
score=0.95,
|
||||
collection="bp_compliance_de",
|
||||
page=page,
|
||||
)
|
||||
|
||||
|
||||
class TestRAGSearchResultPage:
|
||||
"""RAGSearchResult now carries a page field."""
|
||||
|
||||
def test_page_default_none(self):
|
||||
chunk = _make_chunk()
|
||||
assert chunk.page is None
|
||||
|
||||
def test_page_set(self):
|
||||
chunk = _make_chunk(page=42)
|
||||
assert chunk.page == 42
|
||||
|
||||
def test_page_zero(self):
|
||||
chunk = _make_chunk(page=0)
|
||||
assert chunk.page == 0
|
||||
|
||||
|
||||
class TestQdrantPayloadPriority:
|
||||
"""section (D2) should take priority over article (legacy)."""
|
||||
|
||||
def test_section_preferred_over_article(self):
|
||||
payload = {"section": "§ 312k", "article": "Art. 312", "section_title": "Kuendigungsbutton"}
|
||||
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
|
||||
assert article == "§ 312k"
|
||||
|
||||
def test_article_fallback_when_no_section(self):
|
||||
payload = {"section": "", "article": "Art. 35", "section_title": ""}
|
||||
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
|
||||
assert article == "Art. 35"
|
||||
|
||||
def test_section_title_last_resort(self):
|
||||
payload = {"section": "", "article": "", "section_title": "Informationspflichten"}
|
||||
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
|
||||
assert article == "Informationspflichten"
|
||||
|
||||
def test_all_empty(self):
|
||||
payload = {"section": "", "article": "", "section_title": ""}
|
||||
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
|
||||
assert article == ""
|
||||
|
||||
def test_page_from_payload(self):
|
||||
payload = {"page": 847}
|
||||
assert payload.get("page") == 847
|
||||
|
||||
def test_page_none_from_payload(self):
|
||||
payload = {}
|
||||
assert payload.get("page") is None
|
||||
|
||||
|
||||
class TestSourceCitationPage:
|
||||
"""source_citation dict should include page when available."""
|
||||
|
||||
def _build_citation(self, chunk: RAGSearchResult) -> dict:
|
||||
"""Mirrors the citation-building logic from control_generator.py."""
|
||||
return {
|
||||
"source": chunk.regulation_name,
|
||||
"article": chunk.article,
|
||||
"paragraph": chunk.paragraph,
|
||||
"page": chunk.page,
|
||||
"license": "free_use",
|
||||
"source_type": "law",
|
||||
"url": chunk.source_url or "",
|
||||
}
|
||||
|
||||
def test_citation_with_page(self):
|
||||
chunk = _make_chunk(article="§ 312k", paragraph="Abs. 1", page=847)
|
||||
citation = self._build_citation(chunk)
|
||||
assert citation["page"] == 847
|
||||
|
||||
def test_citation_without_page(self):
|
||||
chunk = _make_chunk(article="§ 312k", paragraph="Abs. 1")
|
||||
citation = self._build_citation(chunk)
|
||||
assert citation["page"] is None
|
||||
|
||||
def test_citation_serializable(self):
|
||||
chunk = _make_chunk(article="Art. 35", page=12)
|
||||
citation = self._build_citation(chunk)
|
||||
serialized = json.dumps(citation)
|
||||
restored = json.loads(serialized)
|
||||
assert restored["page"] == 12
|
||||
|
||||
|
||||
class TestFormatCitation:
|
||||
"""_format_citation should include page number."""
|
||||
|
||||
def _format_citation(self, citation) -> str:
|
||||
"""Mirrors _format_citation from decomposition_pass.py."""
|
||||
if not citation:
|
||||
return ""
|
||||
if isinstance(citation, str):
|
||||
try:
|
||||
c = json.loads(citation)
|
||||
if isinstance(c, dict):
|
||||
parts = []
|
||||
if c.get("source"):
|
||||
parts.append(c["source"])
|
||||
if c.get("article"):
|
||||
parts.append(c["article"])
|
||||
if c.get("paragraph"):
|
||||
parts.append(c["paragraph"])
|
||||
if c.get("page") is not None:
|
||||
parts.append(f"S. {c['page']}")
|
||||
return " ".join(parts) if parts else citation
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return citation
|
||||
return str(citation)
|
||||
|
||||
def test_format_with_page(self):
|
||||
citation = json.dumps({
|
||||
"source": "DSGVO",
|
||||
"article": "Art. 35",
|
||||
"paragraph": "Abs. 1",
|
||||
"page": 42,
|
||||
})
|
||||
result = self._format_citation(citation)
|
||||
assert result == "DSGVO Art. 35 Abs. 1 S. 42"
|
||||
|
||||
def test_format_without_page(self):
|
||||
citation = json.dumps({
|
||||
"source": "BGB",
|
||||
"article": "§ 312k",
|
||||
"paragraph": "",
|
||||
})
|
||||
result = self._format_citation(citation)
|
||||
assert result == "BGB § 312k"
|
||||
|
||||
def test_format_page_zero(self):
|
||||
citation = json.dumps({
|
||||
"source": "BGB",
|
||||
"article": "§ 1",
|
||||
"paragraph": "",
|
||||
"page": 0,
|
||||
})
|
||||
result = self._format_citation(citation)
|
||||
assert result == "BGB § 1 S. 0"
|
||||
|
||||
def test_format_empty_citation(self):
|
||||
assert self._format_citation("") == ""
|
||||
assert self._format_citation(None) == ""
|
||||
@@ -0,0 +1,196 @@
|
||||
"""
|
||||
Regression Tests — verify pipeline updates don't break existing controls.
|
||||
|
||||
Requires: DATABASE_URL environment variable for DB tests.
|
||||
Tests without DB run always (structural checks).
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Structural tests (no DB needed)
|
||||
# ============================================================================
|
||||
|
||||
class TestOntologyStability:
|
||||
"""Verify ontology constants haven't accidentally changed."""
|
||||
|
||||
def test_action_types_count(self):
|
||||
from services.control_ontology import ACTION_TYPES
|
||||
assert len(ACTION_TYPES) >= 26, f"ACTION_TYPES shrank to {len(ACTION_TYPES)}"
|
||||
|
||||
def test_phase_order_count(self):
|
||||
from services.control_ontology import PHASE_ORDER
|
||||
assert len(PHASE_ORDER) >= 15, f"PHASE_ORDER shrank to {len(PHASE_ORDER)}"
|
||||
|
||||
def test_key_action_types_exist(self):
|
||||
from services.control_ontology import ACTION_TYPES
|
||||
required = ["define", "implement", "monitor", "test", "prevent", "exclude", "train"]
|
||||
for action in required:
|
||||
assert action in ACTION_TYPES, f"Missing action_type: {action}"
|
||||
|
||||
def test_classify_action_deterministic(self):
|
||||
"""Same input must always produce same output."""
|
||||
from services.control_ontology import classify_action
|
||||
for _ in range(10):
|
||||
assert classify_action("implementieren") == "implement"
|
||||
assert classify_action("überwachen") == "monitor"
|
||||
assert classify_action("verhindern") == "prevent"
|
||||
|
||||
|
||||
class TestDependencyEngineStability:
|
||||
"""Verify dependency engine core functions haven't changed behavior."""
|
||||
|
||||
def test_evaluate_condition_empty(self):
|
||||
from services.dependency_engine import evaluate_condition
|
||||
assert evaluate_condition({}, {}) is True
|
||||
|
||||
def test_evaluate_condition_simple(self):
|
||||
from services.dependency_engine import evaluate_condition
|
||||
cond = {"field": "source.status", "op": "==", "value": "pass"}
|
||||
assert evaluate_condition(cond, {"source": {"status": "pass"}}) is True
|
||||
assert evaluate_condition(cond, {"source": {"status": "fail"}}) is False
|
||||
|
||||
def test_apply_effect_not_applicable(self):
|
||||
from services.dependency_engine import apply_effect
|
||||
assert apply_effect({"set_status": "not_applicable"}, "fail") == "not_applicable"
|
||||
|
||||
def test_default_priorities_unchanged(self):
|
||||
from services.dependency_engine import DEFAULT_PRIORITIES
|
||||
assert DEFAULT_PRIORITIES["supersedes"] == 10
|
||||
assert DEFAULT_PRIORITIES["scope_exclusion"] == 20
|
||||
assert DEFAULT_PRIORITIES["prerequisite"] == 50
|
||||
assert DEFAULT_PRIORITIES["compensating_control"] == 80
|
||||
|
||||
|
||||
class TestDocumentComplianceStability:
|
||||
"""Verify document compliance rules haven't changed."""
|
||||
|
||||
def test_basic_website_requires_impressum(self):
|
||||
from services.document_scope_resolver import resolve_required_documents
|
||||
result = resolve_required_documents({"has_website": True})
|
||||
docs = result.get("required_documents", [])
|
||||
doc_types = [d["document_type"] if isinstance(d, dict) else d.document_type for d in docs]
|
||||
assert "impressum" in doc_types
|
||||
assert "privacy_policy" in doc_types
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# DB tests (require DATABASE_URL)
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.skipif(
|
||||
not os.getenv("DATABASE_URL"),
|
||||
reason="DATABASE_URL not set"
|
||||
)
|
||||
class TestControlCountStability:
|
||||
"""Draft count must stay within expected range."""
|
||||
|
||||
def test_draft_count_minimum(self, db_session):
|
||||
from sqlalchemy import text
|
||||
count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
|
||||
)).scalar()
|
||||
assert count > 140000, f"Draft count too low: {count} (expected >140k)"
|
||||
|
||||
def test_draft_count_maximum(self, db_session):
|
||||
from sqlalchemy import text
|
||||
count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
|
||||
)).scalar()
|
||||
assert count < 200000, f"Draft count too high: {count} (expected <200k)"
|
||||
|
||||
def test_no_null_titles(self, db_session):
|
||||
from sqlalchemy import text
|
||||
null_count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
|
||||
"AND (title IS NULL OR title = '')"
|
||||
)).scalar()
|
||||
assert null_count == 0, f"{null_count} controls without title"
|
||||
|
||||
def test_assertion_coverage(self, db_session):
|
||||
from sqlalchemy import text
|
||||
no_assertion = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
|
||||
"AND (generation_metadata->>'assertion' IS NULL "
|
||||
" OR generation_metadata->>'assertion' = '')"
|
||||
)).scalar()
|
||||
total = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
|
||||
)).scalar()
|
||||
coverage = (total - no_assertion) / max(total, 1) * 100
|
||||
assert coverage > 99, f"Assertion coverage only {coverage:.1f}% (expected >99%)"
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
not os.getenv("DATABASE_URL"),
|
||||
reason="DATABASE_URL not set"
|
||||
)
|
||||
class TestDependencyGraphStability:
|
||||
"""Dependency graph must be valid and within expected size."""
|
||||
|
||||
def test_dependency_count_minimum(self, db_session):
|
||||
from sqlalchemy import text
|
||||
count = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.control_dependencies WHERE is_active = true"
|
||||
)).scalar()
|
||||
assert count > 10000, f"Too few dependencies: {count} (expected >10k)"
|
||||
|
||||
def test_no_self_dependencies(self, db_session):
|
||||
from sqlalchemy import text
|
||||
self_deps = db_session.execute(text(
|
||||
"SELECT COUNT(*) FROM compliance.control_dependencies "
|
||||
"WHERE source_control_id = target_control_id AND is_active = true"
|
||||
)).scalar()
|
||||
assert self_deps == 0, f"{self_deps} self-referencing dependencies"
|
||||
|
||||
def test_no_orphan_dependencies(self, db_session):
|
||||
from sqlalchemy import text
|
||||
orphans = db_session.execute(text("""
|
||||
SELECT COUNT(*) FROM compliance.control_dependencies d
|
||||
WHERE d.is_active = true
|
||||
AND NOT EXISTS (
|
||||
SELECT 1 FROM compliance.canonical_controls c
|
||||
WHERE c.id = d.source_control_id AND c.release_state = 'draft'
|
||||
)
|
||||
""")).scalar()
|
||||
# Some orphans OK (pointing to deprecated/duplicate controls)
|
||||
assert orphans < 1000, f"Too many orphan dependencies: {orphans}"
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
not os.getenv("DATABASE_URL"),
|
||||
reason="DATABASE_URL not set"
|
||||
)
|
||||
class TestQualityMetrics:
|
||||
"""Quality metrics must stay within target ranges."""
|
||||
|
||||
def test_duplicate_rate(self, db_session):
|
||||
from sqlalchemy import text
|
||||
total = db_session.execute(text(
|
||||
"SELECT COUNT(DISTINCT generation_metadata->>'merge_group_hint') "
|
||||
"FROM compliance.canonical_controls "
|
||||
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
|
||||
"AND generation_metadata->>'merge_group_hint' IS NOT NULL"
|
||||
)).scalar()
|
||||
dups = db_session.execute(text("""
|
||||
SELECT COUNT(*) FROM (
|
||||
SELECT generation_metadata->>'merge_group_hint', COUNT(*)
|
||||
FROM compliance.canonical_controls
|
||||
WHERE release_state = 'draft' AND decomposition_method = 'pass0b'
|
||||
AND generation_metadata->>'merge_group_hint' IS NOT NULL
|
||||
GROUP BY generation_metadata->>'merge_group_hint'
|
||||
HAVING COUNT(*) > 1
|
||||
) sub
|
||||
""")).scalar()
|
||||
rate = dups / max(total, 1) * 100
|
||||
assert rate < 5, f"Duplicate merge_key rate {rate:.1f}% exceeds 5% threshold"
|
||||
@@ -0,0 +1,285 @@
|
||||
"""Tests for RegulationRegistry — DB-backed lookup with cache and fallback."""
|
||||
|
||||
import time
|
||||
from unittest.mock import patch, MagicMock
|
||||
|
||||
import pytest
|
||||
|
||||
from services.regulation_registry import (
|
||||
RegulationRegistry,
|
||||
_CACHE_TTL_SECONDS,
|
||||
)
|
||||
|
||||
|
||||
# ── Test data: simulates DB rows ──────────────────────────────────────────
|
||||
|
||||
_MOCK_DB_ROWS = [
|
||||
# (regulation_id, regulation_name_de, license_rule, license_type,
|
||||
# attribution, source_type, jurisdiction, status)
|
||||
("eu_2016_679", "DSGVO (EU) 2016/679", 1, "EU_LAW",
|
||||
None, "law", "EU", "active"),
|
||||
("nist_sp_800_53", "NIST SP 800-53 Rev. 5", 1, "NIST_PUBLIC_DOMAIN",
|
||||
None, "standard", "US", "active"),
|
||||
("owasp_asvs", "OWASP ASVS 4.0", 2, "CC-BY-SA-4.0",
|
||||
"OWASP Foundation, CC BY-SA 4.0", "standard", "INT", "active"),
|
||||
("bdsg", "Bundesdatenschutzgesetz (BDSG)", 1, "DE_LAW",
|
||||
None, "law", "DE", "active"),
|
||||
("at_dsg", "Österreichisches Datenschutzgesetz (DSG)", 1, "AT_LAW",
|
||||
None, "law", "AT", "active"),
|
||||
]
|
||||
|
||||
|
||||
def _mock_db_execute(query):
|
||||
"""Mock that returns our test rows."""
|
||||
mock_result = MagicMock()
|
||||
mock_result.fetchall.return_value = _MOCK_DB_ROWS
|
||||
return mock_result
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def registry():
|
||||
"""Create a registry with mocked DB."""
|
||||
reg = RegulationRegistry()
|
||||
with patch("services.regulation_registry.SessionLocal") as mock_session_cls:
|
||||
mock_session = MagicMock()
|
||||
mock_session.execute = _mock_db_execute
|
||||
mock_session_cls.return_value = mock_session
|
||||
reg._load()
|
||||
return reg
|
||||
|
||||
|
||||
# ── classify_regulation tests ─────────────────────────────────────────────
|
||||
|
||||
|
||||
class TestClassifyRegulation:
|
||||
def test_exact_match_eu_law(self, registry):
|
||||
result = registry.classify_regulation("eu_2016_679")
|
||||
assert result["rule"] == 1
|
||||
assert result["license"] == "EU_LAW"
|
||||
assert result["source_type"] == "law"
|
||||
assert result["name"] == "DSGVO (EU) 2016/679"
|
||||
|
||||
def test_exact_match_case_insensitive(self, registry):
|
||||
result = registry.classify_regulation("EU_2016_679")
|
||||
assert result["rule"] == 1
|
||||
assert result["name"] == "DSGVO (EU) 2016/679"
|
||||
|
||||
def test_exact_match_with_whitespace(self, registry):
|
||||
result = registry.classify_regulation(" eu_2016_679 ")
|
||||
assert result["rule"] == 1
|
||||
|
||||
def test_nist_standard(self, registry):
|
||||
result = registry.classify_regulation("nist_sp_800_53")
|
||||
assert result["rule"] == 1
|
||||
assert result["source_type"] == "standard"
|
||||
|
||||
def test_owasp_rule2(self, registry):
|
||||
result = registry.classify_regulation("owasp_asvs")
|
||||
assert result["rule"] == 2
|
||||
assert result["attribution"] == "OWASP Foundation, CC BY-SA 4.0"
|
||||
|
||||
def test_german_law(self, registry):
|
||||
result = registry.classify_regulation("bdsg")
|
||||
assert result["rule"] == 1
|
||||
assert result["source_type"] == "law"
|
||||
assert result["jurisdiction"] == "DE"
|
||||
|
||||
def test_austrian_law(self, registry):
|
||||
result = registry.classify_regulation("at_dsg")
|
||||
assert result["rule"] == 1
|
||||
assert result["jurisdiction"] == "AT"
|
||||
|
||||
def test_prefix_enisa_rule2(self, registry):
|
||||
result = registry.classify_regulation("enisa_supply_chain_2024")
|
||||
assert result["rule"] == 2
|
||||
assert result["source_type"] == "standard"
|
||||
assert "ENISA" in result["attribution"]
|
||||
|
||||
def test_prefix_bsi_rule3(self, registry):
|
||||
result = registry.classify_regulation("bsi_tr_03161")
|
||||
assert result["rule"] == 3
|
||||
assert result["source_type"] == "restricted"
|
||||
assert result["name"] == "INTERNAL_ONLY"
|
||||
|
||||
def test_prefix_iso_rule3(self, registry):
|
||||
result = registry.classify_regulation("iso_27001")
|
||||
assert result["rule"] == 3
|
||||
assert result["source_type"] == "restricted"
|
||||
|
||||
def test_prefix_etsi_rule3(self, registry):
|
||||
result = registry.classify_regulation("etsi_en_303_645")
|
||||
assert result["rule"] == 3
|
||||
|
||||
def test_unknown_defaults_to_restricted(self, registry):
|
||||
result = registry.classify_regulation("some_unknown_regulation")
|
||||
assert result["rule"] == 3
|
||||
assert result["source_type"] == "restricted"
|
||||
assert result["license"] == "UNKNOWN"
|
||||
|
||||
|
||||
# ── source_type_by_name tests ────────────────────────────────────────────
|
||||
|
||||
|
||||
class TestSourceTypeByName:
|
||||
def test_exact_match_law(self, registry):
|
||||
result = registry.source_type_by_name("DSGVO (EU) 2016/679")
|
||||
assert result == "law"
|
||||
|
||||
def test_exact_match_standard(self, registry):
|
||||
result = registry.source_type_by_name("NIST SP 800-53 Rev. 5")
|
||||
assert result == "standard"
|
||||
|
||||
def test_empty_returns_framework(self, registry):
|
||||
assert registry.source_type_by_name("") == "framework"
|
||||
assert registry.source_type_by_name(None) == "framework"
|
||||
|
||||
def test_heuristic_law(self, registry):
|
||||
assert registry.source_type_by_name("Verordnung XYZ") == "law"
|
||||
assert registry.source_type_by_name("Some EU Directive") == "law"
|
||||
|
||||
def test_heuristic_guideline(self, registry):
|
||||
assert registry.source_type_by_name("EDPB Leitlinie 99/2025") == "guideline"
|
||||
assert registry.source_type_by_name("BSI Standard 200-1") == "guideline"
|
||||
|
||||
def test_heuristic_framework(self, registry):
|
||||
# "ENISA Cloud Guidelines" matches "guideline" before "enisa" in heuristic order
|
||||
assert registry.source_type_by_name("ENISA Cloud Report") == "framework"
|
||||
assert registry.source_type_by_name("OWASP Testing Guide") == "framework"
|
||||
|
||||
def test_unknown_returns_framework(self, registry):
|
||||
assert registry.source_type_by_name("Completely Unknown Document") == "framework"
|
||||
|
||||
|
||||
# ── is_open_source tests ─────────────��───────────────────────────────────
|
||||
|
||||
|
||||
class TestIsOpenSource:
|
||||
def test_rule1_is_open(self, registry):
|
||||
assert registry.is_open_source("eu_2016_679") is True
|
||||
|
||||
def test_rule2_is_open(self, registry):
|
||||
assert registry.is_open_source("owasp_asvs") is True
|
||||
|
||||
def test_rule3_is_not_open(self, registry):
|
||||
assert registry.is_open_source("bsi_tr_03161") is False
|
||||
|
||||
def test_unknown_is_not_open(self, registry):
|
||||
assert registry.is_open_source("unknown_thing") is False
|
||||
|
||||
|
||||
# ── Cache behavior tests ──────��──────────────────────────────────────────
|
||||
|
||||
|
||||
class TestCacheBehavior:
|
||||
def test_fresh_cache_not_stale(self, registry):
|
||||
assert registry._is_stale() is False
|
||||
|
||||
def test_old_cache_is_stale(self, registry):
|
||||
registry._loaded_at = time.monotonic() - _CACHE_TTL_SECONDS - 1
|
||||
assert registry._is_stale() is True
|
||||
|
||||
def test_ensure_loaded_reloads_when_stale(self):
|
||||
reg = RegulationRegistry()
|
||||
reg._loaded_at = time.monotonic() - _CACHE_TTL_SECONDS - 100 # force stale
|
||||
|
||||
load_called = False
|
||||
original_load = reg._load
|
||||
|
||||
def tracking_load():
|
||||
nonlocal load_called
|
||||
load_called = True
|
||||
|
||||
reg._load = tracking_load
|
||||
reg._ensure_loaded()
|
||||
assert load_called, "_load should have been called when cache is stale"
|
||||
|
||||
def test_ensure_loaded_skips_when_fresh(self, registry):
|
||||
with patch.object(registry, "_load") as mock_load:
|
||||
registry._ensure_loaded()
|
||||
mock_load.assert_not_called()
|
||||
|
||||
|
||||
# ── Graceful degradation tests ──────��────────────────────────────────────
|
||||
|
||||
|
||||
class TestGracefulDegradation:
|
||||
def test_db_failure_uses_stale_cache(self):
|
||||
"""If DB fails, stale cache entries are still usable."""
|
||||
reg = RegulationRegistry()
|
||||
|
||||
# First load succeeds
|
||||
with patch("services.regulation_registry.SessionLocal") as mock_cls:
|
||||
mock_session = MagicMock()
|
||||
mock_session.execute = _mock_db_execute
|
||||
mock_cls.return_value = mock_session
|
||||
reg._load()
|
||||
|
||||
# Force stale
|
||||
reg._loaded_at = time.monotonic() - _CACHE_TTL_SECONDS - 1
|
||||
|
||||
# Second load fails — DB error
|
||||
from sqlalchemy.exc import OperationalError
|
||||
with patch("services.regulation_registry.SessionLocal") as mock_cls:
|
||||
mock_cls.side_effect = OperationalError("connection refused", None, None)
|
||||
reg._ensure_loaded()
|
||||
|
||||
# Should still have cached data
|
||||
result = reg.classify_regulation("eu_2016_679")
|
||||
assert result["rule"] == 1
|
||||
|
||||
def test_empty_registry_returns_unknown(self):
|
||||
"""Unloaded registry returns safe defaults."""
|
||||
reg = RegulationRegistry()
|
||||
reg._loaded_at = time.monotonic() # pretend fresh but empty
|
||||
|
||||
result = reg.classify_regulation("eu_2016_679")
|
||||
assert result["rule"] == 3 # safe default
|
||||
assert result["license"] == "UNKNOWN"
|
||||
|
||||
|
||||
# ── Migration data consistency tests ───────��─────────────────────────────
|
||||
|
||||
|
||||
class TestMigrationDataConsistency:
|
||||
"""Verify that the migration script produces valid data."""
|
||||
|
||||
def test_build_rows_produces_data(self):
|
||||
from scripts.f1_migrate_regulation_registry import build_rows
|
||||
rows = build_rows()
|
||||
assert len(rows) > 100 # at least 100 entries
|
||||
|
||||
def test_all_rows_have_required_fields(self):
|
||||
from scripts.f1_migrate_regulation_registry import build_rows
|
||||
rows = build_rows()
|
||||
for row in rows:
|
||||
assert row["regulation_id"], f"Missing regulation_id: {row}"
|
||||
assert row["regulation_name_de"], f"Missing name: {row}"
|
||||
assert row["license_rule"] in (1, 2, 3), f"Bad rule: {row}"
|
||||
assert row["source_type"] in (
|
||||
"law", "guideline", "standard", "framework", "restricted"
|
||||
), f"Bad source_type: {row}"
|
||||
assert row["jurisdiction"], f"Missing jurisdiction: {row}"
|
||||
assert row["status"] in ("active", "needs_review", "deprecated")
|
||||
|
||||
def test_no_duplicate_regulation_ids(self):
|
||||
from scripts.f1_migrate_regulation_registry import build_rows
|
||||
rows = build_rows()
|
||||
ids = [r["regulation_id"] for r in rows]
|
||||
assert len(ids) == len(set(ids)), f"Duplicates: {[x for x in ids if ids.count(x) > 1]}"
|
||||
|
||||
def test_known_regulations_present(self):
|
||||
from scripts.f1_migrate_regulation_registry import build_rows
|
||||
rows = build_rows()
|
||||
ids = {r["regulation_id"] for r in rows}
|
||||
assert "eu_2016_679" in ids # DSGVO
|
||||
assert "bdsg" in ids # BDSG
|
||||
assert "nist_sp_800_53" in ids # NIST
|
||||
assert "owasp_asvs" in ids # OWASP
|
||||
|
||||
def test_owasp_has_attribution(self):
|
||||
from scripts.f1_migrate_regulation_registry import build_rows
|
||||
rows = build_rows()
|
||||
owasp = [r for r in rows if r["regulation_id"] == "owasp_asvs"][0]
|
||||
assert owasp["attribution"] is not None
|
||||
assert "OWASP" in owasp["attribution"]
|
||||
assert owasp["license_rule"] == 2
|
||||
+7
-3
@@ -413,8 +413,12 @@ services:
|
||||
embedding-service:
|
||||
condition: service_healthy
|
||||
healthcheck:
|
||||
disable: true
|
||||
restart: "no"
|
||||
test: ["CMD", "curl", "-f", "http://127.0.0.1:8098/health"]
|
||||
interval: 60s
|
||||
timeout: 30s
|
||||
retries: 10
|
||||
start_period: 30s
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- breakpilot-network
|
||||
|
||||
@@ -430,7 +434,7 @@ services:
|
||||
EMBEDDING_BACKEND: ${EMBEDDING_BACKEND:-local}
|
||||
LOCAL_EMBEDDING_MODEL: ${LOCAL_EMBEDDING_MODEL:-BAAI/bge-m3}
|
||||
LOCAL_RERANKER_MODEL: ${LOCAL_RERANKER_MODEL:-cross-encoder/ms-marco-MiniLM-L-6-v2}
|
||||
PDF_EXTRACTION_BACKEND: ${PDF_EXTRACTION_BACKEND:-pymupdf}
|
||||
PDF_EXTRACTION_BACKEND: ${PDF_EXTRACTION_BACKEND:-auto}
|
||||
OPENAI_API_KEY: ${OPENAI_API_KEY:-}
|
||||
COHERE_API_KEY: ${COHERE_API_KEY:-}
|
||||
LOG_LEVEL: ${LOG_LEVEL:-INFO}
|
||||
|
||||
+239
-18
@@ -10,8 +10,9 @@ Provides REST endpoints for:
|
||||
This service handles all ML-heavy operations, keeping the main klausur-service lightweight.
|
||||
"""
|
||||
|
||||
import os
|
||||
import logging
|
||||
import re
|
||||
import unicodedata
|
||||
from typing import List, Optional
|
||||
from contextlib import asynccontextmanager
|
||||
|
||||
@@ -106,8 +107,19 @@ class ChunkRequest(BaseModel):
|
||||
strategy: str = Field(default="semantic", description="Chunking strategy: semantic or recursive")
|
||||
|
||||
|
||||
class ChunkMetadata(BaseModel):
|
||||
text: str
|
||||
section: str = ""
|
||||
section_title: str = ""
|
||||
paragraph: str = ""
|
||||
paragraph_num: Optional[int] = None
|
||||
page: Optional[int] = None
|
||||
index: int = 0
|
||||
|
||||
|
||||
class ChunkResponse(BaseModel):
|
||||
chunks: List[str]
|
||||
chunks_with_metadata: Optional[List[dict]] = None
|
||||
count: int
|
||||
strategy: str
|
||||
|
||||
@@ -270,9 +282,7 @@ ENGLISH_ABBREVIATIONS = {
|
||||
# Combined abbreviations for both languages
|
||||
ALL_ABBREVIATIONS = GERMAN_ABBREVIATIONS | ENGLISH_ABBREVIATIONS
|
||||
|
||||
# Regex pattern for legal section headers (§, Art., Article, Section, etc.)
|
||||
import re
|
||||
|
||||
# Regex pattern for legal/standard section headers
|
||||
_LEGAL_SECTION_RE = re.compile(
|
||||
r'^(?:'
|
||||
r'§\s*\d+' # § 25, § 5a
|
||||
@@ -287,6 +297,15 @@ _LEGAL_SECTION_RE = re.compile(
|
||||
r'|Part\s+[IVXLC\d]+' # Part III
|
||||
r'|Recital\s+\d+' # Recital 42
|
||||
r'|Erwaegungsgrund\s+\d+' # Erwaegungsgrund 26
|
||||
# NIST/ENISA/standard numbering
|
||||
r'|\d+\.\d+(?:\.\d+)*\s+[A-ZÄÖÜ]' # 1.1 Title, 2.3.1 Subtitle
|
||||
r'|[A-Z]{2,4}[-\.]\d+(?:\.\d+)*\b' # AC-1, AU-2, PO.1, PW.1.1
|
||||
r'|[A-Z]{2}\.[A-Z]{2}-\d{2}\b' # GV.OC-01 (NIST CSF 2.0)
|
||||
r'|[A-Z]{2,4}-\d+\(\d+\)' # AC-1(1) (NIST enhancements)
|
||||
r'|A\d{2}(?::\d{4})?\b' # A01:2021 (OWASP Top 10)
|
||||
r'|Table\s+\d+' # Table 1, Table A-1
|
||||
r'|Figure\s+\d+' # Figure 1
|
||||
r'|Appendix\s+[A-Z\d]' # Appendix A, Appendix 1
|
||||
r')',
|
||||
re.IGNORECASE | re.MULTILINE
|
||||
)
|
||||
@@ -300,6 +319,10 @@ _HEADING_RE = re.compile(
|
||||
re.MULTILINE
|
||||
)
|
||||
|
||||
# Case-sensitive: single-number + ALL-CAPS title (e.g., "1. INTRODUCTION")
|
||||
# Separate regex because _LEGAL_SECTION_RE uses re.IGNORECASE
|
||||
_SINGLE_NUM_ALLCAPS_RE = re.compile(r'^\d+\.\s+[A-Z][A-Z\s]{4,}')
|
||||
|
||||
|
||||
def _detect_language(text: str) -> str:
|
||||
"""Simple heuristic: count German vs English marker words."""
|
||||
@@ -349,17 +372,103 @@ def _split_sentences(text: str) -> List[str]:
|
||||
return sentences
|
||||
|
||||
|
||||
# Regex for paragraph/subsection references within text
|
||||
_PARAGRAPH_RE = re.compile(
|
||||
r'(?:'
|
||||
r'Abs(?:atz|\.)\s*(\d+)' # Abs. 1, Absatz 2
|
||||
r'|Nr\.\s*(\d+)' # Nr. 3
|
||||
r'|Satz\s+(\d+)' # Satz 1
|
||||
r'|lit\.\s*([a-z])' # lit. a
|
||||
r'|\((\d+)\)' # (1), (2)
|
||||
r')',
|
||||
re.IGNORECASE
|
||||
)
|
||||
|
||||
# Regex to extract section number from header
|
||||
_SECTION_NUMBER_RE = re.compile(
|
||||
r'(?:'
|
||||
r'§\s*(\d+[a-z]*)' # § 25, § 312k
|
||||
r'|Art(?:ikel|icle|\.)\s*(\d+)' # Artikel 5, Art. 3
|
||||
r'|Section\s+(\d[\d.]*)' # Section 4.2
|
||||
r'|Kapitel\s+(\d+)' # Kapitel 2
|
||||
r'|Anhang\s+([IVXLC\d]+)' # Anhang III
|
||||
r'|Annex\s+([IVXLC\d]+)' # Annex XII
|
||||
# NIST/ENISA/standard identifiers
|
||||
r'|([A-Z]{2}\.[A-Z]{2}-\d{2})' # GV.OC-01 (NIST CSF 2.0)
|
||||
r'|([A-Z]{2,4}-\d+(?:\(\d+\))?)' # AC-1, AC-1(1) (NIST controls)
|
||||
r'|(\d+\.\d+(?:\.\d+)*)' # 3.1, 2.3.1 (numbered sections)
|
||||
r'|(\d+)(?=\.\s+[A-Z]{5,})' # 1 (from "1. INTRODUCTION", case-sensitive below)
|
||||
r'|(A\d{2}(?::\d{4})?)' # A01:2021 (OWASP)
|
||||
r')',
|
||||
re.IGNORECASE
|
||||
)
|
||||
|
||||
|
||||
def _extract_section_header(line: str) -> Optional[str]:
|
||||
"""Extract a legal section header from a line, or None."""
|
||||
m = _LEGAL_SECTION_RE.match(line.strip())
|
||||
stripped = line.strip()
|
||||
m = _LEGAL_SECTION_RE.match(stripped)
|
||||
if m:
|
||||
return line.strip()
|
||||
m = _HEADING_RE.match(line.strip())
|
||||
return stripped
|
||||
# Case-sensitive check for "1. INTRODUCTION" style (ENISA/BSI docs)
|
||||
if _SINGLE_NUM_ALLCAPS_RE.match(stripped):
|
||||
return stripped
|
||||
m = _HEADING_RE.match(stripped)
|
||||
if m:
|
||||
return line.strip()
|
||||
return stripped
|
||||
return None
|
||||
|
||||
|
||||
def _parse_section_metadata(header: str) -> dict:
|
||||
"""Parse a section header into structured metadata.
|
||||
|
||||
Returns: {"section": "§ 312k", "section_title": "Kuendigungsbutton"}
|
||||
"""
|
||||
if not header:
|
||||
return {"section": "", "section_title": ""}
|
||||
|
||||
m = _SECTION_NUMBER_RE.search(header)
|
||||
section = ""
|
||||
if m:
|
||||
# Find which group matched
|
||||
for i, g in enumerate(m.groups(), 1):
|
||||
if g:
|
||||
section = header[m.start():m.end()].strip()
|
||||
break
|
||||
|
||||
# Title = everything after the section number
|
||||
title = header
|
||||
if section:
|
||||
idx = header.find(section)
|
||||
if idx >= 0:
|
||||
title = header[idx + len(section):].strip()
|
||||
# Remove leading punctuation/whitespace
|
||||
title = title.lstrip(' .-–—:')
|
||||
|
||||
return {"section": section, "section_title": title.strip()}
|
||||
|
||||
|
||||
def _extract_paragraph_ref(text: str) -> dict:
|
||||
"""Extract paragraph/subsection reference from chunk text.
|
||||
|
||||
Returns: {"paragraph": "Abs. 1", "paragraph_num": 1}
|
||||
"""
|
||||
m = _PARAGRAPH_RE.search(text[:200]) # Only search first 200 chars
|
||||
if not m:
|
||||
return {"paragraph": "", "paragraph_num": None}
|
||||
|
||||
for i, g in enumerate(m.groups(), 1):
|
||||
if g:
|
||||
ref = text[m.start():m.end()].strip()
|
||||
try:
|
||||
num = int(g)
|
||||
except ValueError:
|
||||
num = ord(g.lower()) - ord('a') + 1 # lit. a = 1, b = 2
|
||||
return {"paragraph": ref, "paragraph_num": num}
|
||||
|
||||
return {"paragraph": "", "paragraph_num": None}
|
||||
|
||||
|
||||
def chunk_text_legal(text: str, chunk_size: int, overlap: int) -> List[str]:
|
||||
"""
|
||||
Legal-document-aware chunking.
|
||||
@@ -488,12 +597,51 @@ def chunk_text_legal(text: str, chunk_size: int, overlap: int) -> List[str]:
|
||||
if space_idx > 0:
|
||||
overlap_text = overlap_text[space_idx + 1:]
|
||||
if overlap_text:
|
||||
chunk = overlap_text + ' ' + chunk
|
||||
# Insert overlap AFTER the [§ ...] prefix to preserve it
|
||||
# for structured metadata extraction
|
||||
prefix_match = re.match(r'\[.+?\]\s*', chunk)
|
||||
if prefix_match:
|
||||
pos = prefix_match.end()
|
||||
chunk = chunk[:pos] + overlap_text + ' ' + chunk[pos:]
|
||||
else:
|
||||
chunk = overlap_text + ' ' + chunk
|
||||
final_chunks.append(chunk.strip())
|
||||
|
||||
return [c for c in final_chunks if c]
|
||||
|
||||
|
||||
def chunk_text_legal_structured(text: str, chunk_size: int, overlap: int) -> List[dict]:
|
||||
"""Legal-aware chunking that returns structured metadata per chunk.
|
||||
|
||||
Returns list of dicts with: text, section, section_title, paragraph, paragraph_num, index.
|
||||
Uses the same splitting logic as chunk_text_legal but extracts metadata.
|
||||
"""
|
||||
plain_chunks = chunk_text_legal(text, chunk_size, overlap)
|
||||
|
||||
# Track which section each chunk belongs to by re-parsing the prefix
|
||||
structured = []
|
||||
for i, chunk_text in enumerate(plain_chunks):
|
||||
meta = {"text": chunk_text, "section": "", "section_title": "",
|
||||
"paragraph": "", "paragraph_num": None, "page": None, "index": i}
|
||||
|
||||
# Extract section from the [§ 25 Title] prefix that chunk_text_legal adds
|
||||
prefix_match = re.match(r'^\[(.+?)\]\s*', chunk_text)
|
||||
if prefix_match:
|
||||
header = prefix_match.group(1)
|
||||
section_meta = _parse_section_metadata(header)
|
||||
meta["section"] = section_meta["section"]
|
||||
meta["section_title"] = section_meta["section_title"]
|
||||
|
||||
# Extract paragraph reference from chunk content
|
||||
para_meta = _extract_paragraph_ref(chunk_text)
|
||||
meta["paragraph"] = para_meta["paragraph"]
|
||||
meta["paragraph_num"] = para_meta["paragraph_num"]
|
||||
|
||||
structured.append(meta)
|
||||
|
||||
return structured
|
||||
|
||||
|
||||
def chunk_text_recursive(text: str, chunk_size: int, overlap: int) -> List[str]:
|
||||
"""Recursive character-based chunking (legacy, use legal_recursive for legal docs)."""
|
||||
if not text or len(text) <= chunk_size:
|
||||
@@ -621,13 +769,19 @@ def detect_pdf_backends() -> List[str]:
|
||||
available = []
|
||||
|
||||
try:
|
||||
from unstructured.partition.pdf import partition_pdf
|
||||
from unstructured.partition.pdf import partition_pdf # noqa: F401
|
||||
available.append("unstructured")
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
try:
|
||||
from pypdf import PdfReader
|
||||
import pdfplumber # noqa: F401
|
||||
available.append("pdfplumber")
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
try:
|
||||
from pypdf import PdfReader # noqa: F401
|
||||
available.append("pypdf")
|
||||
except ImportError:
|
||||
pass
|
||||
@@ -687,12 +841,64 @@ def extract_pdf_unstructured(pdf_content: bytes) -> ExtractPDFResponse:
|
||||
import os as os_module
|
||||
try:
|
||||
os_module.unlink(tmp_path)
|
||||
except:
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
|
||||
def _normalize_pdf_text(text: str) -> str:
|
||||
"""Fix broken spacing from multi-column PDF extraction.
|
||||
|
||||
pdfplumber/pypdf often break section numbers in multi-column NIST/BSI/ENISA
|
||||
PDFs: "1 . 1" instead of "1.1", "AC - 1" instead of "AC-1".
|
||||
"""
|
||||
# Unicode NFKC: decompose ligatures (fi → fi) before other fixes
|
||||
text = unicodedata.normalize('NFKC', text)
|
||||
# Remove soft hyphens and zero-width spaces
|
||||
text = text.replace('\u00ad', '').replace('\u200b', '')
|
||||
# "1 . 1" → "1.1" (broken section numbers, apply repeatedly for nested)
|
||||
prev = None
|
||||
while prev != text:
|
||||
prev = text
|
||||
text = re.sub(r'(\d+)\s+\.\s+(\d+)', r'\1.\2', text)
|
||||
# "AC - 1" → "AC-1" (broken NIST control IDs, 2-4 uppercase letters)
|
||||
text = re.sub(r'\b([A-Z]{2,4})\s+-\s+(\d+)\b', r'\1-\2', text)
|
||||
# "GV . OC - 01" → "GV.OC-01" (NIST CSF 2.0 compound IDs)
|
||||
text = re.sub(
|
||||
r'\b([A-Z]{2})\s*\.\s*([A-Z]{2})\s*-\s*(\d{2})\b', r'\1.\2-\3', text
|
||||
)
|
||||
# "AC - 1 ( 1 )" → "AC-1(1)" (NIST enhancements with spaced parens)
|
||||
text = re.sub(r'\(\s+(\d+)\s+\)', r'(\1)', text)
|
||||
# Collapse multiple horizontal spaces (keep newlines)
|
||||
text = re.sub(r'[^\S\n]{2,}', ' ', text)
|
||||
return text
|
||||
|
||||
|
||||
def extract_pdf_pdfplumber(pdf_content: bytes) -> ExtractPDFResponse:
|
||||
"""Extract PDF using pdfplumber (best for multi-column EU regulation PDFs)."""
|
||||
import io
|
||||
import pdfplumber
|
||||
|
||||
pdf_file = io.BytesIO(pdf_content)
|
||||
text_parts = []
|
||||
page_count = 0
|
||||
|
||||
with pdfplumber.open(pdf_file) as pdf:
|
||||
page_count = len(pdf.pages)
|
||||
for page in pdf.pages:
|
||||
text = page.extract_text(x_tolerance=3, y_tolerance=4)
|
||||
if text:
|
||||
text_parts.append(text)
|
||||
|
||||
return ExtractPDFResponse(
|
||||
text=_normalize_pdf_text("\n\n".join(text_parts)),
|
||||
backend_used="pdfplumber",
|
||||
pages=page_count,
|
||||
table_count=0,
|
||||
)
|
||||
|
||||
|
||||
def extract_pdf_pypdf(pdf_content: bytes) -> ExtractPDFResponse:
|
||||
"""Extract PDF using pypdf."""
|
||||
"""Extract PDF using pypdf (fallback)."""
|
||||
import io
|
||||
from pypdf import PdfReader
|
||||
|
||||
@@ -706,7 +912,7 @@ def extract_pdf_pypdf(pdf_content: bytes) -> ExtractPDFResponse:
|
||||
text_parts.append(text)
|
||||
|
||||
return ExtractPDFResponse(
|
||||
text="\n\n".join(text_parts),
|
||||
text=_normalize_pdf_text("\n\n".join(text_parts)),
|
||||
backend_used="pypdf",
|
||||
pages=len(reader.pages),
|
||||
table_count=0
|
||||
@@ -879,15 +1085,22 @@ async def chunk_text(request: ChunkRequest):
|
||||
if request.strategy == "semantic":
|
||||
overlap_sentences = max(1, request.overlap // 100)
|
||||
chunks = chunk_text_semantic(request.text, request.chunk_size, overlap_sentences)
|
||||
return ChunkResponse(
|
||||
chunks=chunks,
|
||||
count=len(chunks),
|
||||
strategy=request.strategy,
|
||||
)
|
||||
else:
|
||||
# All strategies (recursive, legal_recursive, etc.) use the legal-aware chunker.
|
||||
# The old plain recursive chunker is no longer exposed via the API.
|
||||
# All strategies use the legal-aware chunker
|
||||
chunks = chunk_text_legal(request.text, request.chunk_size, request.overlap)
|
||||
# Also generate structured metadata
|
||||
structured = chunk_text_legal_structured(request.text, request.chunk_size, request.overlap)
|
||||
|
||||
return ChunkResponse(
|
||||
chunks=chunks,
|
||||
chunks_with_metadata=structured,
|
||||
count=len(chunks),
|
||||
strategy=request.strategy
|
||||
strategy=request.strategy,
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Chunking error: {e}")
|
||||
@@ -908,11 +1121,19 @@ async def extract_pdf(file: UploadFile = File(...)):
|
||||
|
||||
backend = config.PDF_EXTRACTION_BACKEND
|
||||
if backend == "auto":
|
||||
backend = "unstructured" if "unstructured" in available else "pypdf"
|
||||
# Prefer: unstructured > pdfplumber > pypdf
|
||||
if "unstructured" in available:
|
||||
backend = "unstructured"
|
||||
elif "pdfplumber" in available:
|
||||
backend = "pdfplumber"
|
||||
else:
|
||||
backend = "pypdf"
|
||||
|
||||
try:
|
||||
if backend == "unstructured" and "unstructured" in available:
|
||||
return extract_pdf_unstructured(pdf_content)
|
||||
elif backend == "pdfplumber" and "pdfplumber" in available:
|
||||
return extract_pdf_pdfplumber(pdf_content)
|
||||
elif "pypdf" in available:
|
||||
return extract_pdf_pypdf(pdf_content)
|
||||
else:
|
||||
|
||||
@@ -14,6 +14,7 @@ sentence-transformers>=2.2.0
|
||||
# PDF Extraction
|
||||
unstructured>=0.12.0
|
||||
pypdf>=4.0.0
|
||||
pdfplumber>=0.11.0
|
||||
python-magic>=0.4.27
|
||||
|
||||
# HTTP Client (for OpenAI/Cohere API calls)
|
||||
|
||||
@@ -11,7 +11,6 @@ Covers:
|
||||
- Long sentence force-splitting
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from main import (
|
||||
chunk_text_legal,
|
||||
chunk_text_recursive,
|
||||
|
||||
@@ -0,0 +1,217 @@
|
||||
"""
|
||||
D4 Validation: BGB § 312k structural chunking test.
|
||||
|
||||
Tests that real German legal text is correctly chunked with structural
|
||||
metadata (section, section_title, paragraph, paragraph_num).
|
||||
This is the gate test before re-ingesting all 297 legal sources.
|
||||
"""
|
||||
|
||||
import os
|
||||
import pytest
|
||||
|
||||
from main import chunk_text_legal, chunk_text_legal_structured
|
||||
|
||||
FIXTURE_PATH = os.path.join(
|
||||
os.path.dirname(__file__), "tests", "fixtures", "bgb_312_excerpt.txt"
|
||||
)
|
||||
|
||||
# Reasonable defaults for legal text
|
||||
CHUNK_SIZE = 1500
|
||||
OVERLAP = 100
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def bgb_text():
|
||||
with open(FIXTURE_PATH, encoding="utf-8") as f:
|
||||
return f.read()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def plain_chunks(bgb_text):
|
||||
return chunk_text_legal(bgb_text, CHUNK_SIZE, OVERLAP)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def structured_chunks(bgb_text):
|
||||
return chunk_text_legal_structured(bgb_text, CHUNK_SIZE, OVERLAP)
|
||||
|
||||
|
||||
# =========================================================================
|
||||
# Basic sanity
|
||||
# =========================================================================
|
||||
|
||||
class TestChunkingSanity:
|
||||
|
||||
def test_fixture_loads(self, bgb_text):
|
||||
assert len(bgb_text) > 2000, "BGB excerpt should be substantial"
|
||||
assert "§ 312k" in bgb_text
|
||||
assert "§ 312 " in bgb_text
|
||||
|
||||
def test_chunk_count_reasonable(self, plain_chunks):
|
||||
assert 4 <= len(plain_chunks) <= 30, (
|
||||
f"Expected 4-30 chunks, got {len(plain_chunks)}"
|
||||
)
|
||||
|
||||
def test_structured_same_count(self, plain_chunks, structured_chunks):
|
||||
assert len(plain_chunks) == len(structured_chunks)
|
||||
|
||||
def test_no_empty_chunks(self, plain_chunks):
|
||||
for i, chunk in enumerate(plain_chunks):
|
||||
assert chunk.strip(), f"Chunk {i} is empty"
|
||||
|
||||
def test_chunk_sizes_reasonable(self, plain_chunks):
|
||||
for i, chunk in enumerate(plain_chunks):
|
||||
assert len(chunk) < 3000, f"Chunk {i} too large: {len(chunk)} chars"
|
||||
assert len(chunk) > 30, f"Chunk {i} too small: {len(chunk)} chars"
|
||||
|
||||
|
||||
# =========================================================================
|
||||
# Section detection
|
||||
# =========================================================================
|
||||
|
||||
class TestSectionDetection:
|
||||
|
||||
def test_all_four_sections_detected(self, structured_chunks):
|
||||
"""All 4 BGB sections should appear as section metadata."""
|
||||
found_sections = set()
|
||||
for meta in structured_chunks:
|
||||
if meta["section"]:
|
||||
found_sections.add(meta["section"])
|
||||
|
||||
assert "§ 312" in found_sections or any(
|
||||
s.startswith("§ 312") and s != "§ 312a" and s != "§ 312g" and s != "§ 312k"
|
||||
for s in found_sections
|
||||
), f"§ 312 not found. Sections: {found_sections}"
|
||||
assert "§ 312a" in found_sections, f"§ 312a not found. Sections: {found_sections}"
|
||||
assert "§ 312g" in found_sections, f"§ 312g not found. Sections: {found_sections}"
|
||||
assert "§ 312k" in found_sections, f"§ 312k not found. Sections: {found_sections}"
|
||||
|
||||
def test_section_prefix_in_chunks(self, plain_chunks):
|
||||
"""Most chunks should have [§ ...] prefix."""
|
||||
prefixed = sum(1 for c in plain_chunks if c.startswith("[§"))
|
||||
ratio = prefixed / len(plain_chunks)
|
||||
assert ratio >= 0.8, (
|
||||
f"Only {ratio:.0%} chunks have section prefix (expected >= 80%)"
|
||||
)
|
||||
|
||||
def test_312k_has_own_chunk(self, plain_chunks):
|
||||
"""§ 312k must appear as a chunk section header, not merged into another §."""
|
||||
chunks_with_312k = [c for c in plain_chunks if "[§ 312k" in c]
|
||||
assert len(chunks_with_312k) >= 1, (
|
||||
"§ 312k should have at least 1 dedicated chunk"
|
||||
)
|
||||
|
||||
|
||||
# =========================================================================
|
||||
# § 312k specific metadata
|
||||
# =========================================================================
|
||||
|
||||
class TestSection312k:
|
||||
|
||||
def _312k_chunks(self, structured_chunks):
|
||||
return [m for m in structured_chunks if m["section"] == "§ 312k"]
|
||||
|
||||
def test_312k_section_metadata(self, structured_chunks):
|
||||
"""§ 312k chunks should have section='§ 312k' with a title."""
|
||||
chunks = self._312k_chunks(structured_chunks)
|
||||
assert len(chunks) >= 1, "No chunks with section='§ 312k'"
|
||||
for meta in chunks:
|
||||
assert meta["section"] == "§ 312k"
|
||||
# Title should contain key words
|
||||
title = meta["section_title"].lower()
|
||||
assert "kuendigung" in title or "verbrauchervertrae" in title, (
|
||||
f"Unexpected section_title: {meta['section_title']}"
|
||||
)
|
||||
|
||||
def test_312k_paragraph_extraction(self, structured_chunks):
|
||||
"""At least some § 312k chunks should have paragraph references."""
|
||||
chunks = self._312k_chunks(structured_chunks)
|
||||
paragraphs_found = [m["paragraph"] for m in chunks if m["paragraph"]]
|
||||
# § 312k has (1) through (6), at least some should be detected
|
||||
assert len(paragraphs_found) >= 1, (
|
||||
"No paragraph references found in § 312k chunks"
|
||||
)
|
||||
|
||||
def test_312k_content_present(self, structured_chunks):
|
||||
"""§ 312k chunk text should contain key legal terms."""
|
||||
chunks = self._312k_chunks(structured_chunks)
|
||||
all_text = " ".join(m["text"] for m in chunks)
|
||||
assert "Kuendigungsschaltflaeche" in all_text or "kuendigen" in all_text.lower()
|
||||
assert "Webseite" in all_text or "elektronischen" in all_text
|
||||
|
||||
def test_312k_not_merged_with_312g(self, structured_chunks):
|
||||
"""§ 312k and § 312g should be separate sections, not merged."""
|
||||
sections_312g = [m for m in structured_chunks if m["section"] == "§ 312g"]
|
||||
sections_312k = self._312k_chunks(structured_chunks)
|
||||
assert len(sections_312g) >= 1, "§ 312g missing"
|
||||
assert len(sections_312k) >= 1, "§ 312k missing"
|
||||
# Verify they are different chunks (no overlap in indices)
|
||||
g_indices = {m["index"] for m in sections_312g}
|
||||
k_indices = {m["index"] for m in sections_312k}
|
||||
assert g_indices.isdisjoint(k_indices), (
|
||||
f"§ 312g and § 312k share chunk indices: {g_indices & k_indices}"
|
||||
)
|
||||
|
||||
|
||||
# =========================================================================
|
||||
# Metadata quality across all sections
|
||||
# =========================================================================
|
||||
|
||||
class TestMetadataQuality:
|
||||
|
||||
def test_most_chunks_have_section(self, structured_chunks):
|
||||
"""At least 90% of chunks should have a section reference."""
|
||||
with_section = sum(1 for m in structured_chunks if m["section"])
|
||||
ratio = with_section / len(structured_chunks)
|
||||
assert ratio >= 0.9, (
|
||||
f"Only {ratio:.0%} chunks have section metadata (expected >= 90%)"
|
||||
)
|
||||
|
||||
def test_section_titles_not_empty(self, structured_chunks):
|
||||
"""Chunks with a section should also have a section_title."""
|
||||
for meta in structured_chunks:
|
||||
if meta["section"]:
|
||||
assert meta["section_title"], (
|
||||
f"Chunk {meta['index']} has section={meta['section']} but no title"
|
||||
)
|
||||
|
||||
def test_paragraph_nums_are_integers(self, structured_chunks):
|
||||
"""paragraph_num should be int or None, never str."""
|
||||
for meta in structured_chunks:
|
||||
pn = meta["paragraph_num"]
|
||||
assert pn is None or isinstance(pn, int), (
|
||||
f"Chunk {meta['index']}: paragraph_num={pn!r} (type={type(pn).__name__})"
|
||||
)
|
||||
|
||||
def test_indices_sequential(self, structured_chunks):
|
||||
"""Chunk indices should be 0, 1, 2, ... in order."""
|
||||
for i, meta in enumerate(structured_chunks):
|
||||
assert meta["index"] == i, (
|
||||
f"Expected index {i}, got {meta['index']}"
|
||||
)
|
||||
|
||||
|
||||
# =========================================================================
|
||||
# Edge cases
|
||||
# =========================================================================
|
||||
|
||||
class TestEdgeCases:
|
||||
|
||||
def test_numbered_list_not_false_section(self, structured_chunks):
|
||||
"""Numbered items (1., 2., 3.) inside a § should NOT create new sections."""
|
||||
for meta in structured_chunks:
|
||||
section = meta["section"]
|
||||
# Section should always start with § or be empty
|
||||
if section:
|
||||
assert section.startswith("§"), (
|
||||
f"Unexpected section format: {section!r}"
|
||||
)
|
||||
|
||||
def test_subsection_letters_preserved(self, plain_chunks):
|
||||
"""Lettered subsections (a, b, c, d, e) in § 312k(2) should be in the text."""
|
||||
all_text = " ".join(plain_chunks)
|
||||
# § 312k Abs 2 Nr 1 has a) through e)
|
||||
for letter in ["a)", "b)", "c)", "d)", "e)"]:
|
||||
assert letter in all_text, (
|
||||
f"Subsection letter {letter} from § 312k(2) missing"
|
||||
)
|
||||
@@ -0,0 +1,248 @@
|
||||
"""
|
||||
Tests for NIST/BSI/ENISA PDF text normalization and section detection.
|
||||
|
||||
Covers:
|
||||
- _normalize_pdf_text() fixing broken multi-column PDF artifacts
|
||||
- Section detection after normalization
|
||||
- NIST CSF 2.0 compound IDs (GV.OC-01)
|
||||
- NIST SP 800-53 control IDs (AC-1, AC-1(1))
|
||||
- OWASP Top 10 IDs (A01:2021)
|
||||
- Unicode normalization (ligatures, soft hyphens)
|
||||
"""
|
||||
|
||||
from main import (
|
||||
_normalize_pdf_text,
|
||||
_extract_section_header,
|
||||
_parse_section_metadata,
|
||||
chunk_text_legal,
|
||||
chunk_text_legal_structured,
|
||||
)
|
||||
|
||||
|
||||
# =========================================================================
|
||||
# _normalize_pdf_text — broken spacing fixes
|
||||
# =========================================================================
|
||||
|
||||
class TestNormalizePdfText:
|
||||
|
||||
def test_broken_section_number(self):
|
||||
assert _normalize_pdf_text("1 . 1 Risk Framing") == "1.1 Risk Framing"
|
||||
|
||||
def test_nested_section_number(self):
|
||||
assert _normalize_pdf_text("2 . 3 . 1 Subtitle") == "2.3.1 Subtitle"
|
||||
|
||||
def test_broken_nist_control_id(self):
|
||||
assert _normalize_pdf_text("AC - 1 Account Management") == "AC-1 Account Management"
|
||||
|
||||
def test_broken_nist_control_au(self):
|
||||
assert _normalize_pdf_text("AU - 2 Audit Events") == "AU-2 Audit Events"
|
||||
|
||||
def test_broken_csf_compound_id(self):
|
||||
assert _normalize_pdf_text("GV . OC - 01 Context") == "GV.OC-01 Context"
|
||||
|
||||
def test_broken_enhancement_parens(self):
|
||||
assert _normalize_pdf_text("AC-1( 1 ) Enhancement") == "AC-1(1) Enhancement"
|
||||
|
||||
def test_soft_hyphen_removed(self):
|
||||
assert _normalize_pdf_text("infor\u00admation") == "information"
|
||||
|
||||
def test_zero_width_space_removed(self):
|
||||
assert _normalize_pdf_text("data\u200bprotection") == "dataprotection"
|
||||
|
||||
def test_ligature_fi_normalized(self):
|
||||
# U+FB01 = fi ligature
|
||||
assert _normalize_pdf_text("con\ufb01dential") == "confidential"
|
||||
|
||||
def test_ligature_fl_normalized(self):
|
||||
# U+FB02 = fl ligature
|
||||
assert _normalize_pdf_text("over\ufb02ow") == "overflow"
|
||||
|
||||
def test_multiple_spaces_collapsed(self):
|
||||
assert _normalize_pdf_text("too many spaces") == "too many spaces"
|
||||
|
||||
def test_newlines_preserved(self):
|
||||
result = _normalize_pdf_text("line one\nline two\n\nline three")
|
||||
assert "\n" in result
|
||||
assert "line one" in result
|
||||
assert "line three" in result
|
||||
|
||||
def test_normal_text_unchanged(self):
|
||||
text = "AC-1 Account Management requires proper controls."
|
||||
assert _normalize_pdf_text(text) == text
|
||||
|
||||
def test_combined_artifacts(self):
|
||||
"""Multiple broken artifacts in one text block."""
|
||||
broken = "1 . 1 Overview\nAC - 1 Account Management\nGV . OC - 01 Context"
|
||||
fixed = _normalize_pdf_text(broken)
|
||||
assert "1.1 Overview" in fixed
|
||||
assert "AC-1 Account Management" in fixed
|
||||
assert "GV.OC-01 Context" in fixed
|
||||
|
||||
|
||||
# =========================================================================
|
||||
# Section detection after normalization
|
||||
# =========================================================================
|
||||
|
||||
class TestNistSectionDetection:
|
||||
|
||||
def test_nist_control_ac1(self):
|
||||
assert _extract_section_header("AC-1 Account Management") is not None
|
||||
|
||||
def test_nist_control_au2(self):
|
||||
assert _extract_section_header("AU-2 Audit Events") is not None
|
||||
|
||||
def test_nist_csf_compound(self):
|
||||
assert _extract_section_header("GV.OC-01 Organizational Context") is not None
|
||||
|
||||
def test_nist_enhancement(self):
|
||||
assert _extract_section_header("AC-1(1) Policy and Procedures") is not None
|
||||
|
||||
def test_owasp_top10(self):
|
||||
assert _extract_section_header("A01:2021 Broken Access Control") is not None
|
||||
|
||||
def test_owasp_without_year(self):
|
||||
assert _extract_section_header("A03 Injection") is not None
|
||||
|
||||
def test_numbered_section(self):
|
||||
assert _extract_section_header("2.1 Risk Framing") is not None
|
||||
|
||||
def test_deep_numbered_section(self):
|
||||
assert _extract_section_header("3.2.1 Assessment Methodology") is not None
|
||||
|
||||
def test_broken_then_normalized_detects(self):
|
||||
"""After normalization, broken NIST IDs should be detected as sections."""
|
||||
broken = "AC - 1 Account Management"
|
||||
normalized = _normalize_pdf_text(broken)
|
||||
assert _extract_section_header(normalized) is not None
|
||||
|
||||
def test_broken_csf_then_normalized_detects(self):
|
||||
broken = "GV . OC - 01 Organizational Context"
|
||||
normalized = _normalize_pdf_text(broken)
|
||||
assert _extract_section_header(normalized) is not None
|
||||
|
||||
def test_broken_section_num_then_normalized(self):
|
||||
broken = "2 . 1 Risk Framing"
|
||||
normalized = _normalize_pdf_text(broken)
|
||||
assert _extract_section_header(normalized) is not None
|
||||
|
||||
|
||||
# =========================================================================
|
||||
# Section metadata extraction (_parse_section_metadata)
|
||||
# =========================================================================
|
||||
|
||||
class TestNistSectionMetadata:
|
||||
|
||||
def test_nist_control_ac1_section(self):
|
||||
meta = _parse_section_metadata("AC-1 POLICY AND PROCEDURES")
|
||||
assert meta["section"] == "AC-1"
|
||||
|
||||
def test_nist_control_au2_section(self):
|
||||
meta = _parse_section_metadata("AU-2 Audit Events")
|
||||
assert meta["section"] == "AU-2"
|
||||
|
||||
def test_nist_enhancement_section(self):
|
||||
meta = _parse_section_metadata("AC-1(1) Policy and Procedures")
|
||||
assert meta["section"] == "AC-1(1)"
|
||||
|
||||
def test_nist_csf_compound_section(self):
|
||||
meta = _parse_section_metadata("GV.OC-01 Organizational Context")
|
||||
assert meta["section"] == "GV.OC-01"
|
||||
|
||||
def test_numbered_section(self):
|
||||
meta = _parse_section_metadata("3.1 ACCESS CONTROL")
|
||||
assert meta["section"] == "3.1"
|
||||
|
||||
def test_deep_numbered_section(self):
|
||||
meta = _parse_section_metadata("2.3.1 Subtitle")
|
||||
assert meta["section"] == "2.3.1"
|
||||
|
||||
def test_owasp_section(self):
|
||||
meta = _parse_section_metadata("A01:2021 Broken Access Control")
|
||||
assert meta["section"] == "A01:2021"
|
||||
|
||||
def test_section_title_extracted(self):
|
||||
meta = _parse_section_metadata("AC-1 POLICY AND PROCEDURES")
|
||||
assert meta["section_title"] == "POLICY AND PROCEDURES"
|
||||
|
||||
def test_numbered_section_title(self):
|
||||
meta = _parse_section_metadata("3.1 ACCESS CONTROL")
|
||||
assert meta["section_title"] == "ACCESS CONTROL"
|
||||
|
||||
def test_single_number_allcaps_section(self):
|
||||
"""ENISA-style: '1. INTRODUCTION'"""
|
||||
assert _extract_section_header("1. INTRODUCTION") is not None
|
||||
|
||||
def test_single_number_section_metadata(self):
|
||||
meta = _parse_section_metadata("1. INTRODUCTION")
|
||||
assert meta["section"] == "1"
|
||||
assert meta["section_title"] == "INTRODUCTION"
|
||||
|
||||
def test_single_number_lowercase_not_matched(self):
|
||||
"""'1. First item' should NOT be a section (lowercase title)."""
|
||||
assert _extract_section_header("1. First item in a list") is None
|
||||
|
||||
def test_structured_chunks_have_section(self):
|
||||
text = (
|
||||
"3.1 ACCESS CONTROL\n"
|
||||
"Overview of access control family.\n\n"
|
||||
"AC-1 POLICY AND PROCEDURES\n"
|
||||
"The organization develops, documents, and disseminates an access "
|
||||
"control policy that addresses purpose, scope, roles, responsibilities, "
|
||||
"management commitment, coordination among entities.\n\n"
|
||||
"AC-2 ACCOUNT MANAGEMENT\n"
|
||||
"The information system enforces approved authorizations for logical "
|
||||
"access to information and system resources.\n"
|
||||
)
|
||||
result = chunk_text_legal_structured(text, chunk_size=300, overlap=50)
|
||||
sections = [r.get("section", "") for r in result]
|
||||
assert any(s == "AC-1" for s in sections)
|
||||
assert any(s == "AC-2" for s in sections)
|
||||
|
||||
|
||||
# =========================================================================
|
||||
# Chunking with NIST-style text
|
||||
# =========================================================================
|
||||
|
||||
class TestNistChunking:
|
||||
|
||||
NIST_SAMPLE = (
|
||||
"AC-1 Account Management\n"
|
||||
"The organization develops, documents, and disseminates an access "
|
||||
"control policy that addresses purpose, scope, roles, responsibilities, "
|
||||
"management commitment, coordination among organizational entities, "
|
||||
"and compliance.\n\n"
|
||||
"AC-2 Access Enforcement\n"
|
||||
"The information system enforces approved authorizations for logical "
|
||||
"access to information and system resources in accordance with "
|
||||
"applicable access control policies.\n\n"
|
||||
"AC-3 Information Flow Enforcement\n"
|
||||
"The system enforces approved authorizations for controlling the flow "
|
||||
"of information within the system and between interconnected systems.\n"
|
||||
)
|
||||
|
||||
def test_chunks_have_section_prefix(self):
|
||||
chunks = chunk_text_legal(self.NIST_SAMPLE, chunk_size=300, overlap=50)
|
||||
assert any("[AC-1" in c for c in chunks)
|
||||
assert any("[AC-2" in c for c in chunks)
|
||||
|
||||
def test_sections_detected(self):
|
||||
chunks = chunk_text_legal(self.NIST_SAMPLE, chunk_size=500, overlap=50)
|
||||
assert len(chunks) >= 2
|
||||
|
||||
def test_normalized_broken_text_chunks_correctly(self):
|
||||
"""Broken PDF text should chunk correctly after normalization."""
|
||||
broken = (
|
||||
"AC - 1 Account Management\n"
|
||||
"The organization develops, documents, and disseminates an access "
|
||||
"control policy that addresses purpose, scope, roles, responsibilities, "
|
||||
"management commitment, coordination among organizational entities, "
|
||||
"and compliance with applicable regulations and standards.\n\n"
|
||||
"AC - 2 Access Enforcement\n"
|
||||
"The information system enforces approved authorizations for logical "
|
||||
"access to information and system resources in accordance with "
|
||||
"applicable access control policies and procedures.\n"
|
||||
)
|
||||
normalized = _normalize_pdf_text(broken)
|
||||
chunks = chunk_text_legal(normalized, chunk_size=300, overlap=50)
|
||||
assert any("[AC-1" in c for c in chunks)
|
||||
assert any("[AC-2" in c for c in chunks)
|
||||
@@ -0,0 +1,62 @@
|
||||
§ 312 Anwendungsbereich
|
||||
|
||||
(1) Die Vorschriften der Kapitel 1 und 2 dieses Untertitels sind auf Verbrauchervertraege anzuwenden, bei denen sich der Verbraucher zu der Zahlung eines Preises verpflichtet.
|
||||
|
||||
(1a) Die Vorschriften der Kapitel 1 und 2 dieses Untertitels sind auch auf Verbrauchervertraege anzuwenden, bei denen der Verbraucher dem Unternehmer personenbezogene Daten bereitstellt oder sich hierzu verpflichtet. Dies gilt nicht, wenn der Unternehmer die vom Verbraucher bereitgestellten personenbezogenen Daten ausschliesslich verarbeitet, um seine Leistungspflicht oder an ihn gestellte rechtliche Anforderungen zu erfuellen, und sie zu keinem anderen Zweck verarbeitet.
|
||||
|
||||
(2) Von den Vorschriften der Kapitel 1 und 2 dieses Untertitels ist nur § 312a Absatz 1, 3, 4 und 6 auf folgende Vertraege anzuwenden:
|
||||
1. notariell beurkundete Vertraege
|
||||
2. Vertraege ueber die Begruendung, den Erwerb oder die Uebertragung von Eigentum oder anderen Rechten an Grundstuecken
|
||||
3. Vertraege ueber den Bau von neuen Gebaeuden oder erhebliche Umbaumassnahmen an bestehenden Gebaeuden
|
||||
4. Vertraege ueber Reiseleistungen nach § 651a
|
||||
5. Vertraege ueber die Befoerderung von Personen
|
||||
6. Vertraege, die unter Einsatz von Warenautomaten oder automatisierten Geschaeftsraeumen geschlossen werden
|
||||
|
||||
§ 312a Allgemeine Pflichten und Grundsaetze bei Verbrauchervertraegen
|
||||
|
||||
(1) Ruft der Unternehmer oder eine Person, die in seinem Namen oder Auftrag handelt, den Verbraucher an, um mit diesem einen Vertrag zu schliessen, hat der Anrufer zu Beginn des Gespraechs seine Identitaet und gegebenenfalls die Identitaet der Person, fuer die er anruft, sowie den geschaeftlichen Zweck des Anrufs offenzulegen.
|
||||
|
||||
(2) Der Unternehmer ist verpflichtet, den Verbraucher nach Massgabe des Artikels 246 des Einfuehrungsgesetzes zum Buergerlichen Gesetzbuche zu informieren. Der Unternehmer kann von dem Verbraucher Fracht-, Liefer- oder Versandkosten und sonstige Kosten nur verlangen, soweit er den Verbraucher ueber diese Kosten entsprechend den Anforderungen aus Artikel 246 Absatz 1 Nummer 3 des Einfuehrungsgesetzes zum Buergerlichen Gesetzbuche informiert hat. Die Saetze 1 und 2 sind weder auf ausserhalb von Geschaeftsraeumen geschlossene Vertraege noch auf Fernabsatzvertraege noch auf Vertraege ueber Finanzdienstleistungen anzuwenden.
|
||||
|
||||
(3) Eine Vereinbarung, die auf eine ueber das vereinbarte Entgelt fuer die Hauptleistung hinausgehende Zahlung des Verbrauchers gerichtet ist, kann ein Unternehmer mit einem Verbraucher nur ausdruecklich treffen. Schliesst der Unternehmer und der Verbraucher einen Vertrag im elektronischen Geschaeftsverkehr, wird eine solche Vereinbarung nur Vertragsbestandteil, wenn der Unternehmer die Vereinbarung nicht durch eine Voreinstellung herbeifuehrt.
|
||||
|
||||
(4) Eine Vereinbarung, durch die ein Verbraucher verpflichtet wird, ein Entgelt dafuer zu zahlen, dass der Verbraucher fuer die Erfuellung seiner vertraglichen Pflichten ein bestimmtes Zahlungsmittel nutzt, ist unwirksam, wenn fuer den Verbraucher keine zumutbare und gaengige unentgeltliche Zahlungsmoeglichkeit besteht oder das vereinbarte Entgelt ueber die Kosten hinausgeht, die dem Unternehmer durch die Nutzung des Zahlungsmittels entstehen.
|
||||
|
||||
(5) Eine Vereinbarung, durch die ein Verbraucher verpflichtet wird, ein Entgelt dafuer zu zahlen, dass der Verbraucher den Unternehmer wegen Fragen oder Erklaerungen zu einem zwischen ihnen geschlossenen Vertrag ueber eine Rufnummer anruft, die der Unternehmer fuer solche Zwecke bereithaelt, ist unwirksam, wenn das vereinbarte Entgelt das Entgelt fuer die blosse Nutzung des Telekommunikationsdienstes uebersteigt.
|
||||
|
||||
(6) Ist eine Vereinbarung nach den Absaetzen 3 bis 5 nicht Vertragsbestandteil geworden oder ist sie unwirksam, bleibt der Vertrag im Uebrigen wirksam.
|
||||
|
||||
§ 312g Widerrufsrecht
|
||||
|
||||
(1) Dem Verbraucher steht bei ausserhalb von Geschaeftsraeumen geschlossenen Vertraegen und bei Fernabsatzvertraegen ein Widerrufsrecht gemaess § 355 zu.
|
||||
|
||||
(2) Das Widerrufsrecht besteht, soweit die Parteien nichts anderes vereinbart haben, nicht bei folgenden Vertraegen:
|
||||
1. Vertraege zur Lieferung von Waren, die nicht vorgefertigt sind und fuer deren Herstellung eine individuelle Auswahl oder Bestimmung durch den Verbraucher massgeblich ist oder die eindeutig auf die persoenlichen Beduerfnisse des Verbrauchers zugeschnitten sind,
|
||||
2. Vertraege zur Lieferung von Waren, die schnell verderben koennen oder deren Verfallsdatum schnell ueberschritten wuerde,
|
||||
3. Vertraege zur Lieferung versiegelter Waren, die aus Gruenden des Gesundheitsschutzes oder der Hygiene nicht zur Rueckgabe geeignet sind, wenn ihre Versiegelung nach der Lieferung entfernt wurde.
|
||||
|
||||
(3) Das Widerrufsrecht besteht ferner nicht bei Vertraegen, bei denen dem Verbraucher bereits auf Grund der §§ 495, 506 bis 513 ein Widerrufsrecht zusteht.
|
||||
|
||||
§ 312k Kuendigung von Verbrauchervertraegen im elektronischen Geschaeftsverkehr
|
||||
|
||||
(1) Wird Verbrauchern ueber eine Webseite ermoeglicht, einen Vertrag im elektronischen Geschaeftsverkehr zu schliessen, der auf die Begruendung eines Dauerschuldverhaeltnisses gerichtet ist, das einen Unternehmer zu einer entgeltlichen Leistung verpflichtet, so treffen den Unternehmer die Pflichten nach dieser Vorschrift. Dies gilt nicht
|
||||
1. fuer Vertraege, fuer deren Kuendigung gesetzlich ausschliesslich eine strengere Form als die Textform vorgesehen ist, und
|
||||
2. in Bezug auf Webseiten, die Finanzdienstleistungen betreffen, oder fuer Vertraege ueber Finanzdienstleistungen.
|
||||
|
||||
(2) Der Unternehmer hat sicherzustellen, dass der Verbraucher auf der Webseite eine Erklaerung zur ordentlichen oder ausserordentlichen Kuendigung eines auf der Webseite abschliessbaren Vertrags nach Absatz 1 Satz 1 ueber eine Kuendigungsschaltflaeche abgeben kann. Die Kuendigungsschaltflaeche muss gut lesbar mit nichts anderem als den Woertern "Vertraege hier kuendigen" oder mit einer entsprechenden eindeutigen Formulierung beschriftet sein. Sie muss den Verbraucher unmittelbar zu einer Bestaetigungsseite fuehren, die
|
||||
1. den Verbraucher auffordert und ihm ermoeglicht Angaben zu machen
|
||||
a) zur Art der Kuendigung sowie im Falle der ausserordentlichen Kuendigung zum Kuendigungsgrund,
|
||||
b) zu seiner eindeutigen Identifizierbarkeit,
|
||||
c) zur eindeutigen Bezeichnung des Vertrags,
|
||||
d) zum Zeitpunkt, zu dem die Kuendigung das Vertragsverhaeltnis beenden soll,
|
||||
e) zur schnellen elektronischen Uebermittlung der Kuendigungsbestaetigung an ihn und
|
||||
2. eine Bestaetigungsschaltflaeche enthaelt, ueber deren Betaetigung der Verbraucher die Kuendigungserklaerung abgeben kann und die gut lesbar mit nichts anderem als den Woertern "jetzt kuendigen" oder mit einer entsprechenden eindeutigen Formulierung beschriftet ist.
|
||||
Die Schaltflaechen und die Bestaetigungsseite muessen staendig verfuegbar sowie unmittelbar und leicht zugaenglich sein.
|
||||
|
||||
(3) Der Verbraucher muss seine durch das Betaetigen der Bestaetigungsschaltflaeche abgegebene Kuendigungserklaerung mit dem Datum und der Uhrzeit der Abgabe auf einem dauerhaften Datentraeger so speichern koennen, dass erkennbar ist, dass die Kuendigungserklaerung durch das Betaetigen der Bestaetigungsschaltflaeche abgegeben wurde.
|
||||
|
||||
(4) Der Unternehmer hat dem Verbraucher den Inhalt sowie Datum und Uhrzeit des Zugangs der Kuendigungserklaerung sowie den Zeitpunkt, zu dem das Vertragsverhaeltnis durch die Kuendigung beendet werden soll, sofort auf elektronischem Wege in Textform zu bestaetigen. Es wird vermutet, dass eine durch das Betaetigen der Bestaetigungsschaltflaeche abgegebene Kuendigungserklaerung dem Unternehmer unmittelbar nach ihrer Abgabe zugegangen ist.
|
||||
|
||||
(5) Wenn der Verbraucher bei der Abgabe der Kuendigungserklaerung keinen Zeitpunkt angibt, zu dem die Kuendigung das Vertragsverhaeltnis beenden soll, wirkt die Kuendigung im Zweifel zum fruehestmoeglichen Zeitpunkt.
|
||||
|
||||
(6) Werden die Schaltflaechen und die Bestaetigungsseite nicht entsprechend den Absaetzen 1 und 2 zur Verfuegung gestellt, kann ein Verbraucher einen Vertrag, fuer dessen Kuendigung die Schaltflaechen und die Bestaetigungsseite zur Verfuegung zu stellen sind, jederzeit und ohne Einhaltung einer Kuendigungsfrist kuendigen. Die Moeglichkeit des Verbrauchers zur ausserordentlichen Kuendigung bleibt hiervon unberuehrt.
|
||||
@@ -318,6 +318,8 @@ server {
|
||||
set $upstream_admin_compliance bp-compliance-admin:3000;
|
||||
proxy_pass http://$upstream_admin_compliance;
|
||||
proxy_http_version 1.1;
|
||||
proxy_read_timeout 300s;
|
||||
proxy_send_timeout 300s;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
proxy_set_header Host $host;
|
||||
|
||||
@@ -7,6 +7,7 @@ from pydantic import BaseModel
|
||||
|
||||
from api.auth import optional_jwt_auth
|
||||
from embedding_client import embedding_client
|
||||
from html_utils import decode_html_bytes, looks_like_html, strip_html
|
||||
from minio_client_wrapper import minio_wrapper
|
||||
from qdrant_client_wrapper import qdrant_wrapper
|
||||
|
||||
@@ -14,6 +15,9 @@ logger = logging.getLogger("rag-service.api.documents")
|
||||
|
||||
router = APIRouter(prefix="/api/v1/documents")
|
||||
|
||||
# Structural metadata fields from embedding-service chunks_with_metadata (D2)
|
||||
_STRUCT_FIELDS = ("section", "section_title", "paragraph", "paragraph_num", "page")
|
||||
|
||||
|
||||
# ---- Request / Response models --------------------------------------------
|
||||
|
||||
@@ -98,9 +102,16 @@ async def upload_document(
|
||||
try:
|
||||
if content_type == "application/pdf" or filename.lower().endswith(".pdf"):
|
||||
text = await embedding_client.extract_pdf(file_bytes)
|
||||
elif filename.lower().endswith((".html", ".htm")):
|
||||
text = decode_html_bytes(file_bytes)
|
||||
text = strip_html(text)
|
||||
logger.info("Decoded + stripped HTML from %s", filename)
|
||||
else:
|
||||
# Try to decode as text
|
||||
text = file_bytes.decode("utf-8", errors="replace")
|
||||
# Strip HTML if content looks like HTML despite extension
|
||||
if looks_like_html(text):
|
||||
text = strip_html(text)
|
||||
logger.info("Stripped HTML tags from %s", filename)
|
||||
except Exception as exc:
|
||||
logger.error("Text extraction failed: %s", exc)
|
||||
raise HTTPException(status_code=500, detail=f"Text extraction failed: {exc}")
|
||||
@@ -110,7 +121,7 @@ async def upload_document(
|
||||
|
||||
# --- Chunk ---
|
||||
try:
|
||||
chunks = await embedding_client.chunk_text(
|
||||
chunk_result = await embedding_client.chunk_text(
|
||||
text=text,
|
||||
strategy=chunk_strategy,
|
||||
chunk_size=chunk_size,
|
||||
@@ -120,6 +131,9 @@ async def upload_document(
|
||||
logger.error("Chunking failed: %s", exc)
|
||||
raise HTTPException(status_code=500, detail=f"Chunking failed: {exc}")
|
||||
|
||||
chunks = chunk_result.chunks
|
||||
chunks_meta = chunk_result.chunks_with_metadata
|
||||
|
||||
if not chunks:
|
||||
raise HTTPException(status_code=400, detail="Chunking produced zero chunks")
|
||||
|
||||
@@ -154,6 +168,13 @@ async def upload_document(
|
||||
"year": year,
|
||||
**extra_metadata,
|
||||
}
|
||||
# Merge structural metadata from embedding service (D2)
|
||||
if i < len(chunks_meta):
|
||||
meta = chunks_meta[i]
|
||||
for field in _STRUCT_FIELDS:
|
||||
value = meta.get(field)
|
||||
if value is not None and value != "":
|
||||
payload[field] = value
|
||||
payloads.append(payload)
|
||||
|
||||
# --- Index in Qdrant ---
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
import logging
|
||||
import os
|
||||
from typing import Optional
|
||||
from dataclasses import dataclass
|
||||
|
||||
import httpx
|
||||
|
||||
@@ -19,6 +19,14 @@ _OLLAMA_EMBED_MODEL = os.getenv("OLLAMA_EMBED_MODEL", "bge-m3")
|
||||
_EMBED_BATCH_SIZE = int(os.getenv("EMBED_BATCH_SIZE", "32"))
|
||||
|
||||
|
||||
@dataclass
|
||||
class ChunkResult:
|
||||
"""Result from the embedding service /chunk endpoint."""
|
||||
|
||||
chunks: list[str]
|
||||
chunks_with_metadata: list[dict]
|
||||
|
||||
|
||||
class EmbeddingClient:
|
||||
"""
|
||||
Hybrid client:
|
||||
@@ -120,10 +128,10 @@ class EmbeddingClient:
|
||||
strategy: str = "recursive",
|
||||
chunk_size: int = 512,
|
||||
overlap: int = 50,
|
||||
) -> list[str]:
|
||||
) -> ChunkResult:
|
||||
"""
|
||||
Ask the embedding service to chunk a long text.
|
||||
Returns a list of chunk strings.
|
||||
Returns ChunkResult with plain chunks and structural metadata.
|
||||
"""
|
||||
async with httpx.AsyncClient(timeout=_TIMEOUT) as client:
|
||||
response = await client.post(
|
||||
@@ -137,7 +145,10 @@ class EmbeddingClient:
|
||||
)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
return data.get("chunks", [])
|
||||
return ChunkResult(
|
||||
chunks=data.get("chunks", []),
|
||||
chunks_with_metadata=data.get("chunks_with_metadata") or [],
|
||||
)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# PDF extraction (via embedding-service)
|
||||
|
||||
@@ -0,0 +1,66 @@
|
||||
"""HTML detection and stripping for legal document ingestion."""
|
||||
|
||||
import re
|
||||
from html import unescape
|
||||
|
||||
_HTML_TAG_RE = re.compile(r'<(html|head|body|div|p|span|table)\b', re.IGNORECASE)
|
||||
_CHARSET_RE = re.compile(
|
||||
r'<meta[^>]+charset\s*=\s*["\']?([a-zA-Z0-9_-]+)', re.IGNORECASE,
|
||||
)
|
||||
|
||||
|
||||
def looks_like_html(text: str) -> bool:
|
||||
"""Check if text contains HTML tags."""
|
||||
return bool(_HTML_TAG_RE.search(text[:500]))
|
||||
|
||||
|
||||
def decode_html_bytes(raw: bytes) -> str:
|
||||
"""Decode HTML bytes with charset detection from meta tags.
|
||||
|
||||
Tries UTF-8 first, falls back to charset from HTML meta tag, then latin-1.
|
||||
"""
|
||||
try:
|
||||
text = raw.decode("utf-8")
|
||||
# Check if UTF-8 decode produced replacement characters
|
||||
if "\ufffd" not in text:
|
||||
return text
|
||||
except UnicodeDecodeError:
|
||||
pass
|
||||
|
||||
# Peek at ASCII-safe portion to find charset
|
||||
ascii_head = raw[:2000].decode("ascii", errors="ignore")
|
||||
m = _CHARSET_RE.search(ascii_head)
|
||||
if m:
|
||||
charset = m.group(1).lower().replace("_", "-")
|
||||
try:
|
||||
return raw.decode(charset)
|
||||
except (UnicodeDecodeError, LookupError):
|
||||
pass
|
||||
|
||||
# Last resort: iso-8859-1 (covers all byte values)
|
||||
return raw.decode("iso-8859-1")
|
||||
|
||||
|
||||
def strip_html(html_text: str) -> str:
|
||||
"""Convert HTML to plain text preserving legal document structure."""
|
||||
text = html_text
|
||||
# Remove script/style blocks
|
||||
text = re.sub(r'<(script|style)[^>]*>.*?</\1>', '', text, flags=re.DOTALL | re.IGNORECASE)
|
||||
# Block elements → newline (preserves § paragraph structure)
|
||||
# Opening block tags also get newline (e.g., <h3> before § signs)
|
||||
text = re.sub(
|
||||
r'<(div|p|h[1-6]|li|tr|dt|dd|section|article|blockquote)\b[^>]*>',
|
||||
'\n', text, flags=re.IGNORECASE,
|
||||
)
|
||||
text = re.sub(
|
||||
r'</(div|p|h[1-6]|li|tr|dt|dd|section|article|blockquote)>',
|
||||
'\n', text, flags=re.IGNORECASE,
|
||||
)
|
||||
text = re.sub(r'<br\s*/?>', '\n', text, flags=re.IGNORECASE)
|
||||
# Strip remaining tags
|
||||
text = re.sub(r'<[^>]+>', '', text)
|
||||
# Decode HTML entities (ö → ö, § → §)
|
||||
text = unescape(text)
|
||||
# Clean up excessive whitespace
|
||||
text = re.sub(r'\n{3,}', '\n\n', text)
|
||||
return text.strip()
|
||||
@@ -0,0 +1,7 @@
|
||||
"""Shared test fixtures for rag-service tests."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
# Ensure rag-service root is on sys.path so imports resolve
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
@@ -0,0 +1,172 @@
|
||||
"""Tests for document upload payload building — structural metadata (D2)."""
|
||||
|
||||
# Mirror the constant from api/documents.py to avoid heavy import chain
|
||||
# (api → jose, qdrant_client, minio, etc.)
|
||||
_STRUCT_FIELDS = ("section", "section_title", "paragraph", "paragraph_num", "page")
|
||||
|
||||
|
||||
def _build_payload(
|
||||
chunk: str,
|
||||
index: int,
|
||||
chunks_meta: list[dict],
|
||||
extra_metadata: "dict | None" = None,
|
||||
) -> dict:
|
||||
"""Replicate the payload-building logic from documents.py for unit testing."""
|
||||
payload = {
|
||||
"document_id": "test-doc-id",
|
||||
"object_name": "test/path.pdf",
|
||||
"filename": "path.pdf",
|
||||
"chunk_index": index,
|
||||
"chunk_text": chunk,
|
||||
"data_type": "law",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
**(extra_metadata or {}),
|
||||
}
|
||||
if index < len(chunks_meta):
|
||||
meta = chunks_meta[index]
|
||||
for field in _STRUCT_FIELDS:
|
||||
value = meta.get(field)
|
||||
if value is not None and value != "":
|
||||
payload[field] = value
|
||||
return payload
|
||||
|
||||
|
||||
class TestPayloadStructuralMetadata:
|
||||
"""Tests for structural metadata merging into Qdrant payloads."""
|
||||
|
||||
def test_payload_contains_structural_metadata(self):
|
||||
"""Metadata fields from chunks_with_metadata land in the payload."""
|
||||
meta = [
|
||||
{
|
||||
"text": "chunk text",
|
||||
"section": "§ 312k",
|
||||
"section_title": "Kuendigungsbutton",
|
||||
"paragraph": "Abs. 1",
|
||||
"paragraph_num": 1,
|
||||
"page": 847,
|
||||
"index": 0,
|
||||
}
|
||||
]
|
||||
|
||||
payload = _build_payload("chunk text", 0, meta)
|
||||
|
||||
assert payload["section"] == "§ 312k"
|
||||
assert payload["section_title"] == "Kuendigungsbutton"
|
||||
assert payload["paragraph"] == "Abs. 1"
|
||||
assert payload["paragraph_num"] == 1
|
||||
assert payload["page"] == 847
|
||||
|
||||
def test_payload_without_metadata_backwards_compat(self):
|
||||
"""Empty metadata list → payload has no structural fields."""
|
||||
payload = _build_payload("chunk text", 0, [])
|
||||
|
||||
for field in _STRUCT_FIELDS:
|
||||
assert field not in payload
|
||||
|
||||
def test_payload_skips_empty_values(self):
|
||||
"""Empty string and None values are NOT added to payload."""
|
||||
meta = [
|
||||
{
|
||||
"text": "chunk text",
|
||||
"section": "",
|
||||
"section_title": "",
|
||||
"paragraph": "",
|
||||
"paragraph_num": None,
|
||||
"page": None,
|
||||
"index": 0,
|
||||
}
|
||||
]
|
||||
|
||||
payload = _build_payload("chunk text", 0, meta)
|
||||
|
||||
for field in _STRUCT_FIELDS:
|
||||
assert field not in payload
|
||||
|
||||
def test_metadata_overrides_extra_metadata(self):
|
||||
"""Auto-extracted metadata takes precedence over manual extra_metadata."""
|
||||
meta = [
|
||||
{
|
||||
"text": "chunk text",
|
||||
"section": "§ 25",
|
||||
"section_title": "",
|
||||
"paragraph": "",
|
||||
"paragraph_num": None,
|
||||
"page": None,
|
||||
"index": 0,
|
||||
}
|
||||
]
|
||||
extra = {"section": "manual-value"}
|
||||
|
||||
payload = _build_payload("chunk text", 0, meta, extra_metadata=extra)
|
||||
|
||||
assert payload["section"] == "§ 25"
|
||||
|
||||
def test_partial_metadata_alignment(self):
|
||||
"""3 chunks but only 2 metadata entries → third payload has no structural fields."""
|
||||
meta = [
|
||||
{
|
||||
"text": "c1",
|
||||
"section": "§ 1",
|
||||
"section_title": "",
|
||||
"paragraph": "",
|
||||
"paragraph_num": None,
|
||||
"page": None,
|
||||
"index": 0,
|
||||
},
|
||||
{
|
||||
"text": "c2",
|
||||
"section": "§ 2",
|
||||
"section_title": "",
|
||||
"paragraph": "",
|
||||
"paragraph_num": None,
|
||||
"page": None,
|
||||
"index": 1,
|
||||
},
|
||||
]
|
||||
|
||||
p0 = _build_payload("c1", 0, meta)
|
||||
p1 = _build_payload("c2", 1, meta)
|
||||
p2 = _build_payload("c3", 2, meta)
|
||||
|
||||
assert p0["section"] == "§ 1"
|
||||
assert p1["section"] == "§ 2"
|
||||
assert "section" not in p2
|
||||
|
||||
def test_zero_paragraph_num_is_kept(self):
|
||||
"""paragraph_num=0 is a valid value and should be stored."""
|
||||
meta = [
|
||||
{
|
||||
"text": "chunk",
|
||||
"section": "",
|
||||
"section_title": "",
|
||||
"paragraph": "",
|
||||
"paragraph_num": 0,
|
||||
"page": None,
|
||||
"index": 0,
|
||||
}
|
||||
]
|
||||
|
||||
payload = _build_payload("chunk", 0, meta)
|
||||
|
||||
# 0 is not None and not "" → should be stored
|
||||
assert payload["paragraph_num"] == 0
|
||||
|
||||
def test_page_zero_is_kept(self):
|
||||
"""page=0 is a valid value (first page) and should be stored."""
|
||||
meta = [
|
||||
{
|
||||
"text": "chunk",
|
||||
"section": "",
|
||||
"section_title": "",
|
||||
"paragraph": "",
|
||||
"paragraph_num": None,
|
||||
"page": 0,
|
||||
"index": 0,
|
||||
}
|
||||
]
|
||||
|
||||
payload = _build_payload("chunk", 0, meta)
|
||||
|
||||
assert payload["page"] == 0
|
||||
@@ -0,0 +1,135 @@
|
||||
"""Tests for EmbeddingClient.chunk_text() — ChunkResult with metadata (D2)."""
|
||||
|
||||
from unittest.mock import AsyncMock, MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from embedding_client import ChunkResult, EmbeddingClient
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def client():
|
||||
with patch("embedding_client.settings") as mock_settings:
|
||||
mock_settings.EMBEDDING_SERVICE_URL = "http://localhost:8087"
|
||||
return EmbeddingClient()
|
||||
|
||||
|
||||
def _mock_response(json_data: dict, status_code: int = 200):
|
||||
"""Create a mock httpx response (sync methods like .json() and .raise_for_status())."""
|
||||
resp = MagicMock()
|
||||
resp.status_code = status_code
|
||||
resp.json.return_value = json_data
|
||||
return resp
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_chunk_text_returns_chunk_result(client):
|
||||
"""chunk_text returns ChunkResult with both chunks and metadata."""
|
||||
mock_json = {
|
||||
"chunks": ["chunk1 text", "chunk2 text"],
|
||||
"chunks_with_metadata": [
|
||||
{
|
||||
"text": "chunk1 text",
|
||||
"section": "§ 25",
|
||||
"section_title": "Informationspflichten",
|
||||
"paragraph": "Abs. 1",
|
||||
"paragraph_num": 1,
|
||||
"page": None,
|
||||
"index": 0,
|
||||
},
|
||||
{
|
||||
"text": "chunk2 text",
|
||||
"section": "§ 25",
|
||||
"section_title": "Informationspflichten",
|
||||
"paragraph": "Abs. 2",
|
||||
"paragraph_num": 2,
|
||||
"page": None,
|
||||
"index": 1,
|
||||
},
|
||||
],
|
||||
"count": 2,
|
||||
"strategy": "recursive",
|
||||
}
|
||||
|
||||
with patch("httpx.AsyncClient") as mock_client_cls:
|
||||
mock_client = AsyncMock()
|
||||
mock_client.post.return_value = _mock_response(mock_json)
|
||||
mock_client.__aenter__ = AsyncMock(return_value=mock_client)
|
||||
mock_client.__aexit__ = AsyncMock(return_value=False)
|
||||
mock_client_cls.return_value = mock_client
|
||||
|
||||
result = await client.chunk_text("some legal text")
|
||||
|
||||
assert isinstance(result, ChunkResult)
|
||||
assert result.chunks == ["chunk1 text", "chunk2 text"]
|
||||
assert len(result.chunks_with_metadata) == 2
|
||||
assert result.chunks_with_metadata[0]["section"] == "§ 25"
|
||||
assert result.chunks_with_metadata[1]["paragraph"] == "Abs. 2"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_chunk_text_without_metadata_field(client):
|
||||
"""Embedding service response without chunks_with_metadata → empty list."""
|
||||
mock_json = {
|
||||
"chunks": ["chunk1"],
|
||||
"count": 1,
|
||||
"strategy": "semantic",
|
||||
}
|
||||
|
||||
with patch("httpx.AsyncClient") as mock_client_cls:
|
||||
mock_client = AsyncMock()
|
||||
mock_client.post.return_value = _mock_response(mock_json)
|
||||
mock_client.__aenter__ = AsyncMock(return_value=mock_client)
|
||||
mock_client.__aexit__ = AsyncMock(return_value=False)
|
||||
mock_client_cls.return_value = mock_client
|
||||
|
||||
result = await client.chunk_text("text", strategy="semantic")
|
||||
|
||||
assert isinstance(result, ChunkResult)
|
||||
assert result.chunks == ["chunk1"]
|
||||
assert result.chunks_with_metadata == []
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_chunk_text_with_null_metadata(client):
|
||||
"""chunks_with_metadata: null in response → empty list."""
|
||||
mock_json = {
|
||||
"chunks": ["chunk1"],
|
||||
"chunks_with_metadata": None,
|
||||
"count": 1,
|
||||
"strategy": "recursive",
|
||||
}
|
||||
|
||||
with patch("httpx.AsyncClient") as mock_client_cls:
|
||||
mock_client = AsyncMock()
|
||||
mock_client.post.return_value = _mock_response(mock_json)
|
||||
mock_client.__aenter__ = AsyncMock(return_value=mock_client)
|
||||
mock_client.__aexit__ = AsyncMock(return_value=False)
|
||||
mock_client_cls.return_value = mock_client
|
||||
|
||||
result = await client.chunk_text("text")
|
||||
|
||||
assert result.chunks_with_metadata == []
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_chunk_text_empty(client):
|
||||
"""Empty text → empty chunks and metadata."""
|
||||
mock_json = {
|
||||
"chunks": [],
|
||||
"chunks_with_metadata": [],
|
||||
"count": 0,
|
||||
"strategy": "recursive",
|
||||
}
|
||||
|
||||
with patch("httpx.AsyncClient") as mock_client_cls:
|
||||
mock_client = AsyncMock()
|
||||
mock_client.post.return_value = _mock_response(mock_json)
|
||||
mock_client.__aenter__ = AsyncMock(return_value=mock_client)
|
||||
mock_client.__aexit__ = AsyncMock(return_value=False)
|
||||
mock_client_cls.return_value = mock_client
|
||||
|
||||
result = await client.chunk_text("")
|
||||
|
||||
assert result.chunks == []
|
||||
assert result.chunks_with_metadata == []
|
||||
@@ -0,0 +1,161 @@
|
||||
"""Tests for HTML detection and stripping in document upload."""
|
||||
|
||||
from html_utils import (
|
||||
decode_html_bytes,
|
||||
looks_like_html as _looks_like_html,
|
||||
strip_html as _strip_html,
|
||||
)
|
||||
|
||||
|
||||
class TestLooksLikeHtml:
|
||||
|
||||
def test_html_document(self):
|
||||
assert _looks_like_html("<html><body><p>Text</p></body></html>")
|
||||
|
||||
def test_html_div(self):
|
||||
assert _looks_like_html('<div class="jurAbsatz">§ 312</div>')
|
||||
|
||||
def test_html_with_doctype(self):
|
||||
assert _looks_like_html("<!DOCTYPE html><html><head></head><body>")
|
||||
|
||||
def test_plain_text(self):
|
||||
assert not _looks_like_html("§ 312 Anwendungsbereich\n\n(1) Die Vorschriften...")
|
||||
|
||||
def test_legal_text_with_angle_brackets(self):
|
||||
# Legal text might use < or > but not as HTML tags
|
||||
assert not _looks_like_html("Wert < 100 EUR und > 50 EUR ist zulaessig.")
|
||||
|
||||
def test_markdown(self):
|
||||
assert not _looks_like_html("# § 312 Anwendungsbereich\n\n(1) Die Vorschriften...")
|
||||
|
||||
|
||||
class TestStripHtml:
|
||||
|
||||
def test_basic_div_tags(self):
|
||||
html = "<div>§ 312 Anwendungsbereich</div>"
|
||||
result = _strip_html(html)
|
||||
assert result.startswith("§ 312 Anwendungsbereich")
|
||||
|
||||
def test_paragraph_tags_become_newlines(self):
|
||||
html = "<p>Absatz 1</p><p>Absatz 2</p>"
|
||||
result = _strip_html(html)
|
||||
assert "Absatz 1" in result
|
||||
assert "Absatz 2" in result
|
||||
# Paragraphs should be on separate lines
|
||||
lines = [ln.strip() for ln in result.split("\n") if ln.strip()]
|
||||
assert len(lines) >= 2
|
||||
|
||||
def test_preserves_section_headers(self):
|
||||
"""§ signs must be at line starts after stripping."""
|
||||
html = '<div class="jurAbsatz">§ 312 Anwendungsbereich</div>'
|
||||
result = _strip_html(html)
|
||||
# § should be at the start of a line
|
||||
for line in result.split("\n"):
|
||||
if "§ 312" in line:
|
||||
assert line.strip().startswith("§ 312")
|
||||
break
|
||||
else:
|
||||
raise AssertionError("§ 312 not found in stripped text")
|
||||
|
||||
def test_decodes_html_entities(self):
|
||||
html = "Gelöscht und geändert und § 312"
|
||||
result = _strip_html(html)
|
||||
assert "Gelöscht" in result
|
||||
assert "geändert" in result
|
||||
assert "§ 312" in result
|
||||
|
||||
def test_decodes_named_entities(self):
|
||||
html = "§ 312 & § 313"
|
||||
result = _strip_html(html)
|
||||
assert "§ 312" in result
|
||||
assert "§ 313" in result
|
||||
|
||||
def test_removes_script_style(self):
|
||||
html = '<style>body{color:red}</style><script>alert("x")</script><p>§ 1 Text</p>'
|
||||
result = _strip_html(html)
|
||||
assert "color" not in result
|
||||
assert "alert" not in result
|
||||
assert "§ 1 Text" in result
|
||||
|
||||
def test_br_becomes_newline(self):
|
||||
html = "Zeile 1<br/>Zeile 2<br>Zeile 3"
|
||||
result = _strip_html(html)
|
||||
assert "Zeile 1" in result
|
||||
assert "Zeile 2" in result
|
||||
|
||||
def test_no_excessive_whitespace(self):
|
||||
html = "<div></div><div></div><div></div><div>Text</div>"
|
||||
result = _strip_html(html)
|
||||
assert "\n\n\n" not in result
|
||||
|
||||
def test_gesetze_im_internet_format(self):
|
||||
"""Realistic HTML from gesetze-im-internet.de."""
|
||||
html = """<div class="jnhtml">
|
||||
<div>
|
||||
<div class="jurAbsatz">
|
||||
§ 312k Kündigung von Verbraucherverträgen im elektronischen Geschäftsverkehr
|
||||
</div>
|
||||
<div class="jurAbsatz">
|
||||
(1) Wird Verbrauchern über eine Webseite ermöglicht, einen Vertrag im elektronischen Geschäftsverkehr zu schließen, der auf die Begründung eines Dauerschuldverhältnisses gerichtet ist, das einen Unternehmer zu einer entgeltlichen Leistung verpflichtet, so treffen den Unternehmer die Pflichten nach dieser Vorschrift.
|
||||
</div>
|
||||
<div class="jurAbsatz">
|
||||
(2) Der Unternehmer hat sicherzustellen, dass der Verbraucher auf der Webseite eine Erklärung zur ordentlichen oder außerordentlichen Kündigung abgeben kann.
|
||||
</div>
|
||||
</div></div>"""
|
||||
result = _strip_html(html)
|
||||
|
||||
# § 312k should be at start of a line
|
||||
found_312k = False
|
||||
for line in result.split("\n"):
|
||||
stripped = line.strip()
|
||||
if stripped.startswith("§ 312k"):
|
||||
found_312k = True
|
||||
break
|
||||
assert found_312k, f"§ 312k not at line start. Text:\n{result[:500]}"
|
||||
|
||||
# Content should be present without tags
|
||||
assert "Dauerschuldverhältnisses" in result
|
||||
assert "<div>" not in result
|
||||
assert "class=" not in result
|
||||
|
||||
def test_plain_text_passthrough(self):
|
||||
"""Non-HTML text should pass through unchanged."""
|
||||
text = "§ 312 Anwendungsbereich\n\n(1) Die Vorschriften..."
|
||||
result = _strip_html(text)
|
||||
assert "§ 312 Anwendungsbereich" in result
|
||||
assert "(1) Die Vorschriften" in result
|
||||
|
||||
def test_opening_h3_creates_newline(self):
|
||||
"""Opening <h3> must create newline so § is at line start."""
|
||||
html = '<a href="#">Inhaltsverzeichnis</a><h3><span>§ 1</span> Titel</h3>'
|
||||
result = _strip_html(html)
|
||||
found = any(line.strip().startswith("§ 1") for line in result.split("\n"))
|
||||
assert found, f"§ 1 not at line start: {result!r}"
|
||||
|
||||
|
||||
class TestDecodeHtmlBytes:
|
||||
|
||||
def test_utf8_file(self):
|
||||
raw = "<div>§ 312 Anwendungsbereich</div>".encode("utf-8")
|
||||
text = decode_html_bytes(raw)
|
||||
assert "§ 312" in text
|
||||
|
||||
def test_iso_8859_1_with_meta(self):
|
||||
html = '<html><head><meta charset="iso-8859-1"></head><body>§ 1 Test</body></html>'
|
||||
raw = html.encode("iso-8859-1")
|
||||
text = decode_html_bytes(raw)
|
||||
assert "§ 1 Test" in text
|
||||
|
||||
def test_iso_8859_1_without_meta(self):
|
||||
"""Even without meta tag, iso-8859-1 is fallback."""
|
||||
raw = "§ 312 Anwendungsbereich".encode("iso-8859-1")
|
||||
text = decode_html_bytes(raw)
|
||||
assert "§ 312" in text
|
||||
|
||||
def test_gesetze_im_internet_encoding(self):
|
||||
"""gesetze-im-internet.de uses iso-8859-1 with § entities."""
|
||||
html = '<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />'
|
||||
html += '<div>Kündigungsschutzgesetz</div>'
|
||||
raw = html.encode("iso-8859-1")
|
||||
text = decode_html_bytes(raw)
|
||||
assert "Kündigungsschutzgesetz" in text
|
||||
Executable
+65
@@ -0,0 +1,65 @@
|
||||
#!/bin/bash
|
||||
# Qdrant Snapshot — erstellt Snapshots aller Collections
|
||||
#
|
||||
# Usage:
|
||||
# bash scripts/qdrant-snapshot.sh # Create snapshots
|
||||
# bash scripts/qdrant-snapshot.sh --list # List existing snapshots
|
||||
# bash scripts/qdrant-snapshot.sh --restore <file> # Restore (interactive)
|
||||
#
|
||||
# Snapshots werden im Qdrant-Volume unter /qdrant/storage/snapshots/ gespeichert.
|
||||
# Zusaetzlich werden sie nach ./backups/qdrant/ kopiert.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
QDRANT_URL="${QDRANT_URL:-http://localhost:6333}"
|
||||
BACKUP_DIR="${BACKUP_DIR:-./backups/qdrant}"
|
||||
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
|
||||
|
||||
# --- List existing snapshots ---
|
||||
if [[ "${1:-}" == "--list" ]]; then
|
||||
echo "=== Qdrant Snapshots ==="
|
||||
for coll in $(curl -sf "$QDRANT_URL/collections" | python3 -c "import sys,json; [print(c['name']) for c in json.load(sys.stdin)['result']['collections']]"); do
|
||||
echo ""
|
||||
echo "Collection: $coll"
|
||||
curl -sf "$QDRANT_URL/collections/$coll/snapshots" | python3 -c "
|
||||
import sys, json
|
||||
snaps = json.load(sys.stdin).get('result', [])
|
||||
if not snaps:
|
||||
print(' (no snapshots)')
|
||||
else:
|
||||
for s in snaps:
|
||||
print(f\" {s['name']} size={s.get('size',0)/(1024*1024):.1f}MB\")
|
||||
"
|
||||
done
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# --- Create snapshots ---
|
||||
echo "=== Creating Qdrant Snapshots ($TIMESTAMP) ==="
|
||||
mkdir -p "$BACKUP_DIR"
|
||||
|
||||
COLLECTIONS=$(curl -sf "$QDRANT_URL/collections" | python3 -c "import sys,json; [print(c['name']) for c in json.load(sys.stdin)['result']['collections']]")
|
||||
|
||||
for coll in $COLLECTIONS; do
|
||||
echo ""
|
||||
echo "[$coll] Creating snapshot..."
|
||||
|
||||
SNAP=$(curl -sf -X POST "$QDRANT_URL/collections/$coll/snapshots" | python3 -c "import sys,json; print(json.load(sys.stdin)['result']['name'])")
|
||||
|
||||
if [[ -z "$SNAP" ]]; then
|
||||
echo "[$coll] ERROR: snapshot creation failed"
|
||||
continue
|
||||
fi
|
||||
|
||||
echo "[$coll] Snapshot: $SNAP"
|
||||
|
||||
# Download snapshot to backup dir
|
||||
OUTFILE="$BACKUP_DIR/${coll}_${TIMESTAMP}.snapshot"
|
||||
curl -sf "$QDRANT_URL/collections/$coll/snapshots/$SNAP" -o "$OUTFILE"
|
||||
SIZE=$(du -h "$OUTFILE" | cut -f1)
|
||||
echo "[$coll] Saved: $OUTFILE ($SIZE)"
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "=== Done ==="
|
||||
ls -lh "$BACKUP_DIR"/*_${TIMESTAMP}.snapshot 2>/dev/null || echo "No snapshots created"
|
||||
Reference in New Issue
Block a user