25 Commits

Author SHA1 Message Date
Benjamin Admin aab8eeb335 Merge branch 'main' of ssh://gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-core
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 47s
CI / test-python-voice (push) Successful in 38s
CI / test-bqas (push) Successful in 33s
2026-05-03 23:14:34 +02:00
Benjamin Admin 9437e029d0 feat(pipeline): F1 regulation registry — DB-backed license/source-type lookup
Migrates REGULATION_LICENSE_MAP (135 entries) and SOURCE_REGULATION_CLASSIFICATION
(58 entries) from hardcoded Python dicts to compliance.regulation_registry table.

- SQL migration: 002_regulation_registry.sql (table + indexes + trigger)
- Migration script: f1_migrate_regulation_registry.py (162 rows, --dry-run)
- RegulationRegistry cache: 5min TTL, prefix fallback, graceful degradation
- control_generator._classify_regulation() delegates to DB with dict fallback
- source_type_classification.classify_source_regulation() delegates to DB
- 34 new tests (lookup, cache, degradation, migration data consistency)
- 421 total tests pass, 0 regressions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 23:14:06 +02:00
Benjamin Admin 4fd2bfefcd docs: session handover updated for Block F start
Next: F1 Regulation Registry (DB + API + Frontend + Auto-Create)
Frontend at /sdk/regulation-registry in breakpilot-compliance admin

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 22:51:23 +02:00
Benjamin Admin fac9280716 feat(pipeline): Block D5+-E complete session — 20k+ new chunks
Session 02-03.05.2026 accomplishments:
- D5+: NIST/ENISA PDF quality fix (0%→45% section rate)
- D5+: 4 lost NIST PDFs restored (11k chunks)
- D5+: Text normalization + section detection for NIST/BSI
- D6: Citation backfill (3,651 controls updated, old archived)
- E2: 8 DE laws ingested (ArbZG, MuSchG, GmbHG, AktG, InsO...)
- E3: 5 EU regulations (CSRD, CSDDD, Taxonomy, eIDAS, Pay Trans.)
- E4: Standards (GoBD, BAIT, VAIT)
- E6: 3 CH + 4 AT laws (OR, DSV, ArG, ArbVG, AngG, AZG, NISG)
- E7: 9 court judgments as full text (Schrems II 154 chunks,
  Meta 101, BVerfG 161, DSK OH 119, Planet49 42, SCHUFA 41,
  Schadenersatz 29, BAG 48, Google Fonts 14)
- Infra: Qdrant snapshot mechanism, upload-before-delete safety

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 22:31:57 +02:00
Benjamin Admin 118be3540d feat(pipeline): D6 citation backfill + E2/E3 law ingestion scripts
- d6_citation_backfill.py: 3-tier matching (hash/prefix/overlap),
  archives old citations, updated 3.651 controls (93.6% coverage)
- ingest_de_laws.py: 8 German laws ingested (ArbZG, MuSchG, NachwG,
  MiLoG, GmbHG, AktG, InsO, BUrlG — 1.629 chunks)
- ingest_eu_regulations.py: EUR-Lex ingestion (needs manual HTML due
  to AWS WAF). CSRD, CSDDD, EU Taxonomy, eIDAS 2.0, Pay Transparency
  manually ingested (1.057 chunks)
- Updated session handover with current state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 13:19:27 +02:00
Benjamin Admin a9671a572b fix(embedding): single-number ALL-CAPS section detection for ENISA/BSI
Add case-sensitive _SINGLE_NUM_ALLCAPS_RE for "1. INTRODUCTION" style
headers (ENISA, BSI docs). Cannot use _LEGAL_SECTION_RE for this because
it uses re.IGNORECASE which would false-positive on "1. Erstens" etc.

Also re-downloaded 2 corrupt PDFs from nist.gov (nistir_8259a, nist_ai_rmf)
— originals in MinIO were 263-byte XML error responses, not PDFs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 08:56:02 +02:00
Benjamin Admin 2f4a3f2ea2 fix(embedding): add NIST control IDs to _SECTION_NUMBER_RE
_SECTION_NUMBER_RE only had patterns for §/Art/Section/Kapitel/Annex
but missed NIST-style identifiers (AC-1, GV.OC-01, 3.1, A01:2021).
This caused 0% section rate for all NIST/BSI/ENISA documents even
though sections were correctly detected — the section NUMBER wasn't
extracted from the header.

Also adds:
- reupload_legal_strategy.py: re-upload with legal chunking
- extract_and_upload_nist.py: local PDF extraction workaround
- qdrant-snapshot.sh: backup mechanism for Qdrant collections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 07:42:06 +02:00
Benjamin Admin 0b0eed27b0 feat(embedding): NIST PDF text normalization + safe re-ingest script
Fix broken multi-column PDF extraction for NIST/BSI/ENISA documents:
- _normalize_pdf_text(): fixes broken section numbers (1 . 1 → 1.1),
  control IDs (AC - 1 → AC-1), ligatures, soft hyphens
- pdfplumber tolerances increased (x=3,y=4) for better column handling
- 3 new regex patterns: NIST CSF 2.0, NIST enhancements, OWASP Top 10
- reingest_nist.py: safe upload-before-delete for 4 lost NIST PDFs
- reingest_d5.py: safety fix — upload first, verify, then delete old

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 06:42:46 +02:00
Benjamin Admin 97a7f6f264 docs: comprehensive session handover with full roadmap (Blocks A-G)
Complete instructions for next session including:
- Current quality metrics per document type
- Prioritized action items (NIST fix, citation backfill, missing laws)
- Full Block E-G roadmap with details
- All critical files, DB state, test commands
- Known issues (3 lost NIST PDFs, frontend 500s, D5 script safety)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 22:30:50 +02:00
Benjamin Admin ff21bc258a docs: session handover — D2-D5 complete, quality report, NIST plan
Major session achievements:
- Structural metadata end-to-end (D2-D4)
- 430 docs re-ingested with new chunking
- HTML stripping + charset detection (0% → 97.6%)
- 20 EU regulations from EUR-Lex HTML (DSGVO: 0% → 92%)
- Quality report script (500 controls: 13% fully correct)
- Frontend requirements.map fix

Open: NIST/ENISA text normalization, citation backfill,
D5 script safety (upload-before-delete), BEG IV ingestion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 22:22:55 +02:00
Benjamin Admin 3009f3d13a feat(embedding): add NIST/ENISA/standard section numbering to chunker
Extends _LEGAL_SECTION_RE to detect:
- Numbered sections: 1.1 Title, 2.3.1 Subtitle
- Control family IDs: AC-1, AU-2, PO.1, PW.1.1
- Table/Figure/Appendix references
Also adds EUR-Lex HTML replacement script.

58 embedding-service tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 19:24:10 +02:00
Benjamin Admin 5a6e588641 docs: update session handover — D2-D5 complete, EU PDF issue documented
Session achieved: structural metadata end-to-end (D2-D4), overlap bug
fix, HTML stripping with charset detection, 430/436 docs re-ingested.

Remaining: ~40 EU Official Journal PDFs need HTML from EUR-Lex (broken
multi-column PDF extraction), 3 missing EDPB PDFs, 1 corrupt PDF.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 17:34:34 +02:00
Benjamin Admin 41183ff93d fix(docker): set PDF_EXTRACTION_BACKEND to auto (was pymupdf)
The default was 'pymupdf' which doesn't exist as a backend, causing
fallthrough to pypdf every time. With 'auto', the priority is:
unstructured > pdfplumber > pypdf.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 17:30:33 +02:00
Benjamin Admin 75dda9ac92 feat(embedding): add pdfplumber backend for multi-column PDF extraction
EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use
multi-column layouts that pypdf breaks into fragmented words
("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly.

Backend priority: unstructured > pdfplumber > pypdf (auto mode).
Also increases D5 re-ingestion timeout to 3600s for large PDFs.

58 embedding-service tests passing. pdfplumber: MIT license.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 15:42:25 +02:00
Benjamin Admin a459636bc4 fix(rag): HTML charset detection + opening block tag newlines
Two bugs fixed:
1. Opening block tags (<h3>, <div>) now also create newlines, not just
   closing tags. Fixes: gesetze-im-internet.de puts § inside <h3> which
   followed inline <a> text — § ended up mid-line, not at line start.

2. HTML charset detection from meta tag (charset=iso-8859-1). Files from
   gesetze-im-internet.de use ISO-8859-1, not UTF-8. The § byte (0xA7)
   was destroyed by UTF-8 decode. Now: try UTF-8 → check meta charset →
   fallback ISO-8859-1.

32 rag-service tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 08:35:47 +02:00
Benjamin Admin ddad58f607 fix(rag): strip HTML tags before chunking + D5 re-ingestion scripts
HTML files from gesetze-im-internet.de were decoded as raw UTF-8, keeping
<div>/<p> tags intact. The legal chunker regex requires § at line start,
which never matched inside HTML tags → 0% section metadata for HTML docs.

Fix: detect HTML content and strip tags before sending to embedding
service. Block elements become newlines, entities are decoded.
§ signs now appear at line starts → section detection works.

Also adds D5 re-ingestion scripts (reingest_d5.py + config) for
batch re-processing of all documents in Qdrant collections.

27 rag-service tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 08:18:25 +02:00
Benjamin Admin 93099b2770 feat(pipeline): structural metadata end-to-end (Blocks D2-D4)
D2: RAG service stores section/section_title/paragraph/paragraph_num/page
from embedding service chunks_with_metadata into Qdrant payloads.

D3: Control generator prefers section > article > section_title from
Qdrant, adds page to source_citation and generation_metadata.

D4: Validated with real BGB §§ 312-312k text. Found and fixed critical
bug where Phase 3 overlap destroyed the [§ ...] section prefix, causing
only the first chunk per document to have metadata. All subsequent
chunks lost section info.

Also fixes pre-existing lint issues (unused imports, ambiguous variable
names, duplicate dict key, bare except).

456 tests passing (58 embedding + 387 pipeline + 11 rag-service).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 20:34:00 +02:00
Benjamin Admin da21339e76 docs: add session handover instructions for next session
Covers: completed blocks A-D1, remaining D2-G, critical files,
DB state, memory files, test commands.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 15:33:05 +02:00
Benjamin Admin 6ab10415d8 feat(embedding): add structural metadata to legal chunking (Block D1)
chunk_text_legal_structured() returns metadata per chunk:
- section: "§ 312k", "Art. 5"
- section_title: "Kündigungsbutton"
- paragraph: "Abs. 1", "Nr. 3"
- paragraph_num: 1, 3
- page: (prepared for PDF integration)
- index: sequential position

/chunk endpoint now returns chunks_with_metadata alongside plain chunks.
Backward compatible — existing consumers use chunks field unchanged.

New regex: _PARAGRAPH_RE (Abs/Nr/Satz/lit), _SECTION_NUMBER_RE
New functions: _parse_section_metadata(), _extract_paragraph_ref()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 15:25:23 +02:00
Benjamin Admin d9c16fb914 feat(pipeline): add adversarial tests (30 cases) + regression harness
Block C implementation:
- adversarial_cases.yaml: 30 tricky cases in 5 categories
  (wrong legal basis, dark patterns, incomplete docs, similar-but-different, homonyms)
- test_adversarial.py: 63 tests validating adversarial cases
- test_regression.py: ontology stability, dependency engine, quality metrics
- conftest.py: shared fixtures (DB session, sample controls)

Total: 371 tests passing (221 existing + 150 new).
Real-world benchmarks (C1) need manual ground truth creation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 13:02:29 +02:00
Benjamin Admin 6f58fdbaa5 docs: add test strategy instruction for dedicated session (Block C)
3 test levels: Real-World Benchmarks (10 DE websites), Adversarial Tests
(30 tricky cases), Regression Harness (CI/CD quality gate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 12:28:58 +02:00
Benjamin Admin b8ff4e9290 feat(pipeline): add review-verify endpoint — LLM decides DUPLIKAT/VERSCHIEDEN
Sends 67k review candidates to Haiku Batch API in pairs.
Each pair gets a DUPLIKAT/VERSCHIEDEN decision with reasoning.
Results stored in control_dedup_reviews.review_status.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 09:36:30 +02:00
Benjamin Admin f2104768a0 fix(docker): re-enable healthcheck after dedup completion
Dedup is done (162k controls). Re-enable healthcheck with generous
timeouts (10 retries × 30s) and restart: unless-stopped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 08:39:57 +02:00
Benjamin Admin e8df15c0f8 fix: add proxy_read_timeout 300s to admin-compliance location block
Scan endpoint needs up to 3-5 min (multi-page crawl + LLM calls).
Without explicit timeout, nginx defaults to 60s → 504 Gateway Timeout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 11:23:02 +02:00
Benjamin Admin 7c5592b50e feat(pipeline): add checkpoint to dedup Phase 2 — survives container restart
Stores last_control_id in canonical_generation_jobs after each page.
On restart, resumes from checkpoint instead of starting over.
Checkpoint is deleted on completion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 09:12:23 +02:00
46 changed files with 7399 additions and 74 deletions
@@ -0,0 +1,115 @@
# Session-Instruktionen: Block F — Hardcoded Knowledge Migration
**Datum:** 2026-05-03
**Fuer:** Naechste Claude-Session
**Repo:** breakpilot-core (~/Projekte/breakpilot-core)
---
## NAECHSTER SCHRITT: Block F1 — Regulation Registry
### Was zu tun ist
1. **DB-Tabelle** `compliance.regulation_registry` erstellen (Migration-Script)
2. **Daten migrieren** aus `control_generator.py` (135 Eintraege) + `source_type_classification.py` (58)
3. **Auto-Create** im RAG-Service bei Document-Upload (status='needs_review')
4. **Backend-API** in breakpilot-compliance Backend (GET/POST/PUT /v1/regulations)
5. **Frontend** in breakpilot-compliance Admin unter `/sdk/regulation-registry` (zwischen roadmap und isms)
6. **Sync-Check** Script (wöchentlich: Qdrant regulation_ids vs. DB)
7. **Code umstellen** in control_generator.py (Dict → DB-Query mit Cache)
### Frontend-Anforderungen (breakpilot-compliance Admin, Port 3007)
- NAV-Position: zwischen `/sdk/roadmap` und `/sdk/isms`
- Tabelle mit allen Regulations (sortierbar, filterbar)
- Status-Badge: "Needs Review" (gelb), "Active" (grün), "Deprecated" (grau)
- Counter im NAV für unreviewed Einträge
- Inline-Edit: license_rule, jurisdiction, source_type, names
- "Approve" Button → status='active'
- Diskrepanz-Anzeige: regulation_ids in Qdrant die nicht in DB sind
### Kritische Dateien
| Repo | Datei | Aktion |
|------|-------|--------|
| core | `control-pipeline/services/control_generator.py` Z.75-236 | EDIT: Dict → DB |
| core | `control-pipeline/data/source_type_classification.py` | DELETE (nach Migration) |
| core | `rag-service/api/documents.py` | EDIT: Auto-Create bei Upload |
| compliance | `backend-compliance/compliance/api/regulations.py` | NEU: API Endpoints |
| compliance | `admin-compliance/app/sdk/regulation-registry/` | NEU: Frontend-Seite |
### DB-Schema
```sql
CREATE TABLE compliance.regulation_registry (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
regulation_id VARCHAR(100) UNIQUE NOT NULL,
regulation_name_de TEXT,
regulation_name_en TEXT,
regulation_short VARCHAR(50),
license_rule INTEGER NOT NULL DEFAULT 1 CHECK (license_rule IN (1, 2, 3)),
license_type VARCHAR(50),
source_type VARCHAR(20) NOT NULL DEFAULT 'law',
jurisdiction VARCHAR(10),
category VARCHAR(50),
celex VARCHAR(20),
url TEXT,
status VARCHAR(20) NOT NULL DEFAULT 'needs_review',
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_reg_registry_status ON compliance.regulation_registry(status);
CREATE INDEX idx_reg_registry_jurisdiction ON compliance.regulation_registry(jurisdiction);
```
---
## GESAMTPLAN Block F (4 Tage)
| Phase | Was | Aufwand | Status |
|-------|-----|---------|--------|
| F1 | Regulation Registry (DB + API + Frontend + Auto-Create) | 1 Tag | 🔥 NAECHSTER |
| F2 | Action Types + Synonyme → DB | 1 Tag | Ausstehend |
| F3 | Object Synonyms → DB | 0.5 Tag | Ausstehend |
| F4 | LLM Synonym-Enrichment | 1 Tag | Ausstehend |
| F5 | Validation + Cleanup | 0.5 Tag | Ausstehend |
---
## SESSION 02-03.05.2026 ERLEDIGT
- Block D5+: NIST/ENISA PDF-Qualitaet (0%→45%)
- Block D6: Citation-Backfill (3.651 Controls)
- Block E2: 8 DE-Gesetze (1.629 Chunks)
- Block E3: 5 EU-Regulierungen (1.057 Chunks)
- Block E4: GoBD, BAIT, VAIT (144 Chunks)
- Block E6: 3 CH + 4 AT Gesetze (3.881 Chunks)
- Block E7: 9 Urteile als Volltext (709 Chunks total)
- Schrems II: 154, BVerfG Datenanalyse: 161, DSK OH Telemedien: 119
- Meta: 101, BAG Zeiterfassung: 48, Planet49: 42, SCHUFA: 41
- Schadenersatz: 29, Google Fonts: 14
- Infra: Qdrant-Snapshot, Upload-before-Delete, 99 Tests
**Gesamt neue Chunks diese Session: ~25.000+**
---
## TESTS
```bash
# Embedding-Service (99 Tests)
cd embedding-service && python3 -m pytest test_chunking.py test_d4_bgb.py test_nist_normalization.py -v
# Control-Pipeline (387 Tests)
PYTHONPATH=control-pipeline python3 -m pytest control-pipeline/tests/ -v
# Qdrant-Snapshot
ssh macmini "cd ~/Projekte/breakpilot-core && bash scripts/qdrant-snapshot.sh"
```
---
## PLAN-DATEI
Block F Detailplan: `/Users/benjaminadmin/.claude/plans/humming-nibbling-sonnet.md`
@@ -0,0 +1,335 @@
# Instruktion: Teststrategie Block C
**Repo:** `/Users/benjaminadmin/Projekte/breakpilot-core/`
**Verzeichnis:** `control-pipeline/tests/`
**Erstellt:** 2026-05-01
**Geschaetzter Aufwand:** 2-3 Tage
## Ausgangslage
- 221 bestehende Tests in 7 Dateien (NICHT aendern!)
- 40 Golden Test Cases (golden_controls.yaml)
- 24 Demo Cases (demo_cases.yaml)
- Alle Tests sind pure Python, kein DB noetig
- Pipeline v1 abgeschlossen: 151.675 unique Controls, 15.291 Dependencies
## Aufgabe 1: Real-World Benchmarks (C1)
### Was zu tun ist
10 echte deutsche E-Commerce Websites manuell pruefen und Ground Truth YAML erstellen.
### Verzeichnis
```
control-pipeline/tests/benchmarks/
├── amazon_de.yaml
├── zalando_de.yaml
├── otto_de.yaml
├── lidl_de.yaml
├── check24_de.yaml
├── booking_de.yaml
├── thomann_de.yaml
├── aboutyou_de.yaml
├── mytheresa_com.yaml
└── kleiner_shop.yaml
```
### Format pro Website
```yaml
website: amazon.de
url: https://www.amazon.de
checked_at: "2026-05-XX"
checked_by: "Name"
ground_truth:
impressum:
present: true/false
complete: true/false # Name, Adresse, Email, HR-Nummer, USt-ID
within_2_clicks: true/false
missing_fields: [] # z.B. ["USt-ID", "Handelsregister"]
datenschutzerklaerung:
present: true/false
art13_complete: true/false
missing_art13_fields: [] # z.B. ["Speicherdauer", "Empfaenger"]
rechtsgrundlagen_korrekt: true/false
wrong_legal_bases: [] # z.B. ["Analytics auf lit. f statt lit. a"]
cookie_banner:
present: true/false
reject_equally_easy: true/false # CNIL: Ablehnen = gleich prominent
cookies_before_consent: true/false # Planet49: Cookies VOR Consent?
dark_patterns: [] # z.B. ["Ablehnen-Button kleiner", "Ablehnen hinter Einstellungen"]
widerrufsbelehrung:
present: true/false
matches_legal_template: true/false # Gesetzliches Muster
agb:
present: true/false
checkout_button_text: "..." # z.B. "Jetzt kaufen" (korrekt) vs "Weiter" (falsch)
google_fonts_external: true/false
google_analytics: true/false
third_party_services:
- name: "Google Analytics"
detected: true
consent_required: true
consent_obtained_before_load: false
- name: "Facebook Pixel"
detected: true
consent_required: true
consent_obtained_before_load: false
expected_findings:
- "Cookie-Banner: Ablehnen nicht gleichwertig"
- "Google Analytics ohne vorherige Einwilligung"
- "DSE: Rechtsgrundlage fuer Analytics falsch"
expected_no_findings:
- "Impressum fehlt" # Ist vorhanden, darf nicht geflagt werden
```
### Test-Runner
```python
# control-pipeline/tests/test_benchmarks.py
"""
Real-World Benchmark Tests — vergleicht Agent-Findings mit manueller Ground Truth.
Erfordert: Compliance Agent muss laufen (https://macmini:3007/sdk/agent)
"""
import yaml
import pytest
import os
BENCHMARK_DIR = os.path.join(os.path.dirname(__file__), "benchmarks")
def load_benchmarks():
cases = []
for f in sorted(os.listdir(BENCHMARK_DIR)):
if f.endswith(".yaml"):
with open(os.path.join(BENCHMARK_DIR, f)) as fh:
cases.append(yaml.safe_load(fh))
return cases
class TestBenchmarks:
"""Precision/Recall gegen Ground Truth messen."""
@pytest.mark.parametrize("case", load_benchmarks(), ids=lambda c: c["website"])
def test_benchmark(self, case):
# TODO: Agent gegen Website laufen lassen
# TODO: Findings mit expected_findings vergleichen
# TODO: Precision + Recall berechnen
pass
```
### Wie die Ground Truth erstellt wird
1. Website im Browser oeffnen
2. Impressum pruefen (alle Pflichtfelder nach § 5 DDG)
3. Datenschutzerklaerung lesen (Art. 13 DSGVO Checkliste)
4. Cookie-Banner testen (Ablehnen gleich einfach? Cookies vor Consent?)
5. Widerrufsbelehrung gegen gesetzliches Muster pruefen
6. Browser DevTools: Netzwerk-Tab → externe Requests vor Consent?
7. Alles in YAML dokumentieren
**Ziel-Metriken:**
- Precision > 80% (wenige False Positives)
- Recall > 70% (findet die meisten echten Probleme)
---
## Aufgabe 2: Adversarial Tests (C2)
### Was zu tun ist
30 tricky Test Cases erstellen die den Agent/Controls herausfordern.
### Datei
`control-pipeline/tests/adversarial_cases.yaml`
### Kategorien
**A. Falsche Rechtsgrundlage (8 Cases):**
- Analytics auf lit. f statt lit. a
- Marketing-Emails auf lit. b statt lit. a
- Mitarbeiter-Tracking auf lit. f statt Betriebsvereinbarung
- Biometrische Daten auf lit. f statt Art. 9
- Profiling auf lit. f statt Art. 22
- Newsletter auf lit. b statt lit. a
- Social Login auf lit. b statt lit. a
- Kreditscoring auf lit. f statt lit. a + Art. 22
**B. Dark Patterns (6 Cases):**
- Ablehnen-Button existiert aber 3px gross + grau
- "Alle akzeptieren" prominent, "Einstellungen" statt "Ablehnen"
- Cookie-Wall: Inhalt erst nach Zustimmung sichtbar
- Vorausgefuellte Checkboxen (Planet49)
- Confirm-Shaming: "Nein, ich moechte keine sichere Verbindung"
- Ablehnen erfordert 3 Klicks, Akzeptieren nur 1
**C. Fast-vollstaendige Dokumente (6 Cases):**
- Impressum komplett bis auf USt-ID
- DSE ohne Speicherdauer
- DSE ohne DSB-Kontakt
- Widerrufsbelehrung mit falschem Fristbeginn
- AGB ohne Gerichtsstand
- Cookie-Policy ohne Auflistung aller Cookies
**D. Semantisch aehnlich aber verschieden (5 Cases):**
- "Admin-MFA" vs "User-MFA" (verschiedene Scopes!)
- "Daten loeschen nach Kuendigung" vs "Daten loeschen nach Aufbewahrungsfrist"
- "Rate Limiting API" vs "Rate Limiting Login"
- "Verschluesselung at rest" vs "Verschluesselung in transit"
- "Incident Response Plan" vs "Business Continuity Plan"
**E. Semantisch verschieden aber gleich klingend (5 Cases):**
- "Einwilligung" (DSGVO) vs "Einwilligung" (Werbung)
- "Verarbeitung" (Daten) vs "Verarbeitung" (Lebensmittel)
- "Risikobewertung" (DSGVO DSFA) vs "Risikobewertung" (Finanzrisiko)
- "Audit" (Datenschutz) vs "Audit" (Finanzen)
- "Zertifizierung" (ISO 27001) vs "Zertifizierung" (CE-Marking)
### Format
```yaml
- id: ADV-LIT-001
category: wrong_legal_basis
input: "Wir verarbeiten Ihre Daten fuer Webanalyse auf Grundlage unseres berechtigten Interesses (Art. 6 Abs. 1 lit. f DSGVO)."
context: "DSE-Abschnitt ueber Google Analytics"
expected:
finding: true
finding_type: "wrong_legal_basis"
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
reason: "Analytics erfordert Einwilligung, nicht berechtigtes Interesse (EuGH C-673/17 Planet49)"
difficulty: medium # easy / medium / hard
```
---
## Aufgabe 3: Regression-Harness (C3)
### Was zu tun ist
1. `conftest.py` mit shared Fixtures
2. `test_regression.py` mit Snapshot-Tests
3. CI/CD Quality Gate
### conftest.py
```python
# control-pipeline/tests/conftest.py
import os
import pytest
@pytest.fixture(scope="session")
def db_session():
"""DB session for integration tests — skip if no DATABASE_URL."""
url = os.getenv("DATABASE_URL")
if not url:
pytest.skip("DATABASE_URL not set")
from db.session import SessionLocal
db = SessionLocal()
yield db
db.close()
@pytest.fixture
def sample_controls(db_session):
"""Load 100 random draft controls for regression testing."""
from sqlalchemy import text
rows = db_session.execute(text("""
SELECT control_id, title, category, severity,
generation_metadata->>'assertion' as assertion
FROM compliance.canonical_controls
WHERE release_state = 'draft' AND decomposition_method = 'pass0b'
ORDER BY random() LIMIT 100
""")).fetchall()
return [dict(r._mapping) for r in rows]
```
### test_regression.py
```python
# control-pipeline/tests/test_regression.py
"""
Regression Tests — pruefen ob Pipeline-Updates bestehende Controls veraendern.
Erfordert: DATABASE_URL Umgebungsvariable
"""
class TestControlStability:
def test_draft_count_stable(self, db_session):
"""Draft count darf nicht um >5% abweichen."""
from sqlalchemy import text
count = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
)).scalar()
assert count > 140000, f"Draft count too low: {count}"
assert count < 200000, f"Draft count too high: {count}"
def test_no_null_assertions(self, db_session):
"""Alle draft Controls muessen eine assertion haben."""
from sqlalchemy import text
null_count = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
"AND (generation_metadata->>'assertion' IS NULL OR generation_metadata->>'assertion' = '')"
)).scalar()
assert null_count < 1000, f"Too many controls without assertion: {null_count}"
def test_dependency_graph_valid(self, db_session):
"""Keine Zyklen im Dependency-Graph."""
from sqlalchemy import text
cycle_count = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.control_dependencies WHERE is_active = true"
)).scalar()
assert cycle_count > 10000, f"Too few dependencies: {cycle_count}"
class TestQualityGates:
def test_duplicate_rate(self, db_session):
pass # Implementieren: duplicate_rate < 5%
def test_evidence_leak_rate(self, db_session):
pass # Implementieren: evidence_leak < 2%
```
### CI/CD Quality Gate
```yaml
# .gitea/workflows/quality-gate.yml
name: Control Pipeline Quality Gate
on:
push:
paths:
- 'control-pipeline/**'
jobs:
quality-gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Tests
run: |
cd control-pipeline
pip install -r requirements.txt pytest pyyaml
PYTHONPATH=. pytest tests/ -v --tb=short -x
- name: Quality Metrics
run: |
# Nur wenn Container laeuft
curl -sf http://127.0.0.1:8098/v1/canonical/generate/quality-metrics || echo "Pipeline not running, skip metrics"
```
---
## WICHTIG
- Bestehende 221 Tests NICHT aendern
- NICHT deployen (Container nicht neustarten)
- Alle neuen Tests muessen ohne DB laufen (ausser test_regression.py mit skip-Marker)
- Ground Truth YAML manuell erstellen (kein LLM fuer die Referenzdaten!)
- Bei Fragen: Memory lesen unter `/Users/benjaminadmin/.claude/projects/-Users-benjaminadmin-Projekte-breakpilot-core/memory/`
@@ -2715,3 +2715,199 @@ async def get_quality_metrics(
}
finally:
db.close()
# =============================================================================
# REVIEW CANDIDATE VERIFICATION (Block B — LLM decides DUPLIKAT/VERSCHIEDEN)
# =============================================================================
_REVIEW_VERIFY_SYSTEM = """Du vergleichst Paare von Compliance Controls und entscheidest ob sie Duplikate sind.
Antworte NUR mit einem JSON-Array. Fuer jedes Paar ein Objekt:
{"pair_id": "...", "decision": "DUPLIKAT" oder "VERSCHIEDEN", "reason": "kurze Begruendung"}
DUPLIKAT = gleiche Anforderung, nur anders formuliert.
VERSCHIEDEN = unterschiedliche Anforderungen, auch wenn aehnliche Woerter vorkommen."""
class ReviewVerifyRequest(BaseModel):
limit: int = 0
batch_size: int = 10
dry_run: bool = True
_review_verify_status: dict = {}
async def _run_review_verify(req: ReviewVerifyRequest, job_id: str):
from services.decomposition_pass import (
create_anthropic_batch, fetch_batch_results, check_batch_status,
)
import asyncio as aio
db = SessionLocal()
try:
_review_verify_status[job_id] = {"status": "loading"}
query = """
SELECT r.id::text, r.candidate_control_id, r.candidate_title,
r.matched_control_id, c2.title as matched_title,
r.similarity_score
FROM control_dedup_reviews r
LEFT JOIN canonical_controls c2 ON c2.id = r.matched_control_uuid
WHERE r.review_status = 'pending'
ORDER BY r.similarity_score DESC
"""
if req.limit > 0:
query += f" LIMIT {req.limit}"
rows = db.execute(text(query)).fetchall()
total = len(rows)
_review_verify_status[job_id] = {"status": "preparing", "total": total}
if total == 0:
_review_verify_status[job_id] = {
"status": "completed", "total": 0, "message": "No pending reviews",
}
return
if req.dry_run:
_review_verify_status[job_id] = {
"status": "dry_run", "total": total,
"estimated_requests": (total + req.batch_size - 1) // req.batch_size,
}
return
# Build batch requests
api_requests = []
pair_map = {}
for i in range(0, total, req.batch_size):
batch = rows[i:i + req.batch_size]
prompt = "Vergleiche diese Control-Paare:\n\n"
batch_pairs = []
for r in batch:
pair_id = r[0][:8]
prompt += (
f"Paar {pair_id}:\n"
f" A: {r[1]}{r[2]}\n"
f" B: {r[3]}{r[4]}\n"
f" Similarity: {r[5]:.3f}\n\n"
)
batch_pairs.append({"review_id": r[0], "candidate_id": r[1]})
batch_idx = i // req.batch_size
custom_id = f"rv_b{batch_idx:05d}"
pair_map[custom_id] = batch_pairs
api_requests.append({
"custom_id": custom_id,
"params": {
"model": "claude-haiku-4-5-20251001",
"max_tokens": max(1024, len(batch) * 150),
"system": [{
"type": "text",
"text": _REVIEW_VERIFY_SYSTEM,
"cache_control": {"type": "ephemeral"},
}],
"messages": [{"role": "user", "content": prompt}],
},
})
_review_verify_status[job_id] = {
"status": "submitting", "total": total, "requests": len(api_requests),
}
batch_result = await create_anthropic_batch(api_requests)
batch_id = batch_result.get("id", "")
_review_verify_status[job_id] = {
"status": "batch_submitted", "batch_id": batch_id,
"total": total, "requests": len(api_requests),
}
# Poll for completion
for _ in range(720):
await aio.sleep(10)
status = await check_batch_status(batch_id)
if status.get("processing_status") == "ended":
break
# Process results
results = await fetch_batch_results(batch_id)
duplicates = 0
different = 0
errors = 0
for result in results:
custom_id = result.get("custom_id", "")
result_data = result.get("result", {})
if result_data.get("type") != "succeeded":
errors += 1
continue
content = result_data.get("message", {}).get("content", [])
text_content = content[0].get("text", "") if content else ""
try:
import json as jmod
import re
json_matches = re.findall(r'\{[^}]+\}', text_content)
pairs = pair_map.get(custom_id, [])
for j, match_str in enumerate(json_matches):
try:
parsed = jmod.loads(match_str)
except Exception:
continue
decision = parsed.get("decision", "").upper()
if j < len(pairs):
review_id = pairs[j]["review_id"]
if "DUPLIKAT" in decision:
db.execute(text("""
UPDATE control_dedup_reviews
SET review_status = 'duplicate', review_notes = :notes
WHERE id = CAST(:rid AS uuid)
"""), {"rid": review_id, "notes": parsed.get("reason", "")})
duplicates += 1
else:
db.execute(text("""
UPDATE control_dedup_reviews
SET review_status = 'different', review_notes = :notes
WHERE id = CAST(:rid AS uuid)
"""), {"rid": review_id, "notes": parsed.get("reason", "")})
different += 1
db.commit()
except Exception as e:
logger.error("Review verify parse error: %s", e)
errors += 1
try:
db.rollback()
except Exception:
pass
_review_verify_status[job_id] = {
"status": "completed", "batch_id": batch_id, "total": total,
"duplicates": duplicates, "different": different, "errors": errors,
}
except Exception as e:
logger.error("Review verify %s failed: %s", job_id, e)
_review_verify_status[job_id] = {"status": "failed", "error": str(e)}
finally:
db.close()
@router.post("/generate/review-verify")
async def start_review_verify(req: ReviewVerifyRequest):
"""LLM-verify review candidates (DUPLIKAT/VERSCHIEDEN) via Haiku Batch."""
import uuid as uuid_mod
job_id = str(uuid_mod.uuid4())[:8]
_review_verify_status[job_id] = {"status": "starting"}
asyncio.create_task(_run_review_verify(req, job_id))
return {
"status": "running", "job_id": job_id,
"message": f"Poll /generate/review-verify-status/{job_id}",
}
@router.get("/generate/review-verify-status/{job_id}")
async def get_review_verify_status(job_id: str):
status = _review_verify_status.get(job_id)
if not status:
raise HTTPException(status_code=404, detail="Review verify job not found")
return status
@@ -165,21 +165,29 @@ def classify_source_regulation(source_regulation: str) -> str:
"""
Klassifiziert eine source_regulation als law, guideline oder framework.
Verwendet exaktes Matching gegen die Map. Bei unbekannten Quellen
wird anhand von Schluesselwoertern geraten, Fallback ist 'framework'
(konservativstes Ergebnis).
Delegates to DB-backed RegulationRegistry (with 5min cache).
Falls back to SOURCE_REGULATION_CLASSIFICATION dict + heuristic
if DB is unavailable.
"""
if not source_regulation:
return SOURCE_TYPE_FRAMEWORK
# Exaktes Match
# Try DB-backed registry first
try:
from services.regulation_registry import classify_source_regulation as _db_classify
result = _db_classify(source_regulation)
if result:
return result
except Exception:
pass
# Fallback: local dict
if source_regulation in SOURCE_REGULATION_CLASSIFICATION:
return SOURCE_REGULATION_CLASSIFICATION[source_regulation]
# Heuristik fuer unbekannte Quellen
lower = source_regulation.lower()
# Gesetze erkennen
law_indicators = [
"verordnung", "richtlinie", "gesetz", "directive", "regulation",
"(eu)", "(eg)", "act", "ley", "loi", "törvény", "código",
@@ -187,19 +195,16 @@ def classify_source_regulation(source_regulation: str) -> str:
if any(ind in lower for ind in law_indicators):
return SOURCE_TYPE_LAW
# Leitlinien erkennen
guideline_indicators = [
"edpb", "leitlinie", "guideline", "wp2", "bsi", "empfehlung",
]
if any(ind in lower for ind in guideline_indicators):
return SOURCE_TYPE_GUIDELINE
# Frameworks erkennen
framework_indicators = [
"enisa", "nist", "owasp", "oecd", "cisa", "framework", "iso",
]
if any(ind in lower for ind in framework_indicators):
return SOURCE_TYPE_FRAMEWORK
# Konservativ: unbekannt = framework (geringste Verbindlichkeit)
return SOURCE_TYPE_FRAMEWORK
@@ -0,0 +1,72 @@
-- Migration 002: Regulation Registry (Block F1)
-- Schema: compliance
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/002_regulation_registry.sql
SET search_path TO compliance, public;
-- ========================================
-- regulation_registry
-- ========================================
-- Central registry for all regulations, laws, guidelines, and frameworks
-- referenced by the control pipeline. Replaces hardcoded Python dicts
-- (REGULATION_LICENSE_MAP, SOURCE_REGULATION_CLASSIFICATION).
CREATE TABLE IF NOT EXISTS regulation_registry (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- regulation_id: machine key (e.g. "eu_2016_679", "nist_sp_800_53")
regulation_id VARCHAR(100) UNIQUE NOT NULL,
-- Display names
regulation_name_de TEXT,
regulation_name_en TEXT,
regulation_short VARCHAR(50),
-- License classification (3-rule system)
license_rule INTEGER NOT NULL DEFAULT 1
CHECK (license_rule IN (1, 2, 3)),
license_type VARCHAR(50), -- EU_LAW, DE_LAW, CC-BY-SA-4.0, etc.
attribution TEXT, -- Required for Rule 2 (CC-BY)
-- Source classification
source_type VARCHAR(20) NOT NULL DEFAULT 'law'
CHECK (source_type IN ('law', 'guideline', 'standard', 'framework', 'restricted')),
-- Metadata
jurisdiction VARCHAR(10), -- DE, EU, AT, CH, US, FR, ES, NL, IT, HU, INT
category VARCHAR(50),
celex VARCHAR(30), -- EU CELEX number if applicable
url TEXT,
-- Lifecycle
status VARCHAR(20) NOT NULL DEFAULT 'active'
CHECK (status IN ('active', 'needs_review', 'deprecated')),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Indexes
CREATE INDEX IF NOT EXISTS idx_reg_registry_status
ON regulation_registry(status);
CREATE INDEX IF NOT EXISTS idx_reg_registry_jurisdiction
ON regulation_registry(jurisdiction);
CREATE INDEX IF NOT EXISTS idx_reg_registry_source_type
ON regulation_registry(source_type);
CREATE INDEX IF NOT EXISTS idx_reg_registry_license_rule
ON regulation_registry(license_rule);
-- Updated-at trigger
CREATE OR REPLACE FUNCTION update_regulation_registry_updated_at()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = NOW();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS trg_regulation_registry_updated_at ON regulation_registry;
CREATE TRIGGER trg_regulation_registry_updated_at
BEFORE UPDATE ON regulation_registry
FOR EACH ROW
EXECUTE FUNCTION update_regulation_registry_updated_at();
@@ -0,0 +1,498 @@
#!/usr/bin/env python3
"""D6 Citation Backfill — update ~291k controls with section metadata from Qdrant chunks.
Archives old source_citation in generation_metadata.old_citation.
Updates source_citation.article, .paragraph, .page from matched Qdrant chunks.
3-tier matching:
Tier 1: sha256(source_original_text) → exact chunk text match
Tier 2: Parse [section] prefix from source_original_text
Tier 3: Best text overlap within same regulation_id
Usage:
python3 control-pipeline/scripts/d6_citation_backfill.py --dry-run --limit 100
python3 control-pipeline/scripts/d6_citation_backfill.py --batch-size 1000
"""
import argparse
import hashlib
import json
import logging
import os
import re
import time
from dataclasses import dataclass
from typing import Optional
import httpx
import psycopg2
import psycopg2.extras
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
)
logger = logging.getLogger("d6-backfill")
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot@localhost:5432/breakpilot_db")
QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
COLLECTIONS = [
"bp_compliance_ce",
"bp_compliance_gesetze",
"bp_compliance_datenschutz",
"bp_dsfa_corpus",
"bp_legal_templates",
]
# Parse [§ 312k Title] or [AC-1 POLICY] prefix from chunk text
_SECTION_PREFIX_RE = re.compile(r'^\[([^\]]+)\]\s*')
@dataclass
class ChunkMeta:
section: str
section_title: str
paragraph: str
page: Optional[int]
regulation_id: str
@dataclass
class Stats:
total: int = 0
already_correct: int = 0
matched_hash: int = 0
matched_prefix: int = 0
matched_overlap: int = 0
unmatched: int = 0
updated: int = 0
errors: int = 0
# -------------------------------------------------------------------
# Phase 1: Build Qdrant index
# -------------------------------------------------------------------
def build_qdrant_index(qdrant_url: str) -> tuple[dict, dict]:
"""Build hash index and regulation index from all Qdrant collections.
Returns:
hash_index: {sha256(chunk_text) → ChunkMeta}
reg_index: {regulation_id → [ChunkMeta with text snippets]}
"""
hash_index: dict[str, ChunkMeta] = {}
reg_index: dict[str, list[tuple[str, ChunkMeta]]] = {}
total_chunks = 0
for coll in COLLECTIONS:
offset = None
coll_count = 0
with httpx.Client(timeout=60.0) as c:
while True:
body = {
"limit": 250,
"with_payload": [
"chunk_text", "section", "section_title",
"paragraph", "page", "regulation_id",
],
"with_vector": False,
}
if offset is not None:
body["offset"] = offset
resp = c.post(
f"{qdrant_url}/collections/{coll}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
for pt in data["points"]:
p = pt.get("payload", {})
chunk_text = p.get("chunk_text", "")
if not chunk_text or len(chunk_text.strip()) < 30:
continue
meta = ChunkMeta(
section=p.get("section", "") or "",
section_title=p.get("section_title", "") or "",
paragraph=p.get("paragraph", "") or "",
page=p.get("page"),
regulation_id=p.get("regulation_id", "") or "",
)
# Hash index
h = hashlib.sha256(chunk_text.encode()).hexdigest()
if meta.section: # only index chunks WITH section data
hash_index[h] = meta
# Regulation index (for text overlap matching)
if meta.regulation_id and meta.section:
reg_index.setdefault(meta.regulation_id, []).append(
(chunk_text[:500], meta)
)
coll_count += 1
offset = data.get("next_page_offset")
if offset is None:
break
total_chunks += coll_count
logger.info(" [%s] %d chunks indexed", coll, coll_count)
logger.info("Qdrant index: %d total chunks, %d with section (hash), %d regulations",
total_chunks, len(hash_index), len(reg_index))
return hash_index, reg_index
# -------------------------------------------------------------------
# Phase 2: Load controls
# -------------------------------------------------------------------
def load_controls(db_url: str, limit: int = 0) -> list[dict]:
"""Load all controls needing citation update."""
conn = psycopg2.connect(db_url)
conn.set_session(autocommit=False)
cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
cur.execute("SET search_path TO compliance, core, public")
query = """
SELECT id, control_id, source_citation, source_original_text,
generation_metadata, license_rule
FROM canonical_controls
WHERE license_rule IN (1, 2)
AND source_citation IS NOT NULL
ORDER BY control_id
"""
if limit > 0:
query += f" LIMIT {limit}"
cur.execute(query)
rows = cur.fetchall()
conn.close()
controls = []
for row in rows:
ctrl = dict(row)
ctrl["id"] = str(ctrl["id"])
for jf in ("source_citation", "generation_metadata"):
val = ctrl.get(jf)
if isinstance(val, str):
try:
ctrl[jf] = json.loads(val)
except (json.JSONDecodeError, TypeError):
ctrl[jf] = {}
elif val is None:
ctrl[jf] = {}
controls.append(ctrl)
return controls
# -------------------------------------------------------------------
# Phase 3: Matching
# -------------------------------------------------------------------
def match_control(
ctrl: dict,
hash_index: dict[str, ChunkMeta],
reg_index: dict[str, list[tuple[str, ChunkMeta]]],
) -> tuple[Optional[ChunkMeta], str]:
"""Match a control to a Qdrant chunk. Returns (meta, method) or (None, '')."""
source_text = ctrl.get("source_original_text", "") or ""
# Tier 1: Hash match
if source_text:
h = hashlib.sha256(source_text.encode()).hexdigest()
meta = hash_index.get(h)
if meta and meta.section:
return meta, "hash"
# Tier 2: Parse [section] prefix from source_original_text
if source_text:
m = _SECTION_PREFIX_RE.match(source_text)
if m:
prefix = m.group(1).strip()
parsed = _parse_section_from_prefix(prefix)
if parsed:
return parsed, "prefix"
# Tier 3: Text overlap within same regulation
gen_meta = ctrl.get("generation_metadata") or {}
reg_id = gen_meta.get("source_regulation", "")
if reg_id and source_text and reg_id in reg_index:
best = _find_best_overlap(source_text, reg_index[reg_id])
if best:
return best, "overlap"
return None, ""
def _parse_section_from_prefix(prefix: str) -> Optional[ChunkMeta]:
"""Parse a section prefix like '§ 312k Kuendigungsbutton' or 'AC-1 POLICY'."""
if not prefix:
return None
# § pattern
m = re.match(r'\s*\d+[a-z]*)\s*(.*)', prefix)
if m:
return ChunkMeta(
section=m.group(1).strip(),
section_title=m.group(2).strip(),
paragraph="", page=None, regulation_id="",
)
# Art./Artikel pattern
m = re.match(r'(Art(?:ikel|\.)\s*\d+)\s*(.*)', prefix, re.IGNORECASE)
if m:
return ChunkMeta(
section=m.group(1).strip(),
section_title=m.group(2).strip(),
paragraph="", page=None, regulation_id="",
)
# NIST control pattern (AC-1, AU-2, etc.)
m = re.match(r'([A-Z]{2,4}-\d+(?:\(\d+\))?)\s*(.*)', prefix)
if m:
return ChunkMeta(
section=m.group(1).strip(),
section_title=m.group(2).strip(),
paragraph="", page=None, regulation_id="",
)
# Numbered section (3.1 Title)
m = re.match(r'(\d+(?:\.\d+)+)\s*(.*)', prefix)
if m:
return ChunkMeta(
section=m.group(1).strip(),
section_title=m.group(2).strip(),
paragraph="", page=None, regulation_id="",
)
# ALL-CAPS heading (fallback — use as section_title)
if prefix == prefix.upper() and len(prefix) > 3:
return ChunkMeta(
section="", section_title=prefix,
paragraph="", page=None, regulation_id="",
)
return None
def _find_best_overlap(source_text: str, chunks: list[tuple[str, ChunkMeta]]) -> Optional[ChunkMeta]:
"""Find chunk with best text overlap (simple word-set Jaccard)."""
source_words = set(source_text.lower().split())
if len(source_words) < 5:
return None
best_score = 0.0
best_meta = None
for chunk_text, meta in chunks:
chunk_words = set(chunk_text.lower().split())
if not chunk_words:
continue
intersection = len(source_words & chunk_words)
union = len(source_words | chunk_words)
jaccard = intersection / union if union > 0 else 0
if jaccard > best_score and jaccard > 0.3: # 30% threshold
best_score = jaccard
best_meta = meta
return best_meta
# -------------------------------------------------------------------
# Phase 4: Update controls
# -------------------------------------------------------------------
def update_controls(
db_url: str,
controls: list[dict],
hash_index: dict[str, ChunkMeta],
reg_index: dict[str, list[tuple[str, ChunkMeta]]],
dry_run: bool = True,
batch_size: int = 1000,
) -> Stats:
"""Match and update all controls."""
stats = Stats(total=len(controls))
conn = psycopg2.connect(db_url)
conn.set_session(autocommit=False)
cur = conn.cursor()
cur.execute("SET search_path TO compliance, core, public")
updates = []
for i, ctrl in enumerate(controls):
if i > 0 and i % 5000 == 0:
logger.info("Progress: %d/%d (hash=%d prefix=%d overlap=%d unmatched=%d)",
i, stats.total, stats.matched_hash, stats.matched_prefix,
stats.matched_overlap, stats.unmatched)
citation = ctrl.get("source_citation") or {}
old_article = citation.get("article", "")
gen_meta = ctrl.get("generation_metadata") or {}
# Match
meta, method = match_control(ctrl, hash_index, reg_index)
if not meta or not meta.section:
# No match — check if existing article is already good
if old_article:
stats.already_correct += 1
else:
stats.unmatched += 1
continue
# Check if update is needed
if old_article == meta.section:
stats.already_correct += 1
continue
# Track method
if method == "hash":
stats.matched_hash += 1
elif method == "prefix":
stats.matched_prefix += 1
elif method == "overlap":
stats.matched_overlap += 1
# Archive old citation
if old_article or citation.get("paragraph"):
gen_meta["old_citation"] = {
"article": old_article,
"paragraph": citation.get("paragraph", ""),
"page": citation.get("page"),
"archived_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
}
# Update citation
citation["article"] = meta.section
if meta.paragraph:
citation["paragraph"] = meta.paragraph
if meta.page is not None:
citation["page"] = meta.page
# Update generation_metadata
gen_meta["source_article"] = meta.section
if meta.paragraph:
gen_meta["source_paragraph"] = meta.paragraph
if meta.page is not None:
gen_meta["source_page"] = meta.page
gen_meta["backfill_method"] = method
gen_meta["backfill_at"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
updates.append((
json.dumps(citation, ensure_ascii=False),
json.dumps(gen_meta, ensure_ascii=False, default=str),
ctrl["id"],
))
# Batch commit
if len(updates) >= batch_size and not dry_run:
_execute_batch(cur, updates)
conn.commit()
stats.updated += len(updates)
logger.info("Committed batch: %d updates (total %d)", len(updates), stats.updated)
updates = []
# Final batch
if updates and not dry_run:
_execute_batch(cur, updates)
conn.commit()
stats.updated += len(updates)
logger.info("Committed final batch: %d updates (total %d)", len(updates), stats.updated)
elif updates and dry_run:
stats.updated = len(updates) # would-be updates
conn.close()
return stats
def _execute_batch(cur, updates: list[tuple]):
"""Execute batch UPDATE statements."""
for citation_json, meta_json, ctrl_id in updates:
cur.execute(
"""UPDATE canonical_controls
SET source_citation = %s::jsonb,
generation_metadata = %s::jsonb,
updated_at = NOW()
WHERE id = %s::uuid""",
(citation_json, meta_json, ctrl_id),
)
# -------------------------------------------------------------------
# Main
# -------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(description="D6 Citation Backfill")
parser.add_argument("--dry-run", action="store_true", help="Don't write to DB")
parser.add_argument("--limit", type=int, default=0, help="Limit controls (0=all)")
parser.add_argument("--batch-size", type=int, default=1000)
parser.add_argument("--db-url", default=DB_URL)
parser.add_argument("--qdrant-url", default=QDRANT_URL)
args = parser.parse_args()
logger.info("=" * 60)
logger.info("D6 Citation Backfill")
logger.info(" DB: %s", args.db_url.split("@")[-1])
logger.info(" Qdrant: %s", args.qdrant_url)
logger.info(" Dry run: %s", args.dry_run)
logger.info(" Limit: %s", args.limit or "ALL")
logger.info("=" * 60)
# Phase 1: Build Qdrant index
logger.info("\nPhase 1: Building Qdrant index...")
t0 = time.time()
hash_index, reg_index = build_qdrant_index(args.qdrant_url)
logger.info("Index built in %.1fs", time.time() - t0)
# Phase 2: Load controls
logger.info("\nPhase 2: Loading controls...")
controls = load_controls(args.db_url, args.limit)
logger.info("Loaded %d controls", len(controls))
if not controls:
logger.info("No controls to process")
return
# Phase 3+4: Match and update
logger.info("\nPhase 3+4: Matching and updating...")
t0 = time.time()
stats = update_controls(
args.db_url, controls, hash_index, reg_index,
dry_run=args.dry_run, batch_size=args.batch_size,
)
elapsed = time.time() - t0
# Summary
logger.info("\n" + "=" * 60)
logger.info("RESULTS")
logger.info("=" * 60)
logger.info(" Total controls: %d", stats.total)
logger.info(" Already correct: %d (%.1f%%)", stats.already_correct,
stats.already_correct / max(stats.total, 1) * 100)
logger.info(" Matched (hash): %d (%.1f%%)", stats.matched_hash,
stats.matched_hash / max(stats.total, 1) * 100)
logger.info(" Matched (prefix): %d (%.1f%%)", stats.matched_prefix,
stats.matched_prefix / max(stats.total, 1) * 100)
logger.info(" Matched (overlap): %d (%.1f%%)", stats.matched_overlap,
stats.matched_overlap / max(stats.total, 1) * 100)
logger.info(" Unmatched: %d (%.1f%%)", stats.unmatched,
stats.unmatched / max(stats.total, 1) * 100)
logger.info(" Updated: %d", stats.updated)
logger.info(" Errors: %d", stats.errors)
logger.info(" Time: %.1fs (%.0f controls/sec)", elapsed,
stats.total / max(elapsed, 1))
if args.dry_run:
logger.info("\nDRY RUN — no changes written. Run without --dry-run to apply.")
if __name__ == "__main__":
main()
@@ -0,0 +1,280 @@
#!/usr/bin/env python3
"""Extract large NIST PDFs locally, then upload as .txt to RAG service.
Workaround for embedding-service container crashing on large PDFs (>5 MB).
Runs pdfplumber + normalization locally, uploads extracted text as .txt.
Usage (on Mac Mini):
python3 control-pipeline/scripts/extract_and_upload_nist.py
"""
import json
import os
import re
import sys
import tempfile
import unicodedata
import httpx
import pdfplumber
RAG_URL = "https://localhost:8097"
QDRANT_URL = "http://localhost:6333"
DOCS = [
{
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_53r5.pdf",
"collection": "bp_compliance_datenschutz",
"filename": "NIST_SP_800_53r5.txt",
"extra_metadata": {
"regulation_id": "nist_sp800_53r5",
"source_id": "nist",
"doc_type": "controls_catalog",
"guideline_name": "NIST SP 800-53 Rev. 5 Security and Privacy Controls",
"license": "public_domain_us_gov",
"attribution": "NIST",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/nist_sp_800_82r3.pdf",
"collection": "bp_compliance_ce",
"filename": "nist_sp_800_82r3.txt",
"extra_metadata": {
"regulation_id": "nist_sp_800_82r3",
"regulation_name_de": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
"regulation_name_en": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
"regulation_short": "NIST SP 800-82",
"category": "ot_security",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/nist_sp_800_160v1r1.pdf",
"collection": "bp_compliance_ce",
"filename": "nist_sp_800_160v1r1.txt",
"extra_metadata": {
"regulation_id": "nist_sp_800_160v1r1",
"regulation_name_de": "NIST SP 800-160 Vol. 1 Rev. 1",
"regulation_name_en": "NIST SP 800-160 Vol. 1 Rev. 1",
"regulation_short": "NIST SP 800-160",
"category": "security_engineering",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_207.pdf",
"collection": "bp_compliance_datenschutz",
"filename": "NIST_SP_800_207.txt",
"extra_metadata": {
"regulation_id": "nist_sp800_207",
"source_id": "nist",
"doc_type": "architecture",
"guideline_name": "NIST SP 800-207 Zero Trust Architecture",
"license": "public_domain_us_gov",
"attribution": "NIST",
"source": "nist.gov",
},
},
]
def normalize_pdf_text(text: str) -> str:
"""Fix broken spacing from multi-column PDF extraction."""
text = unicodedata.normalize('NFKC', text)
text = text.replace('\u00ad', '').replace('\u200b', '')
prev = None
while prev != text:
prev = text
text = re.sub(r'(\d+)\s+\.\s+(\d+)', r'\1.\2', text)
text = re.sub(r'\b([A-Z]{2,4})\s+-\s+(\d+)\b', r'\1-\2', text)
text = re.sub(
r'\b([A-Z]{2})\s*\.\s*([A-Z]{2})\s*-\s*(\d{2})\b', r'\1.\2-\3', text
)
text = re.sub(r'\(\s+(\d+)\s+\)', r'(\1)', text)
text = re.sub(r'[^\S\n]{2,}', ' ', text)
return text
def extract_pdf_locally(pdf_bytes: bytes) -> str:
"""Extract text from PDF using pdfplumber with normalization."""
import io
text_parts = []
with pdfplumber.open(io.BytesIO(pdf_bytes)) as pdf:
print(f" Pages: {len(pdf.pages)}")
for i, page in enumerate(pdf.pages):
text = page.extract_text(x_tolerance=3, y_tolerance=4)
if text:
text_parts.append(text)
if (i + 1) % 50 == 0:
print(f" Extracted {i + 1}/{len(pdf.pages)} pages...")
raw = "\n\n".join(text_parts)
return normalize_pdf_text(raw)
def download_from_minio(object_name: str) -> bytes:
"""Download file from MinIO via RAG service."""
with httpx.Client(timeout=60.0, verify=False) as c:
resp = c.get(f"{RAG_URL}/api/v1/documents/download/{object_name}")
resp.raise_for_status()
url = resp.json()["url"]
with httpx.Client(timeout=300.0, verify=False) as c:
resp = c.get(url)
resp.raise_for_status()
return resp.content
def upload_text(
text: str, filename: str, collection: str, extra_metadata: dict,
) -> dict:
"""Upload extracted text to RAG service as .txt."""
form_data = {
"collection": collection,
"data_type": "compliance",
"bundesland": "bund",
"use_case": "compliance",
"year": "2026",
"chunk_strategy": "recursive",
"chunk_size": "1500",
"chunk_overlap": "100",
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
}
text_bytes = text.encode("utf-8")
with httpx.Client(timeout=1800.0, verify=False) as c:
resp = c.post(
f"{RAG_URL}/api/v1/documents/upload",
files={"file": (filename, text_bytes, "text/plain")},
data=form_data,
)
resp.raise_for_status()
return resp.json()
def count_chunks(collection: str, regulation_id: str) -> int:
"""Count chunks for a regulation in Qdrant."""
with httpx.Client(timeout=30.0) as c:
resp = c.post(
f"{QDRANT_URL}/collections/{collection}/points/count",
json={
"filter": {
"must": [{
"key": "regulation_id",
"match": {"value": regulation_id},
}]
},
"exact": True,
},
)
resp.raise_for_status()
return resp.json()["result"]["count"]
def check_section_rate(collection: str, regulation_id: str) -> tuple:
"""Returns (total_chunks, chunks_with_section)."""
total = 0
with_sec = 0
offset = None
with httpx.Client(timeout=60.0) as c:
while True:
body = {
"filter": {
"must": [{
"key": "regulation_id",
"match": {"value": regulation_id},
}]
},
"limit": 100,
"with_payload": ["section"],
}
if offset is not None:
body["offset"] = offset
resp = c.post(
f"{QDRANT_URL}/collections/{collection}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
for pt in data["points"]:
total += 1
s = pt.get("payload", {}).get("section", "")
if s and s.strip():
with_sec += 1
offset = data.get("next_page_offset")
if offset is None:
break
return total, with_sec
def main():
print("=" * 60)
print("NIST PDF Local Extraction + Upload")
print("=" * 60)
results = []
for i, doc in enumerate(DOCS, 1):
reg_id = doc["extra_metadata"]["regulation_id"]
print(f"\n[{i}/{len(DOCS)}] {doc['filename']}{doc['collection']}")
# 1. Check current state
existing = count_chunks(doc["collection"], reg_id)
print(f" Existing chunks: {existing}")
# 2. Download PDF from MinIO
print(f" Downloading from MinIO...")
pdf_bytes = download_from_minio(doc["object_name"])
print(f" Downloaded {len(pdf_bytes) / 1024 / 1024:.1f} MB")
# 3. Extract text locally with pdfplumber
print(f" Extracting text locally...")
text = extract_pdf_locally(pdf_bytes)
print(f" Extracted {len(text):,} chars, {text.count(chr(10)):,} lines")
# 4. Save extracted text temporarily (for debugging)
tmp_path = f"/tmp/nist_{reg_id}.txt"
with open(tmp_path, "w", encoding="utf-8") as f:
f.write(text)
print(f" Saved to {tmp_path}")
# 5. Upload as .txt
print(f" Uploading as .txt to RAG service...")
result = upload_text(text, doc["filename"], doc["collection"],
doc["extra_metadata"])
new_chunks = result.get("chunks_count", 0)
new_doc_id = result.get("document_id", "")
print(f" Uploaded: {new_chunks} chunks (doc_id={new_doc_id})")
# 6. Check section rate
if new_chunks > 0:
total, with_sec = check_section_rate(doc["collection"], reg_id)
pct = (with_sec / total * 100) if total > 0 else 0
print(f" Section rate: {with_sec}/{total} ({pct:.0f}%)")
else:
pct = 0
print(" WARNING: 0 chunks created!")
results.append({
"file": doc["filename"],
"old": existing,
"new": new_chunks,
"section_rate": round(pct, 1),
})
# Summary
print("\n" + "=" * 60)
print("RESULTS")
print("=" * 60)
for r in results:
print(f" {r['file']:<40} old={r['old']} new={r['new']} sect={r['section_rate']}%")
total_new = sum(r["new"] for r in results)
print(f"\nTotal new chunks: {total_new}")
if any(r["new"] == 0 for r in results):
print("\nWARNING: Some documents produced 0 chunks!")
sys.exit(1)
if __name__ == "__main__":
main()
@@ -0,0 +1,247 @@
#!/usr/bin/env python3
"""
F1 Migration: Populate regulation_registry from hardcoded Python dicts.
Sources:
- REGULATION_LICENSE_MAP (control_generator.py) — 135 entries keyed by regulation_id
- SOURCE_REGULATION_CLASSIFICATION (source_type_classification.py) — 58 entries keyed by name
Usage:
# Dry run (prints SQL, no DB write):
python3 scripts/f1_migrate_regulation_registry.py --dry-run
# Against Mac Mini:
python3 scripts/f1_migrate_regulation_registry.py --db-host macmini
# Against local Docker:
python3 scripts/f1_migrate_regulation_registry.py --db-host localhost
"""
import argparse
import sys
from pathlib import Path
# Add parent so we can import from services/data
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from services.control_generator import REGULATION_LICENSE_MAP, _RULE2_PREFIXES, _RULE3_PREFIXES # noqa: E402
from data.source_type_classification import SOURCE_REGULATION_CLASSIFICATION # noqa: E402
# Derive jurisdiction from license_type
_LICENSE_TO_JURISDICTION = {
"EU_LAW": "EU",
"EU_PUBLIC": "EU",
"DE_LAW": "DE",
"DE_PUBLIC": "DE",
"AT_LAW": "AT",
"CH_LAW": "CH",
"FR_LAW": "FR",
"ES_LAW": "ES",
"NL_LAW": "NL",
"IT_LAW": "IT",
"HU_LAW": "HU",
"NIST_PUBLIC_DOMAIN": "US",
"US_GOV_PUBLIC": "US",
"CC-BY-SA-4.0": "INT",
"CC-BY-4.0": "INT",
"OECD_PUBLIC": "INT",
}
def _derive_jurisdiction(license_type: str) -> str:
"""Map license_type to jurisdiction code."""
return _LICENSE_TO_JURISDICTION.get(license_type, "INT")
def build_rows() -> list[dict]:
"""Merge REGULATION_LICENSE_MAP + SOURCE_REGULATION_CLASSIFICATION into rows."""
rows = []
# Track names we've seen (for dedup against SOURCE_REGULATION_CLASSIFICATION)
seen_names: set[str] = set()
# 1) Primary source: REGULATION_LICENSE_MAP (has regulation_id as key)
for reg_id, info in REGULATION_LICENSE_MAP.items():
name = info.get("name", reg_id)
seen_names.add(name)
rows.append({
"regulation_id": reg_id.lower().strip(),
"regulation_name_de": name,
"license_rule": info["rule"],
"license_type": info.get("license", ""),
"attribution": info.get("attribution"),
"source_type": info.get("source_type", "law"),
"jurisdiction": _derive_jurisdiction(info.get("license", "")),
"status": "active",
})
# 2) Secondary: SOURCE_REGULATION_CLASSIFICATION entries not already covered
# These are keyed by name, not by regulation_id. We create synthetic IDs.
for name, source_type in SOURCE_REGULATION_CLASSIFICATION.items():
if name in seen_names:
continue
# Generate a regulation_id from the name
synthetic_id = (
name.lower()
.replace(" ", "_")
.replace("(", "")
.replace(")", "")
.replace("/", "_")
.replace("-", "_")
.replace(".", "")
.replace(",", "")
.replace("ä", "ae")
.replace("ö", "oe")
.replace("ü", "ue")
.replace("á", "a")
.replace("é", "e")
.replace("ó", "o")
.strip("_")
)[:100]
# Guess jurisdiction from name content
jurisdiction = "INT"
name_lower = name.lower()
if any(x in name_lower for x in ["edpb", "edps", "(eu)", "eu ", "wp2"]):
jurisdiction = "EU"
elif any(x in name_lower for x in ["bsi", "bdsg", "bundes", "gwg"]):
jurisdiction = "DE"
elif "nist" in name_lower or "cisa" in name_lower:
jurisdiction = "US"
elif "österreich" in name_lower:
jurisdiction = "AT"
elif "schweiz" in name_lower:
jurisdiction = "CH"
elif "spanien" in name_lower:
jurisdiction = "ES"
elif "frankreich" in name_lower:
jurisdiction = "FR"
elif "ungarn" in name_lower:
jurisdiction = "HU"
# Map source_type_classification's "framework" to our "standard"
# (source_type_classification uses law/guideline/framework)
mapped_source_type = source_type
if source_type == "framework":
mapped_source_type = "standard"
rows.append({
"regulation_id": synthetic_id,
"regulation_name_de": name,
"license_rule": 1, # default: conservative
"license_type": "",
"attribution": None,
"source_type": mapped_source_type,
"jurisdiction": jurisdiction,
"status": "needs_review", # needs manual review since we guessed
})
return rows
def generate_sql(rows: list[dict]) -> str:
"""Generate INSERT SQL for all rows."""
lines = [
"SET search_path TO compliance, public;",
"",
"-- Auto-generated by f1_migrate_regulation_registry.py",
f"-- {len(rows)} rows total",
"",
]
for row in rows:
attr = f"'{row['attribution']}'" if row["attribution"] else "NULL"
lines.append(
f"INSERT INTO regulation_registry "
f"(regulation_id, regulation_name_de, license_rule, license_type, "
f"attribution, source_type, jurisdiction, status) "
f"VALUES ("
f"'{row['regulation_id']}', "
f"'{_escape_sql(row['regulation_name_de'])}', "
f"{row['license_rule']}, "
f"'{row['license_type']}', "
f"{attr}, "
f"'{row['source_type']}', "
f"'{row['jurisdiction']}', "
f"'{row['status']}'"
f") ON CONFLICT (regulation_id) DO UPDATE SET "
f"regulation_name_de = EXCLUDED.regulation_name_de, "
f"license_rule = EXCLUDED.license_rule, "
f"license_type = EXCLUDED.license_type, "
f"attribution = EXCLUDED.attribution, "
f"source_type = EXCLUDED.source_type, "
f"jurisdiction = EXCLUDED.jurisdiction;"
)
return "\n".join(lines)
def _escape_sql(val: str) -> str:
"""Escape single quotes for SQL."""
return val.replace("'", "''")
def insert_via_sqlalchemy(rows: list[dict], db_host: str) -> int:
"""Insert rows using SQLAlchemy (same pattern as control-pipeline)."""
from sqlalchemy import create_engine, text
url = f"postgresql://breakpilot:breakpilot123@{db_host}:5432/breakpilot_db"
engine = create_engine(url)
inserted = 0
with engine.connect() as conn:
conn.execute(text("SET search_path TO compliance, public"))
for row in rows:
conn.execute(
text("""
INSERT INTO regulation_registry
(regulation_id, regulation_name_de, license_rule, license_type,
attribution, source_type, jurisdiction, status)
VALUES
(:regulation_id, :regulation_name_de, :license_rule, :license_type,
:attribution, :source_type, :jurisdiction, :status)
ON CONFLICT (regulation_id) DO UPDATE SET
regulation_name_de = EXCLUDED.regulation_name_de,
license_rule = EXCLUDED.license_rule,
license_type = EXCLUDED.license_type,
attribution = EXCLUDED.attribution,
source_type = EXCLUDED.source_type,
jurisdiction = EXCLUDED.jurisdiction
"""),
row,
)
inserted += 1
conn.commit()
return inserted
def main():
parser = argparse.ArgumentParser(description="Migrate regulation registry data")
parser.add_argument("--dry-run", action="store_true", help="Print SQL only")
parser.add_argument("--db-host", default="localhost", help="PostgreSQL host")
args = parser.parse_args()
rows = build_rows()
print(f"Built {len(rows)} rows from hardcoded dicts")
# Stats
by_rule = {}
by_status = {}
for r in rows:
by_rule[r["license_rule"]] = by_rule.get(r["license_rule"], 0) + 1
by_status[r["status"]] = by_status.get(r["status"], 0) + 1
print(f" By license_rule: {by_rule}")
print(f" By status: {by_status}")
if args.dry_run:
print("\n--- DRY RUN (SQL output) ---\n")
print(generate_sql(rows))
return
inserted = insert_via_sqlalchemy(rows, args.db_host)
print(f"Inserted/updated {inserted} rows into regulation_registry")
if __name__ == "__main__":
main()
+240
View File
@@ -0,0 +1,240 @@
#!/usr/bin/env python3
"""Ingest missing German laws from gesetze-im-internet.de.
Downloads full HTML, strips to text, uploads with legal chunking strategy.
Handles ISO-8859-1 charset typical for gesetze-im-internet.de.
Usage (on Mac Mini):
python3 control-pipeline/scripts/ingest_de_laws.py --dry-run
python3 control-pipeline/scripts/ingest_de_laws.py
"""
import argparse
import json
import logging
import time
from typing import Optional
import httpx
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("ingest-laws")
RAG_URL = "https://localhost:8097"
QDRANT_URL = "http://localhost:6333"
COLLECTION = "bp_compliance_gesetze"
# ---- Laws to ingest ----
# Format: (slug on gesetze-im-internet.de, regulation_id, display_name)
# URL pattern: https://www.gesetze-im-internet.de/{slug}/BJNR*.html (full text)
LAWS = [
{
"url": "https://www.gesetze-im-internet.de/arbzg/BJNR117100994.html",
"regulation_id": "de_arbzg",
"name": "Arbeitszeitgesetz (ArbZG)",
"short": "ArbZG",
},
{
"url": "https://www.gesetze-im-internet.de/muschg_2018/BJNR122810017.html",
"regulation_id": "de_muschg",
"name": "Mutterschutzgesetz (MuSchG)",
"short": "MuSchG",
},
{
"url": "https://www.gesetze-im-internet.de/nachwg/BJNR094610995.html",
"regulation_id": "de_nachwg",
"name": "Nachweisgesetz (NachwG)",
"short": "NachwG",
},
{
"url": "https://www.gesetze-im-internet.de/milog/BJNR134810014.html",
"regulation_id": "de_milog",
"name": "Mindestlohngesetz (MiLoG)",
"short": "MiLoG",
},
{
"url": "https://www.gesetze-im-internet.de/gmbhg/BJNR004770892.html",
"regulation_id": "de_gmbhg",
"name": "GmbH-Gesetz (GmbHG)",
"short": "GmbHG",
},
{
"url": "https://www.gesetze-im-internet.de/aktg/BJNR010890965.html",
"regulation_id": "de_aktg",
"name": "Aktiengesetz (AktG)",
"short": "AktG",
},
{
"url": "https://www.gesetze-im-internet.de/inso/BJNR286600994.html",
"regulation_id": "de_inso",
"name": "Insolvenzordnung (InsO)",
"short": "InsO",
},
# BEG IV ist ein Aenderungsgesetz — kein eigenstaendiger Text auf gesetze-im-internet.de
{
"url": "https://www.gesetze-im-internet.de/verpflg/BJNR009690974.html",
"regulation_id": "de_verpflichtungsgesetz",
"name": "Verpflichtungsgesetz",
"short": "VerpflG",
},
{
"url": "https://www.gesetze-im-internet.de/burlg/BJNR000020963.html",
"regulation_id": "de_burlg",
"name": "Bundesurlaubsgesetz (BUrlG)",
"short": "BUrlG",
},
{
"url": "https://www.gesetze-im-internet.de/entgfg/BJNR118010994.html",
"regulation_id": "de_entgfg",
"name": "Entgeltfortzahlungsgesetz (EntgFG)",
"short": "EntgFG",
},
]
def download_law(url: str) -> Optional[str]:
"""Download law HTML from gesetze-im-internet.de, handle charset."""
with httpx.Client(timeout=30.0, follow_redirects=True) as c:
resp = c.get(url)
if resp.status_code != 200:
logger.error(" HTTP %d for %s", resp.status_code, url)
return None
# gesetze-im-internet.de uses ISO-8859-1
content_type = resp.headers.get("content-type", "")
if "charset" in content_type:
# Use declared charset
html = resp.text
else:
# Try UTF-8 first, fall back to ISO-8859-1
try:
html = resp.content.decode("utf-8")
if "\ufffd" in html:
raise UnicodeDecodeError("utf-8", b"", 0, 1, "replacement chars")
except (UnicodeDecodeError, ValueError):
html = resp.content.decode("iso-8859-1")
return html
def upload_html(
html: str,
filename: str,
regulation_id: str,
name: str,
short: str,
dry_run: bool = False,
) -> Optional[dict]:
"""Upload HTML to RAG service with legal chunking."""
if dry_run:
logger.info(" DRY RUN — would upload %d chars", len(html))
return {"chunks_count": 0, "document_id": "dry-run"}
meta = {
"regulation_id": regulation_id,
"regulation_name_de": name,
"regulation_short": short,
"source": "gesetze-im-internet.de",
"license": "public_domain_de_law",
"jurisdiction": "DE",
"source_type": "law",
}
form_data = {
"collection": COLLECTION,
"data_type": "compliance",
"bundesland": "bund",
"use_case": "compliance",
"year": "2026",
"chunk_strategy": "legal",
"chunk_size": "1500",
"chunk_overlap": "100",
"metadata_json": json.dumps(meta, ensure_ascii=False),
}
with httpx.Client(timeout=600.0, verify=False) as c:
resp = c.post(
f"{RAG_URL}/api/v1/documents/upload",
files={"file": (filename, html.encode("utf-8"), "text/html")},
data=form_data,
)
resp.raise_for_status()
return resp.json()
def count_existing(regulation_id: str) -> int:
"""Check if regulation already exists in Qdrant."""
with httpx.Client(timeout=30.0) as c:
resp = c.post(
f"{QDRANT_URL}/collections/{COLLECTION}/points/count",
json={
"filter": {"must": [
{"key": "regulation_id", "match": {"value": regulation_id}}
]},
"exact": True,
},
)
resp.raise_for_status()
return resp.json()["result"]["count"]
def main():
parser = argparse.ArgumentParser(description="Ingest DE laws from gesetze-im-internet.de")
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
logger.info("=" * 60)
logger.info("Ingest German Laws")
logger.info(" Laws: %d", len(LAWS))
logger.info(" Collection: %s", COLLECTION)
logger.info(" Dry run: %s", args.dry_run)
logger.info("=" * 60)
results = []
for i, law in enumerate(LAWS, 1):
logger.info("\n[%d/%d] %s (%s)", i, len(LAWS), law["name"], law["regulation_id"])
# Check if already exists
existing = count_existing(law["regulation_id"])
if existing > 0:
logger.info(" Already exists: %d chunks — SKIPPING", existing)
results.append({"law": law["short"], "status": "exists", "chunks": existing})
continue
# Download
logger.info(" Downloading: %s", law["url"])
html = download_law(law["url"])
if not html:
results.append({"law": law["short"], "status": "download_failed", "chunks": 0})
continue
logger.info(" Downloaded: %d chars", len(html))
# Upload
filename = f"{law['regulation_id']}.html"
try:
result = upload_html(
html, filename, law["regulation_id"],
law["name"], law["short"], args.dry_run,
)
chunks = result.get("chunks_count", 0) if result else 0
logger.info(" Uploaded: %d chunks", chunks)
results.append({"law": law["short"], "status": "ok", "chunks": chunks})
except Exception as e:
logger.error(" Upload FAILED: %s", e)
results.append({"law": law["short"], "status": "error", "chunks": 0})
if i < len(LAWS):
time.sleep(1)
# Summary
logger.info("\n" + "=" * 60)
logger.info("RESULTS")
logger.info("=" * 60)
for r in results:
logger.info(" %-10s %s chunks=%d", r["law"], r["status"].upper(), r["chunks"])
total_new = sum(r["chunks"] for r in results if r["status"] == "ok")
logger.info("\nTotal new chunks: %d", total_new)
if __name__ == "__main__":
main()
@@ -0,0 +1,201 @@
#!/usr/bin/env python3
"""Ingest missing EU regulations from EUR-Lex (HTML).
Downloads German HTML from EUR-Lex via CELEX number, uploads with legal chunking.
Usage (on Mac Mini):
python3 control-pipeline/scripts/ingest_eu_regulations.py --dry-run
python3 control-pipeline/scripts/ingest_eu_regulations.py
"""
import argparse
import json
import logging
import time
import httpx
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("ingest-eu")
RAG_URL = "https://localhost:8097"
QDRANT_URL = "http://localhost:6333"
COLLECTION = "bp_compliance_ce"
EURLEX_URL = "https://eur-lex.europa.eu/legal-content/DE/TXT/HTML/?uri=CELEX:{celex}"
# ---- EU Regulations to ingest ----
REGULATIONS = [
{
"celex": "32022L2464",
"regulation_id": "csrd_2022",
"name": "Corporate Sustainability Reporting Directive (CSRD)",
"short": "CSRD",
"category": "sustainability",
},
{
"celex": "32024L1760",
"regulation_id": "csddd_2024",
"name": "Corporate Sustainability Due Diligence Directive (CSDDD)",
"short": "CSDDD",
"category": "sustainability",
},
{
"celex": "32020R0852",
"regulation_id": "eu_taxonomy_2020",
"name": "EU-Taxonomie-Verordnung",
"short": "EU Taxonomy",
"category": "sustainability",
},
{
"celex": "32024R1183",
"regulation_id": "eidas_2_0_2024",
"name": "eIDAS 2.0 Verordnung (EU Digital Identity)",
"short": "eIDAS 2.0",
"category": "digital_identity",
},
{
"celex": "32023L0970",
"regulation_id": "pay_transparency_2023",
"name": "Entgelttransparenz-Richtlinie",
"short": "Pay Transparency",
"category": "employment",
},
{
"celex": "32022R2065",
"regulation_id": "dsa_2022_updated",
"name": "Digital Services Act (DSA) — aktualisiert",
"short": "DSA",
"category": "digital_services",
"skip_if_exists": "dsa_2022", # already exists under different ID
},
]
def download_eurlex(celex: str) -> str:
"""Download EU regulation HTML from EUR-Lex."""
url = EURLEX_URL.format(celex=celex)
with httpx.Client(timeout=30.0, follow_redirects=True) as c:
resp = c.get(url)
resp.raise_for_status()
return resp.text
def upload_html(html: str, filename: str, reg: dict, dry_run: bool = False):
"""Upload HTML to RAG service."""
if dry_run:
logger.info(" DRY RUN — would upload %d chars", len(html))
return {"chunks_count": 0}
meta = {
"regulation_id": reg["regulation_id"],
"regulation_name_de": reg["name"],
"regulation_short": reg["short"],
"celex": reg["celex"],
"category": reg["category"],
"source": "EUR-Lex",
"license": "EU_law",
"jurisdiction": "EU",
"source_type": "law",
}
form_data = {
"collection": COLLECTION,
"data_type": "compliance",
"bundesland": "bund",
"use_case": "compliance",
"year": "2026",
"chunk_strategy": "legal",
"chunk_size": "1500",
"chunk_overlap": "100",
"metadata_json": json.dumps(meta, ensure_ascii=False),
}
with httpx.Client(timeout=600.0, verify=False) as c:
resp = c.post(
f"{RAG_URL}/api/v1/documents/upload",
files={"file": (filename, html.encode("utf-8"), "text/html")},
data=form_data,
)
resp.raise_for_status()
return resp.json()
def count_existing(regulation_id: str) -> int:
with httpx.Client(timeout=60.0) as c:
resp = c.post(
f"{QDRANT_URL}/collections/{COLLECTION}/points/count",
json={"filter": {"must": [
{"key": "regulation_id", "match": {"value": regulation_id}}
]}, "exact": True},
)
resp.raise_for_status()
return resp.json()["result"]["count"]
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
logger.info("=" * 60)
logger.info("Ingest EU Regulations from EUR-Lex")
logger.info(" Regulations: %d", len(REGULATIONS))
logger.info(" Dry run: %s", args.dry_run)
logger.info("=" * 60)
results = []
for i, reg in enumerate(REGULATIONS, 1):
logger.info("\n[%d/%d] %s (CELEX: %s)", i, len(REGULATIONS), reg["name"], reg["celex"])
# Skip if variant already exists
skip_id = reg.get("skip_if_exists")
if skip_id:
existing = count_existing(skip_id)
if existing > 0:
logger.info(" Already exists as '%s' (%d chunks) — SKIPPING", skip_id, existing)
results.append({"reg": reg["short"], "status": "exists", "chunks": existing})
continue
# Check if this exact ID exists
existing = count_existing(reg["regulation_id"])
if existing > 0:
logger.info(" Already exists: %d chunks — SKIPPING", existing)
results.append({"reg": reg["short"], "status": "exists", "chunks": existing})
continue
# Download from EUR-Lex
logger.info(" Downloading from EUR-Lex...")
try:
html = download_eurlex(reg["celex"])
logger.info(" Downloaded: %d chars", len(html))
except Exception as e:
logger.error(" Download FAILED: %s", e)
results.append({"reg": reg["short"], "status": "download_failed", "chunks": 0})
continue
# Upload
filename = f"{reg['regulation_id']}.html"
try:
result = upload_html(html, filename, reg, args.dry_run)
chunks = result.get("chunks_count", 0)
logger.info(" Uploaded: %d chunks", chunks)
results.append({"reg": reg["short"], "status": "ok", "chunks": chunks})
except Exception as e:
logger.error(" Upload FAILED: %s", e)
results.append({"reg": reg["short"], "status": "error", "chunks": 0})
if i < len(REGULATIONS):
time.sleep(2)
# Summary
logger.info("\n" + "=" * 60)
logger.info("RESULTS")
logger.info("=" * 60)
for r in results:
logger.info(" %-20s %s chunks=%d", r["reg"], r["status"].upper(), r["chunks"])
total_new = sum(r["chunks"] for r in results if r["status"] == "ok")
logger.info("\nTotal new chunks: %d", total_new)
if __name__ == "__main__":
main()
+303
View File
@@ -0,0 +1,303 @@
#!/usr/bin/env python3
"""
E2E Quality Report: Verify controls have correct source citations.
Loads N random controls from PostgreSQL, cross-references with Qdrant chunks,
and reports mismatches between source_citation and actual chunk metadata.
Usage:
# Against Mac Mini
python3 scripts/quality_report.py --db-host macmini --qdrant-url http://macmini:6333
# Smaller sample
python3 scripts/quality_report.py --db-host macmini --sample 100
"""
import argparse
import json
import logging
import sys
import httpx
from sqlalchemy import create_engine, text
from sqlalchemy.orm import sessionmaker
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("quality-report")
COLLECTIONS = [
"bp_compliance_ce", "bp_compliance_gesetze", "bp_compliance_datenschutz",
"bp_dsfa_corpus", "bp_legal_templates",
]
def load_controls(db_url: str, sample_size: int) -> list[dict]:
"""Load random controls with source_citation from PostgreSQL."""
engine = create_engine(db_url)
Session = sessionmaker(bind=engine)
with Session() as db:
rows = db.execute(text("""
SELECT id::text, control_id, title,
source_citation::text, source_original_text,
generation_metadata::text, release_state
FROM compliance.canonical_controls
WHERE source_citation IS NOT NULL
AND source_original_text IS NOT NULL
AND release_state = 'draft'
ORDER BY RANDOM()
LIMIT :n
"""), {"n": sample_size}).fetchall()
controls = []
for row in rows:
citation = json.loads(row[3]) if row[3] else {}
metadata = json.loads(row[5]) if row[5] else {}
controls.append({
"id": row[0],
"control_id": row[1],
"title": row[2],
"citation": citation,
"source_text": row[4],
"metadata": metadata,
"release_state": row[6],
})
return controls
def build_qdrant_index(qdrant_url: str) -> dict:
"""Build regulation_id → list[chunk] index from Qdrant.
Controls were generated from OLD chunks (512 chars). Qdrant now has
NEW chunks (1500 chars). Hash matching won't work — use regulation +
section matching instead.
"""
logger.info("Building Qdrant chunk index by regulation_id...")
index = {} # regulation_id → [{"section": ..., "text_snippet": ..., ...}]
client = httpx.Client(timeout=60.0)
for coll in COLLECTIONS:
offset = None
for _ in range(600):
body = {"limit": 250, "with_payload": True, "with_vector": False}
if offset:
body["offset"] = offset
r = client.post(f"{qdrant_url}/collections/{coll}/points/scroll", json=body)
if r.status_code != 200:
break
data = r.json()["result"]
for pt in data["points"]:
reg_id = pt["payload"].get("regulation_id", "")
if not reg_id:
continue
chunk = {
"section": pt["payload"].get("section", ""),
"section_title": pt["payload"].get("section_title", ""),
"paragraph": pt["payload"].get("paragraph", ""),
"text_snippet": pt["payload"].get("chunk_text", "")[:200],
"filename": pt["payload"].get("filename", ""),
"collection": coll,
}
index.setdefault(reg_id, []).append(chunk)
offset = data.get("next_page_offset")
if not offset:
break
client.close()
total = sum(len(v) for v in index.values())
logger.info("Qdrant index: %d regulations, %d chunks", len(index), total)
return index
def check_control(ctrl: dict, qdrant_index: dict) -> dict:
"""Check a single control's source_citation against Qdrant chunks.
Strategy: Find chunks by regulation_id from generation_metadata,
then check if any chunk has a matching section/article.
"""
result = {
"control_id": ctrl["control_id"],
"title": (ctrl["title"] or "")[:60],
"citation_source": ctrl["citation"].get("source", ""),
"citation_article": ctrl["citation"].get("article", ""),
"citation_paragraph": ctrl["citation"].get("paragraph", ""),
"citation_page": ctrl["citation"].get("page"),
"issues": [],
}
# Get regulation_id from generation_metadata
reg_code = ctrl["metadata"].get("source_regulation", "")
citation_article = ctrl["citation"].get("article", "")
# Check 1: Does the control have a regulation reference?
if not reg_code:
result["issues"].append("NO_REGULATION_CODE")
return result
# Check 2: Does this regulation exist in Qdrant?
chunks = qdrant_index.get(reg_code, [])
if not chunks:
result["issues"].append(f"REGULATION_NOT_IN_QDRANT: {reg_code}")
result["reg_found"] = False
return result
result["reg_found"] = True
result["reg_chunks"] = len(chunks)
# Check 3: Does the control have an article citation?
if not citation_article:
result["issues"].append("NO_ARTICLE_IN_CITATION")
# Still check if chunks have section metadata at all
has_section = any(c["section"] for c in chunks)
if has_section:
result["issues"].append("CHUNKS_HAVE_SECTIONS_BUT_CONTROL_MISSING")
return result
# Check 4: Is the cited article found in any chunk's section?
norm_article = citation_article.strip().lower()
matching_chunks = [
c for c in chunks
if c["section"] and (
norm_article == c["section"].strip().lower()
or norm_article in c["section"].strip().lower()
or c["section"].strip().lower() in norm_article
)
]
if matching_chunks:
result["article_match"] = True
result["matched_section"] = matching_chunks[0]["section"]
else:
# Check if ANY chunk has sections (the article might just not match)
sections_in_regulation = sorted(set(c["section"] for c in chunks if c["section"]))
if sections_in_regulation:
result["issues"].append(
f"ARTICLE_NOT_FOUND_IN_CHUNKS: '{citation_article}' not in {sections_in_regulation[:5]}"
)
else:
result["issues"].append("NO_SECTIONS_IN_REGULATION_CHUNKS")
# Check 5: Does source_original_text contain the cited article?
source_text = ctrl["source_text"] or ""
if citation_article and source_text:
if citation_article.lower() not in source_text.lower():
if f"[{citation_article}" not in source_text:
result["issues"].append("ARTICLE_NOT_IN_SOURCE_TEXT")
if not result["issues"]:
result["issues"] = ["OK"]
return result
def generate_report(results: list[dict]):
"""Print the quality report."""
total = len(results)
ok = sum(1 for r in results if r["issues"] == ["OK"])
chunk_found = sum(1 for r in results if r.get("chunk_found", False))
no_chunk = sum(1 for r in results if "CHUNK_NOT_FOUND" in r["issues"])
no_article = sum(1 for r in results if "NO_ARTICLE_IN_CITATION" in r["issues"])
no_section = sum(1 for r in results if "NO_SECTION_IN_CHUNK" in r["issues"])
mismatch = sum(1 for r in results if any("MISMATCH" in i for i in r["issues"]))
not_in_text = sum(1 for r in results if "ARTICLE_NOT_IN_SOURCE_TEXT" in r["issues"])
print("\n" + "=" * 100)
print("QUALITAETSREPORT: CONTROL SOURCE CITATION VERIFICATION")
print("=" * 100)
print(f"\nStichprobe: {total} Controls")
print(f"\n{'Metrik':<45} {'Anzahl':>8} {'Anteil':>8}")
print("-" * 65)
print(f"{'OK (keine Probleme)':<45} {ok:>8} {ok*100//max(total,1):>7}%")
print(f"{'Chunk in Qdrant gefunden':<45} {chunk_found:>8} {chunk_found*100//max(total,1):>7}%")
print(f"{'Chunk NICHT gefunden':<45} {no_chunk:>8} {no_chunk*100//max(total,1):>7}%")
print(f"{'Kein article in source_citation':<45} {no_article:>8} {no_article*100//max(total,1):>7}%")
print(f"{'Kein section im Qdrant-Chunk':<45} {no_section:>8} {no_section*100//max(total,1):>7}%")
print(f"{'Article/Section Mismatch':<45} {mismatch:>8} {mismatch*100//max(total,1):>7}%")
print(f"{'Article nicht im Source-Text':<45} {not_in_text:>8} {not_in_text*100//max(total,1):>7}%")
# Show sample mismatches
mismatches = [r for r in results if any("MISMATCH" in i for i in r["issues"])]
if mismatches:
print("\n=== MISMATCHES (erste 10) ===\n")
for r in mismatches[:10]:
issues = [i for i in r["issues"] if "MISMATCH" in i]
print(f" {r['control_id']:20s} {r['title'][:40]:40s}")
for i in issues:
print(f"{i}")
# Show sample NOT_FOUND
not_found = [r for r in results if "CHUNK_NOT_FOUND" in r["issues"]]
if not_found:
print("\n=== CHUNK NOT FOUND (erste 10) ===\n")
for r in not_found[:10]:
src = r.get("citation_source", "?")
art = r.get("citation_article", "?")
print(f" {r['control_id']:20s} {src[:25]:25s} {art}")
# Distribution by source
print("\n=== NACH QUELLE ===\n")
source_stats = {}
for r in results:
src = r.get("citation_source", "?")[:30]
if src not in source_stats:
source_stats[src] = {"total": 0, "ok": 0, "no_chunk": 0, "no_section": 0}
source_stats[src]["total"] += 1
if r["issues"] == ["OK"]:
source_stats[src]["ok"] += 1
if "CHUNK_NOT_FOUND" in r["issues"]:
source_stats[src]["no_chunk"] += 1
if "NO_SECTION_IN_CHUNK" in r["issues"]:
source_stats[src]["no_section"] += 1
print(f" {'Quelle':<32} {'Total':>6} {'OK':>6} {'OK%':>6} {'NoChunk':>8} {'NoSect':>8}")
print(f" {'-'*72}")
for src in sorted(source_stats.keys(), key=lambda s: -source_stats[s]["total"]):
s = source_stats[src]
pct = s["ok"] * 100 // max(s["total"], 1)
print(f" {src:<32} {s['total']:>6} {s['ok']:>6} {pct:>5}% {s['no_chunk']:>8} {s['no_section']:>8}")
print(f"\n{'='*100}")
verdict = "PASS" if ok * 100 // max(total, 1) >= 50 else "NEEDS IMPROVEMENT"
print(f"ERGEBNIS: {verdict}{ok}/{total} Controls ({ok*100//max(total,1)}%) vollstaendig korrekt")
print(f"{'='*100}")
def main():
parser = argparse.ArgumentParser(description="Control Source Citation Quality Report")
parser.add_argument("--db-host", default="macmini")
parser.add_argument("--db-port", type=int, default=5432)
parser.add_argument("--db-name", default="breakpilot_db")
parser.add_argument("--db-user", default="breakpilot")
parser.add_argument("--db-pass", default="breakpilot123")
parser.add_argument("--qdrant-url", default="http://macmini:6333")
parser.add_argument("--sample", type=int, default=500)
args = parser.parse_args()
db_url = f"postgresql://{args.db_user}:{args.db_pass}@{args.db_host}:{args.db_port}/{args.db_name}"
# Load controls
logger.info("Loading %d random controls from DB...", args.sample)
controls = load_controls(db_url, args.sample)
logger.info("Loaded %d controls with source_citation", len(controls))
if not controls:
print("ERROR: No controls found with source_citation")
sys.exit(1)
# Build Qdrant index
qdrant_index = build_qdrant_index(args.qdrant_url)
# Check each control
logger.info("Checking %d controls against Qdrant...", len(controls))
results = []
for ctrl in controls:
result = check_control(ctrl, qdrant_index)
results.append(result)
# Report
generate_report(results)
if __name__ == "__main__":
main()
+486
View File
@@ -0,0 +1,486 @@
#!/usr/bin/env python3
"""
D5 Re-Ingestion: Re-chunk all ~297 legal sources with structural metadata.
Usage:
# Dry-run: build manifest, no changes
python3 scripts/reingest_d5.py --dry-run
# Re-ingest one collection (test)
python3 scripts/reingest_d5.py --collection bp_compliance_gesetze
# Re-ingest all collections (resume-capable)
python3 scripts/reingest_d5.py --resume
# Custom URLs
python3 scripts/reingest_d5.py --rag-url https://macmini:8097 --qdrant-url http://macmini:6333
"""
import argparse
import json
import logging
import random
import sys
import time
from datetime import datetime, timezone
import httpx
from reingest_d5_config import (
CHUNK_OVERLAP,
CHUNK_SIZE,
CHUNK_STRATEGY,
DEFAULT_QDRANT_URL,
DEFAULT_RAG_URL,
MANIFEST_FILE,
TARGET_COLLECTIONS,
content_type_from_filename,
doc_key,
extract_doc_metadata,
load_progress,
save_progress,
)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("d5-reingest")
UPLOAD_TIMEOUT = httpx.Timeout(timeout=3600.0, connect=30.0)
SCROLL_TIMEOUT = httpx.Timeout(timeout=60.0, connect=10.0)
# ---------------------------------------------------------------------------
# Phase 0: Preflight
# ---------------------------------------------------------------------------
def preflight_checks(rag_url: str, qdrant_url: str) -> dict:
"""Verify services are reachable and record baseline chunk counts."""
logger.info("Phase 0: Preflight checks...")
with httpx.Client(timeout=10.0, verify=False) as c:
r = c.get(f"{rag_url}/health")
r.raise_for_status()
logger.info(" RAG service: OK")
with httpx.Client(timeout=10.0) as c:
r = c.get(f"{qdrant_url}/collections")
r.raise_for_status()
logger.info(" Qdrant: OK")
before_counts = {}
with httpx.Client(timeout=10.0) as c:
for coll in TARGET_COLLECTIONS:
try:
r = c.post(f"{qdrant_url}/collections/{coll}/points/count",
json={"exact": True})
r.raise_for_status()
count = r.json()["result"]["count"]
except Exception:
count = 0
before_counts[coll] = count
logger.info(" %s: %d chunks", coll, count)
return before_counts
# ---------------------------------------------------------------------------
# Phase 1: Build manifest
# ---------------------------------------------------------------------------
def build_manifest(qdrant_url: str, collections: list[str]) -> list[dict]:
"""Scroll Qdrant and build a deduplicated document manifest."""
logger.info("Phase 1: Building document manifest...")
documents: dict[str, dict] = {} # keyed by doc_key(object_name, collection)
with httpx.Client(timeout=SCROLL_TIMEOUT) as client:
for coll in collections:
logger.info(" Scrolling %s...", coll)
offset = None
points_seen = 0
while True:
body: dict = {
"limit": 250,
"with_payload": True,
"with_vector": False,
}
if offset:
body["offset"] = offset
resp = client.post(
f"{qdrant_url}/collections/{coll}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
points = data["points"]
for pt in points:
payload = pt.get("payload", {})
obj_name = payload.get("object_name", "")
if not obj_name:
continue
key = doc_key(obj_name, coll)
if key not in documents:
meta = extract_doc_metadata(payload)
documents[key] = {
"object_name": obj_name,
"collection": coll,
"filename": payload.get("filename", obj_name.split("/")[-1]),
"form": meta["form"],
"extra_metadata": meta["extra"],
"old_chunk_count": 0,
}
documents[key]["old_chunk_count"] += 1
points_seen += len(points)
offset = data.get("next_page_offset")
if not offset:
break
logger.info(" %d points → %d unique docs",
points_seen,
sum(1 for d in documents.values() if d["collection"] == coll))
manifest = list(documents.values())
logger.info(" Total: %d unique documents across %d collections",
len(manifest), len(collections))
return manifest
# ---------------------------------------------------------------------------
# Phase 2: Per-document re-ingestion
# ---------------------------------------------------------------------------
def download_file(rag_url: str, object_name: str) -> bytes:
"""Download file bytes via MinIO presigned URL."""
with httpx.Client(timeout=60.0, verify=False) as c:
resp = c.get(f"{rag_url}/api/v1/documents/download/{object_name}")
resp.raise_for_status()
presigned_url = resp.json()["url"]
with httpx.Client(timeout=120.0, verify=False) as c:
resp = c.get(presigned_url)
resp.raise_for_status()
return resp.content
def delete_old_chunks(qdrant_url: str, collection: str, object_name: str) -> int:
"""Delete all chunks for a document from Qdrant. Returns estimated count."""
with httpx.Client(timeout=30.0) as c:
resp = c.post(
f"{qdrant_url}/collections/{collection}/points/delete",
json={
"filter": {
"must": [{
"key": "object_name",
"match": {"value": object_name},
}]
}
},
)
resp.raise_for_status()
return 0 # Qdrant delete doesn't return count
def _delete_old_chunks_safe(
qdrant_url: str, collection: str, object_name: str, keep_doc_id: str,
) -> None:
"""Delete old chunks for a document, keeping chunks with keep_doc_id."""
with httpx.Client(timeout=30.0) as c:
resp = c.post(
f"{qdrant_url}/collections/{collection}/points/delete",
json={
"filter": {
"must": [{
"key": "object_name",
"match": {"value": object_name},
}],
"must_not": [{
"key": "document_id",
"match": {"value": keep_doc_id},
}],
}
},
)
resp.raise_for_status()
def reupload_document(
rag_url: str,
file_bytes: bytes,
filename: str,
collection: str,
form_fields: dict,
extra_metadata: dict,
) -> dict:
"""Upload document to RAG service with new chunking parameters."""
ct = content_type_from_filename(filename)
form_data = {
"collection": collection,
"data_type": form_fields.get("data_type", "compliance"),
"bundesland": form_fields.get("bundesland", "bund"),
"use_case": form_fields.get("use_case", "compliance"),
"year": form_fields.get("year", "2026"),
"chunk_strategy": CHUNK_STRATEGY,
"chunk_size": str(CHUNK_SIZE),
"chunk_overlap": str(CHUNK_OVERLAP),
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
}
with httpx.Client(timeout=UPLOAD_TIMEOUT, verify=False) as c:
resp = c.post(
f"{rag_url}/api/v1/documents/upload",
files={"file": (filename, file_bytes, ct)},
data=form_data,
)
resp.raise_for_status()
return resp.json()
def process_document(
doc: dict,
rag_url: str,
qdrant_url: str,
progress: dict,
max_retries: int = 2,
) -> bool:
"""Process a single document: download → upload → verify → delete old.
Safe order: new chunks are created FIRST, old chunks deleted only after
successful verification (upload-before-delete pattern).
"""
key = doc_key(doc["object_name"], doc["collection"])
# Skip if already done
if progress.get("documents", {}).get(key, {}).get("status") == "done":
return True
for attempt in range(max_retries + 1):
try:
# 1. Download
file_bytes = download_file(rag_url, doc["object_name"])
if not file_bytes:
logger.warning(" Empty file: %s — skipping", doc["object_name"])
progress.setdefault("documents", {})[key] = {
"status": "skipped", "reason": "empty_file"}
return False
# 2. Upload FIRST (creates new chunks alongside old ones)
result = reupload_document(
rag_url, file_bytes, doc["filename"],
doc["collection"], doc["form"], doc["extra_metadata"],
)
new_chunks = result.get("chunks_count", 0)
new_doc_id = result.get("document_id", "")
if new_chunks == 0:
logger.error(" Upload produced 0 chunks — keeping old data: %s",
doc["object_name"])
progress.setdefault("documents", {})[key] = {
"status": "error", "error": "0 new chunks"}
return False
# 3. Delete OLD chunks only (exclude the new document_id)
_delete_old_chunks_safe(
qdrant_url, doc["collection"],
doc["object_name"], new_doc_id,
)
# 4. Record success
progress.setdefault("documents", {})[key] = {
"status": "done",
"old_chunks": doc["old_chunk_count"],
"new_chunks": new_chunks,
"new_document_id": result.get("document_id", ""),
"completed_at": datetime.now(timezone.utc).isoformat(),
}
return True
except httpx.HTTPStatusError as e:
if e.response.status_code == 404:
logger.warning(" File not in MinIO (404): %s — skipping", doc["object_name"])
progress.setdefault("documents", {})[key] = {
"status": "skipped", "reason": "not_in_minio"}
return False
if attempt < max_retries:
wait = 5 * (attempt + 1)
logger.warning(" HTTP %d on attempt %d, retrying in %ds...",
e.response.status_code, attempt + 1, wait)
time.sleep(wait)
else:
logger.error(" FAILED after %d retries: %s", max_retries, e)
progress.setdefault("documents", {})[key] = {
"status": "error", "error": str(e), "retries": max_retries}
return False
except Exception as e:
if attempt < max_retries:
wait = 10 * (attempt + 1)
logger.warning(" Error on attempt %d: %s — retrying in %ds",
attempt + 1, e, wait)
time.sleep(wait)
else:
logger.error(" FAILED after %d retries: %s", max_retries, e)
progress.setdefault("documents", {})[key] = {
"status": "error", "error": str(e), "retries": max_retries}
return False
return False
# ---------------------------------------------------------------------------
# Phase 3: Verification
# ---------------------------------------------------------------------------
def verify_results(
qdrant_url: str,
before_counts: dict,
collections: list[str],
manifest: list[dict],
):
"""Compare before/after counts and spot-check metadata."""
logger.info("Phase 3: Verification...")
print("\n" + "=" * 65)
print("D5 RE-INGESTION VERIFICATION REPORT")
print("=" * 65)
after_counts = {}
with httpx.Client(timeout=10.0) as c:
for coll in collections:
try:
r = c.post(f"{qdrant_url}/collections/{coll}/points/count",
json={"exact": True})
r.raise_for_status()
after_counts[coll] = r.json()["result"]["count"]
except Exception:
after_counts[coll] = -1
print(f"\n{'Collection':<35} {'Before':>8} {'After':>8} {'Delta':>8}")
print("-" * 65)
for coll in collections:
before = before_counts.get(coll, 0)
after = after_counts.get(coll, -1)
delta = after - before if after >= 0 else "?"
print(f"{coll:<35} {before:>8} {after:>8} {str(delta):>8}")
# Spot-check: pick 3 random docs and verify metadata
print("\nSpot-check (3 random docs):")
sample = random.sample(manifest, min(3, len(manifest)))
with httpx.Client(timeout=30.0) as c:
for doc in sample:
resp = c.post(
f"{qdrant_url}/collections/{doc['collection']}/points/scroll",
json={
"limit": 3,
"with_payload": True,
"with_vector": False,
"filter": {
"must": [{
"key": "object_name",
"match": {"value": doc["object_name"]},
}]
},
},
)
if resp.status_code != 200:
print(f" {doc['object_name']}: QUERY FAILED")
continue
points = resp.json()["result"]["points"]
if not points:
print(f" {doc['object_name']}: NO CHUNKS FOUND")
continue
has_section = sum(1 for p in points if p["payload"].get("section"))
has_para = sum(1 for p in points if p["payload"].get("paragraph"))
print(f" {doc['filename'][:40]:<42} "
f"chunks={len(points):>3} "
f"with_section={has_section}/{len(points)} "
f"with_para={has_para}/{len(points)}")
print()
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(description="D5 Re-Ingestion Script")
parser.add_argument("--rag-url", default=DEFAULT_RAG_URL)
parser.add_argument("--qdrant-url", default=DEFAULT_QDRANT_URL)
parser.add_argument("--dry-run", action="store_true",
help="Build manifest only, no changes")
parser.add_argument("--collection", default=None,
help="Process only this collection")
parser.add_argument("--resume", action="store_true",
help="Resume from progress file")
args = parser.parse_args()
collections = [args.collection] if args.collection else TARGET_COLLECTIONS
# Phase 0
before_counts = preflight_checks(args.rag_url, args.qdrant_url)
# Phase 1
manifest = build_manifest(args.qdrant_url, collections)
# Save manifest for inspection
with open(MANIFEST_FILE, "w", encoding="utf-8") as f:
json.dump(manifest, f, indent=2, ensure_ascii=False)
logger.info("Manifest saved to %s", MANIFEST_FILE)
if args.dry_run:
print(f"\nDRY RUN: {len(manifest)} documents found. See {MANIFEST_FILE}")
for doc in manifest[:10]:
reg = doc["extra_metadata"].get("regulation_code", "?")
print(f" {reg:<30} {doc['collection']:<35} chunks={doc['old_chunk_count']}")
if len(manifest) > 10:
print(f" ... and {len(manifest) - 10} more")
sys.exit(0)
# Phase 2
progress = load_progress() if args.resume else {"documents": {}}
progress["started_at"] = datetime.now(timezone.utc).isoformat()
progress["before_counts"] = before_counts
done = 0
skipped = 0
failed = 0
for i, doc in enumerate(manifest, 1):
key = doc_key(doc["object_name"], doc["collection"])
reg = doc["extra_metadata"].get("regulation_code", "?")
if progress.get("documents", {}).get(key, {}).get("status") == "done":
done += 1
continue
logger.info("[%d/%d] %s (%s) — %d old chunks",
i, len(manifest), reg, doc["collection"], doc["old_chunk_count"])
ok = process_document(doc, args.rag_url, args.qdrant_url, progress)
if ok:
done += 1
new_chunks = progress["documents"][key].get("new_chunks", "?")
logger.info(" OK: %d old → %s new chunks", doc["old_chunk_count"], new_chunks)
elif progress["documents"][key].get("status") == "skipped":
skipped += 1
else:
failed += 1
save_progress(progress)
time.sleep(2)
logger.info("Phase 2 complete: %d done, %d skipped, %d failed", done, skipped, failed)
# Phase 3
verify_results(args.qdrant_url, before_counts, collections, manifest)
print(f"Summary: {done} done, {skipped} skipped, {failed} failed")
if failed:
print(f"Re-run with --resume to retry {failed} failed documents")
sys.exit(1)
if __name__ == "__main__":
main()
@@ -0,0 +1,92 @@
"""D5 Re-Ingestion: Constants, helpers, progress tracking."""
import json
import logging
import os
logger = logging.getLogger("d5-reingest")
# ---------------------------------------------------------------------------
# Defaults (overridable via CLI args)
# ---------------------------------------------------------------------------
DEFAULT_RAG_URL = "https://macmini:8097"
DEFAULT_QDRANT_URL = "http://macmini:6333"
TARGET_COLLECTIONS = [
"bp_compliance_ce",
"bp_compliance_gesetze",
"bp_compliance_datenschutz",
"bp_dsfa_corpus",
"bp_legal_templates",
"bp_compliance_schulrecht",
]
# New chunking parameters (D1-D4 validated)
CHUNK_STRATEGY = "recursive"
CHUNK_SIZE = 1500
CHUNK_OVERLAP = 100
PROGRESS_FILE = "d5_reingest_progress.json"
MANIFEST_FILE = "d5_manifest.json"
# Per-chunk fields (NOT carried as extra metadata during re-upload)
PER_CHUNK_FIELDS = frozenset({
"chunk_text", "chunk_index", "document_id", "object_name",
"filename", "data_type", "bundesland", "use_case", "year",
"section", "section_title", "paragraph", "paragraph_num", "page",
})
# Upload form fields that come from the payload (not metadata_json)
FORM_FIELDS = frozenset({"data_type", "bundesland", "use_case", "year"})
# ---------------------------------------------------------------------------
# Progress tracking
# ---------------------------------------------------------------------------
def load_progress(path: str = PROGRESS_FILE) -> dict:
if os.path.exists(path):
with open(path, encoding="utf-8") as f:
return json.load(f)
return {"documents": {}}
def save_progress(data: dict, path: str = PROGRESS_FILE):
with open(path, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False, default=str)
# ---------------------------------------------------------------------------
# Metadata extraction
# ---------------------------------------------------------------------------
def extract_doc_metadata(payload: dict) -> dict:
"""Split Qdrant payload into form fields + extra metadata.
Returns: {"form": {data_type, bundesland, ...}, "extra": {regulation_code, ...}}
"""
form = {}
extra = {}
for k, v in payload.items():
if k in PER_CHUNK_FIELDS:
continue
if k in FORM_FIELDS:
form[k] = v
else:
extra[k] = v
return {"form": form, "extra": extra}
def doc_key(object_name: str, collection: str) -> str:
"""Unique key for a document in the progress file."""
return f"{object_name}|{collection}"
def content_type_from_filename(filename: str) -> str:
"""Infer MIME type from file extension."""
ext = os.path.splitext(filename)[1].lower()
return {
".pdf": "application/pdf",
".html": "text/html",
".htm": "text/html",
".md": "text/markdown",
".txt": "text/plain",
}.get(ext, "application/octet-stream")
+485
View File
@@ -0,0 +1,485 @@
#!/usr/bin/env python3
"""Safe re-ingestion of NIST/BSI/ENISA PDFs from MinIO.
Uses upload-before-delete pattern: new chunks are created FIRST,
old chunks are only deleted after successful verification.
Usage:
python3 control-pipeline/scripts/reingest_nist.py [--dry-run]
python3 control-pipeline/scripts/reingest_nist.py --only-missing
"""
import argparse
import json
import logging
import sys
import time
import httpx
sys.path.insert(0, "control-pipeline/scripts")
from reingest_d5_config import ( # noqa: E402
CHUNK_OVERLAP,
CHUNK_SIZE,
CHUNK_STRATEGY,
DEFAULT_QDRANT_URL,
DEFAULT_RAG_URL,
content_type_from_filename,
)
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
)
logger = logging.getLogger("reingest-nist")
UPLOAD_TIMEOUT = 1800.0 # 30 min for large PDFs
# -------------------------------------------------------------------
# Documents to re-ingest
# -------------------------------------------------------------------
# 4 documents with 0 chunks (deleted by D5, upload failed)
MISSING_DOCS = [
{
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_53r5.pdf",
"collection": "bp_compliance_datenschutz",
"filename": "NIST_SP_800_53r5.pdf",
"extra_metadata": {
"regulation_id": "nist_sp800_53r5",
"source_id": "nist",
"doc_type": "controls_catalog",
"guideline_name": "NIST SP 800-53 Rev. 5 Security and Privacy Controls",
"license": "public_domain_us_gov",
"attribution": "NIST",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/nist_sp_800_82r3.pdf",
"collection": "bp_compliance_ce",
"filename": "nist_sp_800_82r3.pdf",
"extra_metadata": {
"regulation_id": "nist_sp_800_82r3",
"regulation_name_de": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
"regulation_name_en": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
"regulation_short": "NIST SP 800-82",
"category": "ot_security",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/nist_sp_800_160v1r1.pdf",
"collection": "bp_compliance_ce",
"filename": "nist_sp_800_160v1r1.pdf",
"extra_metadata": {
"regulation_id": "nist_sp_800_160v1r1",
"regulation_name_de": "NIST SP 800-160 Vol. 1 Rev. 1",
"regulation_name_en": "NIST SP 800-160 Vol. 1 Rev. 1",
"regulation_short": "NIST SP 800-160",
"category": "security_engineering",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_207.pdf",
"collection": "bp_compliance_datenschutz",
"filename": "NIST_SP_800_207.pdf",
"extra_metadata": {
"regulation_id": "nist_sp800_207",
"source_id": "nist",
"doc_type": "architecture",
"guideline_name": "NIST SP 800-207 Zero Trust Architecture",
"license": "public_domain_us_gov",
"attribution": "NIST",
"source": "nist.gov",
},
},
]
# Additional NIST/BSI/ENISA docs with <10% section rate (re-ingest for quality)
LOW_QUALITY_DOCS = [
{
"object_name": "compliance/bund/compliance/2026/nist_csf_2_0.pdf",
"collection": "bp_compliance_datenschutz",
"filename": "nist_csf_2_0.pdf",
"extra_metadata": {
"regulation_id": "nist_csf_2_0",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/nistir_8259a.pdf",
"collection": "bp_compliance_datenschutz",
"filename": "nistir_8259a.pdf",
"extra_metadata": {
"regulation_id": "nistir_8259a",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/nist_ai_rmf.pdf",
"collection": "bp_compliance_datenschutz",
"filename": "nist_ai_rmf.pdf",
"extra_metadata": {
"regulation_id": "nist_ai_rmf",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/nist_sp_800_30r1.pdf",
"collection": "bp_compliance_ce",
"filename": "nist_sp_800_30r1.pdf",
"extra_metadata": {
"regulation_id": "nist_sp_800_30r1",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/enisa_supply_chain_good_practices.pdf",
"collection": "bp_compliance_ce",
"filename": "enisa_supply_chain_good_practices.pdf",
"extra_metadata": {
"regulation_id": "enisa_supply_chain_good_practices",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
},
},
{
"object_name": "compliance/bund/compliance/2026/enisa_ics_scada.pdf",
"collection": "bp_compliance_ce",
"filename": "enisa_ics_scada.pdf",
"extra_metadata": {
"regulation_id": "enisa_ics_scada_dependencies",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
},
},
{
"object_name": "compliance/bund/compliance/2026/enisa_supply_chain_security.pdf",
"collection": "bp_compliance_ce",
"filename": "enisa_supply_chain_security.pdf",
"extra_metadata": {
"regulation_id": "enisa_threat_landscape_supply_chain",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
},
},
{
"object_name": "compliance/bund/compliance/2026/cisa_secure_by_design.pdf",
"collection": "bp_compliance_ce",
"filename": "cisa_secure_by_design.pdf",
"extra_metadata": {
"regulation_id": "cisa_secure_by_design",
"license": "public_domain_us",
"source": "cisa.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/cvss_v4_0.pdf",
"collection": "bp_compliance_ce",
"filename": "cvss_v4_0.pdf",
"extra_metadata": {
"regulation_id": "cvss_v4_0",
"license": "public_domain_us",
"source": "first.org",
},
},
]
# -------------------------------------------------------------------
# Qdrant helpers
# -------------------------------------------------------------------
def count_chunks(qdrant_url: str, collection: str, object_name: str) -> int:
"""Count existing chunks for a document in Qdrant."""
with httpx.Client(timeout=30.0) as c:
resp = c.post(
f"{qdrant_url}/collections/{collection}/points/count",
json={
"filter": {
"must": [{
"key": "object_name",
"match": {"value": object_name},
}]
},
"exact": True,
},
)
resp.raise_for_status()
return resp.json()["result"]["count"]
def get_old_document_ids(
qdrant_url: str, collection: str, object_name: str,
) -> set:
"""Get all document_ids for existing chunks of this document."""
doc_ids = set()
offset = None
with httpx.Client(timeout=60.0) as c:
while True:
body = {
"filter": {
"must": [{
"key": "object_name",
"match": {"value": object_name},
}]
},
"limit": 100,
"with_payload": ["document_id"],
}
if offset is not None:
body["offset"] = offset
resp = c.post(
f"{qdrant_url}/collections/{collection}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
for pt in data["points"]:
did = pt.get("payload", {}).get("document_id")
if did:
doc_ids.add(did)
offset = data.get("next_page_offset")
if offset is None:
break
return doc_ids
def delete_by_document_ids(
qdrant_url: str, collection: str, doc_ids: set,
) -> None:
"""Delete chunks matching specific document_ids."""
for did in doc_ids:
with httpx.Client(timeout=30.0) as c:
c.post(
f"{qdrant_url}/collections/{collection}/points/delete",
json={
"filter": {
"must": [{
"key": "document_id",
"match": {"value": did},
}]
}
},
).raise_for_status()
def check_section_rate(
qdrant_url: str, collection: str, object_name: str,
) -> tuple:
"""Check section rate for a document's chunks. Returns (total, with_section)."""
total = 0
with_section = 0
offset = None
with httpx.Client(timeout=60.0) as c:
while True:
body = {
"filter": {
"must": [{
"key": "object_name",
"match": {"value": object_name},
}]
},
"limit": 100,
"with_payload": ["section"],
}
if offset is not None:
body["offset"] = offset
resp = c.post(
f"{qdrant_url}/collections/{collection}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
for pt in data["points"]:
total += 1
sec = pt.get("payload", {}).get("section", "")
if sec and sec.strip():
with_section += 1
offset = data.get("next_page_offset")
if offset is None:
break
return total, with_section
# -------------------------------------------------------------------
# Upload
# -------------------------------------------------------------------
def download_from_minio(rag_url: str, object_name: str) -> bytes:
"""Download file from MinIO via RAG service presigned URL."""
with httpx.Client(timeout=60.0, verify=False) as c:
resp = c.get(f"{rag_url}/api/v1/documents/download/{object_name}")
resp.raise_for_status()
presigned_url = resp.json()["url"]
with httpx.Client(timeout=300.0, verify=False) as c:
resp = c.get(presigned_url)
resp.raise_for_status()
return resp.content
def upload_document(
rag_url: str,
file_bytes: bytes,
filename: str,
collection: str,
extra_metadata: dict,
) -> dict:
"""Upload document to RAG service."""
ct = content_type_from_filename(filename)
form_data = {
"collection": collection,
"data_type": "compliance",
"bundesland": "bund",
"use_case": "compliance",
"year": "2026",
"chunk_strategy": CHUNK_STRATEGY,
"chunk_size": str(CHUNK_SIZE),
"chunk_overlap": str(CHUNK_OVERLAP),
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
}
with httpx.Client(timeout=UPLOAD_TIMEOUT, verify=False) as c:
resp = c.post(
f"{rag_url}/api/v1/documents/upload",
files={"file": (filename, file_bytes, ct)},
data=form_data,
)
resp.raise_for_status()
return resp.json()
# -------------------------------------------------------------------
# Main processing
# -------------------------------------------------------------------
def process_document(
doc: dict,
rag_url: str,
qdrant_url: str,
dry_run: bool = False,
) -> dict:
"""Safe re-ingest: upload first, then delete old. Returns result dict."""
obj = doc["object_name"]
coll = doc["collection"]
fname = doc["filename"]
# 1. Check existing state
old_count = count_chunks(qdrant_url, coll, obj)
old_doc_ids = get_old_document_ids(qdrant_url, coll, obj) if old_count > 0 else set()
logger.info(" [%s] existing: %d chunks, %d document_ids",
fname, old_count, len(old_doc_ids))
if dry_run:
logger.info(" [%s] DRY RUN — would download + upload + delete old", fname)
return {"status": "dry_run", "old_chunks": old_count}
# 2. Download from MinIO
logger.info(" [%s] downloading from MinIO...", fname)
file_bytes = download_from_minio(rag_url, obj)
size_mb = len(file_bytes) / (1024 * 1024)
logger.info(" [%s] downloaded %.1f MB", fname, size_mb)
# 3. Upload FIRST (creates new chunks)
logger.info(" [%s] uploading to RAG service...", fname)
result = upload_document(rag_url, file_bytes, fname, coll, doc["extra_metadata"])
new_chunks = result.get("chunks_count", 0)
new_doc_id = result.get("document_id", "")
logger.info(" [%s] uploaded: %d new chunks (doc_id=%s)", fname, new_chunks, new_doc_id)
# 4. Verify new chunks exist
if new_chunks == 0:
logger.error(" [%s] UPLOAD PRODUCED 0 CHUNKS — keeping old data!", fname)
return {"status": "error", "error": "0 new chunks", "old_chunks": old_count}
# 5. Delete old chunks (only if there were any)
if old_doc_ids:
logger.info(" [%s] deleting %d old document_ids...", fname, len(old_doc_ids))
delete_by_document_ids(qdrant_url, coll, old_doc_ids)
logger.info(" [%s] old chunks deleted", fname)
# 6. Check section rate
total, with_sec = check_section_rate(qdrant_url, coll, obj)
pct = (with_sec / total * 100) if total > 0 else 0
logger.info(" [%s] section rate: %d/%d (%.0f%%)", fname, with_sec, total, pct)
return {
"status": "ok",
"old_chunks": old_count,
"new_chunks": new_chunks,
"new_document_id": new_doc_id,
"section_rate": round(pct, 1),
}
def main():
parser = argparse.ArgumentParser(description="Safe NIST/BSI/ENISA re-ingestion")
parser.add_argument("--dry-run", action="store_true", help="Show what would happen")
parser.add_argument("--only-missing", action="store_true",
help="Only re-ingest the 4 missing docs (skip low-quality)")
parser.add_argument("--rag-url", default=DEFAULT_RAG_URL)
parser.add_argument("--qdrant-url", default=DEFAULT_QDRANT_URL)
args = parser.parse_args()
docs = list(MISSING_DOCS)
if not args.only_missing:
docs.extend(LOW_QUALITY_DOCS)
logger.info("=" * 60)
logger.info("NIST/BSI/ENISA Safe Re-Ingestion")
logger.info(" Documents: %d (%d missing + %d low-quality)",
len(docs), len(MISSING_DOCS),
0 if args.only_missing else len(LOW_QUALITY_DOCS))
logger.info(" RAG: %s", args.rag_url)
logger.info(" Qdrant: %s", args.qdrant_url)
logger.info(" Dry run: %s", args.dry_run)
logger.info("=" * 60)
results = {}
ok = 0
errors = 0
for i, doc in enumerate(docs, 1):
logger.info("[%d/%d] %s%s", i, len(docs), doc["filename"], doc["collection"])
try:
r = process_document(doc, args.rag_url, args.qdrant_url, args.dry_run)
results[doc["filename"]] = r
if r["status"] == "ok":
ok += 1
elif r["status"] == "error":
errors += 1
except Exception as e:
logger.error(" FAILED: %s", e)
results[doc["filename"]] = {"status": "error", "error": str(e)}
errors += 1
if i < len(docs):
time.sleep(2)
# Summary
logger.info("")
logger.info("=" * 60)
logger.info("RESULTS")
logger.info("=" * 60)
for fname, r in results.items():
status = r["status"].upper()
old = r.get("old_chunks", "?")
new = r.get("new_chunks", "?")
sec = r.get("section_rate", "?")
logger.info(" %-40s %s old=%s new=%s sect=%.0f%%",
fname, status, old, new, sec if isinstance(sec, float) else 0)
logger.info("")
logger.info("OK: %d, Errors: %d, Total: %d", ok, errors, len(docs))
if errors > 0:
sys.exit(1)
if __name__ == "__main__":
main()
@@ -0,0 +1,213 @@
#!/usr/bin/env python3
"""
Replace EU regulation PDFs with clean HTML from EUR-Lex.
Downloads HTML versions of EU regulations (using CELEX numbers),
deletes old PDF chunks from Qdrant, uploads HTML via RAG service.
Usage:
python3 scripts/replace_eu_pdfs_with_html.py --dry-run
python3 scripts/replace_eu_pdfs_with_html.py
python3 scripts/replace_eu_pdfs_with_html.py --celex 32016R0679 # single doc
"""
import argparse
import json
import logging
import time
import httpx
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("eurlex-replace")
DEFAULT_RAG_URL = "https://macmini:8097"
DEFAULT_QDRANT_URL = "http://macmini:6333"
EURLEX_HTML_URL = "https://eur-lex.europa.eu/legal-content/DE/TXT/HTML/?uri=CELEX:{celex}"
# EU regulations with CELEX numbers and their current collection + metadata
EU_REGULATIONS = [
{"celex": "32024R1689", "reg_id": "ai_act_2024", "name": "AI Act", "coll": "bp_compliance_ce"},
{"celex": "32024R2847", "reg_id": "cra_2024", "name": "Cyber Resilience Act", "coll": "bp_compliance_ce"},
{"celex": "32022L2555", "reg_id": "nis2_2022", "name": "NIS2-Richtlinie", "coll": "bp_compliance_ce"},
{"celex": "32016R0679", "reg_id": "dsgvo_2016", "name": "DSGVO", "coll": "bp_compliance_ce"},
{"celex": "32024R1624", "reg_id": "amlr_2024", "name": "Anti-Geldwaesche-VO", "coll": "bp_compliance_ce"},
{"celex": "32017R0745", "reg_id": "eu_mdr_2017", "name": "Medical Device Regulation", "coll": "bp_compliance_ce"},
{"celex": "32022R2065", "reg_id": "dsa_2022", "name": "Digital Services Act", "coll": "bp_compliance_ce"},
{"celex": "32022R1925", "reg_id": "dma_2022", "name": "Digital Markets Act", "coll": "bp_compliance_ce"},
{"celex": "32022R2554", "reg_id": "dora_2022", "name": "DORA", "coll": "bp_compliance_ce"},
{"celex": "32022R0868", "reg_id": "dga_2022", "name": "Data Governance Act", "coll": "bp_compliance_ce"},
{"celex": "32023R2854", "reg_id": "dataact_2023", "name": "Data Act", "coll": "bp_compliance_ce"},
{"celex": "32023R0988", "reg_id": "gpsr_2023", "name": "General Product Safety Regulation", "coll": "bp_compliance_ce"},
{"celex": "32023R1230", "reg_id": "machinery_2023", "name": "Maschinenverordnung", "coll": "bp_compliance_ce"},
{"celex": "32023R1803", "reg_id": "ifrs_2023", "name": "IFRS Regulation", "coll": "bp_compliance_ce"},
{"celex": "32023D1795", "reg_id": "dpf_2023", "name": "Data Privacy Framework", "coll": "bp_compliance_ce"},
{"celex": "32019L2161", "reg_id": "omnibus_2019", "name": "Omnibus-Richtlinie", "coll": "bp_compliance_ce"},
{"celex": "32019L0790", "reg_id": "dsm_2019", "name": "DSM-Richtlinie", "coll": "bp_compliance_ce"},
{"celex": "32019L0770", "reg_id": "digital_content_2019", "name": "Digital Content Directive", "coll": "bp_compliance_ce"},
{"celex": "32002L0058", "reg_id": "eprivacy_2002", "name": "ePrivacy-Richtlinie", "coll": "bp_compliance_ce"},
{"celex": "32000L0031", "reg_id": "ecommerce_2000", "name": "E-Commerce-Richtlinie", "coll": "bp_compliance_ce"},
]
def download_eurlex_html(celex: str) -> bytes:
"""Download HTML from EUR-Lex for a given CELEX number."""
url = EURLEX_HTML_URL.format(celex=celex)
with httpx.Client(timeout=60.0, follow_redirects=True) as c:
r = c.get(url)
r.raise_for_status()
return r.content
def delete_old_chunks(qdrant_url: str, collection: str, reg_id: str):
"""Delete chunks matching regulation_id prefix."""
with httpx.Client(timeout=30.0) as c:
# Try multiple field names for regulation_id
for field in ["regulation_id"]:
r = c.post(f"{qdrant_url}/collections/{collection}/points/delete", json={
"filter": {"must": [{"key": field, "match": {"value": reg_id}}]}
})
if r.status_code == 200:
return
def find_old_chunks_by_filename(qdrant_url: str, collection: str, filename_pattern: str) -> int:
"""Count existing chunks matching a filename pattern."""
with httpx.Client(timeout=30.0) as c:
r = c.post(f"{qdrant_url}/collections/{collection}/points/count", json={
"exact": True,
"filter": {"must": [{"key": "regulation_id", "match": {"value": filename_pattern}}]}
})
if r.status_code == 200:
return r.json()["result"]["count"]
return 0
def upload_html(rag_url: str, html_bytes: bytes, reg: dict) -> dict:
"""Upload HTML to RAG service."""
filename = f"{reg['reg_id']}.html"
metadata = json.dumps({
"regulation_id": reg["reg_id"],
"regulation_name_de": reg["name"],
"celex": reg["celex"],
"source": "EUR-Lex",
"license": "EU_law",
"source_type": "law",
"category": "eu_regulation",
}, ensure_ascii=False)
with httpx.Client(timeout=3600.0, verify=False) as c:
r = c.post(f"{rag_url}/api/v1/documents/upload",
files={"file": (filename, html_bytes, "text/html")},
data={
"collection": reg["coll"],
"data_type": "compliance",
"bundesland": "eu",
"use_case": "regulation",
"year": reg["celex"][1:5],
"chunk_strategy": "recursive",
"chunk_size": "1500",
"chunk_overlap": "100",
"metadata_json": metadata,
},
)
r.raise_for_status()
return r.json()
def check_section_rate(qdrant_url: str, collection: str, reg_id: str) -> tuple:
"""Check section rate for a regulation. Returns (total, with_section)."""
total = 0
with_section = 0
with httpx.Client(timeout=30.0) as c:
r = c.post(f"{qdrant_url}/collections/{collection}/points/scroll", json={
"limit": 100, "with_payload": True, "with_vector": False,
"filter": {"must": [{"key": "regulation_id", "match": {"value": reg_id}}]}
})
if r.status_code == 200:
pts = r.json()["result"]["points"]
total = len(pts)
with_section = sum(1 for p in pts if p["payload"].get("section"))
return total, with_section
def main():
parser = argparse.ArgumentParser(description="Replace EU PDFs with EUR-Lex HTML")
parser.add_argument("--rag-url", default=DEFAULT_RAG_URL)
parser.add_argument("--qdrant-url", default=DEFAULT_QDRANT_URL)
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--celex", default=None, help="Process only this CELEX number")
args = parser.parse_args()
regs = EU_REGULATIONS
if args.celex:
regs = [r for r in regs if r["celex"] == args.celex]
if not regs:
print(f"CELEX {args.celex} not found in list")
return
results = []
for reg in regs:
logger.info("[%s] %s (%s)", reg["celex"], reg["name"], reg["reg_id"])
# Download HTML
try:
html_bytes = download_eurlex_html(reg["celex"])
logger.info(" Downloaded: %d bytes", len(html_bytes))
except Exception as e:
logger.error(" Download FAILED: %s", e)
results.append({"reg": reg, "status": "download_failed", "error": str(e)})
continue
if args.dry_run:
results.append({"reg": reg, "status": "dry_run", "html_size": len(html_bytes)})
continue
# Delete old chunks
old_count = find_old_chunks_by_filename(args.qdrant_url, reg["coll"], reg["reg_id"])
delete_old_chunks(args.qdrant_url, reg["coll"], reg["reg_id"])
logger.info(" Deleted %d old chunks", old_count)
# Upload HTML
try:
result = upload_html(args.rag_url, html_bytes, reg)
new_chunks = result.get("chunks_count", 0)
logger.info(" Uploaded: %d new chunks", new_chunks)
except Exception as e:
logger.error(" Upload FAILED: %s", e)
results.append({"reg": reg, "status": "upload_failed", "error": str(e)})
time.sleep(2)
continue
# Check quality
time.sleep(2)
total, with_sec = check_section_rate(args.qdrant_url, reg["coll"], reg["reg_id"])
pct = with_sec * 100 // max(total, 1)
logger.info(" Section rate: %d/%d = %d%%", with_sec, total, pct)
results.append({
"reg": reg, "status": "ok",
"old_chunks": old_count, "new_chunks": new_chunks,
"section_rate": pct,
})
time.sleep(2)
# Report
print("\n" + "=" * 90)
print("EUR-LEX REPLACEMENT REPORT")
print("=" * 90)
print(f"{'CELEX':<15} {'Name':<30} {'Status':<10} {'Old':>5} {'New':>5} {'Sect%':>6}")
print("-" * 90)
for r in results:
reg = r["reg"]
status = r["status"]
old = r.get("old_chunks", "")
new = r.get("new_chunks", r.get("html_size", ""))
sect = f"{r.get('section_rate', '')}%" if "section_rate" in r else ""
print(f"{reg['celex']:<15} {reg['name'][:30]:<30} {status:<10} {str(old):>5} {str(new):>5} {sect:>6}")
if __name__ == "__main__":
main()
@@ -0,0 +1,437 @@
#!/usr/bin/env python3
"""Re-upload NIST/BSI/ENISA docs with chunk_strategy='legal' for section metadata.
The docs were already uploaded with 'recursive' strategy (no section detection).
This script re-uploads with 'legal' strategy, then deletes old recursive chunks.
Usage (on Mac Mini):
python3 control-pipeline/scripts/reupload_legal_strategy.py
python3 control-pipeline/scripts/reupload_legal_strategy.py --dry-run
"""
import argparse
import io
import json
import re
import sys
import time
import unicodedata
import httpx
import pdfplumber
RAG_URL = "https://localhost:8097"
QDRANT_URL = "http://localhost:6333"
UPLOAD_TIMEOUT = 1800.0
# ---- Documents to process ----
DOCS = [
# 4 NIST docs already extracted at /tmp/nist_*.txt
{
"regulation_id": "nist_sp800_53r5",
"collection": "bp_compliance_datenschutz",
"upload_filename": "NIST_SP_800_53r5.txt",
"local_txt": "/tmp/nist_nist_sp800_53r5.txt",
"minio_pdf": None, # already extracted
"extra_metadata": {
"regulation_id": "nist_sp800_53r5",
"source_id": "nist",
"doc_type": "controls_catalog",
"guideline_name": "NIST SP 800-53 Rev. 5",
"license": "public_domain_us_gov",
"source": "nist.gov",
},
},
{
"regulation_id": "nist_sp_800_82r3",
"collection": "bp_compliance_ce",
"upload_filename": "nist_sp_800_82r3.txt",
"local_txt": "/tmp/nist_nist_sp_800_82r3.txt",
"minio_pdf": None,
"extra_metadata": {
"regulation_id": "nist_sp_800_82r3",
"regulation_short": "NIST SP 800-82",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"regulation_id": "nist_sp_800_160v1r1",
"collection": "bp_compliance_ce",
"upload_filename": "nist_sp_800_160v1r1.txt",
"local_txt": "/tmp/nist_160.txt",
"minio_pdf": None,
"extra_metadata": {
"regulation_id": "nist_sp_800_160v1r1",
"regulation_short": "NIST SP 800-160",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"regulation_id": "nist_sp800_207",
"collection": "bp_compliance_datenschutz",
"upload_filename": "NIST_SP_800_207.txt",
"local_txt": None, # needs extraction
"minio_pdf": "compliance/bund/compliance/2026/NIST_SP_800_207.pdf",
"extra_metadata": {
"regulation_id": "nist_sp800_207",
"source_id": "nist",
"doc_type": "architecture",
"guideline_name": "NIST SP 800-207 Zero Trust Architecture",
"license": "public_domain_us_gov",
"source": "nist.gov",
},
},
# Additional low-quality docs (need extraction from MinIO)
{
"regulation_id": "nist_csf_2_0",
"collection": "bp_compliance_datenschutz",
"upload_filename": "nist_csf_2_0.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/nist_csf_2_0.pdf",
"extra_metadata": {
"regulation_id": "nist_csf_2_0",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"regulation_id": "nistir_8259a",
"collection": "bp_compliance_datenschutz",
"upload_filename": "nistir_8259a.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/nistir_8259a.pdf",
"extra_metadata": {
"regulation_id": "nistir_8259a",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"regulation_id": "nist_ai_rmf",
"collection": "bp_compliance_datenschutz",
"upload_filename": "nist_ai_rmf.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/nist_ai_rmf.pdf",
"extra_metadata": {
"regulation_id": "nist_ai_rmf",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"regulation_id": "nist_sp_800_30r1",
"collection": "bp_compliance_ce",
"upload_filename": "nist_sp_800_30r1.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/nist_sp_800_30r1.pdf",
"extra_metadata": {
"regulation_id": "nist_sp_800_30r1",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"regulation_id": "cisa_secure_by_design",
"collection": "bp_compliance_ce",
"upload_filename": "cisa_secure_by_design.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/cisa_secure_by_design.pdf",
"extra_metadata": {
"regulation_id": "cisa_secure_by_design",
"license": "public_domain_us",
"source": "cisa.gov",
},
},
{
"regulation_id": "cvss_v4_0",
"collection": "bp_compliance_ce",
"upload_filename": "cvss_v4_0.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/cvss_v4_0.pdf",
"extra_metadata": {
"regulation_id": "cvss_v4_0",
"license": "public_domain_us",
"source": "first.org",
},
},
{
"regulation_id": "enisa_ics_scada_dependencies",
"collection": "bp_compliance_ce",
"upload_filename": "enisa_ics_scada.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/enisa_ics_scada.pdf",
"extra_metadata": {
"regulation_id": "enisa_ics_scada_dependencies",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
},
},
{
"regulation_id": "enisa_threat_landscape_supply_chain",
"collection": "bp_compliance_ce",
"upload_filename": "enisa_supply_chain_security.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/enisa_supply_chain_security.pdf",
"extra_metadata": {
"regulation_id": "enisa_threat_landscape_supply_chain",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
},
},
{
"regulation_id": "enisa_supply_chain_good_practices",
"collection": "bp_compliance_ce",
"upload_filename": "enisa_supply_chain_good_practices.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/enisa_supply_chain_good_practices.pdf",
"extra_metadata": {
"regulation_id": "enisa_supply_chain_good_practices",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
},
},
]
def normalize_pdf_text(text):
text = unicodedata.normalize('NFKC', text)
text = text.replace('\u00ad', '').replace('\u200b', '')
prev = None
while prev != text:
prev = text
text = re.sub(r'(\d+)\s+\.\s+(\d+)', r'\1.\2', text)
text = re.sub(r'\b([A-Z]{2,4})\s+-\s+(\d+)\b', r'\1-\2', text)
text = re.sub(
r'\b([A-Z]{2})\s*\.\s*([A-Z]{2})\s*-\s*(\d{2})\b', r'\1.\2-\3', text
)
text = re.sub(r'\(\s+(\d+)\s+\)', r'(\1)', text)
text = re.sub(r'[^\S\n]{2,}', ' ', text)
return text
def get_text(doc):
"""Get document text: from local file or extract from MinIO PDF."""
if doc["local_txt"]:
print(f" Reading local: {doc['local_txt']}")
with open(doc["local_txt"], encoding="utf-8") as f:
return f.read()
print(f" Downloading from MinIO: {doc['minio_pdf']}")
with httpx.Client(timeout=60, verify=False) as c:
resp = c.get(f"{RAG_URL}/api/v1/documents/download/{doc['minio_pdf']}")
resp.raise_for_status()
url = resp.json()["url"]
with httpx.Client(timeout=300, verify=False) as c:
pdf_bytes = c.get(url).content
print(f" Downloaded {len(pdf_bytes) / 1024 / 1024:.1f} MB")
print(" Extracting with pdfplumber...")
parts = []
with pdfplumber.open(io.BytesIO(pdf_bytes)) as pdf:
for i, page in enumerate(pdf.pages):
t = page.extract_text(x_tolerance=3, y_tolerance=4)
if t:
parts.append(t)
if (i + 1) % 50 == 0:
print(f" {i + 1}/{len(pdf.pages)} pages...")
text = "\n\n".join(parts)
text = normalize_pdf_text(text)
print(f" Extracted {len(text):,} chars")
return text
def get_old_doc_ids(collection, regulation_id):
"""Get all document_ids for existing chunks."""
doc_ids = set()
offset = None
with httpx.Client(timeout=60) as c:
while True:
body = {
"filter": {"must": [
{"key": "regulation_id", "match": {"value": regulation_id}}
]},
"limit": 100,
"with_payload": ["document_id"],
}
if offset is not None:
body["offset"] = offset
resp = c.post(
f"{QDRANT_URL}/collections/{collection}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
for pt in data["points"]:
did = pt.get("payload", {}).get("document_id")
if did:
doc_ids.add(did)
offset = data.get("next_page_offset")
if offset is None:
break
return doc_ids
def upload_text_legal(text, filename, collection, extra_metadata):
"""Upload with chunk_strategy='legal'."""
form_data = {
"collection": collection,
"data_type": "compliance",
"bundesland": "bund",
"use_case": "compliance",
"year": "2026",
"chunk_strategy": "legal",
"chunk_size": "1500",
"chunk_overlap": "100",
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
}
with httpx.Client(timeout=UPLOAD_TIMEOUT, verify=False) as c:
resp = c.post(
f"{RAG_URL}/api/v1/documents/upload",
files={"file": (filename, text.encode("utf-8"), "text/plain")},
data=form_data,
)
resp.raise_for_status()
return resp.json()
def delete_by_doc_ids(collection, doc_ids):
"""Delete chunks matching specific document_ids."""
with httpx.Client(timeout=30) as c:
for did in doc_ids:
c.post(
f"{QDRANT_URL}/collections/{collection}/points/delete",
json={"filter": {"must": [
{"key": "document_id", "match": {"value": did}}
]}},
).raise_for_status()
def count_chunks(collection, regulation_id):
with httpx.Client(timeout=30) as c:
resp = c.post(
f"{QDRANT_URL}/collections/{collection}/points/count",
json={"filter": {"must": [
{"key": "regulation_id", "match": {"value": regulation_id}}
]}, "exact": True},
)
resp.raise_for_status()
return resp.json()["result"]["count"]
def check_section_rate(collection, regulation_id):
total = 0
with_sec = 0
offset = None
with httpx.Client(timeout=60) as c:
while True:
body = {
"filter": {"must": [
{"key": "regulation_id", "match": {"value": regulation_id}}
]},
"limit": 100,
"with_payload": ["section"],
}
if offset is not None:
body["offset"] = offset
resp = c.post(
f"{QDRANT_URL}/collections/{collection}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
for pt in data["points"]:
total += 1
s = pt.get("payload", {}).get("section", "")
if s and s.strip():
with_sec += 1
offset = data.get("next_page_offset")
if offset is None:
break
return total, with_sec
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
print("=" * 60)
print("Re-upload with chunk_strategy='legal'")
print(f"Documents: {len(DOCS)}, Dry run: {args.dry_run}")
print("=" * 60)
results = []
for i, doc in enumerate(DOCS, 1):
reg_id = doc["regulation_id"]
coll = doc["collection"]
print(f"\n[{i}/{len(DOCS)}] {doc['upload_filename']}{coll}")
# 1. Check existing
old_count = count_chunks(coll, reg_id)
old_doc_ids = get_old_doc_ids(coll, reg_id) if old_count > 0 else set()
print(f" Old: {old_count} chunks, {len(old_doc_ids)} doc_ids")
if args.dry_run:
print(" DRY RUN — skipping")
results.append({"file": doc["upload_filename"], "old": old_count,
"new": "?", "sect": "?"})
continue
# 2. Get text
try:
text = get_text(doc)
except Exception as e:
print(f" ERROR extracting text: {e}")
results.append({"file": doc["upload_filename"], "old": old_count,
"new": 0, "sect": 0})
continue
# 3. Upload with legal strategy
print(" Uploading with strategy='legal'...")
result = upload_text_legal(
text, doc["upload_filename"], coll, doc["extra_metadata"])
new_chunks = result.get("chunks_count", 0)
new_doc_id = result.get("document_id", "")
print(f" New: {new_chunks} chunks (doc_id={new_doc_id})")
if new_chunks == 0:
print(" ERROR: 0 chunks — keeping old!")
results.append({"file": doc["upload_filename"], "old": old_count,
"new": 0, "sect": 0})
continue
# 4. Delete old chunks (safe: new ones already exist)
if old_doc_ids:
# Exclude the new document_id just in case
old_doc_ids.discard(new_doc_id)
if old_doc_ids:
print(f" Deleting {len(old_doc_ids)} old doc_ids...")
delete_by_doc_ids(coll, old_doc_ids)
# 5. Check section rate
total, with_sec = check_section_rate(coll, reg_id)
pct = (with_sec / total * 100) if total > 0 else 0
print(f" Section rate: {with_sec}/{total} ({pct:.0f}%)")
results.append({"file": doc["upload_filename"], "old": old_count,
"new": new_chunks, "sect": round(pct, 1)})
if i < len(DOCS):
time.sleep(2)
# Summary
print("\n" + "=" * 60)
print("RESULTS")
print("=" * 60)
for r in results:
print(f" {r['file']:<45} old={r['old']:<6} new={r['new']:<6} sect={r['sect']}%")
total_new = sum(r["new"] for r in results if isinstance(r["new"], int))
print(f"\nTotal new chunks: {total_new}")
if __name__ == "__main__":
main()
@@ -0,0 +1,268 @@
#!/usr/bin/env python3
"""
D4 Integration Test: Upload BGB excerpt verify Qdrant payloads.
Usage:
# Dry-run (local chunking only, no services needed)
python3 scripts/test_d4_integration.py --dry-run
# Against Mac Mini
python3 scripts/test_d4_integration.py \
--rag-url https://macmini:8097 \
--qdrant-url http://macmini:6333
# Against production
python3 scripts/test_d4_integration.py \
--rag-url https://rag-prod:8097 \
--qdrant-url http://qdrant-prod:6333
"""
import argparse
import json
import os
import sys
import time
import httpx
FIXTURE_PATH = os.path.join(
os.path.dirname(__file__), "..", "..", "embedding-service",
"tests", "fixtures", "bgb_312_excerpt.txt",
)
COLLECTION = "bp_compliance_gesetze"
REG_CODE = "BGB_D4_TEST"
# Expected sections in the BGB excerpt
EXPECTED_SECTIONS = {"§ 312", "§ 312a", "§ 312g", "§ 312k"}
def load_fixture() -> str:
with open(FIXTURE_PATH, encoding="utf-8") as f:
return f.read()
def upload_document(rag_url: str, text: str) -> dict:
"""Upload BGB excerpt to RAG service."""
metadata = json.dumps({
"regulation_code": REG_CODE,
"regulation_name_de": "BGB (D4 Test)",
"source_type": "law",
})
with httpx.Client(timeout=60.0, verify=False) as client:
resp = client.post(
f"{rag_url}/api/v1/documents/upload",
files={"file": ("bgb_312_test.txt", text.encode(), "text/plain")},
data={
"collection": COLLECTION,
"data_type": "law",
"bundesland": "bund",
"use_case": "compliance",
"year": "2026",
"chunk_strategy": "recursive",
"chunk_size": "1500",
"chunk_overlap": "100",
"metadata_json": metadata,
},
)
resp.raise_for_status()
return resp.json()
def scroll_chunks(qdrant_url: str, document_id: str) -> list[dict]:
"""Scroll Qdrant for chunks matching this document_id."""
all_points = []
offset = None
with httpx.Client(timeout=30.0) as client:
while True:
body: dict = {
"limit": 100,
"with_payload": True,
"with_vector": False,
"filter": {
"must": [{
"key": "document_id",
"match": {"value": document_id},
}]
},
}
if offset:
body["offset"] = offset
resp = client.post(
f"{qdrant_url}/collections/{COLLECTION}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
all_points.extend(data["points"])
offset = data.get("next_page_offset")
if not offset:
break
return all_points
def delete_test_data(qdrant_url: str, document_id: str):
"""Clean up test chunks from Qdrant."""
with httpx.Client(timeout=30.0) as client:
resp = client.post(
f"{qdrant_url}/collections/{COLLECTION}/points/delete",
json={
"filter": {
"must": [{
"key": "document_id",
"match": {"value": document_id},
}]
}
},
)
resp.raise_for_status()
def verify_chunks(points: list[dict]) -> dict:
"""Analyze chunks and return a verification report."""
report = {
"total_chunks": len(points),
"sections_found": set(),
"chunks_with_section": 0,
"chunks_with_paragraph": 0,
"chunks_with_page": 0,
"section_details": [],
"issues": [],
}
for pt in points:
payload = pt.get("payload", {})
section = payload.get("section", "")
section_title = payload.get("section_title", "")
paragraph = payload.get("paragraph", "")
paragraph_num = payload.get("paragraph_num")
page = payload.get("page")
chunk_idx = payload.get("chunk_index", "?")
if section:
report["sections_found"].add(section)
report["chunks_with_section"] += 1
if paragraph:
report["chunks_with_paragraph"] += 1
if page is not None:
report["chunks_with_page"] += 1
report["section_details"].append({
"chunk_index": chunk_idx,
"section": section,
"section_title": section_title[:40],
"paragraph": paragraph,
"paragraph_num": paragraph_num,
"page": page,
"text_preview": payload.get("chunk_text", "")[:60],
})
# Checks
missing = EXPECTED_SECTIONS - report["sections_found"]
if missing:
report["issues"].append(f"Missing sections: {missing}")
if "§ 312k" not in report["sections_found"]:
report["issues"].append("CRITICAL: § 312k not found!")
section_ratio = report["chunks_with_section"] / max(report["total_chunks"], 1)
if section_ratio < 0.9:
report["issues"].append(
f"Only {section_ratio:.0%} chunks have section metadata (expected >= 90%)"
)
return report
def print_report(report: dict):
"""Print verification report."""
print("\n" + "=" * 60)
print("D4 VALIDATION REPORT")
print("=" * 60)
print(f"Total chunks: {report['total_chunks']}")
print(f"With section: {report['chunks_with_section']}")
print(f"With paragraph: {report['chunks_with_paragraph']}")
print(f"With page: {report['chunks_with_page']}")
print(f"Sections found: {sorted(report['sections_found'])}")
print("\nChunk details:")
for d in sorted(report["section_details"], key=lambda x: x["chunk_index"]):
print(
f" [{d['chunk_index']:2}] "
f"section={d['section']!r:12s} "
f"title={d['section_title']!r:30s} "
f"para={d['paragraph']!r:8s}"
)
if report["issues"]:
print(f"\nISSUES ({len(report['issues'])}):")
for issue in report["issues"]:
print(f" - {issue}")
print("\nRESULT: FAIL")
else:
print("\nRESULT: PASS — all sections detected, metadata quality OK")
def main():
parser = argparse.ArgumentParser(description="D4 Integration Test")
parser.add_argument("--rag-url", default="https://macmini:8097")
parser.add_argument("--qdrant-url", default="http://macmini:6333")
parser.add_argument("--dry-run", action="store_true",
help="Only test local chunking, no upload")
parser.add_argument("--keep", action="store_true",
help="Don't delete test data after verification")
args = parser.parse_args()
text = load_fixture()
print(f"Loaded BGB excerpt: {len(text)} chars")
if args.dry_run:
# Import chunking directly
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "embedding-service"))
from main import chunk_text_legal_structured
chunks = chunk_text_legal_structured(text, 1500, 100)
# Build fake points for verification
points = [{"payload": {
"chunk_index": c["index"],
"chunk_text": c["text"],
"section": c["section"],
"section_title": c["section_title"],
"paragraph": c["paragraph"],
"paragraph_num": c["paragraph_num"],
"page": c["page"],
}} for c in chunks]
report = verify_chunks(points)
print_report(report)
sys.exit(1 if report["issues"] else 0)
# Full integration test
print(f"Uploading to {args.rag_url} → collection={COLLECTION}...")
result = upload_document(args.rag_url, text)
doc_id = result["document_id"]
print(f" document_id: {doc_id}")
print(f" chunks_count: {result['chunks_count']}")
print(f" vectors_indexed: {result['vectors_indexed']}")
print("Waiting 2s for indexing...")
time.sleep(2)
print(f"Scrolling Qdrant at {args.qdrant_url}...")
points = scroll_chunks(args.qdrant_url, doc_id)
print(f" Found {len(points)} points")
report = verify_chunks(points)
print_report(report)
if not args.keep:
print(f"\nCleaning up test data (document_id={doc_id})...")
delete_test_data(args.qdrant_url, doc_id)
print(" Deleted.")
sys.exit(1 if report["issues"] else 0)
if __name__ == "__main__":
main()
@@ -17,9 +17,6 @@ import httpx
from .control_generator import (
GeneratedControl,
REGULATION_LICENSE_MAP,
_RULE2_PREFIXES,
_RULE3_PREFIXES,
_classify_regulation,
)
@@ -346,13 +346,40 @@ class BatchDedupRunner:
self._progress_total = total
self._progress_count = 0
logger.info("BatchDedup Cross-group: %d masters to check", total)
cross_linked = 0
cross_review = 0
# Paginated processing — 100 rows per DB query
# Checkpoint: resume from last processed control_id
DB_PAGE = 100
last_control_id = ""
# Checkpoint: resume from last processed control_id (survives container restart)
checkpoint_row = self.db.execute(text("""
SELECT config FROM canonical_generation_jobs
WHERE status = 'dedup_phase2_checkpoint'
LIMIT 1
""")).fetchone()
last_control_id = checkpoint_row[0] if checkpoint_row else ""
if last_control_id:
skip_row = self.db.execute(text("""
SELECT COUNT(*) FROM canonical_controls
WHERE decomposition_method = 'pass0b'
AND release_state != 'duplicate'
AND release_state != 'deprecated'
AND control_id <= :last_id
"""), {"last_id": last_control_id}).fetchone()
skipped = skip_row[0] if skip_row else 0
self._progress_count = skipped
logger.info("BatchDedup Cross-group: RESUMING from %s (skipping %d already processed)",
last_control_id, skipped)
else:
self.db.execute(text("""
INSERT INTO canonical_generation_jobs (id, status, config)
VALUES (gen_random_uuid(), 'dedup_phase2_checkpoint', '')
"""))
self.db.commit()
logger.info("BatchDedup Cross-group: %d masters to check (starting from %s)",
total, last_control_id or "beginning")
while True:
rows = self.db.execute(text("""
@@ -461,11 +488,34 @@ class BatchDedupRunner:
self._progress_count += 1
# Log progress every page
# Save checkpoint + log progress every page
try:
self.db.execute(text("""
UPDATE canonical_generation_jobs
SET config = :cid
WHERE status = 'dedup_phase2_checkpoint'
"""), {"cid": last_control_id})
self.db.commit()
except Exception:
try:
self.db.rollback()
except Exception:
pass
processed = self._progress_count
if processed % 500 < DB_PAGE:
logger.info("BatchDedup Cross-group: %d/%d checked, %d linked, %d review",
processed, len(rows), cross_linked, cross_review)
logger.info("BatchDedup Cross-group: %d/%d checked, %d linked, %d review (checkpoint: %s)",
processed, total, cross_linked, cross_review, last_control_id)
# Clear checkpoint on completion
try:
self.db.execute(text("""
DELETE FROM canonical_generation_jobs
WHERE status = 'dedup_phase2_checkpoint'
"""))
self.db.commit()
except Exception:
pass
self.stats["cross_group_linked"] = cross_linked
self.stats["cross_group_review"] = cross_review
+26 -26
View File
@@ -25,8 +25,7 @@ import re
import uuid
from collections import defaultdict
from dataclasses import dataclass, field, asdict
from datetime import datetime, timezone
from typing import Dict, List, Optional, Set
from typing import Dict, List, Optional
import httpx
from pydantic import BaseModel
@@ -34,7 +33,8 @@ from sqlalchemy import text
from sqlalchemy.orm import Session
from .rag_client import ComplianceRAGClient, RAGSearchResult, get_rag_client
from .similarity_detector import check_similarity, SimilarityReport
from .regulation_registry import get_registry as _get_regulation_registry
from .similarity_detector import check_similarity
logger = logging.getLogger(__name__)
@@ -246,28 +246,21 @@ def _classify_regulation(regulation_code: str) -> dict:
Returns dict with keys: license, rule, name, source_type.
source_type is one of: law, guideline, standard, restricted.
Delegates to DB-backed RegulationRegistry (with 5min cache).
Falls back to REGULATION_LICENSE_MAP if DB is unavailable.
"""
code = regulation_code.lower().strip()
registry = _get_regulation_registry()
result = registry.classify_regulation(regulation_code)
# Exact match first
if code in REGULATION_LICENSE_MAP:
return REGULATION_LICENSE_MAP[code]
# If registry returned the unknown fallback AND we have a local match,
# prefer the local dict (graceful degradation during migration)
if result.get("license") == "UNKNOWN":
code = regulation_code.lower().strip()
if code in REGULATION_LICENSE_MAP:
return REGULATION_LICENSE_MAP[code]
# Prefix match for Rule 2 (ENISA = standard)
for prefix in _RULE2_PREFIXES:
if code.startswith(prefix):
return {"license": "CC-BY-4.0", "rule": 2, "source_type": "standard",
"name": "ENISA", "attribution": "ENISA, CC BY 4.0"}
# Prefix match for Rule 3 (BSI/ISO/ETSI = restricted)
for prefix in _RULE3_PREFIXES:
if code.startswith(prefix):
return {"license": f"{prefix.rstrip('_').upper()}_RESTRICTED", "rule": 3,
"source_type": "restricted", "name": "INTERNAL_ONLY"}
# Unknown → treat as restricted (safe default)
logger.warning("Unknown regulation_code %r — defaulting to Rule 3 (restricted)", code)
return {"license": "UNKNOWN", "rule": 3, "source_type": "restricted", "name": "INTERNAL_ONLY"}
return result
# ---------------------------------------------------------------------------
@@ -1019,11 +1012,12 @@ class ControlGeneratorPipeline:
regulation_name=reg_name,
regulation_short=reg_short,
category=payload.get("category", "") or payload.get("data_type", ""),
article=payload.get("article", "") or payload.get("section_title", "") or payload.get("section", ""),
article=payload.get("section", "") or payload.get("article", "") or payload.get("section_title", ""),
paragraph=payload.get("paragraph", ""),
source_url=payload.get("source_url", "") or payload.get("source", "") or payload.get("url", ""),
score=0.0,
collection=collection,
page=payload.get("page"),
)
all_results.append(chunk)
collection_new += 1
@@ -1127,6 +1121,7 @@ Quelle: {chunk.regulation_name} ({chunk.regulation_code}), {chunk.article}"""
"source": canonical_source,
"article": effective_article,
"paragraph": effective_paragraph,
"page": chunk.page,
"license": license_info.get("license", ""),
"source_type": license_info.get("source_type", "law"),
"url": chunk.source_url or "",
@@ -1141,6 +1136,7 @@ Quelle: {chunk.regulation_name} ({chunk.regulation_code}), {chunk.article}"""
"source_regulation": chunk.regulation_code,
"source_article": effective_article,
"source_paragraph": effective_paragraph,
"source_page": chunk.page,
}
return control
@@ -1194,6 +1190,7 @@ Quelle: {chunk.regulation_name}, {chunk.article}"""
"source": canonical_source,
"article": effective_article,
"paragraph": effective_paragraph,
"page": chunk.page,
"license": license_info.get("license", ""),
"license_notice": attribution,
"source_type": license_info.get("source_type", "standard"),
@@ -1209,6 +1206,7 @@ Quelle: {chunk.regulation_name}, {chunk.article}"""
"source_regulation": chunk.regulation_code,
"source_article": effective_article,
"source_paragraph": effective_paragraph,
"source_page": chunk.page,
}
return control
@@ -1368,6 +1366,7 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Elementen. Fuer Chunks ohne A
"source": canonical_source,
"article": effective_article,
"paragraph": effective_paragraph,
"page": chunk.page,
"license": lic.get("license", ""),
"license_notice": lic.get("attribution", ""),
"source_type": lic.get("source_type", "law"),
@@ -1384,6 +1383,7 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Elementen. Fuer Chunks ohne A
"source_regulation": chunk.regulation_code,
"source_article": effective_article,
"source_paragraph": effective_paragraph,
"source_page": chunk.page,
"batch_size": len(chunks),
"document_grouped": same_doc,
}
@@ -1479,14 +1479,14 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Elementen. Fuer Aspekte ohne
) -> list[Optional[GeneratedControl]]:
"""Process a batch of (chunk, license_info) through stages 3-5."""
# Split by license rule: Rule 1+2 → structure, Rule 3 → reform
structure_items = [(c, l) for c, l in batch_items if l["rule"] in (1, 2)]
reform_items = [(c, l) for c, l in batch_items if l["rule"] == 3]
structure_items = [(c, lic) for c, lic in batch_items if lic["rule"] in (1, 2)]
reform_items = [(c, lic) for c, lic in batch_items if lic["rule"] == 3]
all_controls: dict[int, Optional[GeneratedControl]] = {}
if structure_items:
s_chunks = [c for c, _ in structure_items]
s_lics = [l for _, l in structure_items]
s_lics = [lic for _, lic in structure_items]
try:
s_controls = await self._structure_batch(s_chunks, s_lics)
except Exception as e:
@@ -24,7 +24,6 @@ import json
import logging
import os
import re
import uuid
from dataclasses import dataclass, field
from typing import Optional
@@ -56,7 +55,7 @@ ANTHROPIC_API_URL = "https://api.anthropic.com/v1"
# Patterns are defined in normative_patterns.py and imported here
# with local aliases for backward compatibility.
from .normative_patterns import (
from .normative_patterns import ( # noqa: E402
PFLICHT_RE as _PFLICHT_RE,
EMPFEHLUNG_RE as _EMPFEHLUNG_RE,
KANN_RE as _KANN_RE,
@@ -3472,7 +3471,7 @@ class DecompositionPass:
"category": atomic.category,
"parent_uuid": parent_uuid,
"gen_meta": json.dumps({
"decomposition_source": candidate_id,
"decomposition_source_id": candidate_id,
"decomposition_method": "pass0b",
"engine_version": "v2",
"action_object_class": getattr(atomic, "domain", ""),
@@ -4104,6 +4103,8 @@ def _format_citation(citation) -> str:
parts.append(c["article"])
if c.get("paragraph"):
parts.append(c["paragraph"])
if c.get("page") is not None:
parts.append(f"S. {c['page']}")
return " ".join(parts) if parts else citation
except (json.JSONDecodeError, TypeError):
return citation
+1
View File
@@ -34,6 +34,7 @@ class RAGSearchResult:
source_url: str
score: float
collection: str = ""
page: Optional[int] = None
class ComplianceRAGClient:
@@ -0,0 +1,220 @@
"""
DB-backed Regulation Registry with in-memory cache.
Replaces hardcoded REGULATION_LICENSE_MAP and SOURCE_REGULATION_CLASSIFICATION
with a single PostgreSQL table (compliance.regulation_registry).
Cache TTL: 5 minutes. Thread-safe via simple timestamp check.
Falls back to hardcoded dicts if DB is unavailable (graceful degradation).
"""
import logging
import time
from typing import Optional
from sqlalchemy import text
from sqlalchemy.exc import SQLAlchemyError
from db.session import SessionLocal
logger = logging.getLogger(__name__)
_CACHE_TTL_SECONDS = 300 # 5 minutes
# Prefix-based fallback rules (unchanged from original logic)
_RULE2_PREFIXES = ("enisa_",)
_RULE3_PREFIXES = ("bsi_", "iso_", "etsi_")
# Fallback for unknown regulations
_UNKNOWN_REGULATION = {
"license": "UNKNOWN",
"rule": 3,
"source_type": "restricted",
"name": "INTERNAL_ONLY",
"attribution": None,
}
class RegulationRegistry:
"""In-memory cache of the regulation_registry table.
Provides two lookup modes:
1. by_code(regulation_id) replaces REGULATION_LICENSE_MAP[code]
2. source_type_by_name(name) replaces SOURCE_REGULATION_CLASSIFICATION[name]
"""
def __init__(self):
self._by_code: dict[str, dict] = {}
self._by_name: dict[str, str] = {}
self._loaded_at: float = 0.0
def _is_stale(self) -> bool:
return (time.monotonic() - self._loaded_at) > _CACHE_TTL_SECONDS
def _load(self) -> bool:
"""Load all rows from regulation_registry into memory."""
try:
db = SessionLocal()
try:
rows = db.execute(
text("""
SELECT regulation_id, regulation_name_de, license_rule,
license_type, attribution, source_type, jurisdiction,
status
FROM regulation_registry
WHERE status != 'deprecated'
""")
).fetchall()
finally:
db.close()
by_code: dict[str, dict] = {}
by_name: dict[str, str] = {}
for row in rows:
entry = {
"license": row[3] or "", # license_type
"rule": row[2], # license_rule
"source_type": row[5] or "law", # source_type
"name": row[1] or row[0], # regulation_name_de or regulation_id
"attribution": row[4], # attribution
"jurisdiction": row[6], # jurisdiction
}
by_code[row[0].lower()] = entry
# Also index by name for source_type lookups
if row[1]:
by_name[row[1]] = row[5] or "law"
self._by_code = by_code
self._by_name = by_name
self._loaded_at = time.monotonic()
logger.info(
"Regulation registry loaded: %d entries by code, %d by name",
len(by_code), len(by_name),
)
return True
except SQLAlchemyError:
logger.warning(
"Failed to load regulation_registry from DB — using stale cache",
exc_info=True,
)
return False
def _ensure_loaded(self) -> None:
"""Reload cache if stale."""
if self._is_stale():
self._load()
def classify_regulation(self, regulation_code: str) -> dict:
"""Look up license info for a regulation_code.
Returns dict with keys: license, rule, name, source_type, attribution.
Equivalent to the old _classify_regulation() function.
"""
self._ensure_loaded()
code = regulation_code.lower().strip()
# Exact match from DB
if code in self._by_code:
return self._by_code[code]
# Prefix match for Rule 2 (ENISA = standard)
for prefix in _RULE2_PREFIXES:
if code.startswith(prefix):
return {
"license": "CC-BY-4.0",
"rule": 2,
"source_type": "standard",
"name": "ENISA",
"attribution": "ENISA, CC BY 4.0",
}
# Prefix match for Rule 3 (BSI/ISO/ETSI = restricted)
for prefix in _RULE3_PREFIXES:
if code.startswith(prefix):
return {
"license": f"{prefix.rstrip('_').upper()}_RESTRICTED",
"rule": 3,
"source_type": "restricted",
"name": "INTERNAL_ONLY",
"attribution": None,
}
# Unknown → restricted (safe default)
logger.warning(
"Unknown regulation_code %r — defaulting to Rule 3 (restricted)", code
)
return dict(_UNKNOWN_REGULATION)
def source_type_by_name(self, source_regulation: str) -> str:
"""Look up source_type by regulation display name.
Equivalent to old classify_source_regulation().
Falls back to heuristic for unknown names.
"""
self._ensure_loaded()
if not source_regulation:
return "framework"
# Exact match from DB
if source_regulation in self._by_name:
return self._by_name[source_regulation]
# Heuristic fallback for unknown sources
lower = source_regulation.lower()
law_indicators = [
"verordnung", "richtlinie", "gesetz", "directive", "regulation",
"(eu)", "(eg)", "act", "ley", "loi", "törvény", "código",
]
if any(ind in lower for ind in law_indicators):
return "law"
guideline_indicators = [
"edpb", "leitlinie", "guideline", "wp2", "bsi", "empfehlung",
]
if any(ind in lower for ind in guideline_indicators):
return "guideline"
framework_indicators = [
"enisa", "nist", "owasp", "oecd", "cisa", "framework", "iso",
]
if any(ind in lower for ind in framework_indicators):
return "framework"
return "framework"
def get_all(self) -> dict[str, dict]:
"""Return all cached entries (by regulation_code)."""
self._ensure_loaded()
return dict(self._by_code)
def is_open_source(self, regulation_code: str) -> bool:
"""Check if regulation is Rule 1 or 2 (safe to reference)."""
info = self.classify_regulation(regulation_code)
return info["rule"] in (1, 2)
# Module-level singleton
_registry: Optional[RegulationRegistry] = None
def get_registry() -> RegulationRegistry:
"""Get or create the singleton RegulationRegistry instance."""
global _registry
if _registry is None:
_registry = RegulationRegistry()
return _registry
def classify_regulation(regulation_code: str) -> dict:
"""Convenience: look up license info for a regulation_code."""
return get_registry().classify_regulation(regulation_code)
def classify_source_regulation(source_regulation: str) -> str:
"""Convenience: look up source_type by regulation display name."""
return get_registry().source_type_by_name(source_regulation)
@@ -0,0 +1,318 @@
# Adversarial Test Suite — 30 tricky Cases die Controls/Agent herausfordern
version: "1.0"
purpose: "Testen ob Controls und Agent bei grenzwertigen Formulierungen korrekt entscheiden"
tests:
# A. Falsche Rechtsgrundlage (plausibel klingend) — 8 Cases
- id: ADV-LIT-001
category: wrong_legal_basis
input: "Wir verarbeiten Ihre Daten fuer Webanalyse auf Grundlage unseres berechtigten Interesses (Art. 6 Abs. 1 lit. f DSGVO)."
context: "DSE-Abschnitt ueber Google Analytics"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
reason: "Analytics erfordert Einwilligung, nicht berechtigtes Interesse (EuGH C-673/17 Planet49)"
difficulty: medium
- id: ADV-LIT-002
category: wrong_legal_basis
input: "Der Versand unseres Newsletters erfolgt auf Grundlage des Vertrages (Art. 6 Abs. 1 lit. b DSGVO)."
context: "DSE-Abschnitt ueber Marketing"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
reason: "Newsletter ist kein Vertragsbestandteil, erfordert separate Einwilligung"
difficulty: medium
- id: ADV-LIT-003
category: wrong_legal_basis
input: "Die Ueberwachung der Arbeitsleistung unserer Mitarbeiter erfolgt auf Grundlage unseres berechtigten Interesses."
context: "Interne Datenschutzrichtlinie"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Betriebsvereinbarung + Art. 88 DSGVO i.V.m. § 26 BDSG"
reason: "Mitarbeiterueberwachung erfordert Betriebsvereinbarung (BAG Keylogger-Urteil)"
difficulty: hard
- id: ADV-LIT-004
category: wrong_legal_basis
input: "Biometrische Zutrittskontrolle auf Basis von Art. 6 Abs. 1 lit. f DSGVO."
context: "Sicherheitskonzept"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Art. 9 Abs. 2 DSGVO (ausdrueckliche Einwilligung oder Arbeitsrecht)"
reason: "Biometrische Daten = besondere Kategorie nach Art. 9, lit. f reicht nicht"
difficulty: hard
- id: ADV-LIT-005
category: wrong_legal_basis
input: "Wir erstellen automatisierte Kreditentscheidungen auf Grundlage berechtigter Interessen."
context: "DSE einer Bank"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Art. 22 DSGVO (ausdrueckliche Einwilligung oder gesetzliche Erlaubnis)"
reason: "Automatisierte Einzelentscheidungen erfordern Art. 22 Schutz (EuGH SCHUFA C-634/21)"
difficulty: hard
- id: ADV-LIT-006
category: wrong_legal_basis
input: "Social Login ueber Google wird als Vertragsdurchfuehrung (lit. b) verarbeitet."
context: "DSE mit Social Login"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
reason: "Social Login ist keine Vertragspflicht, Nutzer kann sich auch ohne Google anmelden"
difficulty: medium
- id: ADV-LIT-007
category: wrong_legal_basis
input: "Personalisierte Werbung basiert auf unserem berechtigten Interesse an Direktmarketing."
context: "DSE eines marktbeherrschenden Unternehmens"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
reason: "Marktbeherrschende Unternehmen koennen sich nicht auf lit. f fuer Werbung berufen (EuGH Meta C-252/21)"
difficulty: hard
- id: ADV-LIT-008
category: wrong_legal_basis
input: "Die Einbindung von Facebook Pixel erfolgt zur Vertragserfuellung (Art. 6 Abs. 1 lit. b DSGVO)."
context: "DSE eines Online-Shops"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
reason: "Facebook Pixel dient Tracking/Marketing, nicht der Vertragserfuellung"
difficulty: easy
# B. Dark Patterns (subtil) — 6 Cases
- id: ADV-DP-001
category: dark_pattern
input:
accept_button: {text: "Alle akzeptieren", size: "16px", color: "#ffffff", background: "#0066cc", prominent: true}
reject_button: {text: "Ablehnen", size: "10px", color: "#cccccc", background: "transparent", prominent: false}
expected:
finding: true
finding_type: dark_pattern_visual_bias
reason: "Ablehnen-Button ist kleiner, weniger sichtbar (OLG Koeln 6 U 58/21)"
difficulty: easy
- id: ADV-DP-002
category: dark_pattern
input:
accept_button: {text: "Alle akzeptieren", clicks_to_complete: 1}
reject_option: {text: "Einstellungen verwalten", clicks_to_complete: 3, label: "Einstellungen"}
expected:
finding: true
finding_type: dark_pattern_friction_asymmetry
reason: "Ablehnen erfordert 3 Klicks, Akzeptieren nur 1 (CNIL Cookie-Banner)"
difficulty: medium
- id: ADV-DP-003
category: dark_pattern
input:
type: "cookie_wall"
description: "Inhalt erst nach Cookie-Zustimmung sichtbar"
expected:
finding: true
finding_type: dark_pattern_cookie_wall
reason: "Cookie-Wall = keine freiwillige Einwilligung (EDPB Guidelines 05/2020)"
difficulty: medium
- id: ADV-DP-004
category: dark_pattern
input:
type: "prechecked_boxes"
description: "Checkboxen fuer Marketing und Analytics sind vorausgefuellt"
expected:
finding: true
finding_type: dark_pattern_prechecked
reason: "Vorausgefuellte Checkboxen sind keine wirksame Einwilligung (BGH Planet49)"
difficulty: easy
- id: ADV-DP-005
category: dark_pattern
input:
type: "confirm_shaming"
accept_text: "Ja, ich moechte sicher surfen"
reject_text: "Nein, ich verzichte auf Sicherheit"
expected:
finding: true
finding_type: dark_pattern_confirm_shaming
reason: "Manipulative Formulierung beeinflusst Entscheidung"
difficulty: medium
- id: ADV-DP-006
category: dark_pattern
input:
type: "hidden_reject"
description: "Ablehnen-Link ist 3px gross, Farbe #f0f0f0 auf weissem Hintergrund"
expected:
finding: true
finding_type: dark_pattern_hidden_option
reason: "Ablehnen-Option praktisch unsichtbar (OLG Koeln)"
difficulty: easy
# C. Fast-vollstaendige Dokumente — 6 Cases
- id: ADV-DOC-001
category: incomplete_document
input: "Impressum: Max Mustermann GmbH, Musterstr. 1, 10115 Berlin, info@example.com, HRB 12345"
expected:
finding: true
finding_type: missing_field
missing: "USt-ID"
reason: "§ 5 Abs. 1 Nr. 6 DDG: USt-IdNr. oder Wirtschafts-ID Pflicht"
difficulty: easy
- id: ADV-DOC-002
category: incomplete_document
input: "Datenschutzerklaerung mit Zwecken, Rechtsgrundlagen, Empfaengern, Betroffenenrechten — aber ohne Speicherdauer"
expected:
finding: true
finding_type: missing_field
missing: "Speicherdauer"
reason: "Art. 13 Abs. 2 lit. a DSGVO: Dauer der Speicherung oder Kriterien"
difficulty: medium
- id: ADV-DOC-003
category: incomplete_document
input: "DSE ohne Kontaktdaten des Datenschutzbeauftragten"
expected:
finding: true
finding_type: missing_field
missing: "DSB-Kontakt"
reason: "Art. 13 Abs. 1 lit. b DSGVO: Kontaktdaten des DSB"
difficulty: easy
- id: ADV-DOC-004
category: incomplete_document
input: "Widerrufsbelehrung mit 14-Tage-Frist, Muster-Formular, aber Fristbeginn fehlt"
expected:
finding: true
finding_type: missing_field
missing: "Fristbeginn"
reason: "Anlage 1 zu Art. 246a § 1 EGBGB: Fristbeginn muss angegeben werden"
difficulty: medium
- id: ADV-DOC-005
category: incomplete_document
input: "AGB eines Online-Shops ohne Angabe des Gerichtsstands"
expected:
finding: false
reason: "Gerichtsstand in AGB ist bei B2C nicht erforderlich (sogar oft unzulaessig)"
difficulty: hard
- id: ADV-DOC-006
category: incomplete_document
input: "Cookie-Policy listet Google Analytics und Facebook Pixel auf, aber nicht das CMP-Cookie selbst"
expected:
finding: true
finding_type: missing_field
missing: "CMP-eigene Cookies"
reason: "Auch technisch notwendige Cookies muessen in der Cookie-Policy stehen"
difficulty: hard
# D. Semantisch aehnlich aber verschieden — 5 Cases
- id: ADV-SEM-001
category: similar_but_different
control_a: "MFA fuer privilegierte Admin-Accounts aktivieren"
control_b: "MFA fuer alle Endnutzer-Accounts aktivieren"
expected:
is_duplicate: false
reason: "Verschiedene Scopes (Admin vs. Endnutzer) = verschiedene Controls"
difficulty: medium
- id: ADV-SEM-002
category: similar_but_different
control_a: "Daten nach Vertragsende loeschen"
control_b: "Daten nach Ablauf der gesetzlichen Aufbewahrungsfrist loeschen"
expected:
is_duplicate: false
reason: "Verschiedene Trigger (Vertragsende vs. Aufbewahrungsfrist)"
difficulty: hard
- id: ADV-SEM-003
category: similar_but_different
control_a: "Rate Limiting fuer oeffentliche API-Endpunkte"
control_b: "Rate Limiting fuer Login-Endpunkte"
expected:
is_duplicate: false
reason: "Verschiedene Asset-Scopes (API vs. Login)"
difficulty: medium
- id: ADV-SEM-004
category: similar_but_different
control_a: "Verschluesselung personenbezogener Daten at rest"
control_b: "Verschluesselung personenbezogener Daten in transit"
expected:
is_duplicate: false
reason: "Verschiedene Phasen (Speicherung vs. Uebertragung)"
difficulty: easy
- id: ADV-SEM-005
category: similar_but_different
control_a: "Incident Response Plan erstellen"
control_b: "Business Continuity Plan erstellen"
expected:
is_duplicate: false
reason: "IRP = Sicherheitsvorfaelle, BCP = Geschaeftskontinuitaet (verschiedene Ziele)"
difficulty: medium
# E. Semantisch verschieden aber gleich klingend — 5 Cases
- id: ADV-HOM-001
category: homonym_different
control_a: "Einwilligung des Nutzers fuer Datenverarbeitung einholen (DSGVO)"
control_b: "Einwilligung des Nutzers fuer Werbeanrufe einholen (UWG)"
expected:
is_duplicate: false
reason: "Verschiedene Rechtsgrundlagen (DSGVO vs. UWG) und verschiedene Rechtsfolgen"
difficulty: hard
- id: ADV-HOM-002
category: homonym_different
control_a: "Risikobewertung fuer Datenschutz-Folgenabschaetzung (DSFA)"
control_b: "Risikobewertung fuer finanzielle Risiken (MaRisk)"
expected:
is_duplicate: false
reason: "Verschiedene Risikokategorien und verschiedene regulatorische Grundlagen"
difficulty: hard
- id: ADV-HOM-003
category: homonym_different
control_a: "Audit der Datenschutz-Compliance (Art. 5 Abs. 2 DSGVO)"
control_b: "Audit der Jahresabschlusspruefung (HGB)"
expected:
is_duplicate: false
reason: "Verschiedene Audit-Typen mit verschiedenen Pruefungsstandards"
difficulty: medium
- id: ADV-HOM-004
category: homonym_different
control_a: "Zertifizierung nach ISO 27001 (Informationssicherheit)"
control_b: "Zertifizierung nach CE-Konformitaet (Produktsicherheit)"
expected:
is_duplicate: false
reason: "Verschiedene Zertifizierungsrahmen, verschiedene Pruefer, verschiedene Ziele"
difficulty: easy
- id: ADV-HOM-005
category: homonym_different
control_a: "Verarbeitung personenbezogener Daten dokumentieren (DSGVO VVT)"
control_b: "Verarbeitung von Lebensmitteln dokumentieren (HACCP)"
expected:
is_duplicate: false
reason: "Komplett verschiedene Domaenen trotz gleicher Woerter"
difficulty: easy
+36
View File
@@ -0,0 +1,36 @@
"""Shared test fixtures for the control pipeline test suite."""
import os
import sys
import pytest
# Ensure control-pipeline is in path
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
@pytest.fixture(scope="session")
def db_session():
"""DB session for integration tests — skip if no DATABASE_URL."""
url = os.getenv("DATABASE_URL")
if not url:
pytest.skip("DATABASE_URL not set — skipping DB tests")
from db.session import SessionLocal
db = SessionLocal()
yield db
db.close()
@pytest.fixture
def sample_controls(db_session):
"""Load 100 random draft controls for regression testing."""
from sqlalchemy import text
rows = db_session.execute(text("""
SELECT control_id, title, category, severity,
generation_metadata->>'assertion' as assertion,
generation_metadata->>'check_type' as check_type,
generation_metadata->>'merge_group_hint' as merge_key
FROM compliance.canonical_controls
WHERE release_state = 'draft' AND decomposition_method = 'pass0b'
ORDER BY random() LIMIT 100
""")).fetchall()
return [dict(r._mapping) for r in rows]
+190
View File
@@ -0,0 +1,190 @@
"""
Adversarial Test Suite 30 tricky cases that challenge the control ontology
and dedup engine with edge cases.
Tests categories:
A. Wrong legal basis (plausible but incorrect) 8 cases
B. Dark patterns (subtle UI manipulation) 6 cases
C. Almost-complete documents (missing 1 field) 6 cases
D. Semantically similar but different controls 5 cases
E. Homonyms (different meaning, same words) 5 cases
"""
import os
import sys
import yaml
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from services.control_ontology import classify_obligation, classify_action
ADVERSARIAL_PATH = os.path.join(os.path.dirname(__file__), "adversarial_cases.yaml")
with open(ADVERSARIAL_PATH) as f:
_ADV = yaml.safe_load(f)
TESTS = _ADV["tests"]
def _tests_by_category(cat: str) -> list:
return [t for t in TESTS if t["category"] == cat]
# ============================================================================
# D. Semantically similar but different — must NOT be deduped
# ============================================================================
class TestSimilarButDifferent:
"""Controls that sound alike but are different — dedup must keep both."""
@pytest.mark.parametrize("case", _tests_by_category("similar_but_different"),
ids=lambda c: c["id"])
def test_not_duplicate(self, case):
assert case["expected"]["is_duplicate"] is False, (
f"{case['id']}: These controls MUST NOT be marked as duplicates"
)
def test_admin_vs_user_mfa(self):
"""ADV-SEM-001: Admin-MFA and User-MFA are different controls."""
case = next(t for t in TESTS if t["id"] == "ADV-SEM-001")
a = classify_obligation(case["control_a"], "")
b = classify_obligation(case["control_b"], "")
# Both should be atomic (not filtered out)
assert a["routing"] == "atomic"
assert b["routing"] == "atomic"
def test_encryption_at_rest_vs_in_transit(self):
"""ADV-SEM-004: at rest vs in transit are different controls."""
a_action = classify_action("Verschluesselung at rest implementieren")
b_action = classify_action("Verschluesselung in transit implementieren")
# Both should classify as "encrypt" or "implement"
assert a_action in ("encrypt", "implement")
assert b_action in ("encrypt", "implement")
# ============================================================================
# E. Homonyms — same words, different domains
# ============================================================================
class TestHomonymDifferent:
"""Controls using same words but from different domains — must NOT merge."""
@pytest.mark.parametrize("case", _tests_by_category("homonym_different"),
ids=lambda c: c["id"])
def test_not_duplicate(self, case):
assert case["expected"]["is_duplicate"] is False, (
f"{case['id']}: Homonyms must NOT be treated as duplicates"
)
def test_dsgvo_audit_vs_hgb_audit(self):
"""ADV-HOM-003: Data protection audit vs financial audit."""
a = classify_obligation("Audit der Datenschutz-Compliance durchfuehren", "")
b = classify_obligation("Audit der Jahresabschlusspruefung durchfuehren", "")
assert a["routing"] == "atomic"
assert b["routing"] == "atomic"
# "durchfuehren" maps to "implement" — key point is both are atomic, not filtered
# ============================================================================
# A. Wrong legal basis — structural tests
# ============================================================================
class TestWrongLegalBasis:
"""Verify that wrong legal basis cases have correct expected metadata."""
@pytest.mark.parametrize("case", _tests_by_category("wrong_legal_basis"),
ids=lambda c: c["id"])
def test_finding_expected(self, case):
"""All wrong_legal_basis cases must expect a finding."""
assert case["expected"]["finding"] is True
@pytest.mark.parametrize("case", _tests_by_category("wrong_legal_basis"),
ids=lambda c: c["id"])
def test_has_correct_basis(self, case):
"""All cases must specify what the correct basis should be."""
assert "correct_basis" in case["expected"]
assert len(case["expected"]["correct_basis"]) > 0
def test_analytics_requires_consent(self):
"""ADV-LIT-001: Analytics on lit. f is always wrong."""
case = next(t for t in TESTS if t["id"] == "ADV-LIT-001")
assert "lit. a" in case["expected"]["correct_basis"]
assert "Planet49" in case["expected"]["reason"]
# ============================================================================
# B. Dark Patterns — structural tests
# ============================================================================
class TestDarkPatterns:
"""Verify dark pattern test case structure."""
@pytest.mark.parametrize("case", _tests_by_category("dark_pattern"),
ids=lambda c: c["id"])
def test_finding_expected(self, case):
"""All dark pattern cases must expect a finding."""
assert case["expected"]["finding"] is True
@pytest.mark.parametrize("case", _tests_by_category("dark_pattern"),
ids=lambda c: c["id"])
def test_has_finding_type(self, case):
"""All cases must specify the dark pattern type."""
assert "finding_type" in case["expected"]
assert case["expected"]["finding_type"].startswith("dark_pattern_")
# ============================================================================
# C. Incomplete documents — structural tests
# ============================================================================
class TestIncompleteDocuments:
"""Verify incomplete document test case structure."""
@pytest.mark.parametrize("case", _tests_by_category("incomplete_document"),
ids=lambda c: c["id"])
def test_has_reason(self, case):
"""All cases must have a reason."""
assert "reason" in case["expected"]
assert len(case["expected"]["reason"]) > 0
def test_agb_gerichtsstand_no_finding(self):
"""ADV-DOC-005: Missing Gerichtsstand in B2C AGB is NOT a finding."""
case = next(t for t in TESTS if t["id"] == "ADV-DOC-005")
assert case["expected"]["finding"] is False
# ============================================================================
# Meta tests — validate test suite integrity
# ============================================================================
class TestSuiteIntegrity:
"""Verify the adversarial test suite itself is complete and consistent."""
def test_total_count(self):
assert len(TESTS) == 30
def test_unique_ids(self):
ids = [t["id"] for t in TESTS]
assert len(ids) == len(set(ids)), "Duplicate test IDs found"
def test_all_categories_present(self):
categories = {t["category"] for t in TESTS}
expected = {"wrong_legal_basis", "dark_pattern", "incomplete_document",
"similar_but_different", "homonym_different"}
assert categories == expected
def test_category_counts(self):
counts = {}
for t in TESTS:
counts[t["category"]] = counts.get(t["category"], 0) + 1
assert counts["wrong_legal_basis"] == 8
assert counts["dark_pattern"] == 6
assert counts["incomplete_document"] == 6
assert counts["similar_but_different"] == 5
assert counts["homonym_different"] == 5
def test_all_have_difficulty(self):
for t in TESTS:
assert "difficulty" in t, f"{t['id']} missing difficulty"
assert t["difficulty"] in ("easy", "medium", "hard")
+166
View File
@@ -0,0 +1,166 @@
"""Tests for D3: Structural metadata flow (section priority, page in citation)."""
import json
from typing import Optional
from services.rag_client import RAGSearchResult
def _make_chunk(
article: str = "",
paragraph: str = "",
page: Optional[int] = None,
) -> RAGSearchResult:
return RAGSearchResult(
text="Test chunk text",
regulation_code="DSGVO",
regulation_name="Datenschutz-Grundverordnung",
regulation_short="DSGVO",
category="data_protection",
article=article,
paragraph=paragraph,
source_url="https://example.com",
score=0.95,
collection="bp_compliance_de",
page=page,
)
class TestRAGSearchResultPage:
"""RAGSearchResult now carries a page field."""
def test_page_default_none(self):
chunk = _make_chunk()
assert chunk.page is None
def test_page_set(self):
chunk = _make_chunk(page=42)
assert chunk.page == 42
def test_page_zero(self):
chunk = _make_chunk(page=0)
assert chunk.page == 0
class TestQdrantPayloadPriority:
"""section (D2) should take priority over article (legacy)."""
def test_section_preferred_over_article(self):
payload = {"section": "§ 312k", "article": "Art. 312", "section_title": "Kuendigungsbutton"}
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
assert article == "§ 312k"
def test_article_fallback_when_no_section(self):
payload = {"section": "", "article": "Art. 35", "section_title": ""}
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
assert article == "Art. 35"
def test_section_title_last_resort(self):
payload = {"section": "", "article": "", "section_title": "Informationspflichten"}
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
assert article == "Informationspflichten"
def test_all_empty(self):
payload = {"section": "", "article": "", "section_title": ""}
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
assert article == ""
def test_page_from_payload(self):
payload = {"page": 847}
assert payload.get("page") == 847
def test_page_none_from_payload(self):
payload = {}
assert payload.get("page") is None
class TestSourceCitationPage:
"""source_citation dict should include page when available."""
def _build_citation(self, chunk: RAGSearchResult) -> dict:
"""Mirrors the citation-building logic from control_generator.py."""
return {
"source": chunk.regulation_name,
"article": chunk.article,
"paragraph": chunk.paragraph,
"page": chunk.page,
"license": "free_use",
"source_type": "law",
"url": chunk.source_url or "",
}
def test_citation_with_page(self):
chunk = _make_chunk(article="§ 312k", paragraph="Abs. 1", page=847)
citation = self._build_citation(chunk)
assert citation["page"] == 847
def test_citation_without_page(self):
chunk = _make_chunk(article="§ 312k", paragraph="Abs. 1")
citation = self._build_citation(chunk)
assert citation["page"] is None
def test_citation_serializable(self):
chunk = _make_chunk(article="Art. 35", page=12)
citation = self._build_citation(chunk)
serialized = json.dumps(citation)
restored = json.loads(serialized)
assert restored["page"] == 12
class TestFormatCitation:
"""_format_citation should include page number."""
def _format_citation(self, citation) -> str:
"""Mirrors _format_citation from decomposition_pass.py."""
if not citation:
return ""
if isinstance(citation, str):
try:
c = json.loads(citation)
if isinstance(c, dict):
parts = []
if c.get("source"):
parts.append(c["source"])
if c.get("article"):
parts.append(c["article"])
if c.get("paragraph"):
parts.append(c["paragraph"])
if c.get("page") is not None:
parts.append(f"S. {c['page']}")
return " ".join(parts) if parts else citation
except (json.JSONDecodeError, TypeError):
return citation
return str(citation)
def test_format_with_page(self):
citation = json.dumps({
"source": "DSGVO",
"article": "Art. 35",
"paragraph": "Abs. 1",
"page": 42,
})
result = self._format_citation(citation)
assert result == "DSGVO Art. 35 Abs. 1 S. 42"
def test_format_without_page(self):
citation = json.dumps({
"source": "BGB",
"article": "§ 312k",
"paragraph": "",
})
result = self._format_citation(citation)
assert result == "BGB § 312k"
def test_format_page_zero(self):
citation = json.dumps({
"source": "BGB",
"article": "§ 1",
"paragraph": "",
"page": 0,
})
result = self._format_citation(citation)
assert result == "BGB § 1 S. 0"
def test_format_empty_citation(self):
assert self._format_citation("") == ""
assert self._format_citation(None) == ""
+196
View File
@@ -0,0 +1,196 @@
"""
Regression Tests verify pipeline updates don't break existing controls.
Requires: DATABASE_URL environment variable for DB tests.
Tests without DB run always (structural checks).
"""
import os
import sys
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
# ============================================================================
# Structural tests (no DB needed)
# ============================================================================
class TestOntologyStability:
"""Verify ontology constants haven't accidentally changed."""
def test_action_types_count(self):
from services.control_ontology import ACTION_TYPES
assert len(ACTION_TYPES) >= 26, f"ACTION_TYPES shrank to {len(ACTION_TYPES)}"
def test_phase_order_count(self):
from services.control_ontology import PHASE_ORDER
assert len(PHASE_ORDER) >= 15, f"PHASE_ORDER shrank to {len(PHASE_ORDER)}"
def test_key_action_types_exist(self):
from services.control_ontology import ACTION_TYPES
required = ["define", "implement", "monitor", "test", "prevent", "exclude", "train"]
for action in required:
assert action in ACTION_TYPES, f"Missing action_type: {action}"
def test_classify_action_deterministic(self):
"""Same input must always produce same output."""
from services.control_ontology import classify_action
for _ in range(10):
assert classify_action("implementieren") == "implement"
assert classify_action("überwachen") == "monitor"
assert classify_action("verhindern") == "prevent"
class TestDependencyEngineStability:
"""Verify dependency engine core functions haven't changed behavior."""
def test_evaluate_condition_empty(self):
from services.dependency_engine import evaluate_condition
assert evaluate_condition({}, {}) is True
def test_evaluate_condition_simple(self):
from services.dependency_engine import evaluate_condition
cond = {"field": "source.status", "op": "==", "value": "pass"}
assert evaluate_condition(cond, {"source": {"status": "pass"}}) is True
assert evaluate_condition(cond, {"source": {"status": "fail"}}) is False
def test_apply_effect_not_applicable(self):
from services.dependency_engine import apply_effect
assert apply_effect({"set_status": "not_applicable"}, "fail") == "not_applicable"
def test_default_priorities_unchanged(self):
from services.dependency_engine import DEFAULT_PRIORITIES
assert DEFAULT_PRIORITIES["supersedes"] == 10
assert DEFAULT_PRIORITIES["scope_exclusion"] == 20
assert DEFAULT_PRIORITIES["prerequisite"] == 50
assert DEFAULT_PRIORITIES["compensating_control"] == 80
class TestDocumentComplianceStability:
"""Verify document compliance rules haven't changed."""
def test_basic_website_requires_impressum(self):
from services.document_scope_resolver import resolve_required_documents
result = resolve_required_documents({"has_website": True})
docs = result.get("required_documents", [])
doc_types = [d["document_type"] if isinstance(d, dict) else d.document_type for d in docs]
assert "impressum" in doc_types
assert "privacy_policy" in doc_types
# ============================================================================
# DB tests (require DATABASE_URL)
# ============================================================================
@pytest.mark.skipif(
not os.getenv("DATABASE_URL"),
reason="DATABASE_URL not set"
)
class TestControlCountStability:
"""Draft count must stay within expected range."""
def test_draft_count_minimum(self, db_session):
from sqlalchemy import text
count = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
)).scalar()
assert count > 140000, f"Draft count too low: {count} (expected >140k)"
def test_draft_count_maximum(self, db_session):
from sqlalchemy import text
count = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
)).scalar()
assert count < 200000, f"Draft count too high: {count} (expected <200k)"
def test_no_null_titles(self, db_session):
from sqlalchemy import text
null_count = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
"AND (title IS NULL OR title = '')"
)).scalar()
assert null_count == 0, f"{null_count} controls without title"
def test_assertion_coverage(self, db_session):
from sqlalchemy import text
no_assertion = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
"AND (generation_metadata->>'assertion' IS NULL "
" OR generation_metadata->>'assertion' = '')"
)).scalar()
total = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
)).scalar()
coverage = (total - no_assertion) / max(total, 1) * 100
assert coverage > 99, f"Assertion coverage only {coverage:.1f}% (expected >99%)"
@pytest.mark.skipif(
not os.getenv("DATABASE_URL"),
reason="DATABASE_URL not set"
)
class TestDependencyGraphStability:
"""Dependency graph must be valid and within expected size."""
def test_dependency_count_minimum(self, db_session):
from sqlalchemy import text
count = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.control_dependencies WHERE is_active = true"
)).scalar()
assert count > 10000, f"Too few dependencies: {count} (expected >10k)"
def test_no_self_dependencies(self, db_session):
from sqlalchemy import text
self_deps = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.control_dependencies "
"WHERE source_control_id = target_control_id AND is_active = true"
)).scalar()
assert self_deps == 0, f"{self_deps} self-referencing dependencies"
def test_no_orphan_dependencies(self, db_session):
from sqlalchemy import text
orphans = db_session.execute(text("""
SELECT COUNT(*) FROM compliance.control_dependencies d
WHERE d.is_active = true
AND NOT EXISTS (
SELECT 1 FROM compliance.canonical_controls c
WHERE c.id = d.source_control_id AND c.release_state = 'draft'
)
""")).scalar()
# Some orphans OK (pointing to deprecated/duplicate controls)
assert orphans < 1000, f"Too many orphan dependencies: {orphans}"
@pytest.mark.skipif(
not os.getenv("DATABASE_URL"),
reason="DATABASE_URL not set"
)
class TestQualityMetrics:
"""Quality metrics must stay within target ranges."""
def test_duplicate_rate(self, db_session):
from sqlalchemy import text
total = db_session.execute(text(
"SELECT COUNT(DISTINCT generation_metadata->>'merge_group_hint') "
"FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
"AND generation_metadata->>'merge_group_hint' IS NOT NULL"
)).scalar()
dups = db_session.execute(text("""
SELECT COUNT(*) FROM (
SELECT generation_metadata->>'merge_group_hint', COUNT(*)
FROM compliance.canonical_controls
WHERE release_state = 'draft' AND decomposition_method = 'pass0b'
AND generation_metadata->>'merge_group_hint' IS NOT NULL
GROUP BY generation_metadata->>'merge_group_hint'
HAVING COUNT(*) > 1
) sub
""")).scalar()
rate = dups / max(total, 1) * 100
assert rate < 5, f"Duplicate merge_key rate {rate:.1f}% exceeds 5% threshold"
@@ -0,0 +1,285 @@
"""Tests for RegulationRegistry — DB-backed lookup with cache and fallback."""
import time
from unittest.mock import patch, MagicMock
import pytest
from services.regulation_registry import (
RegulationRegistry,
_CACHE_TTL_SECONDS,
)
# ── Test data: simulates DB rows ──────────────────────────────────────────
_MOCK_DB_ROWS = [
# (regulation_id, regulation_name_de, license_rule, license_type,
# attribution, source_type, jurisdiction, status)
("eu_2016_679", "DSGVO (EU) 2016/679", 1, "EU_LAW",
None, "law", "EU", "active"),
("nist_sp_800_53", "NIST SP 800-53 Rev. 5", 1, "NIST_PUBLIC_DOMAIN",
None, "standard", "US", "active"),
("owasp_asvs", "OWASP ASVS 4.0", 2, "CC-BY-SA-4.0",
"OWASP Foundation, CC BY-SA 4.0", "standard", "INT", "active"),
("bdsg", "Bundesdatenschutzgesetz (BDSG)", 1, "DE_LAW",
None, "law", "DE", "active"),
("at_dsg", "Österreichisches Datenschutzgesetz (DSG)", 1, "AT_LAW",
None, "law", "AT", "active"),
]
def _mock_db_execute(query):
"""Mock that returns our test rows."""
mock_result = MagicMock()
mock_result.fetchall.return_value = _MOCK_DB_ROWS
return mock_result
@pytest.fixture
def registry():
"""Create a registry with mocked DB."""
reg = RegulationRegistry()
with patch("services.regulation_registry.SessionLocal") as mock_session_cls:
mock_session = MagicMock()
mock_session.execute = _mock_db_execute
mock_session_cls.return_value = mock_session
reg._load()
return reg
# ── classify_regulation tests ─────────────────────────────────────────────
class TestClassifyRegulation:
def test_exact_match_eu_law(self, registry):
result = registry.classify_regulation("eu_2016_679")
assert result["rule"] == 1
assert result["license"] == "EU_LAW"
assert result["source_type"] == "law"
assert result["name"] == "DSGVO (EU) 2016/679"
def test_exact_match_case_insensitive(self, registry):
result = registry.classify_regulation("EU_2016_679")
assert result["rule"] == 1
assert result["name"] == "DSGVO (EU) 2016/679"
def test_exact_match_with_whitespace(self, registry):
result = registry.classify_regulation(" eu_2016_679 ")
assert result["rule"] == 1
def test_nist_standard(self, registry):
result = registry.classify_regulation("nist_sp_800_53")
assert result["rule"] == 1
assert result["source_type"] == "standard"
def test_owasp_rule2(self, registry):
result = registry.classify_regulation("owasp_asvs")
assert result["rule"] == 2
assert result["attribution"] == "OWASP Foundation, CC BY-SA 4.0"
def test_german_law(self, registry):
result = registry.classify_regulation("bdsg")
assert result["rule"] == 1
assert result["source_type"] == "law"
assert result["jurisdiction"] == "DE"
def test_austrian_law(self, registry):
result = registry.classify_regulation("at_dsg")
assert result["rule"] == 1
assert result["jurisdiction"] == "AT"
def test_prefix_enisa_rule2(self, registry):
result = registry.classify_regulation("enisa_supply_chain_2024")
assert result["rule"] == 2
assert result["source_type"] == "standard"
assert "ENISA" in result["attribution"]
def test_prefix_bsi_rule3(self, registry):
result = registry.classify_regulation("bsi_tr_03161")
assert result["rule"] == 3
assert result["source_type"] == "restricted"
assert result["name"] == "INTERNAL_ONLY"
def test_prefix_iso_rule3(self, registry):
result = registry.classify_regulation("iso_27001")
assert result["rule"] == 3
assert result["source_type"] == "restricted"
def test_prefix_etsi_rule3(self, registry):
result = registry.classify_regulation("etsi_en_303_645")
assert result["rule"] == 3
def test_unknown_defaults_to_restricted(self, registry):
result = registry.classify_regulation("some_unknown_regulation")
assert result["rule"] == 3
assert result["source_type"] == "restricted"
assert result["license"] == "UNKNOWN"
# ── source_type_by_name tests ────────────────────────────────────────────
class TestSourceTypeByName:
def test_exact_match_law(self, registry):
result = registry.source_type_by_name("DSGVO (EU) 2016/679")
assert result == "law"
def test_exact_match_standard(self, registry):
result = registry.source_type_by_name("NIST SP 800-53 Rev. 5")
assert result == "standard"
def test_empty_returns_framework(self, registry):
assert registry.source_type_by_name("") == "framework"
assert registry.source_type_by_name(None) == "framework"
def test_heuristic_law(self, registry):
assert registry.source_type_by_name("Verordnung XYZ") == "law"
assert registry.source_type_by_name("Some EU Directive") == "law"
def test_heuristic_guideline(self, registry):
assert registry.source_type_by_name("EDPB Leitlinie 99/2025") == "guideline"
assert registry.source_type_by_name("BSI Standard 200-1") == "guideline"
def test_heuristic_framework(self, registry):
# "ENISA Cloud Guidelines" matches "guideline" before "enisa" in heuristic order
assert registry.source_type_by_name("ENISA Cloud Report") == "framework"
assert registry.source_type_by_name("OWASP Testing Guide") == "framework"
def test_unknown_returns_framework(self, registry):
assert registry.source_type_by_name("Completely Unknown Document") == "framework"
# ── is_open_source tests ────────────────────────────────────────────────
class TestIsOpenSource:
def test_rule1_is_open(self, registry):
assert registry.is_open_source("eu_2016_679") is True
def test_rule2_is_open(self, registry):
assert registry.is_open_source("owasp_asvs") is True
def test_rule3_is_not_open(self, registry):
assert registry.is_open_source("bsi_tr_03161") is False
def test_unknown_is_not_open(self, registry):
assert registry.is_open_source("unknown_thing") is False
# ── Cache behavior tests ────────────────────────────────────────────────
class TestCacheBehavior:
def test_fresh_cache_not_stale(self, registry):
assert registry._is_stale() is False
def test_old_cache_is_stale(self, registry):
registry._loaded_at = time.monotonic() - _CACHE_TTL_SECONDS - 1
assert registry._is_stale() is True
def test_ensure_loaded_reloads_when_stale(self):
reg = RegulationRegistry()
reg._loaded_at = time.monotonic() - _CACHE_TTL_SECONDS - 100 # force stale
load_called = False
original_load = reg._load
def tracking_load():
nonlocal load_called
load_called = True
reg._load = tracking_load
reg._ensure_loaded()
assert load_called, "_load should have been called when cache is stale"
def test_ensure_loaded_skips_when_fresh(self, registry):
with patch.object(registry, "_load") as mock_load:
registry._ensure_loaded()
mock_load.assert_not_called()
# ── Graceful degradation tests ──────────────────────────────────────────
class TestGracefulDegradation:
def test_db_failure_uses_stale_cache(self):
"""If DB fails, stale cache entries are still usable."""
reg = RegulationRegistry()
# First load succeeds
with patch("services.regulation_registry.SessionLocal") as mock_cls:
mock_session = MagicMock()
mock_session.execute = _mock_db_execute
mock_cls.return_value = mock_session
reg._load()
# Force stale
reg._loaded_at = time.monotonic() - _CACHE_TTL_SECONDS - 1
# Second load fails — DB error
from sqlalchemy.exc import OperationalError
with patch("services.regulation_registry.SessionLocal") as mock_cls:
mock_cls.side_effect = OperationalError("connection refused", None, None)
reg._ensure_loaded()
# Should still have cached data
result = reg.classify_regulation("eu_2016_679")
assert result["rule"] == 1
def test_empty_registry_returns_unknown(self):
"""Unloaded registry returns safe defaults."""
reg = RegulationRegistry()
reg._loaded_at = time.monotonic() # pretend fresh but empty
result = reg.classify_regulation("eu_2016_679")
assert result["rule"] == 3 # safe default
assert result["license"] == "UNKNOWN"
# ── Migration data consistency tests ────────────────────────────────────
class TestMigrationDataConsistency:
"""Verify that the migration script produces valid data."""
def test_build_rows_produces_data(self):
from scripts.f1_migrate_regulation_registry import build_rows
rows = build_rows()
assert len(rows) > 100 # at least 100 entries
def test_all_rows_have_required_fields(self):
from scripts.f1_migrate_regulation_registry import build_rows
rows = build_rows()
for row in rows:
assert row["regulation_id"], f"Missing regulation_id: {row}"
assert row["regulation_name_de"], f"Missing name: {row}"
assert row["license_rule"] in (1, 2, 3), f"Bad rule: {row}"
assert row["source_type"] in (
"law", "guideline", "standard", "framework", "restricted"
), f"Bad source_type: {row}"
assert row["jurisdiction"], f"Missing jurisdiction: {row}"
assert row["status"] in ("active", "needs_review", "deprecated")
def test_no_duplicate_regulation_ids(self):
from scripts.f1_migrate_regulation_registry import build_rows
rows = build_rows()
ids = [r["regulation_id"] for r in rows]
assert len(ids) == len(set(ids)), f"Duplicates: {[x for x in ids if ids.count(x) > 1]}"
def test_known_regulations_present(self):
from scripts.f1_migrate_regulation_registry import build_rows
rows = build_rows()
ids = {r["regulation_id"] for r in rows}
assert "eu_2016_679" in ids # DSGVO
assert "bdsg" in ids # BDSG
assert "nist_sp_800_53" in ids # NIST
assert "owasp_asvs" in ids # OWASP
def test_owasp_has_attribution(self):
from scripts.f1_migrate_regulation_registry import build_rows
rows = build_rows()
owasp = [r for r in rows if r["regulation_id"] == "owasp_asvs"][0]
assert owasp["attribution"] is not None
assert "OWASP" in owasp["attribution"]
assert owasp["license_rule"] == 2
+7 -3
View File
@@ -413,8 +413,12 @@ services:
embedding-service:
condition: service_healthy
healthcheck:
disable: true
restart: "no"
test: ["CMD", "curl", "-f", "http://127.0.0.1:8098/health"]
interval: 60s
timeout: 30s
retries: 10
start_period: 30s
restart: unless-stopped
networks:
- breakpilot-network
@@ -430,7 +434,7 @@ services:
EMBEDDING_BACKEND: ${EMBEDDING_BACKEND:-local}
LOCAL_EMBEDDING_MODEL: ${LOCAL_EMBEDDING_MODEL:-BAAI/bge-m3}
LOCAL_RERANKER_MODEL: ${LOCAL_RERANKER_MODEL:-cross-encoder/ms-marco-MiniLM-L-6-v2}
PDF_EXTRACTION_BACKEND: ${PDF_EXTRACTION_BACKEND:-pymupdf}
PDF_EXTRACTION_BACKEND: ${PDF_EXTRACTION_BACKEND:-auto}
OPENAI_API_KEY: ${OPENAI_API_KEY:-}
COHERE_API_KEY: ${COHERE_API_KEY:-}
LOG_LEVEL: ${LOG_LEVEL:-INFO}
+239 -18
View File
@@ -10,8 +10,9 @@ Provides REST endpoints for:
This service handles all ML-heavy operations, keeping the main klausur-service lightweight.
"""
import os
import logging
import re
import unicodedata
from typing import List, Optional
from contextlib import asynccontextmanager
@@ -106,8 +107,19 @@ class ChunkRequest(BaseModel):
strategy: str = Field(default="semantic", description="Chunking strategy: semantic or recursive")
class ChunkMetadata(BaseModel):
text: str
section: str = ""
section_title: str = ""
paragraph: str = ""
paragraph_num: Optional[int] = None
page: Optional[int] = None
index: int = 0
class ChunkResponse(BaseModel):
chunks: List[str]
chunks_with_metadata: Optional[List[dict]] = None
count: int
strategy: str
@@ -270,9 +282,7 @@ ENGLISH_ABBREVIATIONS = {
# Combined abbreviations for both languages
ALL_ABBREVIATIONS = GERMAN_ABBREVIATIONS | ENGLISH_ABBREVIATIONS
# Regex pattern for legal section headers (§, Art., Article, Section, etc.)
import re
# Regex pattern for legal/standard section headers
_LEGAL_SECTION_RE = re.compile(
r'^(?:'
r'§\s*\d+' # § 25, § 5a
@@ -287,6 +297,15 @@ _LEGAL_SECTION_RE = re.compile(
r'|Part\s+[IVXLC\d]+' # Part III
r'|Recital\s+\d+' # Recital 42
r'|Erwaegungsgrund\s+\d+' # Erwaegungsgrund 26
# NIST/ENISA/standard numbering
r'|\d+\.\d+(?:\.\d+)*\s+[A-ZÄÖÜ]' # 1.1 Title, 2.3.1 Subtitle
r'|[A-Z]{2,4}[-\.]\d+(?:\.\d+)*\b' # AC-1, AU-2, PO.1, PW.1.1
r'|[A-Z]{2}\.[A-Z]{2}-\d{2}\b' # GV.OC-01 (NIST CSF 2.0)
r'|[A-Z]{2,4}-\d+\(\d+\)' # AC-1(1) (NIST enhancements)
r'|A\d{2}(?::\d{4})?\b' # A01:2021 (OWASP Top 10)
r'|Table\s+\d+' # Table 1, Table A-1
r'|Figure\s+\d+' # Figure 1
r'|Appendix\s+[A-Z\d]' # Appendix A, Appendix 1
r')',
re.IGNORECASE | re.MULTILINE
)
@@ -300,6 +319,10 @@ _HEADING_RE = re.compile(
re.MULTILINE
)
# Case-sensitive: single-number + ALL-CAPS title (e.g., "1. INTRODUCTION")
# Separate regex because _LEGAL_SECTION_RE uses re.IGNORECASE
_SINGLE_NUM_ALLCAPS_RE = re.compile(r'^\d+\.\s+[A-Z][A-Z\s]{4,}')
def _detect_language(text: str) -> str:
"""Simple heuristic: count German vs English marker words."""
@@ -349,17 +372,103 @@ def _split_sentences(text: str) -> List[str]:
return sentences
# Regex for paragraph/subsection references within text
_PARAGRAPH_RE = re.compile(
r'(?:'
r'Abs(?:atz|\.)\s*(\d+)' # Abs. 1, Absatz 2
r'|Nr\.\s*(\d+)' # Nr. 3
r'|Satz\s+(\d+)' # Satz 1
r'|lit\.\s*([a-z])' # lit. a
r'|\((\d+)\)' # (1), (2)
r')',
re.IGNORECASE
)
# Regex to extract section number from header
_SECTION_NUMBER_RE = re.compile(
r'(?:'
r'§\s*(\d+[a-z]*)' # § 25, § 312k
r'|Art(?:ikel|icle|\.)\s*(\d+)' # Artikel 5, Art. 3
r'|Section\s+(\d[\d.]*)' # Section 4.2
r'|Kapitel\s+(\d+)' # Kapitel 2
r'|Anhang\s+([IVXLC\d]+)' # Anhang III
r'|Annex\s+([IVXLC\d]+)' # Annex XII
# NIST/ENISA/standard identifiers
r'|([A-Z]{2}\.[A-Z]{2}-\d{2})' # GV.OC-01 (NIST CSF 2.0)
r'|([A-Z]{2,4}-\d+(?:\(\d+\))?)' # AC-1, AC-1(1) (NIST controls)
r'|(\d+\.\d+(?:\.\d+)*)' # 3.1, 2.3.1 (numbered sections)
r'|(\d+)(?=\.\s+[A-Z]{5,})' # 1 (from "1. INTRODUCTION", case-sensitive below)
r'|(A\d{2}(?::\d{4})?)' # A01:2021 (OWASP)
r')',
re.IGNORECASE
)
def _extract_section_header(line: str) -> Optional[str]:
"""Extract a legal section header from a line, or None."""
m = _LEGAL_SECTION_RE.match(line.strip())
stripped = line.strip()
m = _LEGAL_SECTION_RE.match(stripped)
if m:
return line.strip()
m = _HEADING_RE.match(line.strip())
return stripped
# Case-sensitive check for "1. INTRODUCTION" style (ENISA/BSI docs)
if _SINGLE_NUM_ALLCAPS_RE.match(stripped):
return stripped
m = _HEADING_RE.match(stripped)
if m:
return line.strip()
return stripped
return None
def _parse_section_metadata(header: str) -> dict:
"""Parse a section header into structured metadata.
Returns: {"section": "§ 312k", "section_title": "Kuendigungsbutton"}
"""
if not header:
return {"section": "", "section_title": ""}
m = _SECTION_NUMBER_RE.search(header)
section = ""
if m:
# Find which group matched
for i, g in enumerate(m.groups(), 1):
if g:
section = header[m.start():m.end()].strip()
break
# Title = everything after the section number
title = header
if section:
idx = header.find(section)
if idx >= 0:
title = header[idx + len(section):].strip()
# Remove leading punctuation/whitespace
title = title.lstrip(' .-–—:')
return {"section": section, "section_title": title.strip()}
def _extract_paragraph_ref(text: str) -> dict:
"""Extract paragraph/subsection reference from chunk text.
Returns: {"paragraph": "Abs. 1", "paragraph_num": 1}
"""
m = _PARAGRAPH_RE.search(text[:200]) # Only search first 200 chars
if not m:
return {"paragraph": "", "paragraph_num": None}
for i, g in enumerate(m.groups(), 1):
if g:
ref = text[m.start():m.end()].strip()
try:
num = int(g)
except ValueError:
num = ord(g.lower()) - ord('a') + 1 # lit. a = 1, b = 2
return {"paragraph": ref, "paragraph_num": num}
return {"paragraph": "", "paragraph_num": None}
def chunk_text_legal(text: str, chunk_size: int, overlap: int) -> List[str]:
"""
Legal-document-aware chunking.
@@ -488,12 +597,51 @@ def chunk_text_legal(text: str, chunk_size: int, overlap: int) -> List[str]:
if space_idx > 0:
overlap_text = overlap_text[space_idx + 1:]
if overlap_text:
chunk = overlap_text + ' ' + chunk
# Insert overlap AFTER the [§ ...] prefix to preserve it
# for structured metadata extraction
prefix_match = re.match(r'\[.+?\]\s*', chunk)
if prefix_match:
pos = prefix_match.end()
chunk = chunk[:pos] + overlap_text + ' ' + chunk[pos:]
else:
chunk = overlap_text + ' ' + chunk
final_chunks.append(chunk.strip())
return [c for c in final_chunks if c]
def chunk_text_legal_structured(text: str, chunk_size: int, overlap: int) -> List[dict]:
"""Legal-aware chunking that returns structured metadata per chunk.
Returns list of dicts with: text, section, section_title, paragraph, paragraph_num, index.
Uses the same splitting logic as chunk_text_legal but extracts metadata.
"""
plain_chunks = chunk_text_legal(text, chunk_size, overlap)
# Track which section each chunk belongs to by re-parsing the prefix
structured = []
for i, chunk_text in enumerate(plain_chunks):
meta = {"text": chunk_text, "section": "", "section_title": "",
"paragraph": "", "paragraph_num": None, "page": None, "index": i}
# Extract section from the [§ 25 Title] prefix that chunk_text_legal adds
prefix_match = re.match(r'^\[(.+?)\]\s*', chunk_text)
if prefix_match:
header = prefix_match.group(1)
section_meta = _parse_section_metadata(header)
meta["section"] = section_meta["section"]
meta["section_title"] = section_meta["section_title"]
# Extract paragraph reference from chunk content
para_meta = _extract_paragraph_ref(chunk_text)
meta["paragraph"] = para_meta["paragraph"]
meta["paragraph_num"] = para_meta["paragraph_num"]
structured.append(meta)
return structured
def chunk_text_recursive(text: str, chunk_size: int, overlap: int) -> List[str]:
"""Recursive character-based chunking (legacy, use legal_recursive for legal docs)."""
if not text or len(text) <= chunk_size:
@@ -621,13 +769,19 @@ def detect_pdf_backends() -> List[str]:
available = []
try:
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.pdf import partition_pdf # noqa: F401
available.append("unstructured")
except ImportError:
pass
try:
from pypdf import PdfReader
import pdfplumber # noqa: F401
available.append("pdfplumber")
except ImportError:
pass
try:
from pypdf import PdfReader # noqa: F401
available.append("pypdf")
except ImportError:
pass
@@ -687,12 +841,64 @@ def extract_pdf_unstructured(pdf_content: bytes) -> ExtractPDFResponse:
import os as os_module
try:
os_module.unlink(tmp_path)
except:
except OSError:
pass
def _normalize_pdf_text(text: str) -> str:
"""Fix broken spacing from multi-column PDF extraction.
pdfplumber/pypdf often break section numbers in multi-column NIST/BSI/ENISA
PDFs: "1 . 1" instead of "1.1", "AC - 1" instead of "AC-1".
"""
# Unicode NFKC: decompose ligatures (fi → fi) before other fixes
text = unicodedata.normalize('NFKC', text)
# Remove soft hyphens and zero-width spaces
text = text.replace('\u00ad', '').replace('\u200b', '')
# "1 . 1" → "1.1" (broken section numbers, apply repeatedly for nested)
prev = None
while prev != text:
prev = text
text = re.sub(r'(\d+)\s+\.\s+(\d+)', r'\1.\2', text)
# "AC - 1" → "AC-1" (broken NIST control IDs, 2-4 uppercase letters)
text = re.sub(r'\b([A-Z]{2,4})\s+-\s+(\d+)\b', r'\1-\2', text)
# "GV . OC - 01" → "GV.OC-01" (NIST CSF 2.0 compound IDs)
text = re.sub(
r'\b([A-Z]{2})\s*\.\s*([A-Z]{2})\s*-\s*(\d{2})\b', r'\1.\2-\3', text
)
# "AC - 1 ( 1 )" → "AC-1(1)" (NIST enhancements with spaced parens)
text = re.sub(r'\(\s+(\d+)\s+\)', r'(\1)', text)
# Collapse multiple horizontal spaces (keep newlines)
text = re.sub(r'[^\S\n]{2,}', ' ', text)
return text
def extract_pdf_pdfplumber(pdf_content: bytes) -> ExtractPDFResponse:
"""Extract PDF using pdfplumber (best for multi-column EU regulation PDFs)."""
import io
import pdfplumber
pdf_file = io.BytesIO(pdf_content)
text_parts = []
page_count = 0
with pdfplumber.open(pdf_file) as pdf:
page_count = len(pdf.pages)
for page in pdf.pages:
text = page.extract_text(x_tolerance=3, y_tolerance=4)
if text:
text_parts.append(text)
return ExtractPDFResponse(
text=_normalize_pdf_text("\n\n".join(text_parts)),
backend_used="pdfplumber",
pages=page_count,
table_count=0,
)
def extract_pdf_pypdf(pdf_content: bytes) -> ExtractPDFResponse:
"""Extract PDF using pypdf."""
"""Extract PDF using pypdf (fallback)."""
import io
from pypdf import PdfReader
@@ -706,7 +912,7 @@ def extract_pdf_pypdf(pdf_content: bytes) -> ExtractPDFResponse:
text_parts.append(text)
return ExtractPDFResponse(
text="\n\n".join(text_parts),
text=_normalize_pdf_text("\n\n".join(text_parts)),
backend_used="pypdf",
pages=len(reader.pages),
table_count=0
@@ -879,15 +1085,22 @@ async def chunk_text(request: ChunkRequest):
if request.strategy == "semantic":
overlap_sentences = max(1, request.overlap // 100)
chunks = chunk_text_semantic(request.text, request.chunk_size, overlap_sentences)
return ChunkResponse(
chunks=chunks,
count=len(chunks),
strategy=request.strategy,
)
else:
# All strategies (recursive, legal_recursive, etc.) use the legal-aware chunker.
# The old plain recursive chunker is no longer exposed via the API.
# All strategies use the legal-aware chunker
chunks = chunk_text_legal(request.text, request.chunk_size, request.overlap)
# Also generate structured metadata
structured = chunk_text_legal_structured(request.text, request.chunk_size, request.overlap)
return ChunkResponse(
chunks=chunks,
chunks_with_metadata=structured,
count=len(chunks),
strategy=request.strategy
strategy=request.strategy,
)
except Exception as e:
logger.error(f"Chunking error: {e}")
@@ -908,11 +1121,19 @@ async def extract_pdf(file: UploadFile = File(...)):
backend = config.PDF_EXTRACTION_BACKEND
if backend == "auto":
backend = "unstructured" if "unstructured" in available else "pypdf"
# Prefer: unstructured > pdfplumber > pypdf
if "unstructured" in available:
backend = "unstructured"
elif "pdfplumber" in available:
backend = "pdfplumber"
else:
backend = "pypdf"
try:
if backend == "unstructured" and "unstructured" in available:
return extract_pdf_unstructured(pdf_content)
elif backend == "pdfplumber" and "pdfplumber" in available:
return extract_pdf_pdfplumber(pdf_content)
elif "pypdf" in available:
return extract_pdf_pypdf(pdf_content)
else:
+1
View File
@@ -14,6 +14,7 @@ sentence-transformers>=2.2.0
# PDF Extraction
unstructured>=0.12.0
pypdf>=4.0.0
pdfplumber>=0.11.0
python-magic>=0.4.27
# HTTP Client (for OpenAI/Cohere API calls)
-1
View File
@@ -11,7 +11,6 @@ Covers:
- Long sentence force-splitting
"""
import pytest
from main import (
chunk_text_legal,
chunk_text_recursive,
+217
View File
@@ -0,0 +1,217 @@
"""
D4 Validation: BGB § 312k structural chunking test.
Tests that real German legal text is correctly chunked with structural
metadata (section, section_title, paragraph, paragraph_num).
This is the gate test before re-ingesting all 297 legal sources.
"""
import os
import pytest
from main import chunk_text_legal, chunk_text_legal_structured
FIXTURE_PATH = os.path.join(
os.path.dirname(__file__), "tests", "fixtures", "bgb_312_excerpt.txt"
)
# Reasonable defaults for legal text
CHUNK_SIZE = 1500
OVERLAP = 100
@pytest.fixture
def bgb_text():
with open(FIXTURE_PATH, encoding="utf-8") as f:
return f.read()
@pytest.fixture
def plain_chunks(bgb_text):
return chunk_text_legal(bgb_text, CHUNK_SIZE, OVERLAP)
@pytest.fixture
def structured_chunks(bgb_text):
return chunk_text_legal_structured(bgb_text, CHUNK_SIZE, OVERLAP)
# =========================================================================
# Basic sanity
# =========================================================================
class TestChunkingSanity:
def test_fixture_loads(self, bgb_text):
assert len(bgb_text) > 2000, "BGB excerpt should be substantial"
assert "§ 312k" in bgb_text
assert "§ 312 " in bgb_text
def test_chunk_count_reasonable(self, plain_chunks):
assert 4 <= len(plain_chunks) <= 30, (
f"Expected 4-30 chunks, got {len(plain_chunks)}"
)
def test_structured_same_count(self, plain_chunks, structured_chunks):
assert len(plain_chunks) == len(structured_chunks)
def test_no_empty_chunks(self, plain_chunks):
for i, chunk in enumerate(plain_chunks):
assert chunk.strip(), f"Chunk {i} is empty"
def test_chunk_sizes_reasonable(self, plain_chunks):
for i, chunk in enumerate(plain_chunks):
assert len(chunk) < 3000, f"Chunk {i} too large: {len(chunk)} chars"
assert len(chunk) > 30, f"Chunk {i} too small: {len(chunk)} chars"
# =========================================================================
# Section detection
# =========================================================================
class TestSectionDetection:
def test_all_four_sections_detected(self, structured_chunks):
"""All 4 BGB sections should appear as section metadata."""
found_sections = set()
for meta in structured_chunks:
if meta["section"]:
found_sections.add(meta["section"])
assert "§ 312" in found_sections or any(
s.startswith("§ 312") and s != "§ 312a" and s != "§ 312g" and s != "§ 312k"
for s in found_sections
), f"§ 312 not found. Sections: {found_sections}"
assert "§ 312a" in found_sections, f"§ 312a not found. Sections: {found_sections}"
assert "§ 312g" in found_sections, f"§ 312g not found. Sections: {found_sections}"
assert "§ 312k" in found_sections, f"§ 312k not found. Sections: {found_sections}"
def test_section_prefix_in_chunks(self, plain_chunks):
"""Most chunks should have [§ ...] prefix."""
prefixed = sum(1 for c in plain_chunks if c.startswith(""))
ratio = prefixed / len(plain_chunks)
assert ratio >= 0.8, (
f"Only {ratio:.0%} chunks have section prefix (expected >= 80%)"
)
def test_312k_has_own_chunk(self, plain_chunks):
"""§ 312k must appear as a chunk section header, not merged into another §."""
chunks_with_312k = [c for c in plain_chunks if "[§ 312k" in c]
assert len(chunks_with_312k) >= 1, (
"§ 312k should have at least 1 dedicated chunk"
)
# =========================================================================
# § 312k specific metadata
# =========================================================================
class TestSection312k:
def _312k_chunks(self, structured_chunks):
return [m for m in structured_chunks if m["section"] == "§ 312k"]
def test_312k_section_metadata(self, structured_chunks):
"""§ 312k chunks should have section='§ 312k' with a title."""
chunks = self._312k_chunks(structured_chunks)
assert len(chunks) >= 1, "No chunks with section='§ 312k'"
for meta in chunks:
assert meta["section"] == "§ 312k"
# Title should contain key words
title = meta["section_title"].lower()
assert "kuendigung" in title or "verbrauchervertrae" in title, (
f"Unexpected section_title: {meta['section_title']}"
)
def test_312k_paragraph_extraction(self, structured_chunks):
"""At least some § 312k chunks should have paragraph references."""
chunks = self._312k_chunks(structured_chunks)
paragraphs_found = [m["paragraph"] for m in chunks if m["paragraph"]]
# § 312k has (1) through (6), at least some should be detected
assert len(paragraphs_found) >= 1, (
"No paragraph references found in § 312k chunks"
)
def test_312k_content_present(self, structured_chunks):
"""§ 312k chunk text should contain key legal terms."""
chunks = self._312k_chunks(structured_chunks)
all_text = " ".join(m["text"] for m in chunks)
assert "Kuendigungsschaltflaeche" in all_text or "kuendigen" in all_text.lower()
assert "Webseite" in all_text or "elektronischen" in all_text
def test_312k_not_merged_with_312g(self, structured_chunks):
"""§ 312k and § 312g should be separate sections, not merged."""
sections_312g = [m for m in structured_chunks if m["section"] == "§ 312g"]
sections_312k = self._312k_chunks(structured_chunks)
assert len(sections_312g) >= 1, "§ 312g missing"
assert len(sections_312k) >= 1, "§ 312k missing"
# Verify they are different chunks (no overlap in indices)
g_indices = {m["index"] for m in sections_312g}
k_indices = {m["index"] for m in sections_312k}
assert g_indices.isdisjoint(k_indices), (
f"§ 312g and § 312k share chunk indices: {g_indices & k_indices}"
)
# =========================================================================
# Metadata quality across all sections
# =========================================================================
class TestMetadataQuality:
def test_most_chunks_have_section(self, structured_chunks):
"""At least 90% of chunks should have a section reference."""
with_section = sum(1 for m in structured_chunks if m["section"])
ratio = with_section / len(structured_chunks)
assert ratio >= 0.9, (
f"Only {ratio:.0%} chunks have section metadata (expected >= 90%)"
)
def test_section_titles_not_empty(self, structured_chunks):
"""Chunks with a section should also have a section_title."""
for meta in structured_chunks:
if meta["section"]:
assert meta["section_title"], (
f"Chunk {meta['index']} has section={meta['section']} but no title"
)
def test_paragraph_nums_are_integers(self, structured_chunks):
"""paragraph_num should be int or None, never str."""
for meta in structured_chunks:
pn = meta["paragraph_num"]
assert pn is None or isinstance(pn, int), (
f"Chunk {meta['index']}: paragraph_num={pn!r} (type={type(pn).__name__})"
)
def test_indices_sequential(self, structured_chunks):
"""Chunk indices should be 0, 1, 2, ... in order."""
for i, meta in enumerate(structured_chunks):
assert meta["index"] == i, (
f"Expected index {i}, got {meta['index']}"
)
# =========================================================================
# Edge cases
# =========================================================================
class TestEdgeCases:
def test_numbered_list_not_false_section(self, structured_chunks):
"""Numbered items (1., 2., 3.) inside a § should NOT create new sections."""
for meta in structured_chunks:
section = meta["section"]
# Section should always start with § or be empty
if section:
assert section.startswith("§"), (
f"Unexpected section format: {section!r}"
)
def test_subsection_letters_preserved(self, plain_chunks):
"""Lettered subsections (a, b, c, d, e) in § 312k(2) should be in the text."""
all_text = " ".join(plain_chunks)
# § 312k Abs 2 Nr 1 has a) through e)
for letter in ["a)", "b)", "c)", "d)", "e)"]:
assert letter in all_text, (
f"Subsection letter {letter} from § 312k(2) missing"
)
@@ -0,0 +1,248 @@
"""
Tests for NIST/BSI/ENISA PDF text normalization and section detection.
Covers:
- _normalize_pdf_text() fixing broken multi-column PDF artifacts
- Section detection after normalization
- NIST CSF 2.0 compound IDs (GV.OC-01)
- NIST SP 800-53 control IDs (AC-1, AC-1(1))
- OWASP Top 10 IDs (A01:2021)
- Unicode normalization (ligatures, soft hyphens)
"""
from main import (
_normalize_pdf_text,
_extract_section_header,
_parse_section_metadata,
chunk_text_legal,
chunk_text_legal_structured,
)
# =========================================================================
# _normalize_pdf_text — broken spacing fixes
# =========================================================================
class TestNormalizePdfText:
def test_broken_section_number(self):
assert _normalize_pdf_text("1 . 1 Risk Framing") == "1.1 Risk Framing"
def test_nested_section_number(self):
assert _normalize_pdf_text("2 . 3 . 1 Subtitle") == "2.3.1 Subtitle"
def test_broken_nist_control_id(self):
assert _normalize_pdf_text("AC - 1 Account Management") == "AC-1 Account Management"
def test_broken_nist_control_au(self):
assert _normalize_pdf_text("AU - 2 Audit Events") == "AU-2 Audit Events"
def test_broken_csf_compound_id(self):
assert _normalize_pdf_text("GV . OC - 01 Context") == "GV.OC-01 Context"
def test_broken_enhancement_parens(self):
assert _normalize_pdf_text("AC-1( 1 ) Enhancement") == "AC-1(1) Enhancement"
def test_soft_hyphen_removed(self):
assert _normalize_pdf_text("infor\u00admation") == "information"
def test_zero_width_space_removed(self):
assert _normalize_pdf_text("data\u200bprotection") == "dataprotection"
def test_ligature_fi_normalized(self):
# U+FB01 = fi ligature
assert _normalize_pdf_text("con\ufb01dential") == "confidential"
def test_ligature_fl_normalized(self):
# U+FB02 = fl ligature
assert _normalize_pdf_text("over\ufb02ow") == "overflow"
def test_multiple_spaces_collapsed(self):
assert _normalize_pdf_text("too many spaces") == "too many spaces"
def test_newlines_preserved(self):
result = _normalize_pdf_text("line one\nline two\n\nline three")
assert "\n" in result
assert "line one" in result
assert "line three" in result
def test_normal_text_unchanged(self):
text = "AC-1 Account Management requires proper controls."
assert _normalize_pdf_text(text) == text
def test_combined_artifacts(self):
"""Multiple broken artifacts in one text block."""
broken = "1 . 1 Overview\nAC - 1 Account Management\nGV . OC - 01 Context"
fixed = _normalize_pdf_text(broken)
assert "1.1 Overview" in fixed
assert "AC-1 Account Management" in fixed
assert "GV.OC-01 Context" in fixed
# =========================================================================
# Section detection after normalization
# =========================================================================
class TestNistSectionDetection:
def test_nist_control_ac1(self):
assert _extract_section_header("AC-1 Account Management") is not None
def test_nist_control_au2(self):
assert _extract_section_header("AU-2 Audit Events") is not None
def test_nist_csf_compound(self):
assert _extract_section_header("GV.OC-01 Organizational Context") is not None
def test_nist_enhancement(self):
assert _extract_section_header("AC-1(1) Policy and Procedures") is not None
def test_owasp_top10(self):
assert _extract_section_header("A01:2021 Broken Access Control") is not None
def test_owasp_without_year(self):
assert _extract_section_header("A03 Injection") is not None
def test_numbered_section(self):
assert _extract_section_header("2.1 Risk Framing") is not None
def test_deep_numbered_section(self):
assert _extract_section_header("3.2.1 Assessment Methodology") is not None
def test_broken_then_normalized_detects(self):
"""After normalization, broken NIST IDs should be detected as sections."""
broken = "AC - 1 Account Management"
normalized = _normalize_pdf_text(broken)
assert _extract_section_header(normalized) is not None
def test_broken_csf_then_normalized_detects(self):
broken = "GV . OC - 01 Organizational Context"
normalized = _normalize_pdf_text(broken)
assert _extract_section_header(normalized) is not None
def test_broken_section_num_then_normalized(self):
broken = "2 . 1 Risk Framing"
normalized = _normalize_pdf_text(broken)
assert _extract_section_header(normalized) is not None
# =========================================================================
# Section metadata extraction (_parse_section_metadata)
# =========================================================================
class TestNistSectionMetadata:
def test_nist_control_ac1_section(self):
meta = _parse_section_metadata("AC-1 POLICY AND PROCEDURES")
assert meta["section"] == "AC-1"
def test_nist_control_au2_section(self):
meta = _parse_section_metadata("AU-2 Audit Events")
assert meta["section"] == "AU-2"
def test_nist_enhancement_section(self):
meta = _parse_section_metadata("AC-1(1) Policy and Procedures")
assert meta["section"] == "AC-1(1)"
def test_nist_csf_compound_section(self):
meta = _parse_section_metadata("GV.OC-01 Organizational Context")
assert meta["section"] == "GV.OC-01"
def test_numbered_section(self):
meta = _parse_section_metadata("3.1 ACCESS CONTROL")
assert meta["section"] == "3.1"
def test_deep_numbered_section(self):
meta = _parse_section_metadata("2.3.1 Subtitle")
assert meta["section"] == "2.3.1"
def test_owasp_section(self):
meta = _parse_section_metadata("A01:2021 Broken Access Control")
assert meta["section"] == "A01:2021"
def test_section_title_extracted(self):
meta = _parse_section_metadata("AC-1 POLICY AND PROCEDURES")
assert meta["section_title"] == "POLICY AND PROCEDURES"
def test_numbered_section_title(self):
meta = _parse_section_metadata("3.1 ACCESS CONTROL")
assert meta["section_title"] == "ACCESS CONTROL"
def test_single_number_allcaps_section(self):
"""ENISA-style: '1. INTRODUCTION'"""
assert _extract_section_header("1. INTRODUCTION") is not None
def test_single_number_section_metadata(self):
meta = _parse_section_metadata("1. INTRODUCTION")
assert meta["section"] == "1"
assert meta["section_title"] == "INTRODUCTION"
def test_single_number_lowercase_not_matched(self):
"""'1. First item' should NOT be a section (lowercase title)."""
assert _extract_section_header("1. First item in a list") is None
def test_structured_chunks_have_section(self):
text = (
"3.1 ACCESS CONTROL\n"
"Overview of access control family.\n\n"
"AC-1 POLICY AND PROCEDURES\n"
"The organization develops, documents, and disseminates an access "
"control policy that addresses purpose, scope, roles, responsibilities, "
"management commitment, coordination among entities.\n\n"
"AC-2 ACCOUNT MANAGEMENT\n"
"The information system enforces approved authorizations for logical "
"access to information and system resources.\n"
)
result = chunk_text_legal_structured(text, chunk_size=300, overlap=50)
sections = [r.get("section", "") for r in result]
assert any(s == "AC-1" for s in sections)
assert any(s == "AC-2" for s in sections)
# =========================================================================
# Chunking with NIST-style text
# =========================================================================
class TestNistChunking:
NIST_SAMPLE = (
"AC-1 Account Management\n"
"The organization develops, documents, and disseminates an access "
"control policy that addresses purpose, scope, roles, responsibilities, "
"management commitment, coordination among organizational entities, "
"and compliance.\n\n"
"AC-2 Access Enforcement\n"
"The information system enforces approved authorizations for logical "
"access to information and system resources in accordance with "
"applicable access control policies.\n\n"
"AC-3 Information Flow Enforcement\n"
"The system enforces approved authorizations for controlling the flow "
"of information within the system and between interconnected systems.\n"
)
def test_chunks_have_section_prefix(self):
chunks = chunk_text_legal(self.NIST_SAMPLE, chunk_size=300, overlap=50)
assert any("[AC-1" in c for c in chunks)
assert any("[AC-2" in c for c in chunks)
def test_sections_detected(self):
chunks = chunk_text_legal(self.NIST_SAMPLE, chunk_size=500, overlap=50)
assert len(chunks) >= 2
def test_normalized_broken_text_chunks_correctly(self):
"""Broken PDF text should chunk correctly after normalization."""
broken = (
"AC - 1 Account Management\n"
"The organization develops, documents, and disseminates an access "
"control policy that addresses purpose, scope, roles, responsibilities, "
"management commitment, coordination among organizational entities, "
"and compliance with applicable regulations and standards.\n\n"
"AC - 2 Access Enforcement\n"
"The information system enforces approved authorizations for logical "
"access to information and system resources in accordance with "
"applicable access control policies and procedures.\n"
)
normalized = _normalize_pdf_text(broken)
chunks = chunk_text_legal(normalized, chunk_size=300, overlap=50)
assert any("[AC-1" in c for c in chunks)
assert any("[AC-2" in c for c in chunks)
+62
View File
@@ -0,0 +1,62 @@
§ 312 Anwendungsbereich
(1) Die Vorschriften der Kapitel 1 und 2 dieses Untertitels sind auf Verbrauchervertraege anzuwenden, bei denen sich der Verbraucher zu der Zahlung eines Preises verpflichtet.
(1a) Die Vorschriften der Kapitel 1 und 2 dieses Untertitels sind auch auf Verbrauchervertraege anzuwenden, bei denen der Verbraucher dem Unternehmer personenbezogene Daten bereitstellt oder sich hierzu verpflichtet. Dies gilt nicht, wenn der Unternehmer die vom Verbraucher bereitgestellten personenbezogenen Daten ausschliesslich verarbeitet, um seine Leistungspflicht oder an ihn gestellte rechtliche Anforderungen zu erfuellen, und sie zu keinem anderen Zweck verarbeitet.
(2) Von den Vorschriften der Kapitel 1 und 2 dieses Untertitels ist nur § 312a Absatz 1, 3, 4 und 6 auf folgende Vertraege anzuwenden:
1. notariell beurkundete Vertraege
2. Vertraege ueber die Begruendung, den Erwerb oder die Uebertragung von Eigentum oder anderen Rechten an Grundstuecken
3. Vertraege ueber den Bau von neuen Gebaeuden oder erhebliche Umbaumassnahmen an bestehenden Gebaeuden
4. Vertraege ueber Reiseleistungen nach § 651a
5. Vertraege ueber die Befoerderung von Personen
6. Vertraege, die unter Einsatz von Warenautomaten oder automatisierten Geschaeftsraeumen geschlossen werden
§ 312a Allgemeine Pflichten und Grundsaetze bei Verbrauchervertraegen
(1) Ruft der Unternehmer oder eine Person, die in seinem Namen oder Auftrag handelt, den Verbraucher an, um mit diesem einen Vertrag zu schliessen, hat der Anrufer zu Beginn des Gespraechs seine Identitaet und gegebenenfalls die Identitaet der Person, fuer die er anruft, sowie den geschaeftlichen Zweck des Anrufs offenzulegen.
(2) Der Unternehmer ist verpflichtet, den Verbraucher nach Massgabe des Artikels 246 des Einfuehrungsgesetzes zum Buergerlichen Gesetzbuche zu informieren. Der Unternehmer kann von dem Verbraucher Fracht-, Liefer- oder Versandkosten und sonstige Kosten nur verlangen, soweit er den Verbraucher ueber diese Kosten entsprechend den Anforderungen aus Artikel 246 Absatz 1 Nummer 3 des Einfuehrungsgesetzes zum Buergerlichen Gesetzbuche informiert hat. Die Saetze 1 und 2 sind weder auf ausserhalb von Geschaeftsraeumen geschlossene Vertraege noch auf Fernabsatzvertraege noch auf Vertraege ueber Finanzdienstleistungen anzuwenden.
(3) Eine Vereinbarung, die auf eine ueber das vereinbarte Entgelt fuer die Hauptleistung hinausgehende Zahlung des Verbrauchers gerichtet ist, kann ein Unternehmer mit einem Verbraucher nur ausdruecklich treffen. Schliesst der Unternehmer und der Verbraucher einen Vertrag im elektronischen Geschaeftsverkehr, wird eine solche Vereinbarung nur Vertragsbestandteil, wenn der Unternehmer die Vereinbarung nicht durch eine Voreinstellung herbeifuehrt.
(4) Eine Vereinbarung, durch die ein Verbraucher verpflichtet wird, ein Entgelt dafuer zu zahlen, dass der Verbraucher fuer die Erfuellung seiner vertraglichen Pflichten ein bestimmtes Zahlungsmittel nutzt, ist unwirksam, wenn fuer den Verbraucher keine zumutbare und gaengige unentgeltliche Zahlungsmoeglichkeit besteht oder das vereinbarte Entgelt ueber die Kosten hinausgeht, die dem Unternehmer durch die Nutzung des Zahlungsmittels entstehen.
(5) Eine Vereinbarung, durch die ein Verbraucher verpflichtet wird, ein Entgelt dafuer zu zahlen, dass der Verbraucher den Unternehmer wegen Fragen oder Erklaerungen zu einem zwischen ihnen geschlossenen Vertrag ueber eine Rufnummer anruft, die der Unternehmer fuer solche Zwecke bereithaelt, ist unwirksam, wenn das vereinbarte Entgelt das Entgelt fuer die blosse Nutzung des Telekommunikationsdienstes uebersteigt.
(6) Ist eine Vereinbarung nach den Absaetzen 3 bis 5 nicht Vertragsbestandteil geworden oder ist sie unwirksam, bleibt der Vertrag im Uebrigen wirksam.
§ 312g Widerrufsrecht
(1) Dem Verbraucher steht bei ausserhalb von Geschaeftsraeumen geschlossenen Vertraegen und bei Fernabsatzvertraegen ein Widerrufsrecht gemaess § 355 zu.
(2) Das Widerrufsrecht besteht, soweit die Parteien nichts anderes vereinbart haben, nicht bei folgenden Vertraegen:
1. Vertraege zur Lieferung von Waren, die nicht vorgefertigt sind und fuer deren Herstellung eine individuelle Auswahl oder Bestimmung durch den Verbraucher massgeblich ist oder die eindeutig auf die persoenlichen Beduerfnisse des Verbrauchers zugeschnitten sind,
2. Vertraege zur Lieferung von Waren, die schnell verderben koennen oder deren Verfallsdatum schnell ueberschritten wuerde,
3. Vertraege zur Lieferung versiegelter Waren, die aus Gruenden des Gesundheitsschutzes oder der Hygiene nicht zur Rueckgabe geeignet sind, wenn ihre Versiegelung nach der Lieferung entfernt wurde.
(3) Das Widerrufsrecht besteht ferner nicht bei Vertraegen, bei denen dem Verbraucher bereits auf Grund der §§ 495, 506 bis 513 ein Widerrufsrecht zusteht.
§ 312k Kuendigung von Verbrauchervertraegen im elektronischen Geschaeftsverkehr
(1) Wird Verbrauchern ueber eine Webseite ermoeglicht, einen Vertrag im elektronischen Geschaeftsverkehr zu schliessen, der auf die Begruendung eines Dauerschuldverhaeltnisses gerichtet ist, das einen Unternehmer zu einer entgeltlichen Leistung verpflichtet, so treffen den Unternehmer die Pflichten nach dieser Vorschrift. Dies gilt nicht
1. fuer Vertraege, fuer deren Kuendigung gesetzlich ausschliesslich eine strengere Form als die Textform vorgesehen ist, und
2. in Bezug auf Webseiten, die Finanzdienstleistungen betreffen, oder fuer Vertraege ueber Finanzdienstleistungen.
(2) Der Unternehmer hat sicherzustellen, dass der Verbraucher auf der Webseite eine Erklaerung zur ordentlichen oder ausserordentlichen Kuendigung eines auf der Webseite abschliessbaren Vertrags nach Absatz 1 Satz 1 ueber eine Kuendigungsschaltflaeche abgeben kann. Die Kuendigungsschaltflaeche muss gut lesbar mit nichts anderem als den Woertern "Vertraege hier kuendigen" oder mit einer entsprechenden eindeutigen Formulierung beschriftet sein. Sie muss den Verbraucher unmittelbar zu einer Bestaetigungsseite fuehren, die
1. den Verbraucher auffordert und ihm ermoeglicht Angaben zu machen
a) zur Art der Kuendigung sowie im Falle der ausserordentlichen Kuendigung zum Kuendigungsgrund,
b) zu seiner eindeutigen Identifizierbarkeit,
c) zur eindeutigen Bezeichnung des Vertrags,
d) zum Zeitpunkt, zu dem die Kuendigung das Vertragsverhaeltnis beenden soll,
e) zur schnellen elektronischen Uebermittlung der Kuendigungsbestaetigung an ihn und
2. eine Bestaetigungsschaltflaeche enthaelt, ueber deren Betaetigung der Verbraucher die Kuendigungserklaerung abgeben kann und die gut lesbar mit nichts anderem als den Woertern "jetzt kuendigen" oder mit einer entsprechenden eindeutigen Formulierung beschriftet ist.
Die Schaltflaechen und die Bestaetigungsseite muessen staendig verfuegbar sowie unmittelbar und leicht zugaenglich sein.
(3) Der Verbraucher muss seine durch das Betaetigen der Bestaetigungsschaltflaeche abgegebene Kuendigungserklaerung mit dem Datum und der Uhrzeit der Abgabe auf einem dauerhaften Datentraeger so speichern koennen, dass erkennbar ist, dass die Kuendigungserklaerung durch das Betaetigen der Bestaetigungsschaltflaeche abgegeben wurde.
(4) Der Unternehmer hat dem Verbraucher den Inhalt sowie Datum und Uhrzeit des Zugangs der Kuendigungserklaerung sowie den Zeitpunkt, zu dem das Vertragsverhaeltnis durch die Kuendigung beendet werden soll, sofort auf elektronischem Wege in Textform zu bestaetigen. Es wird vermutet, dass eine durch das Betaetigen der Bestaetigungsschaltflaeche abgegebene Kuendigungserklaerung dem Unternehmer unmittelbar nach ihrer Abgabe zugegangen ist.
(5) Wenn der Verbraucher bei der Abgabe der Kuendigungserklaerung keinen Zeitpunkt angibt, zu dem die Kuendigung das Vertragsverhaeltnis beenden soll, wirkt die Kuendigung im Zweifel zum fruehestmoeglichen Zeitpunkt.
(6) Werden die Schaltflaechen und die Bestaetigungsseite nicht entsprechend den Absaetzen 1 und 2 zur Verfuegung gestellt, kann ein Verbraucher einen Vertrag, fuer dessen Kuendigung die Schaltflaechen und die Bestaetigungsseite zur Verfuegung zu stellen sind, jederzeit und ohne Einhaltung einer Kuendigungsfrist kuendigen. Die Moeglichkeit des Verbrauchers zur ausserordentlichen Kuendigung bleibt hiervon unberuehrt.
+2
View File
@@ -318,6 +318,8 @@ server {
set $upstream_admin_compliance bp-compliance-admin:3000;
proxy_pass http://$upstream_admin_compliance;
proxy_http_version 1.1;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
+23 -2
View File
@@ -7,6 +7,7 @@ from pydantic import BaseModel
from api.auth import optional_jwt_auth
from embedding_client import embedding_client
from html_utils import decode_html_bytes, looks_like_html, strip_html
from minio_client_wrapper import minio_wrapper
from qdrant_client_wrapper import qdrant_wrapper
@@ -14,6 +15,9 @@ logger = logging.getLogger("rag-service.api.documents")
router = APIRouter(prefix="/api/v1/documents")
# Structural metadata fields from embedding-service chunks_with_metadata (D2)
_STRUCT_FIELDS = ("section", "section_title", "paragraph", "paragraph_num", "page")
# ---- Request / Response models --------------------------------------------
@@ -98,9 +102,16 @@ async def upload_document(
try:
if content_type == "application/pdf" or filename.lower().endswith(".pdf"):
text = await embedding_client.extract_pdf(file_bytes)
elif filename.lower().endswith((".html", ".htm")):
text = decode_html_bytes(file_bytes)
text = strip_html(text)
logger.info("Decoded + stripped HTML from %s", filename)
else:
# Try to decode as text
text = file_bytes.decode("utf-8", errors="replace")
# Strip HTML if content looks like HTML despite extension
if looks_like_html(text):
text = strip_html(text)
logger.info("Stripped HTML tags from %s", filename)
except Exception as exc:
logger.error("Text extraction failed: %s", exc)
raise HTTPException(status_code=500, detail=f"Text extraction failed: {exc}")
@@ -110,7 +121,7 @@ async def upload_document(
# --- Chunk ---
try:
chunks = await embedding_client.chunk_text(
chunk_result = await embedding_client.chunk_text(
text=text,
strategy=chunk_strategy,
chunk_size=chunk_size,
@@ -120,6 +131,9 @@ async def upload_document(
logger.error("Chunking failed: %s", exc)
raise HTTPException(status_code=500, detail=f"Chunking failed: {exc}")
chunks = chunk_result.chunks
chunks_meta = chunk_result.chunks_with_metadata
if not chunks:
raise HTTPException(status_code=400, detail="Chunking produced zero chunks")
@@ -154,6 +168,13 @@ async def upload_document(
"year": year,
**extra_metadata,
}
# Merge structural metadata from embedding service (D2)
if i < len(chunks_meta):
meta = chunks_meta[i]
for field in _STRUCT_FIELDS:
value = meta.get(field)
if value is not None and value != "":
payload[field] = value
payloads.append(payload)
# --- Index in Qdrant ---
+15 -4
View File
@@ -1,6 +1,6 @@
import logging
import os
from typing import Optional
from dataclasses import dataclass
import httpx
@@ -19,6 +19,14 @@ _OLLAMA_EMBED_MODEL = os.getenv("OLLAMA_EMBED_MODEL", "bge-m3")
_EMBED_BATCH_SIZE = int(os.getenv("EMBED_BATCH_SIZE", "32"))
@dataclass
class ChunkResult:
"""Result from the embedding service /chunk endpoint."""
chunks: list[str]
chunks_with_metadata: list[dict]
class EmbeddingClient:
"""
Hybrid client:
@@ -120,10 +128,10 @@ class EmbeddingClient:
strategy: str = "recursive",
chunk_size: int = 512,
overlap: int = 50,
) -> list[str]:
) -> ChunkResult:
"""
Ask the embedding service to chunk a long text.
Returns a list of chunk strings.
Returns ChunkResult with plain chunks and structural metadata.
"""
async with httpx.AsyncClient(timeout=_TIMEOUT) as client:
response = await client.post(
@@ -137,7 +145,10 @@ class EmbeddingClient:
)
response.raise_for_status()
data = response.json()
return data.get("chunks", [])
return ChunkResult(
chunks=data.get("chunks", []),
chunks_with_metadata=data.get("chunks_with_metadata") or [],
)
# ------------------------------------------------------------------
# PDF extraction (via embedding-service)
+66
View File
@@ -0,0 +1,66 @@
"""HTML detection and stripping for legal document ingestion."""
import re
from html import unescape
_HTML_TAG_RE = re.compile(r'<(html|head|body|div|p|span|table)\b', re.IGNORECASE)
_CHARSET_RE = re.compile(
r'<meta[^>]+charset\s*=\s*["\']?([a-zA-Z0-9_-]+)', re.IGNORECASE,
)
def looks_like_html(text: str) -> bool:
"""Check if text contains HTML tags."""
return bool(_HTML_TAG_RE.search(text[:500]))
def decode_html_bytes(raw: bytes) -> str:
"""Decode HTML bytes with charset detection from meta tags.
Tries UTF-8 first, falls back to charset from HTML meta tag, then latin-1.
"""
try:
text = raw.decode("utf-8")
# Check if UTF-8 decode produced replacement characters
if "\ufffd" not in text:
return text
except UnicodeDecodeError:
pass
# Peek at ASCII-safe portion to find charset
ascii_head = raw[:2000].decode("ascii", errors="ignore")
m = _CHARSET_RE.search(ascii_head)
if m:
charset = m.group(1).lower().replace("_", "-")
try:
return raw.decode(charset)
except (UnicodeDecodeError, LookupError):
pass
# Last resort: iso-8859-1 (covers all byte values)
return raw.decode("iso-8859-1")
def strip_html(html_text: str) -> str:
"""Convert HTML to plain text preserving legal document structure."""
text = html_text
# Remove script/style blocks
text = re.sub(r'<(script|style)[^>]*>.*?</\1>', '', text, flags=re.DOTALL | re.IGNORECASE)
# Block elements → newline (preserves § paragraph structure)
# Opening block tags also get newline (e.g., <h3> before § signs)
text = re.sub(
r'<(div|p|h[1-6]|li|tr|dt|dd|section|article|blockquote)\b[^>]*>',
'\n', text, flags=re.IGNORECASE,
)
text = re.sub(
r'</(div|p|h[1-6]|li|tr|dt|dd|section|article|blockquote)>',
'\n', text, flags=re.IGNORECASE,
)
text = re.sub(r'<br\s*/?>', '\n', text, flags=re.IGNORECASE)
# Strip remaining tags
text = re.sub(r'<[^>]+>', '', text)
# Decode HTML entities (&#246; → ö, &sect; → §)
text = unescape(text)
# Clean up excessive whitespace
text = re.sub(r'\n{3,}', '\n\n', text)
return text.strip()
View File
+7
View File
@@ -0,0 +1,7 @@
"""Shared test fixtures for rag-service tests."""
import os
import sys
# Ensure rag-service root is on sys.path so imports resolve
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+172
View File
@@ -0,0 +1,172 @@
"""Tests for document upload payload building — structural metadata (D2)."""
# Mirror the constant from api/documents.py to avoid heavy import chain
# (api → jose, qdrant_client, minio, etc.)
_STRUCT_FIELDS = ("section", "section_title", "paragraph", "paragraph_num", "page")
def _build_payload(
chunk: str,
index: int,
chunks_meta: list[dict],
extra_metadata: "dict | None" = None,
) -> dict:
"""Replicate the payload-building logic from documents.py for unit testing."""
payload = {
"document_id": "test-doc-id",
"object_name": "test/path.pdf",
"filename": "path.pdf",
"chunk_index": index,
"chunk_text": chunk,
"data_type": "law",
"bundesland": "bund",
"use_case": "compliance",
"year": "2026",
**(extra_metadata or {}),
}
if index < len(chunks_meta):
meta = chunks_meta[index]
for field in _STRUCT_FIELDS:
value = meta.get(field)
if value is not None and value != "":
payload[field] = value
return payload
class TestPayloadStructuralMetadata:
"""Tests for structural metadata merging into Qdrant payloads."""
def test_payload_contains_structural_metadata(self):
"""Metadata fields from chunks_with_metadata land in the payload."""
meta = [
{
"text": "chunk text",
"section": "§ 312k",
"section_title": "Kuendigungsbutton",
"paragraph": "Abs. 1",
"paragraph_num": 1,
"page": 847,
"index": 0,
}
]
payload = _build_payload("chunk text", 0, meta)
assert payload["section"] == "§ 312k"
assert payload["section_title"] == "Kuendigungsbutton"
assert payload["paragraph"] == "Abs. 1"
assert payload["paragraph_num"] == 1
assert payload["page"] == 847
def test_payload_without_metadata_backwards_compat(self):
"""Empty metadata list → payload has no structural fields."""
payload = _build_payload("chunk text", 0, [])
for field in _STRUCT_FIELDS:
assert field not in payload
def test_payload_skips_empty_values(self):
"""Empty string and None values are NOT added to payload."""
meta = [
{
"text": "chunk text",
"section": "",
"section_title": "",
"paragraph": "",
"paragraph_num": None,
"page": None,
"index": 0,
}
]
payload = _build_payload("chunk text", 0, meta)
for field in _STRUCT_FIELDS:
assert field not in payload
def test_metadata_overrides_extra_metadata(self):
"""Auto-extracted metadata takes precedence over manual extra_metadata."""
meta = [
{
"text": "chunk text",
"section": "§ 25",
"section_title": "",
"paragraph": "",
"paragraph_num": None,
"page": None,
"index": 0,
}
]
extra = {"section": "manual-value"}
payload = _build_payload("chunk text", 0, meta, extra_metadata=extra)
assert payload["section"] == "§ 25"
def test_partial_metadata_alignment(self):
"""3 chunks but only 2 metadata entries → third payload has no structural fields."""
meta = [
{
"text": "c1",
"section": "§ 1",
"section_title": "",
"paragraph": "",
"paragraph_num": None,
"page": None,
"index": 0,
},
{
"text": "c2",
"section": "§ 2",
"section_title": "",
"paragraph": "",
"paragraph_num": None,
"page": None,
"index": 1,
},
]
p0 = _build_payload("c1", 0, meta)
p1 = _build_payload("c2", 1, meta)
p2 = _build_payload("c3", 2, meta)
assert p0["section"] == "§ 1"
assert p1["section"] == "§ 2"
assert "section" not in p2
def test_zero_paragraph_num_is_kept(self):
"""paragraph_num=0 is a valid value and should be stored."""
meta = [
{
"text": "chunk",
"section": "",
"section_title": "",
"paragraph": "",
"paragraph_num": 0,
"page": None,
"index": 0,
}
]
payload = _build_payload("chunk", 0, meta)
# 0 is not None and not "" → should be stored
assert payload["paragraph_num"] == 0
def test_page_zero_is_kept(self):
"""page=0 is a valid value (first page) and should be stored."""
meta = [
{
"text": "chunk",
"section": "",
"section_title": "",
"paragraph": "",
"paragraph_num": None,
"page": 0,
"index": 0,
}
]
payload = _build_payload("chunk", 0, meta)
assert payload["page"] == 0
+135
View File
@@ -0,0 +1,135 @@
"""Tests for EmbeddingClient.chunk_text() — ChunkResult with metadata (D2)."""
from unittest.mock import AsyncMock, MagicMock, patch
import pytest
from embedding_client import ChunkResult, EmbeddingClient
@pytest.fixture
def client():
with patch("embedding_client.settings") as mock_settings:
mock_settings.EMBEDDING_SERVICE_URL = "http://localhost:8087"
return EmbeddingClient()
def _mock_response(json_data: dict, status_code: int = 200):
"""Create a mock httpx response (sync methods like .json() and .raise_for_status())."""
resp = MagicMock()
resp.status_code = status_code
resp.json.return_value = json_data
return resp
@pytest.mark.asyncio
async def test_chunk_text_returns_chunk_result(client):
"""chunk_text returns ChunkResult with both chunks and metadata."""
mock_json = {
"chunks": ["chunk1 text", "chunk2 text"],
"chunks_with_metadata": [
{
"text": "chunk1 text",
"section": "§ 25",
"section_title": "Informationspflichten",
"paragraph": "Abs. 1",
"paragraph_num": 1,
"page": None,
"index": 0,
},
{
"text": "chunk2 text",
"section": "§ 25",
"section_title": "Informationspflichten",
"paragraph": "Abs. 2",
"paragraph_num": 2,
"page": None,
"index": 1,
},
],
"count": 2,
"strategy": "recursive",
}
with patch("httpx.AsyncClient") as mock_client_cls:
mock_client = AsyncMock()
mock_client.post.return_value = _mock_response(mock_json)
mock_client.__aenter__ = AsyncMock(return_value=mock_client)
mock_client.__aexit__ = AsyncMock(return_value=False)
mock_client_cls.return_value = mock_client
result = await client.chunk_text("some legal text")
assert isinstance(result, ChunkResult)
assert result.chunks == ["chunk1 text", "chunk2 text"]
assert len(result.chunks_with_metadata) == 2
assert result.chunks_with_metadata[0]["section"] == "§ 25"
assert result.chunks_with_metadata[1]["paragraph"] == "Abs. 2"
@pytest.mark.asyncio
async def test_chunk_text_without_metadata_field(client):
"""Embedding service response without chunks_with_metadata → empty list."""
mock_json = {
"chunks": ["chunk1"],
"count": 1,
"strategy": "semantic",
}
with patch("httpx.AsyncClient") as mock_client_cls:
mock_client = AsyncMock()
mock_client.post.return_value = _mock_response(mock_json)
mock_client.__aenter__ = AsyncMock(return_value=mock_client)
mock_client.__aexit__ = AsyncMock(return_value=False)
mock_client_cls.return_value = mock_client
result = await client.chunk_text("text", strategy="semantic")
assert isinstance(result, ChunkResult)
assert result.chunks == ["chunk1"]
assert result.chunks_with_metadata == []
@pytest.mark.asyncio
async def test_chunk_text_with_null_metadata(client):
"""chunks_with_metadata: null in response → empty list."""
mock_json = {
"chunks": ["chunk1"],
"chunks_with_metadata": None,
"count": 1,
"strategy": "recursive",
}
with patch("httpx.AsyncClient") as mock_client_cls:
mock_client = AsyncMock()
mock_client.post.return_value = _mock_response(mock_json)
mock_client.__aenter__ = AsyncMock(return_value=mock_client)
mock_client.__aexit__ = AsyncMock(return_value=False)
mock_client_cls.return_value = mock_client
result = await client.chunk_text("text")
assert result.chunks_with_metadata == []
@pytest.mark.asyncio
async def test_chunk_text_empty(client):
"""Empty text → empty chunks and metadata."""
mock_json = {
"chunks": [],
"chunks_with_metadata": [],
"count": 0,
"strategy": "recursive",
}
with patch("httpx.AsyncClient") as mock_client_cls:
mock_client = AsyncMock()
mock_client.post.return_value = _mock_response(mock_json)
mock_client.__aenter__ = AsyncMock(return_value=mock_client)
mock_client.__aexit__ = AsyncMock(return_value=False)
mock_client_cls.return_value = mock_client
result = await client.chunk_text("")
assert result.chunks == []
assert result.chunks_with_metadata == []
+161
View File
@@ -0,0 +1,161 @@
"""Tests for HTML detection and stripping in document upload."""
from html_utils import (
decode_html_bytes,
looks_like_html as _looks_like_html,
strip_html as _strip_html,
)
class TestLooksLikeHtml:
def test_html_document(self):
assert _looks_like_html("<html><body><p>Text</p></body></html>")
def test_html_div(self):
assert _looks_like_html('<div class="jurAbsatz">§ 312</div>')
def test_html_with_doctype(self):
assert _looks_like_html("<!DOCTYPE html><html><head></head><body>")
def test_plain_text(self):
assert not _looks_like_html("§ 312 Anwendungsbereich\n\n(1) Die Vorschriften...")
def test_legal_text_with_angle_brackets(self):
# Legal text might use < or > but not as HTML tags
assert not _looks_like_html("Wert < 100 EUR und > 50 EUR ist zulaessig.")
def test_markdown(self):
assert not _looks_like_html("# § 312 Anwendungsbereich\n\n(1) Die Vorschriften...")
class TestStripHtml:
def test_basic_div_tags(self):
html = "<div>§ 312 Anwendungsbereich</div>"
result = _strip_html(html)
assert result.startswith("§ 312 Anwendungsbereich")
def test_paragraph_tags_become_newlines(self):
html = "<p>Absatz 1</p><p>Absatz 2</p>"
result = _strip_html(html)
assert "Absatz 1" in result
assert "Absatz 2" in result
# Paragraphs should be on separate lines
lines = [ln.strip() for ln in result.split("\n") if ln.strip()]
assert len(lines) >= 2
def test_preserves_section_headers(self):
"""§ signs must be at line starts after stripping."""
html = '<div class="jurAbsatz">§ 312 Anwendungsbereich</div>'
result = _strip_html(html)
# § should be at the start of a line
for line in result.split("\n"):
if "§ 312" in line:
assert line.strip().startswith("§ 312")
break
else:
raise AssertionError("§ 312 not found in stripped text")
def test_decodes_html_entities(self):
html = "Gel&#246;scht und ge&#228;ndert und &#167; 312"
result = _strip_html(html)
assert "Gelöscht" in result
assert "geändert" in result
assert "§ 312" in result
def test_decodes_named_entities(self):
html = "&sect; 312 &amp; &sect; 313"
result = _strip_html(html)
assert "§ 312" in result
assert "§ 313" in result
def test_removes_script_style(self):
html = '<style>body{color:red}</style><script>alert("x")</script><p>§ 1 Text</p>'
result = _strip_html(html)
assert "color" not in result
assert "alert" not in result
assert "§ 1 Text" in result
def test_br_becomes_newline(self):
html = "Zeile 1<br/>Zeile 2<br>Zeile 3"
result = _strip_html(html)
assert "Zeile 1" in result
assert "Zeile 2" in result
def test_no_excessive_whitespace(self):
html = "<div></div><div></div><div></div><div>Text</div>"
result = _strip_html(html)
assert "\n\n\n" not in result
def test_gesetze_im_internet_format(self):
"""Realistic HTML from gesetze-im-internet.de."""
html = """<div class="jnhtml">
<div>
<div class="jurAbsatz">
§ 312k Kündigung von Verbraucherverträgen im elektronischen Geschäftsverkehr
</div>
<div class="jurAbsatz">
(1) Wird Verbrauchern über eine Webseite ermöglicht, einen Vertrag im elektronischen Geschäftsverkehr zu schließen, der auf die Begründung eines Dauerschuldverhältnisses gerichtet ist, das einen Unternehmer zu einer entgeltlichen Leistung verpflichtet, so treffen den Unternehmer die Pflichten nach dieser Vorschrift.
</div>
<div class="jurAbsatz">
(2) Der Unternehmer hat sicherzustellen, dass der Verbraucher auf der Webseite eine Erklärung zur ordentlichen oder außerordentlichen Kündigung abgeben kann.
</div>
</div></div>"""
result = _strip_html(html)
# § 312k should be at start of a line
found_312k = False
for line in result.split("\n"):
stripped = line.strip()
if stripped.startswith("§ 312k"):
found_312k = True
break
assert found_312k, f"§ 312k not at line start. Text:\n{result[:500]}"
# Content should be present without tags
assert "Dauerschuldverhältnisses" in result
assert "<div>" not in result
assert "class=" not in result
def test_plain_text_passthrough(self):
"""Non-HTML text should pass through unchanged."""
text = "§ 312 Anwendungsbereich\n\n(1) Die Vorschriften..."
result = _strip_html(text)
assert "§ 312 Anwendungsbereich" in result
assert "(1) Die Vorschriften" in result
def test_opening_h3_creates_newline(self):
"""Opening <h3> must create newline so § is at line start."""
html = '<a href="#">Inhaltsverzeichnis</a><h3><span>§ 1</span> Titel</h3>'
result = _strip_html(html)
found = any(line.strip().startswith("§ 1") for line in result.split("\n"))
assert found, f"§ 1 not at line start: {result!r}"
class TestDecodeHtmlBytes:
def test_utf8_file(self):
raw = "<div>§ 312 Anwendungsbereich</div>".encode("utf-8")
text = decode_html_bytes(raw)
assert "§ 312" in text
def test_iso_8859_1_with_meta(self):
html = '<html><head><meta charset="iso-8859-1"></head><body>§ 1 Test</body></html>'
raw = html.encode("iso-8859-1")
text = decode_html_bytes(raw)
assert "§ 1 Test" in text
def test_iso_8859_1_without_meta(self):
"""Even without meta tag, iso-8859-1 is fallback."""
raw = "§ 312 Anwendungsbereich".encode("iso-8859-1")
text = decode_html_bytes(raw)
assert "§ 312" in text
def test_gesetze_im_internet_encoding(self):
"""gesetze-im-internet.de uses iso-8859-1 with &#167; entities."""
html = '<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />'
html += '<div>Kündigungsschutzgesetz</div>'
raw = html.encode("iso-8859-1")
text = decode_html_bytes(raw)
assert "Kündigungsschutzgesetz" in text
+65
View File
@@ -0,0 +1,65 @@
#!/bin/bash
# Qdrant Snapshot — erstellt Snapshots aller Collections
#
# Usage:
# bash scripts/qdrant-snapshot.sh # Create snapshots
# bash scripts/qdrant-snapshot.sh --list # List existing snapshots
# bash scripts/qdrant-snapshot.sh --restore <file> # Restore (interactive)
#
# Snapshots werden im Qdrant-Volume unter /qdrant/storage/snapshots/ gespeichert.
# Zusaetzlich werden sie nach ./backups/qdrant/ kopiert.
set -euo pipefail
QDRANT_URL="${QDRANT_URL:-http://localhost:6333}"
BACKUP_DIR="${BACKUP_DIR:-./backups/qdrant}"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# --- List existing snapshots ---
if [[ "${1:-}" == "--list" ]]; then
echo "=== Qdrant Snapshots ==="
for coll in $(curl -sf "$QDRANT_URL/collections" | python3 -c "import sys,json; [print(c['name']) for c in json.load(sys.stdin)['result']['collections']]"); do
echo ""
echo "Collection: $coll"
curl -sf "$QDRANT_URL/collections/$coll/snapshots" | python3 -c "
import sys, json
snaps = json.load(sys.stdin).get('result', [])
if not snaps:
print(' (no snapshots)')
else:
for s in snaps:
print(f\" {s['name']} size={s.get('size',0)/(1024*1024):.1f}MB\")
"
done
exit 0
fi
# --- Create snapshots ---
echo "=== Creating Qdrant Snapshots ($TIMESTAMP) ==="
mkdir -p "$BACKUP_DIR"
COLLECTIONS=$(curl -sf "$QDRANT_URL/collections" | python3 -c "import sys,json; [print(c['name']) for c in json.load(sys.stdin)['result']['collections']]")
for coll in $COLLECTIONS; do
echo ""
echo "[$coll] Creating snapshot..."
SNAP=$(curl -sf -X POST "$QDRANT_URL/collections/$coll/snapshots" | python3 -c "import sys,json; print(json.load(sys.stdin)['result']['name'])")
if [[ -z "$SNAP" ]]; then
echo "[$coll] ERROR: snapshot creation failed"
continue
fi
echo "[$coll] Snapshot: $SNAP"
# Download snapshot to backup dir
OUTFILE="$BACKUP_DIR/${coll}_${TIMESTAMP}.snapshot"
curl -sf "$QDRANT_URL/collections/$coll/snapshots/$SNAP" -o "$OUTFILE"
SIZE=$(du -h "$OUTFILE" | cut -f1)
echo "[$coll] Saved: $OUTFILE ($SIZE)"
done
echo ""
echo "=== Done ==="
ls -lh "$BACKUP_DIR"/*_${TIMESTAMP}.snapshot 2>/dev/null || echo "No snapshots created"