feat(pipeline): D6 citation backfill + E2/E3 law ingestion scripts
- d6_citation_backfill.py: 3-tier matching (hash/prefix/overlap), archives old citations, updated 3.651 controls (93.6% coverage) - ingest_de_laws.py: 8 German laws ingested (ArbZG, MuSchG, NachwG, MiLoG, GmbHG, AktG, InsO, BUrlG — 1.629 chunks) - ingest_eu_regulations.py: EUR-Lex ingestion (needs manual HTML due to AWS WAF). CSRD, CSDDD, EU Taxonomy, eIDAS 2.0, Pay Transparency manually ingested (1.057 chunks) - Updated session handover with current state Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# Session-Instruktionen: Pipeline-Qualitaet + Gesetze
|
||||
|
||||
**Datum:** 2026-05-02
|
||||
**Datum:** 2026-05-03
|
||||
**Fuer:** Naechste Claude-Session
|
||||
**Repo:** breakpilot-core (~/Projekte/breakpilot-core)
|
||||
|
||||
@@ -8,7 +8,7 @@
|
||||
|
||||
## ZUSAMMENFASSUNG: WO STEHEN WIR
|
||||
|
||||
### Was fertig ist (Bloecke A-D)
|
||||
### Was fertig ist (Bloecke A-D5+)
|
||||
|
||||
| Block | Was | Status |
|
||||
|-------|-----|--------|
|
||||
@@ -16,33 +16,52 @@
|
||||
| Block A | v1 Tag, Healthcheck, Textkorrektur, Dependencies (15.291) | ✅ |
|
||||
| Block B | Review-Verify (67k Paare, 43.527 DUPLIKAT via Haiku) | ✅ |
|
||||
| Block C | Adversarial Tests (30), Regression Harness (387 Tests) | ✅ |
|
||||
| **Block D** | **Strukturelles Chunking End-to-End** | ✅ |
|
||||
| Block D | Strukturelles Chunking End-to-End (D1-D5) | ✅ |
|
||||
| **D5+** | **NIST/BSI/ENISA PDF-Qualitaet gefixt** | ✅ |
|
||||
|
||||
### Block D Details (diese Session, 01-02.05.2026)
|
||||
### D5+ Details (Session 02-03.05.2026)
|
||||
|
||||
- **D1:** Embedding-Service extrahiert section/section_title/paragraph/paragraph_num/page
|
||||
- **D2:** RAG-Service speichert diese Felder im Qdrant-Payload
|
||||
- **D3:** Control Generator liest sie fuer source_citation (section > article > section_title Prioritaet)
|
||||
- **D4:** BGB § 312k Validierung — **kritischen Bug gefunden:** Phase-3-Overlap zerstoerte den [§]-Prefix. Gefixt.
|
||||
- **D5:** 430 von 436 Dokumenten re-ingestiert (alle 6 Compliance-Collections, ~70.000 Chunks)
|
||||
- **HTML-Fix:** Stripping + Charset-Erkennung (ISO-8859-1) + Opening-Block-Tags → HTML: 0%→97.6% Section-Rate
|
||||
- **EUR-Lex:** 20 EU-Verordnungen als HTML ersetzt (DSGVO: 0%→92%, AI Act: 33%→55%)
|
||||
- **pdfplumber:** Als PDF-Backend, PDF_EXTRACTION_BACKEND=auto in docker-compose.yml
|
||||
- **NIST Regex:** Nummerierte Abschnitte (1.1 Title), Control-IDs (AC-1, PO.1)
|
||||
- **Frontend-Bug:** requirements.map TypeError in breakpilot-compliance gefixt
|
||||
**Problem geloest:** 4 NIST-PDFs hatten 0 Chunks (D5-Script hatte delete-before-upload, Upload scheiterte).
|
||||
|
||||
### Aktuelle Qualitaet (Stand 02.05.2026)
|
||||
**Was gemacht wurde:**
|
||||
- `_normalize_pdf_text()` in embedding-service: Repariert gebrochene Sektionsnummern ("1 . 1"→"1.1", "AC - 1"→"AC-1"), Ligaturen, Soft Hyphens
|
||||
- `_LEGAL_SECTION_RE` erweitert: NIST CSF 2.0, NIST Enhancements, OWASP Top 10
|
||||
- `_SECTION_NUMBER_RE` erweitert: NIST Control-IDs (AC-1), numbered sections (3.1), OWASP (A01:2021)
|
||||
- `_SINGLE_NUM_ALLCAPS_RE` (case-sensitive): "1. INTRODUCTION" fuer ENISA/BSI-Docs
|
||||
- pdfplumber Toleranzen: x_tolerance=3, y_tolerance=4 (war 2/3)
|
||||
- **Lokale PDF-Extraktion Workaround:** Embedding-Service-Container crasht bei PDFs >5 MB (OOM). Fix: pdfplumber lokal auf Mac Mini, dann .txt hochladen.
|
||||
- `reingest_d5.py` Safety Fix: Upload → Verify → Delete old (mit `must_not` Filter)
|
||||
- `reingest_nist.py` (NEU): Sicheres Re-Ingest-Script
|
||||
- `reupload_legal_strategy.py` (NEU): Re-Upload mit chunk_strategy="legal"
|
||||
- `extract_and_upload_nist.py` (NEU): Lokale PDF-Extraktion fuer grosse Dateien
|
||||
- `scripts/qdrant-snapshot.sh` (NEU): Backup aller Qdrant-Collections
|
||||
- 2 korrupte PDFs (nistir_8259a, nist_ai_rmf) waren 263-Byte XML-Fehler in MinIO → Neu von nist.gov heruntergeladen und ingestiert
|
||||
- **99 Embedding-Service-Tests gruen** (28 neue NIST-Tests)
|
||||
- **Qdrant-Snapshot erstellt:** 14 Collections, ~1 GB unter `backups/qdrant/`
|
||||
|
||||
### Aktuelle Qualitaet (Stand 03.05.2026)
|
||||
|
||||
| Dokumenttyp | Section-Rate | Status |
|
||||
|-------------|-------------|--------|
|
||||
| DE Gesetze (TXT) | **79-100%** | ✅ Exzellent |
|
||||
| HTML (EUR-Lex + gesetze-im-internet) | **40-99%** | ✅ Gut |
|
||||
| PDF (EDPB/DSK Leitlinien) | **60-98%** | ✅ Gut |
|
||||
| PDF (NIST/BSI/ENISA) | **0-10%** | ❌ OFFEN |
|
||||
| PDF (NIST SP 800-53/82/160/207) | **27-45%** | ✅ Gut (war 0%) |
|
||||
| PDF (NIST CSF, 800-30, ENISA) | **5-13%** | 🟡 Akzeptabel |
|
||||
| PDF (CISA Secure by Design) | **0%** | ⚪ Prose-Dokument, erwartet |
|
||||
| TXT (OWASP) | **0%** | ❌ OFFEN |
|
||||
| Legal Templates (JSON/MD) | **0%** | ⚪ Erwartet |
|
||||
|
||||
**Qualitaetsreport (500 Controls Stichprobe):**
|
||||
**NIST Section-Rate-Verbesserungen (diese Session):**
|
||||
- NIST SP 800-53: 0% → **45%** (2.847 Chunks)
|
||||
- NIST SP 800-207: 0% → **43%** (207 Chunks)
|
||||
- NIST SP 800-160: 0% → **36%** (977 Chunks)
|
||||
- NIST SP 800-82: 0% → **27%** (2.301 Chunks)
|
||||
- ENISA ICS/SCADA: 0% → **22%** (235 Chunks)
|
||||
- ENISA Supply Chain Good Practices: 2% → **12%** (159 Chunks)
|
||||
- ENISA Supply Chain Security: 0% → **5%** (184 Chunks)
|
||||
|
||||
**Qualitaetsreport (500 Controls Stichprobe, Stand 02.05.):**
|
||||
- 13% vollstaendig korrekt
|
||||
- 41% Article-nicht-im-Source-Text (Controls aus alten kaputten Chunks)
|
||||
- 7% kein Article in Citation
|
||||
@@ -52,52 +71,9 @@
|
||||
|
||||
## WAS ALS NAECHSTES ZU TUN IST (PRIORISIERT)
|
||||
|
||||
### PRIO 1: NIST/ENISA/BSI/OWASP Dokumente sauber ingestieren
|
||||
### PRIO 1: Citation-Backfill (D6 — Block D Abschluss)
|
||||
|
||||
**Problem:** pypdf UND pdfplumber brechen den Text mehrspaliger PDFs. Die Regex-Erweiterung fuer NIST-Nummern hat nicht geholfen weil die Nummern nach der PDF-Extraktion gebrochen sind ("1 . 1" statt "1.1", "AC - 1" statt "AC-1").
|
||||
|
||||
**3 grosse NIST-PDFs fehlen in Qdrant** (Chunks geloescht, Upload 500-Error):
|
||||
- NIST_SP_800_53r5.pdf (6 MB) — 0 Chunks
|
||||
- nist_sp_800_82r3.pdf (8.5 MB) — 0 Chunks
|
||||
- nist_sp_800_160v1r1.pdf (8.2 MB) — 0 Chunks
|
||||
Die Originale sind sicher in MinIO.
|
||||
|
||||
**Loesungsansaetze (in dieser Reihenfolge testen):**
|
||||
|
||||
1. **Text-Normalisierung** nach PDF-Extraktion in `embedding-service/main.py`:
|
||||
```python
|
||||
def _normalize_pdf_text(text: str) -> str:
|
||||
"""Fix broken spacing from pypdf/pdfplumber multi-column extraction."""
|
||||
# "1 . 1" → "1.1"
|
||||
text = re.sub(r'(\d+)\s+\.\s+(\d+)', r'\1.\2', text)
|
||||
# "AC - 1" → "AC-1"
|
||||
text = re.sub(r'([A-Z]{2})\s*-\s*(\d+)', r'\1-\2', text)
|
||||
# "GV . OC - 01" → "GV.OC-01"
|
||||
text = re.sub(r'([A-Z]{2})\s*\.\s*([A-Z]{2})\s*-\s*(\d+)', r'\1.\2-\3', text)
|
||||
return text
|
||||
```
|
||||
Einfuegen in `extract_pdf_pdfplumber()` und `extract_pdf_pypdf()` vor dem Return.
|
||||
|
||||
2. **Fehlende Regex-Patterns** in `_LEGAL_SECTION_RE`:
|
||||
- `GV.OC-01` (NIST CSF 2.0): `[A-Z]{2}\.[A-Z]{2}-\d{2}`
|
||||
- `A01:2021` (OWASP Top 10): `A\d{2}(?::\d{4})?`
|
||||
- `AC-1(1)` (NIST Enhancements): `[A-Z]{2}-\d+\(\d+\)`
|
||||
|
||||
3. **HTML-Download** von NIST/ENISA-Websites (wie bei EUR-Lex):
|
||||
- NIST: `https://csrc.nist.gov/pubs/sp/800/53/r5/upd1/final` (HTML-Version)
|
||||
- ENISA: `https://www.enisa.europa.eu/publications/` (HTML)
|
||||
- Eigenes Script analog zu `control-pipeline/scripts/replace_eu_pdfs_with_html.py`
|
||||
|
||||
4. **D5-Script fixen:** `control-pipeline/scripts/reingest_d5.py` Zeile ~170:
|
||||
**KRITISCH:** Aktuell: Delete-THEN-Upload. Wenn Upload fehlschlaegt, sind Chunks weg.
|
||||
Fix: Upload ZUERST in temp-Collection oder mit neuem document_id, DANN alte loeschen.
|
||||
|
||||
**Betroffene Dokumente (105 PDFs mit <50% Section-Rate):**
|
||||
Vollstaendige Liste in `eu_pdfs_to_replace.json` im Repo-Root.
|
||||
|
||||
### PRIO 2: Citation-Backfill
|
||||
|
||||
**Problem:** 41% der Controls haben falsche source_citation.article weil sie aus alten (kaputten) Chunks generiert wurden. Die Chunks sind jetzt sauber, aber die Controls tragen noch die alten Citations.
|
||||
**Problem:** 41% der Controls haben falsche source_citation.article weil sie aus alten (kaputten) Chunks generiert wurden. Die Chunks sind jetzt sauber (mit Section-Metadaten), aber die Controls tragen noch die alten Citations.
|
||||
|
||||
**Loesung:**
|
||||
1. Fuer jeden Control: `source_citation.source` → regulation_code ermitteln
|
||||
@@ -111,10 +87,12 @@ Muss erweitert werden um UPDATE statt nur Report.
|
||||
**Bestehender Backfill-Service:** `control-pipeline/services/citation_backfill.py`
|
||||
Hat 3-Tier-Matching: Hash → Regex → LLM. Muss fuer neues Chunk-Format angepasst werden.
|
||||
|
||||
### PRIO 3: Fehlende Gesetze ingestieren (Block E)
|
||||
**Aufwand:** ~0.5 Tag
|
||||
|
||||
### PRIO 2: Fehlende Gesetze ingestieren (Block E)
|
||||
|
||||
**Neue Gesetze (noch nicht im RAG):**
|
||||
- **BEG IV** (Viertes Buerokratieentlastungsgesetz, BGBl. 2024 I Nr. 323) — verkuerzte Aufbewahrungsfristen
|
||||
- **BEG IV** (Viertes Buerokratieentlastungsgesetz, BGBl. 2024 I Nr. 323)
|
||||
- ArbZG, MuSchG, NachwG, MiLoG, GmbHG, AktG, InsO
|
||||
- Gesetz fuer faire Verbrauchervertraege
|
||||
- CSRD, EU Taxonomy, CSDDD, eIDAS 2.0
|
||||
@@ -122,8 +100,7 @@ Hat 3-Tier-Matching: Hash → Regex → LLM. Muss fuer neues Chunk-Format angepa
|
||||
- AT: ArbVG, AngG, AZG, GmbHG-AT, NISG
|
||||
|
||||
**Veraltete Gesetze (aktualisieren):**
|
||||
- BGB: § 312k Kuendigungsbutton seit 01.07.2022 (existiert jetzt als TXT, sollte aktuell sein)
|
||||
- TMG → ersetzen durch TDDDG
|
||||
- TMG → ersetzen durch TDDDG (TMG aufgehoben seit 2024)
|
||||
- GwG aktualisieren (Aenderungen 2024)
|
||||
- HGB aktualisieren (MoPeG 2024)
|
||||
|
||||
@@ -139,11 +116,9 @@ Hat 3-Tier-Matching: Hash → Regex → LLM. Muss fuer neues Chunk-Format angepa
|
||||
- Rule 3 (Behoerden-Presse, proprietaer): NUR eigene Formulierungen
|
||||
- VERBOTEN: ISO, beck-online, juris, DIN
|
||||
|
||||
### PRIO 4: Frontend 500-Fehler untersuchen
|
||||
|
||||
macmini:3007/sdk/control-library zeigt noch 500-Fehler bei API-Aufrufen:
|
||||
- /controls, /canonical, /control-instances, /findings, /vendors, /contracts
|
||||
### PRIO 3: Frontend 500-Fehler untersuchen
|
||||
|
||||
macmini:3007/sdk/control-library zeigt noch 500-Fehler bei API-Aufrufen.
|
||||
Der `requirements.map` TypeError ist gefixt (commit fe6764d im compliance-repo).
|
||||
Die 500er koennten vom Compliance-Backend (Port 8002) kommen — separat pruefen.
|
||||
|
||||
@@ -155,16 +130,16 @@ Die 500er koennten vom Compliance-Backend (Port 8002) kommen — separat pruefen
|
||||
- **Block A:** v1 Abschluss (Healthcheck, Dependencies, v1 Tag)
|
||||
- **Block B:** Review-Verify (67k Paare)
|
||||
- **Block C:** Tests (Adversarial + Regression)
|
||||
- **Block D:** Strukturelles Chunking (D1-D5)
|
||||
- **Block D1-D5:** Strukturelles Chunking End-to-End
|
||||
- **Block D5+:** NIST/ENISA/BSI PDF-Qualitaet (Text-Normalisierung, Section-Detection, Re-Ingestion)
|
||||
|
||||
### 🔄 IN ARBEIT
|
||||
- **Block D5+:** NIST/ENISA/BSI Dokumente (Prio 1 oben)
|
||||
- **Block E1:** EU-Verordnungen als HTML (20 von ~50 erledigt)
|
||||
### 🔥 NAECHSTER SCHRITT
|
||||
- **Block D6:** Citation-Backfill — Controls auf neue Chunks umhaengen (Prio 1)
|
||||
|
||||
### 📋 AUSSTEHEND
|
||||
|
||||
**Block E: Gesetze aktualisieren + neue ingestieren**
|
||||
- E1: Veraltete Quellen aktualisieren (BGB, TMG→TDDDG, GwG, HGB)
|
||||
- E1: Veraltete Quellen aktualisieren (TMG→TDDDG, GwG, HGB)
|
||||
- E2: Fehlende DE-Gesetze (ArbZG, MuSchG, NachwG, MiLoG, etc.)
|
||||
- E3: Fehlende EU-Regulierung (CSRD, EU Taxonomy, CSDDD, eIDAS 2.0)
|
||||
- E4: Fehlende Standards lizenzgerecht (GoBD, BAIT/VAIT, PCI DSS Rule 3)
|
||||
@@ -192,34 +167,32 @@ Die 500er koennten vom Compliance-Backend (Port 8002) kommen — separat pruefen
|
||||
|
||||
## KRITISCHE DATEIEN
|
||||
|
||||
### Hauptaenderungen dieser Session (breakpilot-core)
|
||||
### Aenderungen Session 03.05.2026 (breakpilot-core)
|
||||
|
||||
| Datei | Was |
|
||||
|-------|-----|
|
||||
| `embedding-service/main.py` | Overlap-Bug-Fix (Phase 3), pdfplumber-Backend, NIST-Regex |
|
||||
| `embedding-service/requirements.txt` | pdfplumber>=0.11.0 hinzugefuegt |
|
||||
| `embedding-service/test_d4_bgb.py` | 18 BGB-Validierungstests |
|
||||
| `embedding-service/tests/fixtures/bgb_312_excerpt.txt` | BGB §§ 312-312k Testfixture |
|
||||
| `rag-service/api/documents.py` | D2 Payload-Felder + HTML-Erkennung + Encoding |
|
||||
| `rag-service/html_utils.py` | HTML-Strip + Charset (NEU) |
|
||||
| `rag-service/embedding_client.py` | ChunkResult Dataclass (D2) |
|
||||
| `rag-service/tests/` | 32 Tests (D2 + HTML) |
|
||||
| `control-pipeline/services/rag_client.py` | page: Optional[int] in RAGSearchResult |
|
||||
| `control-pipeline/services/control_generator.py` | section>article>section_title Prio + page |
|
||||
| `control-pipeline/services/decomposition_pass.py` | Seitenzahl in _format_citation |
|
||||
| `control-pipeline/tests/test_d3_metadata.py` | 16 D3-Tests |
|
||||
| `control-pipeline/scripts/reingest_d5.py` | Re-Ingestion Script (MUSS GEFIXT WERDEN: Upload-before-Delete) |
|
||||
| `control-pipeline/scripts/reingest_d5_config.py` | Config + Helpers |
|
||||
| `control-pipeline/scripts/replace_eu_pdfs_with_html.py` | EUR-Lex HTML Replacement |
|
||||
| `control-pipeline/scripts/quality_report.py` | E2E Qualitaetstest (500 Controls) |
|
||||
| `docker-compose.yml` | PDF_EXTRACTION_BACKEND=auto |
|
||||
| `embedding-service/main.py` | `_normalize_pdf_text()`, `_SECTION_NUMBER_RE` NIST-Patterns, `_SINGLE_NUM_ALLCAPS_RE`, pdfplumber Toleranzen |
|
||||
| `embedding-service/test_nist_normalization.py` | 41 neue Tests (Normalisierung, Section-Detection, Metadata) |
|
||||
| `control-pipeline/scripts/reingest_nist.py` | Sicheres Re-Ingest (upload-before-delete) |
|
||||
| `control-pipeline/scripts/reingest_d5.py` | Safety Fix: `_delete_old_chunks_safe()` mit must_not Filter |
|
||||
| `control-pipeline/scripts/reupload_legal_strategy.py` | Re-Upload mit chunk_strategy="legal" |
|
||||
| `control-pipeline/scripts/extract_and_upload_nist.py` | Lokale PDF-Extraktion Workaround (Container-OOM) |
|
||||
| `scripts/qdrant-snapshot.sh` | Qdrant Backup aller Collections |
|
||||
|
||||
### Aenderungen in breakpilot-compliance
|
||||
### Wichtig: Embedding-Service Container-Limit
|
||||
|
||||
| Datei | Was |
|
||||
|-------|-----|
|
||||
| `backend-compliance/compliance/services/canonical_control_service.py` | _ensure_list() fuer JSONB-Arrays |
|
||||
| `admin-compliance/app/sdk/control-library/components/ControlDetail.tsx` | Array.isArray() Guard |
|
||||
Der Embedding-Service-Container (8 GB RAM) crasht bei PDFs >5 MB. Workaround:
|
||||
1. PDF lokal auf Mac Mini extrahieren (`pdfplumber` ist dort installiert)
|
||||
2. `_normalize_pdf_text()` anwenden
|
||||
3. Als .txt mit `chunk_strategy="legal"` hochladen
|
||||
|
||||
Wenn das Container-Limit erhoehrt werden soll: `docker-compose.yml` Zeile ~445:
|
||||
```yaml
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 12G # war 8G
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -227,14 +200,16 @@ Die 500er koennten vom Compliance-Backend (Port 8002) kommen — separat pruefen
|
||||
|
||||
### Qdrant (Mac Mini, Port 6333)
|
||||
|
||||
| Collection | Chunks | Dokumente | Section-Rate |
|
||||
|-----------|--------|-----------|-------------|
|
||||
| bp_compliance_ce | ~23.600 | ~55 | 47% |
|
||||
| bp_compliance_gesetze | ~32.000 | ~98 | 86% |
|
||||
| bp_compliance_datenschutz | ~13.000 | ~107 | 36% |
|
||||
| bp_dsfa_corpus | ~320 | ~20 | 60% |
|
||||
| bp_legal_templates | ~1.460 | ~100 | 7% |
|
||||
| **Gesamt** | **~70.000** | **~430** | **62%** |
|
||||
| Collection | Chunks | Section-Rate | Aenderung |
|
||||
|-----------|--------|-------------|-----------|
|
||||
| bp_compliance_ce | ~23.600 | ~50% | NIST/ENISA re-ingestiert |
|
||||
| bp_compliance_gesetze | ~32.000 | ~86% | Unveraendert |
|
||||
| bp_compliance_datenschutz | ~13.000 | ~40% | NIST 800-53/207 re-ingestiert |
|
||||
| bp_dsfa_corpus | ~8.200 | ~60% | Unveraendert |
|
||||
| bp_legal_templates | ~1.460 | ~7% | Unveraendert |
|
||||
| **Gesamt** | **~78.000** | **~65%** | Verbessert (war 62%) |
|
||||
|
||||
**Qdrant-Backup:** `backups/qdrant/` — 14 Collections, ~1 GB (Stand 03.05.2026 08:21)
|
||||
|
||||
### PostgreSQL (Mac Mini, Port 5432)
|
||||
|
||||
@@ -247,15 +222,19 @@ Die 500er koennten vom Compliance-Backend (Port 8002) kommen — separat pruefen
|
||||
### MinIO (Hetzner, nbg1.your-objectstorage.com)
|
||||
|
||||
Alle Originaldokumente sind sicher in MinIO, Bucket: `breakpilot-rag`.
|
||||
Pfad-Format: `{data_type}/{bundesland}/{use_case}/{year}/{filename}`
|
||||
**Achtung:** 2 Dateien waren korrupt (263-Byte XML statt PDF):
|
||||
- `nistir_8259a.pdf` — Neu heruntergeladen von nist.gov, re-ingestiert ✅
|
||||
- `nist_ai_rmf.pdf` — Neu heruntergeladen von nist.gov, re-ingestiert ✅
|
||||
Die neuen PDFs wurden nur in Qdrant ingestiert, NICHT in MinIO ersetzt.
|
||||
Fuer MinIO-Update: Manuell via RAG-Service-Upload oder mc-CLI.
|
||||
|
||||
---
|
||||
|
||||
## TESTS AUSFUEHREN
|
||||
|
||||
```bash
|
||||
# Embedding-Service (58 Tests)
|
||||
cd embedding-service && python3 -m pytest test_chunking.py test_d4_bgb.py -v
|
||||
# Embedding-Service (99 Tests inkl. 41 NIST-Tests)
|
||||
cd embedding-service && python3 -m pytest test_chunking.py test_d4_bgb.py test_nist_normalization.py -v
|
||||
|
||||
# RAG-Service (32 Tests)
|
||||
cd rag-service && PYTHONPATH=. python3 -m pytest tests/ -v
|
||||
@@ -265,6 +244,12 @@ PYTHONPATH=control-pipeline python3 -m pytest control-pipeline/tests/ -v
|
||||
|
||||
# Qualitaetsreport (500 Controls gegen Qdrant)
|
||||
python3 control-pipeline/scripts/quality_report.py --db-host macmini --sample 500
|
||||
|
||||
# Qdrant-Snapshot erstellen
|
||||
ssh macmini "cd ~/Projekte/breakpilot-core && bash scripts/qdrant-snapshot.sh"
|
||||
|
||||
# Qdrant-Snapshots auflisten
|
||||
ssh macmini "cd ~/Projekte/breakpilot-core && bash scripts/qdrant-snapshot.sh --list"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -0,0 +1,498 @@
|
||||
#!/usr/bin/env python3
|
||||
"""D6 Citation Backfill — update ~291k controls with section metadata from Qdrant chunks.
|
||||
|
||||
Archives old source_citation in generation_metadata.old_citation.
|
||||
Updates source_citation.article, .paragraph, .page from matched Qdrant chunks.
|
||||
|
||||
3-tier matching:
|
||||
Tier 1: sha256(source_original_text) → exact chunk text match
|
||||
Tier 2: Parse [section] prefix from source_original_text
|
||||
Tier 3: Best text overlap within same regulation_id
|
||||
|
||||
Usage:
|
||||
python3 control-pipeline/scripts/d6_citation_backfill.py --dry-run --limit 100
|
||||
python3 control-pipeline/scripts/d6_citation_backfill.py --batch-size 1000
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
import httpx
|
||||
import psycopg2
|
||||
import psycopg2.extras
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
)
|
||||
logger = logging.getLogger("d6-backfill")
|
||||
|
||||
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot@localhost:5432/breakpilot_db")
|
||||
QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
|
||||
|
||||
COLLECTIONS = [
|
||||
"bp_compliance_ce",
|
||||
"bp_compliance_gesetze",
|
||||
"bp_compliance_datenschutz",
|
||||
"bp_dsfa_corpus",
|
||||
"bp_legal_templates",
|
||||
]
|
||||
|
||||
# Parse [§ 312k Title] or [AC-1 POLICY] prefix from chunk text
|
||||
_SECTION_PREFIX_RE = re.compile(r'^\[([^\]]+)\]\s*')
|
||||
|
||||
|
||||
@dataclass
|
||||
class ChunkMeta:
|
||||
section: str
|
||||
section_title: str
|
||||
paragraph: str
|
||||
page: Optional[int]
|
||||
regulation_id: str
|
||||
|
||||
|
||||
@dataclass
|
||||
class Stats:
|
||||
total: int = 0
|
||||
already_correct: int = 0
|
||||
matched_hash: int = 0
|
||||
matched_prefix: int = 0
|
||||
matched_overlap: int = 0
|
||||
unmatched: int = 0
|
||||
updated: int = 0
|
||||
errors: int = 0
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Phase 1: Build Qdrant index
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
def build_qdrant_index(qdrant_url: str) -> tuple[dict, dict]:
|
||||
"""Build hash index and regulation index from all Qdrant collections.
|
||||
|
||||
Returns:
|
||||
hash_index: {sha256(chunk_text) → ChunkMeta}
|
||||
reg_index: {regulation_id → [ChunkMeta with text snippets]}
|
||||
"""
|
||||
hash_index: dict[str, ChunkMeta] = {}
|
||||
reg_index: dict[str, list[tuple[str, ChunkMeta]]] = {}
|
||||
total_chunks = 0
|
||||
|
||||
for coll in COLLECTIONS:
|
||||
offset = None
|
||||
coll_count = 0
|
||||
with httpx.Client(timeout=60.0) as c:
|
||||
while True:
|
||||
body = {
|
||||
"limit": 250,
|
||||
"with_payload": [
|
||||
"chunk_text", "section", "section_title",
|
||||
"paragraph", "page", "regulation_id",
|
||||
],
|
||||
"with_vector": False,
|
||||
}
|
||||
if offset is not None:
|
||||
body["offset"] = offset
|
||||
resp = c.post(
|
||||
f"{qdrant_url}/collections/{coll}/points/scroll",
|
||||
json=body,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()["result"]
|
||||
|
||||
for pt in data["points"]:
|
||||
p = pt.get("payload", {})
|
||||
chunk_text = p.get("chunk_text", "")
|
||||
if not chunk_text or len(chunk_text.strip()) < 30:
|
||||
continue
|
||||
|
||||
meta = ChunkMeta(
|
||||
section=p.get("section", "") or "",
|
||||
section_title=p.get("section_title", "") or "",
|
||||
paragraph=p.get("paragraph", "") or "",
|
||||
page=p.get("page"),
|
||||
regulation_id=p.get("regulation_id", "") or "",
|
||||
)
|
||||
|
||||
# Hash index
|
||||
h = hashlib.sha256(chunk_text.encode()).hexdigest()
|
||||
if meta.section: # only index chunks WITH section data
|
||||
hash_index[h] = meta
|
||||
|
||||
# Regulation index (for text overlap matching)
|
||||
if meta.regulation_id and meta.section:
|
||||
reg_index.setdefault(meta.regulation_id, []).append(
|
||||
(chunk_text[:500], meta)
|
||||
)
|
||||
|
||||
coll_count += 1
|
||||
|
||||
offset = data.get("next_page_offset")
|
||||
if offset is None:
|
||||
break
|
||||
|
||||
total_chunks += coll_count
|
||||
logger.info(" [%s] %d chunks indexed", coll, coll_count)
|
||||
|
||||
logger.info("Qdrant index: %d total chunks, %d with section (hash), %d regulations",
|
||||
total_chunks, len(hash_index), len(reg_index))
|
||||
return hash_index, reg_index
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Phase 2: Load controls
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
def load_controls(db_url: str, limit: int = 0) -> list[dict]:
|
||||
"""Load all controls needing citation update."""
|
||||
conn = psycopg2.connect(db_url)
|
||||
conn.set_session(autocommit=False)
|
||||
cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
|
||||
|
||||
cur.execute("SET search_path TO compliance, core, public")
|
||||
|
||||
query = """
|
||||
SELECT id, control_id, source_citation, source_original_text,
|
||||
generation_metadata, license_rule
|
||||
FROM canonical_controls
|
||||
WHERE license_rule IN (1, 2)
|
||||
AND source_citation IS NOT NULL
|
||||
ORDER BY control_id
|
||||
"""
|
||||
if limit > 0:
|
||||
query += f" LIMIT {limit}"
|
||||
|
||||
cur.execute(query)
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
|
||||
controls = []
|
||||
for row in rows:
|
||||
ctrl = dict(row)
|
||||
ctrl["id"] = str(ctrl["id"])
|
||||
for jf in ("source_citation", "generation_metadata"):
|
||||
val = ctrl.get(jf)
|
||||
if isinstance(val, str):
|
||||
try:
|
||||
ctrl[jf] = json.loads(val)
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
ctrl[jf] = {}
|
||||
elif val is None:
|
||||
ctrl[jf] = {}
|
||||
controls.append(ctrl)
|
||||
|
||||
return controls
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Phase 3: Matching
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
def match_control(
|
||||
ctrl: dict,
|
||||
hash_index: dict[str, ChunkMeta],
|
||||
reg_index: dict[str, list[tuple[str, ChunkMeta]]],
|
||||
) -> tuple[Optional[ChunkMeta], str]:
|
||||
"""Match a control to a Qdrant chunk. Returns (meta, method) or (None, '')."""
|
||||
source_text = ctrl.get("source_original_text", "") or ""
|
||||
|
||||
# Tier 1: Hash match
|
||||
if source_text:
|
||||
h = hashlib.sha256(source_text.encode()).hexdigest()
|
||||
meta = hash_index.get(h)
|
||||
if meta and meta.section:
|
||||
return meta, "hash"
|
||||
|
||||
# Tier 2: Parse [section] prefix from source_original_text
|
||||
if source_text:
|
||||
m = _SECTION_PREFIX_RE.match(source_text)
|
||||
if m:
|
||||
prefix = m.group(1).strip()
|
||||
parsed = _parse_section_from_prefix(prefix)
|
||||
if parsed:
|
||||
return parsed, "prefix"
|
||||
|
||||
# Tier 3: Text overlap within same regulation
|
||||
gen_meta = ctrl.get("generation_metadata") or {}
|
||||
reg_id = gen_meta.get("source_regulation", "")
|
||||
if reg_id and source_text and reg_id in reg_index:
|
||||
best = _find_best_overlap(source_text, reg_index[reg_id])
|
||||
if best:
|
||||
return best, "overlap"
|
||||
|
||||
return None, ""
|
||||
|
||||
|
||||
def _parse_section_from_prefix(prefix: str) -> Optional[ChunkMeta]:
|
||||
"""Parse a section prefix like '§ 312k Kuendigungsbutton' or 'AC-1 POLICY'."""
|
||||
if not prefix:
|
||||
return None
|
||||
|
||||
# § pattern
|
||||
m = re.match(r'(§\s*\d+[a-z]*)\s*(.*)', prefix)
|
||||
if m:
|
||||
return ChunkMeta(
|
||||
section=m.group(1).strip(),
|
||||
section_title=m.group(2).strip(),
|
||||
paragraph="", page=None, regulation_id="",
|
||||
)
|
||||
|
||||
# Art./Artikel pattern
|
||||
m = re.match(r'(Art(?:ikel|\.)\s*\d+)\s*(.*)', prefix, re.IGNORECASE)
|
||||
if m:
|
||||
return ChunkMeta(
|
||||
section=m.group(1).strip(),
|
||||
section_title=m.group(2).strip(),
|
||||
paragraph="", page=None, regulation_id="",
|
||||
)
|
||||
|
||||
# NIST control pattern (AC-1, AU-2, etc.)
|
||||
m = re.match(r'([A-Z]{2,4}-\d+(?:\(\d+\))?)\s*(.*)', prefix)
|
||||
if m:
|
||||
return ChunkMeta(
|
||||
section=m.group(1).strip(),
|
||||
section_title=m.group(2).strip(),
|
||||
paragraph="", page=None, regulation_id="",
|
||||
)
|
||||
|
||||
# Numbered section (3.1 Title)
|
||||
m = re.match(r'(\d+(?:\.\d+)+)\s*(.*)', prefix)
|
||||
if m:
|
||||
return ChunkMeta(
|
||||
section=m.group(1).strip(),
|
||||
section_title=m.group(2).strip(),
|
||||
paragraph="", page=None, regulation_id="",
|
||||
)
|
||||
|
||||
# ALL-CAPS heading (fallback — use as section_title)
|
||||
if prefix == prefix.upper() and len(prefix) > 3:
|
||||
return ChunkMeta(
|
||||
section="", section_title=prefix,
|
||||
paragraph="", page=None, regulation_id="",
|
||||
)
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def _find_best_overlap(source_text: str, chunks: list[tuple[str, ChunkMeta]]) -> Optional[ChunkMeta]:
|
||||
"""Find chunk with best text overlap (simple word-set Jaccard)."""
|
||||
source_words = set(source_text.lower().split())
|
||||
if len(source_words) < 5:
|
||||
return None
|
||||
|
||||
best_score = 0.0
|
||||
best_meta = None
|
||||
|
||||
for chunk_text, meta in chunks:
|
||||
chunk_words = set(chunk_text.lower().split())
|
||||
if not chunk_words:
|
||||
continue
|
||||
intersection = len(source_words & chunk_words)
|
||||
union = len(source_words | chunk_words)
|
||||
jaccard = intersection / union if union > 0 else 0
|
||||
if jaccard > best_score and jaccard > 0.3: # 30% threshold
|
||||
best_score = jaccard
|
||||
best_meta = meta
|
||||
|
||||
return best_meta
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Phase 4: Update controls
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
def update_controls(
|
||||
db_url: str,
|
||||
controls: list[dict],
|
||||
hash_index: dict[str, ChunkMeta],
|
||||
reg_index: dict[str, list[tuple[str, ChunkMeta]]],
|
||||
dry_run: bool = True,
|
||||
batch_size: int = 1000,
|
||||
) -> Stats:
|
||||
"""Match and update all controls."""
|
||||
stats = Stats(total=len(controls))
|
||||
|
||||
conn = psycopg2.connect(db_url)
|
||||
conn.set_session(autocommit=False)
|
||||
cur = conn.cursor()
|
||||
cur.execute("SET search_path TO compliance, core, public")
|
||||
|
||||
updates = []
|
||||
|
||||
for i, ctrl in enumerate(controls):
|
||||
if i > 0 and i % 5000 == 0:
|
||||
logger.info("Progress: %d/%d (hash=%d prefix=%d overlap=%d unmatched=%d)",
|
||||
i, stats.total, stats.matched_hash, stats.matched_prefix,
|
||||
stats.matched_overlap, stats.unmatched)
|
||||
|
||||
citation = ctrl.get("source_citation") or {}
|
||||
old_article = citation.get("article", "")
|
||||
gen_meta = ctrl.get("generation_metadata") or {}
|
||||
|
||||
# Match
|
||||
meta, method = match_control(ctrl, hash_index, reg_index)
|
||||
|
||||
if not meta or not meta.section:
|
||||
# No match — check if existing article is already good
|
||||
if old_article:
|
||||
stats.already_correct += 1
|
||||
else:
|
||||
stats.unmatched += 1
|
||||
continue
|
||||
|
||||
# Check if update is needed
|
||||
if old_article == meta.section:
|
||||
stats.already_correct += 1
|
||||
continue
|
||||
|
||||
# Track method
|
||||
if method == "hash":
|
||||
stats.matched_hash += 1
|
||||
elif method == "prefix":
|
||||
stats.matched_prefix += 1
|
||||
elif method == "overlap":
|
||||
stats.matched_overlap += 1
|
||||
|
||||
# Archive old citation
|
||||
if old_article or citation.get("paragraph"):
|
||||
gen_meta["old_citation"] = {
|
||||
"article": old_article,
|
||||
"paragraph": citation.get("paragraph", ""),
|
||||
"page": citation.get("page"),
|
||||
"archived_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
|
||||
}
|
||||
|
||||
# Update citation
|
||||
citation["article"] = meta.section
|
||||
if meta.paragraph:
|
||||
citation["paragraph"] = meta.paragraph
|
||||
if meta.page is not None:
|
||||
citation["page"] = meta.page
|
||||
|
||||
# Update generation_metadata
|
||||
gen_meta["source_article"] = meta.section
|
||||
if meta.paragraph:
|
||||
gen_meta["source_paragraph"] = meta.paragraph
|
||||
if meta.page is not None:
|
||||
gen_meta["source_page"] = meta.page
|
||||
gen_meta["backfill_method"] = method
|
||||
gen_meta["backfill_at"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
|
||||
|
||||
updates.append((
|
||||
json.dumps(citation, ensure_ascii=False),
|
||||
json.dumps(gen_meta, ensure_ascii=False, default=str),
|
||||
ctrl["id"],
|
||||
))
|
||||
|
||||
# Batch commit
|
||||
if len(updates) >= batch_size and not dry_run:
|
||||
_execute_batch(cur, updates)
|
||||
conn.commit()
|
||||
stats.updated += len(updates)
|
||||
logger.info("Committed batch: %d updates (total %d)", len(updates), stats.updated)
|
||||
updates = []
|
||||
|
||||
# Final batch
|
||||
if updates and not dry_run:
|
||||
_execute_batch(cur, updates)
|
||||
conn.commit()
|
||||
stats.updated += len(updates)
|
||||
logger.info("Committed final batch: %d updates (total %d)", len(updates), stats.updated)
|
||||
elif updates and dry_run:
|
||||
stats.updated = len(updates) # would-be updates
|
||||
|
||||
conn.close()
|
||||
return stats
|
||||
|
||||
|
||||
def _execute_batch(cur, updates: list[tuple]):
|
||||
"""Execute batch UPDATE statements."""
|
||||
for citation_json, meta_json, ctrl_id in updates:
|
||||
cur.execute(
|
||||
"""UPDATE canonical_controls
|
||||
SET source_citation = %s::jsonb,
|
||||
generation_metadata = %s::jsonb,
|
||||
updated_at = NOW()
|
||||
WHERE id = %s::uuid""",
|
||||
(citation_json, meta_json, ctrl_id),
|
||||
)
|
||||
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Main
|
||||
# -------------------------------------------------------------------
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="D6 Citation Backfill")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Don't write to DB")
|
||||
parser.add_argument("--limit", type=int, default=0, help="Limit controls (0=all)")
|
||||
parser.add_argument("--batch-size", type=int, default=1000)
|
||||
parser.add_argument("--db-url", default=DB_URL)
|
||||
parser.add_argument("--qdrant-url", default=QDRANT_URL)
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info("D6 Citation Backfill")
|
||||
logger.info(" DB: %s", args.db_url.split("@")[-1])
|
||||
logger.info(" Qdrant: %s", args.qdrant_url)
|
||||
logger.info(" Dry run: %s", args.dry_run)
|
||||
logger.info(" Limit: %s", args.limit or "ALL")
|
||||
logger.info("=" * 60)
|
||||
|
||||
# Phase 1: Build Qdrant index
|
||||
logger.info("\nPhase 1: Building Qdrant index...")
|
||||
t0 = time.time()
|
||||
hash_index, reg_index = build_qdrant_index(args.qdrant_url)
|
||||
logger.info("Index built in %.1fs", time.time() - t0)
|
||||
|
||||
# Phase 2: Load controls
|
||||
logger.info("\nPhase 2: Loading controls...")
|
||||
controls = load_controls(args.db_url, args.limit)
|
||||
logger.info("Loaded %d controls", len(controls))
|
||||
|
||||
if not controls:
|
||||
logger.info("No controls to process")
|
||||
return
|
||||
|
||||
# Phase 3+4: Match and update
|
||||
logger.info("\nPhase 3+4: Matching and updating...")
|
||||
t0 = time.time()
|
||||
stats = update_controls(
|
||||
args.db_url, controls, hash_index, reg_index,
|
||||
dry_run=args.dry_run, batch_size=args.batch_size,
|
||||
)
|
||||
elapsed = time.time() - t0
|
||||
|
||||
# Summary
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("RESULTS")
|
||||
logger.info("=" * 60)
|
||||
logger.info(" Total controls: %d", stats.total)
|
||||
logger.info(" Already correct: %d (%.1f%%)", stats.already_correct,
|
||||
stats.already_correct / max(stats.total, 1) * 100)
|
||||
logger.info(" Matched (hash): %d (%.1f%%)", stats.matched_hash,
|
||||
stats.matched_hash / max(stats.total, 1) * 100)
|
||||
logger.info(" Matched (prefix): %d (%.1f%%)", stats.matched_prefix,
|
||||
stats.matched_prefix / max(stats.total, 1) * 100)
|
||||
logger.info(" Matched (overlap): %d (%.1f%%)", stats.matched_overlap,
|
||||
stats.matched_overlap / max(stats.total, 1) * 100)
|
||||
logger.info(" Unmatched: %d (%.1f%%)", stats.unmatched,
|
||||
stats.unmatched / max(stats.total, 1) * 100)
|
||||
logger.info(" Updated: %d", stats.updated)
|
||||
logger.info(" Errors: %d", stats.errors)
|
||||
logger.info(" Time: %.1fs (%.0f controls/sec)", elapsed,
|
||||
stats.total / max(elapsed, 1))
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("\nDRY RUN — no changes written. Run without --dry-run to apply.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,240 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Ingest missing German laws from gesetze-im-internet.de.
|
||||
|
||||
Downloads full HTML, strips to text, uploads with legal chunking strategy.
|
||||
Handles ISO-8859-1 charset typical for gesetze-im-internet.de.
|
||||
|
||||
Usage (on Mac Mini):
|
||||
python3 control-pipeline/scripts/ingest_de_laws.py --dry-run
|
||||
python3 control-pipeline/scripts/ingest_de_laws.py
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
from typing import Optional
|
||||
|
||||
import httpx
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("ingest-laws")
|
||||
|
||||
RAG_URL = "https://localhost:8097"
|
||||
QDRANT_URL = "http://localhost:6333"
|
||||
COLLECTION = "bp_compliance_gesetze"
|
||||
|
||||
# ---- Laws to ingest ----
|
||||
# Format: (slug on gesetze-im-internet.de, regulation_id, display_name)
|
||||
# URL pattern: https://www.gesetze-im-internet.de/{slug}/BJNR*.html (full text)
|
||||
|
||||
LAWS = [
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/arbzg/BJNR117100994.html",
|
||||
"regulation_id": "de_arbzg",
|
||||
"name": "Arbeitszeitgesetz (ArbZG)",
|
||||
"short": "ArbZG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/muschg_2018/BJNR122810017.html",
|
||||
"regulation_id": "de_muschg",
|
||||
"name": "Mutterschutzgesetz (MuSchG)",
|
||||
"short": "MuSchG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/nachwg/BJNR094610995.html",
|
||||
"regulation_id": "de_nachwg",
|
||||
"name": "Nachweisgesetz (NachwG)",
|
||||
"short": "NachwG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/milog/BJNR134810014.html",
|
||||
"regulation_id": "de_milog",
|
||||
"name": "Mindestlohngesetz (MiLoG)",
|
||||
"short": "MiLoG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/gmbhg/BJNR004770892.html",
|
||||
"regulation_id": "de_gmbhg",
|
||||
"name": "GmbH-Gesetz (GmbHG)",
|
||||
"short": "GmbHG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/aktg/BJNR010890965.html",
|
||||
"regulation_id": "de_aktg",
|
||||
"name": "Aktiengesetz (AktG)",
|
||||
"short": "AktG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/inso/BJNR286600994.html",
|
||||
"regulation_id": "de_inso",
|
||||
"name": "Insolvenzordnung (InsO)",
|
||||
"short": "InsO",
|
||||
},
|
||||
# BEG IV ist ein Aenderungsgesetz — kein eigenstaendiger Text auf gesetze-im-internet.de
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/verpflg/BJNR009690974.html",
|
||||
"regulation_id": "de_verpflichtungsgesetz",
|
||||
"name": "Verpflichtungsgesetz",
|
||||
"short": "VerpflG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/burlg/BJNR000020963.html",
|
||||
"regulation_id": "de_burlg",
|
||||
"name": "Bundesurlaubsgesetz (BUrlG)",
|
||||
"short": "BUrlG",
|
||||
},
|
||||
{
|
||||
"url": "https://www.gesetze-im-internet.de/entgfg/BJNR118010994.html",
|
||||
"regulation_id": "de_entgfg",
|
||||
"name": "Entgeltfortzahlungsgesetz (EntgFG)",
|
||||
"short": "EntgFG",
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def download_law(url: str) -> Optional[str]:
|
||||
"""Download law HTML from gesetze-im-internet.de, handle charset."""
|
||||
with httpx.Client(timeout=30.0, follow_redirects=True) as c:
|
||||
resp = c.get(url)
|
||||
if resp.status_code != 200:
|
||||
logger.error(" HTTP %d for %s", resp.status_code, url)
|
||||
return None
|
||||
|
||||
# gesetze-im-internet.de uses ISO-8859-1
|
||||
content_type = resp.headers.get("content-type", "")
|
||||
if "charset" in content_type:
|
||||
# Use declared charset
|
||||
html = resp.text
|
||||
else:
|
||||
# Try UTF-8 first, fall back to ISO-8859-1
|
||||
try:
|
||||
html = resp.content.decode("utf-8")
|
||||
if "\ufffd" in html:
|
||||
raise UnicodeDecodeError("utf-8", b"", 0, 1, "replacement chars")
|
||||
except (UnicodeDecodeError, ValueError):
|
||||
html = resp.content.decode("iso-8859-1")
|
||||
|
||||
return html
|
||||
|
||||
|
||||
def upload_html(
|
||||
html: str,
|
||||
filename: str,
|
||||
regulation_id: str,
|
||||
name: str,
|
||||
short: str,
|
||||
dry_run: bool = False,
|
||||
) -> Optional[dict]:
|
||||
"""Upload HTML to RAG service with legal chunking."""
|
||||
if dry_run:
|
||||
logger.info(" DRY RUN — would upload %d chars", len(html))
|
||||
return {"chunks_count": 0, "document_id": "dry-run"}
|
||||
|
||||
meta = {
|
||||
"regulation_id": regulation_id,
|
||||
"regulation_name_de": name,
|
||||
"regulation_short": short,
|
||||
"source": "gesetze-im-internet.de",
|
||||
"license": "public_domain_de_law",
|
||||
"jurisdiction": "DE",
|
||||
"source_type": "law",
|
||||
}
|
||||
form_data = {
|
||||
"collection": COLLECTION,
|
||||
"data_type": "compliance",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_strategy": "legal",
|
||||
"chunk_size": "1500",
|
||||
"chunk_overlap": "100",
|
||||
"metadata_json": json.dumps(meta, ensure_ascii=False),
|
||||
}
|
||||
with httpx.Client(timeout=600.0, verify=False) as c:
|
||||
resp = c.post(
|
||||
f"{RAG_URL}/api/v1/documents/upload",
|
||||
files={"file": (filename, html.encode("utf-8"), "text/html")},
|
||||
data=form_data,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def count_existing(regulation_id: str) -> int:
|
||||
"""Check if regulation already exists in Qdrant."""
|
||||
with httpx.Client(timeout=30.0) as c:
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{COLLECTION}/points/count",
|
||||
json={
|
||||
"filter": {"must": [
|
||||
{"key": "regulation_id", "match": {"value": regulation_id}}
|
||||
]},
|
||||
"exact": True,
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["result"]["count"]
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Ingest DE laws from gesetze-im-internet.de")
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info("Ingest German Laws")
|
||||
logger.info(" Laws: %d", len(LAWS))
|
||||
logger.info(" Collection: %s", COLLECTION)
|
||||
logger.info(" Dry run: %s", args.dry_run)
|
||||
logger.info("=" * 60)
|
||||
|
||||
results = []
|
||||
for i, law in enumerate(LAWS, 1):
|
||||
logger.info("\n[%d/%d] %s (%s)", i, len(LAWS), law["name"], law["regulation_id"])
|
||||
|
||||
# Check if already exists
|
||||
existing = count_existing(law["regulation_id"])
|
||||
if existing > 0:
|
||||
logger.info(" Already exists: %d chunks — SKIPPING", existing)
|
||||
results.append({"law": law["short"], "status": "exists", "chunks": existing})
|
||||
continue
|
||||
|
||||
# Download
|
||||
logger.info(" Downloading: %s", law["url"])
|
||||
html = download_law(law["url"])
|
||||
if not html:
|
||||
results.append({"law": law["short"], "status": "download_failed", "chunks": 0})
|
||||
continue
|
||||
logger.info(" Downloaded: %d chars", len(html))
|
||||
|
||||
# Upload
|
||||
filename = f"{law['regulation_id']}.html"
|
||||
try:
|
||||
result = upload_html(
|
||||
html, filename, law["regulation_id"],
|
||||
law["name"], law["short"], args.dry_run,
|
||||
)
|
||||
chunks = result.get("chunks_count", 0) if result else 0
|
||||
logger.info(" Uploaded: %d chunks", chunks)
|
||||
results.append({"law": law["short"], "status": "ok", "chunks": chunks})
|
||||
except Exception as e:
|
||||
logger.error(" Upload FAILED: %s", e)
|
||||
results.append({"law": law["short"], "status": "error", "chunks": 0})
|
||||
|
||||
if i < len(LAWS):
|
||||
time.sleep(1)
|
||||
|
||||
# Summary
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("RESULTS")
|
||||
logger.info("=" * 60)
|
||||
for r in results:
|
||||
logger.info(" %-10s %s chunks=%d", r["law"], r["status"].upper(), r["chunks"])
|
||||
|
||||
total_new = sum(r["chunks"] for r in results if r["status"] == "ok")
|
||||
logger.info("\nTotal new chunks: %d", total_new)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,201 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Ingest missing EU regulations from EUR-Lex (HTML).
|
||||
|
||||
Downloads German HTML from EUR-Lex via CELEX number, uploads with legal chunking.
|
||||
|
||||
Usage (on Mac Mini):
|
||||
python3 control-pipeline/scripts/ingest_eu_regulations.py --dry-run
|
||||
python3 control-pipeline/scripts/ingest_eu_regulations.py
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
|
||||
import httpx
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("ingest-eu")
|
||||
|
||||
RAG_URL = "https://localhost:8097"
|
||||
QDRANT_URL = "http://localhost:6333"
|
||||
COLLECTION = "bp_compliance_ce"
|
||||
|
||||
EURLEX_URL = "https://eur-lex.europa.eu/legal-content/DE/TXT/HTML/?uri=CELEX:{celex}"
|
||||
|
||||
# ---- EU Regulations to ingest ----
|
||||
REGULATIONS = [
|
||||
{
|
||||
"celex": "32022L2464",
|
||||
"regulation_id": "csrd_2022",
|
||||
"name": "Corporate Sustainability Reporting Directive (CSRD)",
|
||||
"short": "CSRD",
|
||||
"category": "sustainability",
|
||||
},
|
||||
{
|
||||
"celex": "32024L1760",
|
||||
"regulation_id": "csddd_2024",
|
||||
"name": "Corporate Sustainability Due Diligence Directive (CSDDD)",
|
||||
"short": "CSDDD",
|
||||
"category": "sustainability",
|
||||
},
|
||||
{
|
||||
"celex": "32020R0852",
|
||||
"regulation_id": "eu_taxonomy_2020",
|
||||
"name": "EU-Taxonomie-Verordnung",
|
||||
"short": "EU Taxonomy",
|
||||
"category": "sustainability",
|
||||
},
|
||||
{
|
||||
"celex": "32024R1183",
|
||||
"regulation_id": "eidas_2_0_2024",
|
||||
"name": "eIDAS 2.0 Verordnung (EU Digital Identity)",
|
||||
"short": "eIDAS 2.0",
|
||||
"category": "digital_identity",
|
||||
},
|
||||
{
|
||||
"celex": "32023L0970",
|
||||
"regulation_id": "pay_transparency_2023",
|
||||
"name": "Entgelttransparenz-Richtlinie",
|
||||
"short": "Pay Transparency",
|
||||
"category": "employment",
|
||||
},
|
||||
{
|
||||
"celex": "32022R2065",
|
||||
"regulation_id": "dsa_2022_updated",
|
||||
"name": "Digital Services Act (DSA) — aktualisiert",
|
||||
"short": "DSA",
|
||||
"category": "digital_services",
|
||||
"skip_if_exists": "dsa_2022", # already exists under different ID
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def download_eurlex(celex: str) -> str:
|
||||
"""Download EU regulation HTML from EUR-Lex."""
|
||||
url = EURLEX_URL.format(celex=celex)
|
||||
with httpx.Client(timeout=30.0, follow_redirects=True) as c:
|
||||
resp = c.get(url)
|
||||
resp.raise_for_status()
|
||||
return resp.text
|
||||
|
||||
|
||||
def upload_html(html: str, filename: str, reg: dict, dry_run: bool = False):
|
||||
"""Upload HTML to RAG service."""
|
||||
if dry_run:
|
||||
logger.info(" DRY RUN — would upload %d chars", len(html))
|
||||
return {"chunks_count": 0}
|
||||
|
||||
meta = {
|
||||
"regulation_id": reg["regulation_id"],
|
||||
"regulation_name_de": reg["name"],
|
||||
"regulation_short": reg["short"],
|
||||
"celex": reg["celex"],
|
||||
"category": reg["category"],
|
||||
"source": "EUR-Lex",
|
||||
"license": "EU_law",
|
||||
"jurisdiction": "EU",
|
||||
"source_type": "law",
|
||||
}
|
||||
form_data = {
|
||||
"collection": COLLECTION,
|
||||
"data_type": "compliance",
|
||||
"bundesland": "bund",
|
||||
"use_case": "compliance",
|
||||
"year": "2026",
|
||||
"chunk_strategy": "legal",
|
||||
"chunk_size": "1500",
|
||||
"chunk_overlap": "100",
|
||||
"metadata_json": json.dumps(meta, ensure_ascii=False),
|
||||
}
|
||||
with httpx.Client(timeout=600.0, verify=False) as c:
|
||||
resp = c.post(
|
||||
f"{RAG_URL}/api/v1/documents/upload",
|
||||
files={"file": (filename, html.encode("utf-8"), "text/html")},
|
||||
data=form_data,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def count_existing(regulation_id: str) -> int:
|
||||
with httpx.Client(timeout=60.0) as c:
|
||||
resp = c.post(
|
||||
f"{QDRANT_URL}/collections/{COLLECTION}/points/count",
|
||||
json={"filter": {"must": [
|
||||
{"key": "regulation_id", "match": {"value": regulation_id}}
|
||||
]}, "exact": True},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["result"]["count"]
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info("Ingest EU Regulations from EUR-Lex")
|
||||
logger.info(" Regulations: %d", len(REGULATIONS))
|
||||
logger.info(" Dry run: %s", args.dry_run)
|
||||
logger.info("=" * 60)
|
||||
|
||||
results = []
|
||||
for i, reg in enumerate(REGULATIONS, 1):
|
||||
logger.info("\n[%d/%d] %s (CELEX: %s)", i, len(REGULATIONS), reg["name"], reg["celex"])
|
||||
|
||||
# Skip if variant already exists
|
||||
skip_id = reg.get("skip_if_exists")
|
||||
if skip_id:
|
||||
existing = count_existing(skip_id)
|
||||
if existing > 0:
|
||||
logger.info(" Already exists as '%s' (%d chunks) — SKIPPING", skip_id, existing)
|
||||
results.append({"reg": reg["short"], "status": "exists", "chunks": existing})
|
||||
continue
|
||||
|
||||
# Check if this exact ID exists
|
||||
existing = count_existing(reg["regulation_id"])
|
||||
if existing > 0:
|
||||
logger.info(" Already exists: %d chunks — SKIPPING", existing)
|
||||
results.append({"reg": reg["short"], "status": "exists", "chunks": existing})
|
||||
continue
|
||||
|
||||
# Download from EUR-Lex
|
||||
logger.info(" Downloading from EUR-Lex...")
|
||||
try:
|
||||
html = download_eurlex(reg["celex"])
|
||||
logger.info(" Downloaded: %d chars", len(html))
|
||||
except Exception as e:
|
||||
logger.error(" Download FAILED: %s", e)
|
||||
results.append({"reg": reg["short"], "status": "download_failed", "chunks": 0})
|
||||
continue
|
||||
|
||||
# Upload
|
||||
filename = f"{reg['regulation_id']}.html"
|
||||
try:
|
||||
result = upload_html(html, filename, reg, args.dry_run)
|
||||
chunks = result.get("chunks_count", 0)
|
||||
logger.info(" Uploaded: %d chunks", chunks)
|
||||
results.append({"reg": reg["short"], "status": "ok", "chunks": chunks})
|
||||
except Exception as e:
|
||||
logger.error(" Upload FAILED: %s", e)
|
||||
results.append({"reg": reg["short"], "status": "error", "chunks": 0})
|
||||
|
||||
if i < len(REGULATIONS):
|
||||
time.sleep(2)
|
||||
|
||||
# Summary
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("RESULTS")
|
||||
logger.info("=" * 60)
|
||||
for r in results:
|
||||
logger.info(" %-20s %s chunks=%d", r["reg"], r["status"].upper(), r["chunks"])
|
||||
|
||||
total_new = sum(r["chunks"] for r in results if r["status"] == "ok")
|
||||
logger.info("\nTotal new chunks: %d", total_new)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user