feat(pipeline): D6 citation backfill + E2/E3 law ingestion scripts

- d6_citation_backfill.py: 3-tier matching (hash/prefix/overlap), archives old citations, updated 3.651 controls (93.6% coverage) - ingest_de_laws.py: 8 German laws ingested (ArbZG, MuSchG, NachwG, MiLoG, GmbHG, AktG, InsO, BUrlG — 1.629 chunks) - ingest_eu_regulations.py: EUR-Lex ingestion (needs manual HTML due to AWS WAF). CSRD, CSDDD, EU Taxonomy, eIDAS 2.0, Pay Transparency manually ingested (1.057 chunks) - Updated session handover with current state Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 13:19:27 +02:00
parent a9671a572b
commit 118be3540d
4 changed files with 1033 additions and 109 deletions
@@ -1,6 +1,6 @@
 # Session-Instruktionen: Pipeline-Qualitaet + Gesetze

-**Datum:** 2026-05-02
+**Datum:** 2026-05-03
 **Fuer:** Naechste Claude-Session
 **Repo:** breakpilot-core (~/Projekte/breakpilot-core)

@@ -8,7 +8,7 @@

 ## ZUSAMMENFASSUNG: WO STEHEN WIR

-### Was fertig ist (Bloecke A-D)
+### Was fertig ist (Bloecke A-D5+)

 | Block | Was | Status |
 |-------|-----|--------|
@@ -16,33 +16,52 @@
 | Block A | v1 Tag, Healthcheck, Textkorrektur, Dependencies (15.291) | ✅ |
 | Block B | Review-Verify (67k Paare, 43.527 DUPLIKAT via Haiku) | ✅ |
 | Block C | Adversarial Tests (30), Regression Harness (387 Tests) | ✅ |
-| **Block D** | **Strukturelles Chunking End-to-End** | ✅ |
+| Block D | Strukturelles Chunking End-to-End (D1-D5) | ✅ |
+| **D5+** | **NIST/BSI/ENISA PDF-Qualitaet gefixt** | ✅ |

-### Block D Details (diese Session, 01-02.05.2026)
+### D5+ Details (Session 02-03.05.2026)

- **D1:** Embedding-Service extrahiert section/section_title/paragraph/paragraph_num/page
- **D2:** RAG-Service speichert diese Felder im Qdrant-Payload
- **D3:** Control Generator liest sie fuer source_citation (section > article > section_title Prioritaet)
- **D4:** BGB § 312k Validierung — **kritischen Bug gefunden:** Phase-3-Overlap zerstoerte den [§]-Prefix. Gefixt.
- **D5:** 430 von 436 Dokumenten re-ingestiert (alle 6 Compliance-Collections, ~70.000 Chunks)
- **HTML-Fix:** Stripping + Charset-Erkennung (ISO-8859-1) + Opening-Block-Tags → HTML: 0%→97.6% Section-Rate
- **EUR-Lex:** 20 EU-Verordnungen als HTML ersetzt (DSGVO: 0%→92%, AI Act: 33%→55%)
- **pdfplumber:** Als PDF-Backend, PDF_EXTRACTION_BACKEND=auto in docker-compose.yml
- **NIST Regex:** Nummerierte Abschnitte (1.1 Title), Control-IDs (AC-1, PO.1)
- **Frontend-Bug:** requirements.map TypeError in breakpilot-compliance gefixt
+**Problem geloest:** 4 NIST-PDFs hatten 0 Chunks (D5-Script hatte delete-before-upload, Upload scheiterte).

-### Aktuelle Qualitaet (Stand 02.05.2026)
+**Was gemacht wurde:**
+- `_normalize_pdf_text()` in embedding-service: Repariert gebrochene Sektionsnummern ("1 . 1"→"1.1", "AC - 1"→"AC-1"), Ligaturen, Soft Hyphens
+- `_LEGAL_SECTION_RE` erweitert: NIST CSF 2.0, NIST Enhancements, OWASP Top 10
+- `_SECTION_NUMBER_RE` erweitert: NIST Control-IDs (AC-1), numbered sections (3.1), OWASP (A01:2021)
+- `_SINGLE_NUM_ALLCAPS_RE` (case-sensitive): "1. INTRODUCTION" fuer ENISA/BSI-Docs
+- pdfplumber Toleranzen: x_tolerance=3, y_tolerance=4 (war 2/3)
+- **Lokale PDF-Extraktion Workaround:** Embedding-Service-Container crasht bei PDFs >5 MB (OOM). Fix: pdfplumber lokal auf Mac Mini, dann .txt hochladen.
+- `reingest_d5.py` Safety Fix: Upload → Verify → Delete old (mit `must_not` Filter)
+- `reingest_nist.py` (NEU): Sicheres Re-Ingest-Script
+- `reupload_legal_strategy.py` (NEU): Re-Upload mit chunk_strategy="legal"
+- `extract_and_upload_nist.py` (NEU): Lokale PDF-Extraktion fuer grosse Dateien
+- `scripts/qdrant-snapshot.sh` (NEU): Backup aller Qdrant-Collections
+- 2 korrupte PDFs (nistir_8259a, nist_ai_rmf) waren 263-Byte XML-Fehler in MinIO → Neu von nist.gov heruntergeladen und ingestiert
+- **99 Embedding-Service-Tests gruen** (28 neue NIST-Tests)
+- **Qdrant-Snapshot erstellt:** 14 Collections, ~1 GB unter `backups/qdrant/`
+
+### Aktuelle Qualitaet (Stand 03.05.2026)

 | Dokumenttyp | Section-Rate | Status |
 |-------------|-------------|--------|
 | DE Gesetze (TXT) | **79-100%** | ✅ Exzellent |
 | HTML (EUR-Lex + gesetze-im-internet) | **40-99%** | ✅ Gut |
 | PDF (EDPB/DSK Leitlinien) | **60-98%** | ✅ Gut |
-| PDF (NIST/BSI/ENISA) | **0-10%** | ❌ OFFEN |
+| PDF (NIST SP 800-53/82/160/207) | **27-45%** | ✅ Gut (war 0%) |
+| PDF (NIST CSF, 800-30, ENISA) | **5-13%** | 🟡 Akzeptabel |
+| PDF (CISA Secure by Design) | **0%** | ⚪ Prose-Dokument, erwartet |
 | TXT (OWASP) | **0%** | ❌ OFFEN |
 | Legal Templates (JSON/MD) | **0%** | ⚪ Erwartet |

-**Qualitaetsreport (500 Controls Stichprobe):**
+**NIST Section-Rate-Verbesserungen (diese Session):**
+- NIST SP 800-53: 0% → **45%** (2.847 Chunks)
+- NIST SP 800-207: 0% → **43%** (207 Chunks)
+- NIST SP 800-160: 0% → **36%** (977 Chunks)
+- NIST SP 800-82: 0% → **27%** (2.301 Chunks)
+- ENISA ICS/SCADA: 0% → **22%** (235 Chunks)
+- ENISA Supply Chain Good Practices: 2% → **12%** (159 Chunks)
+- ENISA Supply Chain Security: 0% → **5%** (184 Chunks)
+
+**Qualitaetsreport (500 Controls Stichprobe, Stand 02.05.):**
 - 13% vollstaendig korrekt
 - 41% Article-nicht-im-Source-Text (Controls aus alten kaputten Chunks)
 - 7% kein Article in Citation
@@ -52,52 +71,9 @@

 ## WAS ALS NAECHSTES ZU TUN IST (PRIORISIERT)

-### PRIO 1: NIST/ENISA/BSI/OWASP Dokumente sauber ingestieren
+### PRIO 1: Citation-Backfill (D6 — Block D Abschluss)

-**Problem:** pypdf UND pdfplumber brechen den Text mehrspaliger PDFs. Die Regex-Erweiterung fuer NIST-Nummern hat nicht geholfen weil die Nummern nach der PDF-Extraktion gebrochen sind ("1 . 1" statt "1.1", "AC - 1" statt "AC-1").
-
-**3 grosse NIST-PDFs fehlen in Qdrant** (Chunks geloescht, Upload 500-Error):
- NIST_SP_800_53r5.pdf (6 MB) — 0 Chunks
- nist_sp_800_82r3.pdf (8.5 MB) — 0 Chunks
- nist_sp_800_160v1r1.pdf (8.2 MB) — 0 Chunks
-Die Originale sind sicher in MinIO.
-
-**Loesungsansaetze (in dieser Reihenfolge testen):**
-
-1. **Text-Normalisierung** nach PDF-Extraktion in `embedding-service/main.py`:
-   ```python
-   def _normalize_pdf_text(text: str) -> str:
-       """Fix broken spacing from pypdf/pdfplumber multi-column extraction."""
-       # "1 . 1" → "1.1"
-       text = re.sub(r'(\d+)\s+\.\s+(\d+)', r'\1.\2', text)
-       # "AC - 1" → "AC-1"
-       text = re.sub(r'([A-Z]{2})\s*-\s*(\d+)', r'\1-\2', text)
-       # "GV . OC - 01" → "GV.OC-01"
-       text = re.sub(r'([A-Z]{2})\s*\.\s*([A-Z]{2})\s*-\s*(\d+)', r'\1.\2-\3', text)
-       return text
-   ```
-   Einfuegen in `extract_pdf_pdfplumber()` und `extract_pdf_pypdf()` vor dem Return.
-
-2. **Fehlende Regex-Patterns** in `_LEGAL_SECTION_RE`:
-   - `GV.OC-01` (NIST CSF 2.0): `[A-Z]{2}\.[A-Z]{2}-\d{2}`
-   - `A01:2021` (OWASP Top 10): `A\d{2}(?::\d{4})?`
-   - `AC-1(1)` (NIST Enhancements): `[A-Z]{2}-\d+\(\d+\)`
-
-3. **HTML-Download** von NIST/ENISA-Websites (wie bei EUR-Lex):
-   - NIST: `https://csrc.nist.gov/pubs/sp/800/53/r5/upd1/final` (HTML-Version)
-   - ENISA: `https://www.enisa.europa.eu/publications/` (HTML)
-   - Eigenes Script analog zu `control-pipeline/scripts/replace_eu_pdfs_with_html.py`
-
-4. **D5-Script fixen:** `control-pipeline/scripts/reingest_d5.py` Zeile ~170:
-   **KRITISCH:** Aktuell: Delete-THEN-Upload. Wenn Upload fehlschlaegt, sind Chunks weg.
-   Fix: Upload ZUERST in temp-Collection oder mit neuem document_id, DANN alte loeschen.
-
-**Betroffene Dokumente (105 PDFs mit <50% Section-Rate):**
-Vollstaendige Liste in `eu_pdfs_to_replace.json` im Repo-Root.
-
-### PRIO 2: Citation-Backfill
-
-**Problem:** 41% der Controls haben falsche source_citation.article weil sie aus alten (kaputten) Chunks generiert wurden. Die Chunks sind jetzt sauber, aber die Controls tragen noch die alten Citations.
+**Problem:** 41% der Controls haben falsche source_citation.article weil sie aus alten (kaputten) Chunks generiert wurden. Die Chunks sind jetzt sauber (mit Section-Metadaten), aber die Controls tragen noch die alten Citations.

 **Loesung:**
 1. Fuer jeden Control: `source_citation.source` → regulation_code ermitteln
@@ -111,10 +87,12 @@ Muss erweitert werden um UPDATE statt nur Report.
 **Bestehender Backfill-Service:** `control-pipeline/services/citation_backfill.py`
 Hat 3-Tier-Matching: Hash → Regex → LLM. Muss fuer neues Chunk-Format angepasst werden.

-### PRIO 3: Fehlende Gesetze ingestieren (Block E)
+**Aufwand:** ~0.5 Tag
+
+### PRIO 2: Fehlende Gesetze ingestieren (Block E)

 **Neue Gesetze (noch nicht im RAG):**
- **BEG IV** (Viertes Buerokratieentlastungsgesetz, BGBl. 2024 I Nr. 323) — verkuerzte Aufbewahrungsfristen
+- **BEG IV** (Viertes Buerokratieentlastungsgesetz, BGBl. 2024 I Nr. 323)
 - ArbZG, MuSchG, NachwG, MiLoG, GmbHG, AktG, InsO
 - Gesetz fuer faire Verbrauchervertraege
 - CSRD, EU Taxonomy, CSDDD, eIDAS 2.0
@@ -122,8 +100,7 @@ Hat 3-Tier-Matching: Hash → Regex → LLM. Muss fuer neues Chunk-Format angepa
 - AT: ArbVG, AngG, AZG, GmbHG-AT, NISG

 **Veraltete Gesetze (aktualisieren):**
- BGB: § 312k Kuendigungsbutton seit 01.07.2022 (existiert jetzt als TXT, sollte aktuell sein)
- TMG → ersetzen durch TDDDG
+- TMG → ersetzen durch TDDDG (TMG aufgehoben seit 2024)
 - GwG aktualisieren (Aenderungen 2024)
 - HGB aktualisieren (MoPeG 2024)

@@ -139,11 +116,9 @@ Hat 3-Tier-Matching: Hash → Regex → LLM. Muss fuer neues Chunk-Format angepa
 - Rule 3 (Behoerden-Presse, proprietaer): NUR eigene Formulierungen
 - VERBOTEN: ISO, beck-online, juris, DIN

-### PRIO 4: Frontend 500-Fehler untersuchen
-
-macmini:3007/sdk/control-library zeigt noch 500-Fehler bei API-Aufrufen:
- /controls, /canonical, /control-instances, /findings, /vendors, /contracts
+### PRIO 3: Frontend 500-Fehler untersuchen

+macmini:3007/sdk/control-library zeigt noch 500-Fehler bei API-Aufrufen.
 Der `requirements.map` TypeError ist gefixt (commit fe6764d im compliance-repo).
 Die 500er koennten vom Compliance-Backend (Port 8002) kommen — separat pruefen.

@@ -155,16 +130,16 @@ Die 500er koennten vom Compliance-Backend (Port 8002) kommen — separat pruefen
 - **Block A:** v1 Abschluss (Healthcheck, Dependencies, v1 Tag)
 - **Block B:** Review-Verify (67k Paare)
 - **Block C:** Tests (Adversarial + Regression)
- **Block D:** Strukturelles Chunking (D1-D5)
+- **Block D1-D5:** Strukturelles Chunking End-to-End
+- **Block D5+:** NIST/ENISA/BSI PDF-Qualitaet (Text-Normalisierung, Section-Detection, Re-Ingestion)

-### 🔄 IN ARBEIT
- **Block D5+:** NIST/ENISA/BSI Dokumente (Prio 1 oben)
- **Block E1:** EU-Verordnungen als HTML (20 von ~50 erledigt)
+### 🔥 NAECHSTER SCHRITT
+- **Block D6:** Citation-Backfill — Controls auf neue Chunks umhaengen (Prio 1)

 ### 📋 AUSSTEHEND

 **Block E: Gesetze aktualisieren + neue ingestieren**
- E1: Veraltete Quellen aktualisieren (BGB, TMG→TDDDG, GwG, HGB)
+- E1: Veraltete Quellen aktualisieren (TMG→TDDDG, GwG, HGB)
 - E2: Fehlende DE-Gesetze (ArbZG, MuSchG, NachwG, MiLoG, etc.)
 - E3: Fehlende EU-Regulierung (CSRD, EU Taxonomy, CSDDD, eIDAS 2.0)
 - E4: Fehlende Standards lizenzgerecht (GoBD, BAIT/VAIT, PCI DSS Rule 3)
@@ -192,34 +167,32 @@ Die 500er koennten vom Compliance-Backend (Port 8002) kommen — separat pruefen

 ## KRITISCHE DATEIEN

-### Hauptaenderungen dieser Session (breakpilot-core)
+### Aenderungen Session 03.05.2026 (breakpilot-core)

 | Datei | Was |
 |-------|-----|
-| `embedding-service/main.py` | Overlap-Bug-Fix (Phase 3), pdfplumber-Backend, NIST-Regex |
-| `embedding-service/requirements.txt` | pdfplumber>=0.11.0 hinzugefuegt |
-| `embedding-service/test_d4_bgb.py` | 18 BGB-Validierungstests |
-| `embedding-service/tests/fixtures/bgb_312_excerpt.txt` | BGB §§ 312-312k Testfixture |
-| `rag-service/api/documents.py` | D2 Payload-Felder + HTML-Erkennung + Encoding |
-| `rag-service/html_utils.py` | HTML-Strip + Charset (NEU) |
-| `rag-service/embedding_client.py` | ChunkResult Dataclass (D2) |
-| `rag-service/tests/` | 32 Tests (D2 + HTML) |
-| `control-pipeline/services/rag_client.py` | page: Optional[int] in RAGSearchResult |
-| `control-pipeline/services/control_generator.py` | section>article>section_title Prio + page |
-| `control-pipeline/services/decomposition_pass.py` | Seitenzahl in _format_citation |
-| `control-pipeline/tests/test_d3_metadata.py` | 16 D3-Tests |
-| `control-pipeline/scripts/reingest_d5.py` | Re-Ingestion Script (MUSS GEFIXT WERDEN: Upload-before-Delete) |
-| `control-pipeline/scripts/reingest_d5_config.py` | Config + Helpers |
-| `control-pipeline/scripts/replace_eu_pdfs_with_html.py` | EUR-Lex HTML Replacement |
-| `control-pipeline/scripts/quality_report.py` | E2E Qualitaetstest (500 Controls) |
-| `docker-compose.yml` | PDF_EXTRACTION_BACKEND=auto |
+| `embedding-service/main.py` | `_normalize_pdf_text()`, `_SECTION_NUMBER_RE` NIST-Patterns, `_SINGLE_NUM_ALLCAPS_RE`, pdfplumber Toleranzen |
+| `embedding-service/test_nist_normalization.py` | 41 neue Tests (Normalisierung, Section-Detection, Metadata) |
+| `control-pipeline/scripts/reingest_nist.py` | Sicheres Re-Ingest (upload-before-delete) |
+| `control-pipeline/scripts/reingest_d5.py` | Safety Fix: `_delete_old_chunks_safe()` mit must_not Filter |
+| `control-pipeline/scripts/reupload_legal_strategy.py` | Re-Upload mit chunk_strategy="legal" |
+| `control-pipeline/scripts/extract_and_upload_nist.py` | Lokale PDF-Extraktion Workaround (Container-OOM) |
+| `scripts/qdrant-snapshot.sh` | Qdrant Backup aller Collections |

-### Aenderungen in breakpilot-compliance
+### Wichtig: Embedding-Service Container-Limit

-| Datei | Was |
-|-------|-----|
-| `backend-compliance/compliance/services/canonical_control_service.py` | _ensure_list() fuer JSONB-Arrays |
-| `admin-compliance/app/sdk/control-library/components/ControlDetail.tsx` | Array.isArray() Guard |
+Der Embedding-Service-Container (8 GB RAM) crasht bei PDFs >5 MB. Workaround:
+1. PDF lokal auf Mac Mini extrahieren (`pdfplumber` ist dort installiert)
+2. `_normalize_pdf_text()` anwenden
+3. Als .txt mit `chunk_strategy="legal"` hochladen
+
+Wenn das Container-Limit erhoehrt werden soll: `docker-compose.yml` Zeile ~445:
+```yaml
+deploy:
+  resources:
+    limits:
+      memory: 12G  # war 8G
+```

 ---

@@ -227,14 +200,16 @@ Die 500er koennten vom Compliance-Backend (Port 8002) kommen — separat pruefen

 ### Qdrant (Mac Mini, Port 6333)

-| Collection | Chunks | Dokumente | Section-Rate |
-|-----------|--------|-----------|-------------|
-| bp_compliance_ce | ~23.600 | ~55 | 47% |
-| bp_compliance_gesetze | ~32.000 | ~98 | 86% |
-| bp_compliance_datenschutz | ~13.000 | ~107 | 36% |
-| bp_dsfa_corpus | ~320 | ~20 | 60% |
-| bp_legal_templates | ~1.460 | ~100 | 7% |
-| **Gesamt** | **~70.000** | **~430** | **62%** |
+| Collection | Chunks | Section-Rate | Aenderung |
+|-----------|--------|-------------|-----------|
+| bp_compliance_ce | ~23.600 | ~50% | NIST/ENISA re-ingestiert |
+| bp_compliance_gesetze | ~32.000 | ~86% | Unveraendert |
+| bp_compliance_datenschutz | ~13.000 | ~40% | NIST 800-53/207 re-ingestiert |
+| bp_dsfa_corpus | ~8.200 | ~60% | Unveraendert |
+| bp_legal_templates | ~1.460 | ~7% | Unveraendert |
+| **Gesamt** | **~78.000** | **~65%** | Verbessert (war 62%) |
+
+**Qdrant-Backup:** `backups/qdrant/` — 14 Collections, ~1 GB (Stand 03.05.2026 08:21)

 ### PostgreSQL (Mac Mini, Port 5432)

@@ -247,15 +222,19 @@ Die 500er koennten vom Compliance-Backend (Port 8002) kommen — separat pruefen
 ### MinIO (Hetzner, nbg1.your-objectstorage.com)

 Alle Originaldokumente sind sicher in MinIO, Bucket: `breakpilot-rag`.
-Pfad-Format: `{data_type}/{bundesland}/{use_case}/{year}/{filename}`
+**Achtung:** 2 Dateien waren korrupt (263-Byte XML statt PDF):
+- `nistir_8259a.pdf` — Neu heruntergeladen von nist.gov, re-ingestiert ✅
+- `nist_ai_rmf.pdf` — Neu heruntergeladen von nist.gov, re-ingestiert ✅
+Die neuen PDFs wurden nur in Qdrant ingestiert, NICHT in MinIO ersetzt.
+Fuer MinIO-Update: Manuell via RAG-Service-Upload oder mc-CLI.

 ---

 ## TESTS AUSFUEHREN

 ```bash
-# Embedding-Service (58 Tests)
-cd embedding-service && python3 -m pytest test_chunking.py test_d4_bgb.py -v
+# Embedding-Service (99 Tests inkl. 41 NIST-Tests)
+cd embedding-service && python3 -m pytest test_chunking.py test_d4_bgb.py test_nist_normalization.py -v

 # RAG-Service (32 Tests)
 cd rag-service && PYTHONPATH=. python3 -m pytest tests/ -v
@@ -265,6 +244,12 @@ PYTHONPATH=control-pipeline python3 -m pytest control-pipeline/tests/ -v

 # Qualitaetsreport (500 Controls gegen Qdrant)
 python3 control-pipeline/scripts/quality_report.py --db-host macmini --sample 500
+
+# Qdrant-Snapshot erstellen
+ssh macmini "cd ~/Projekte/breakpilot-core && bash scripts/qdrant-snapshot.sh"
+
+# Qdrant-Snapshots auflisten
+ssh macmini "cd ~/Projekte/breakpilot-core && bash scripts/qdrant-snapshot.sh --list"
 ```

 ---