docs: session handover — D2-D5 complete, quality report, NIST plan

Major session achievements: - Structural metadata end-to-end (D2-D4) - 430 docs re-ingested with new chunking - HTML stripping + charset detection (0% → 97.6%) - 20 EU regulations from EUR-Lex HTML (DSGVO: 0% → 92%) - Quality report script (500 controls: 13% fully correct) - Frontend requirements.map fix Open: NIST/ENISA text normalization, citation backfill, D5 script safety (upload-before-delete), BEG IV ingestion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 22:22:55 +02:00
parent 3009f3d13a
commit ff21bc258a
2 changed files with 423 additions and 87 deletions
@@ -1,63 +1,25 @@
-# Session-Uebergabe: Strukturelles Chunking + Re-Ingestion
+# Session-Uebergabe: Strukturelles Chunking + Qualitaetssicherung

 **Datum:** 2026-05-02
-**Uebergeben von:** Pipeline-Session (01.05 - 02.05.2026)
+**Uebergeben von:** Pipeline-Session (01.05 - 02.05.2026, ~20h)

 ## Was wurde erledigt

 | Block | Was | Status |
 |-------|-----|--------|
-| **D2** | RAG-Service speichert section/section_title/paragraph/paragraph_num/page in Qdrant | ✅ |
-| **D3** | Control Generator liest strukturelle Metadaten, page in source_citation | ✅ |
-| **D4** | BGB § 312k Validierung — Overlap-Bug gefunden + gefixt | ✅ |
-| **D5** | 430/436 Dokumente re-ingestiert mit neuem Chunking | ✅ |
-| **HTML-Fix** | HTML-Stripping + Charset-Erkennung (ISO-8859-1) | ✅ |
-| **pdfplumber** | Als PDF-Backend hinzugefuegt, PDF_EXTRACTION_BACKEND=auto | ✅ |
+| **D2** | RAG-Service speichert section/section_title/paragraph/page in Qdrant | ✅ |
+| **D3** | Control Generator nutzt strukturelle Metadaten in source_citation | ✅ |
+| **D4** | BGB § 312k Validierung — kritischen Overlap-Bug gefunden + gefixt | ✅ |
+| **D5** | 430/436 Dokumente re-ingestiert (alle 6 Collections) | ✅ |
+| **HTML-Fix** | Stripping + Charset-Erkennung (ISO-8859-1), Opening-Block-Tags | ✅ |
+| **EUR-Lex** | 20 EU-Verordnungen als HTML ersetzt (DSGVO: 0%→92%) | ✅ |
+| **pdfplumber** | Als PDF-Backend hinzugefuegt + PDF_EXTRACTION_BACKEND=auto | ✅ |
+| **NIST Regex** | Nummerierte Abschnitte (1.1), Control-IDs (AC-1, PO.1) | ✅ |
+| **3 fehlende PDFs** | EDPB Controller/Processor, GL 7, RTBF hochgeladen | ✅ |
+| **Frontend-Bug** | requirements.map TypeError in Compliance Admin gefixt | ✅ |
+| **Qualitaetsreport** | 500 Controls geprueft: 13% OK, 41% Article-Mismatch | ✅ |

-## Ergebnisse der Re-Ingestion
-
-| Dokumenttyp | Section-Rate | Anmerkung |
-|-------------|-------------|-----------|
-| **DE Gesetze (TXT)** | **95-100%** | Exzellent |
-| **HTML (gesetze-im-internet.de)** | **97.6%** | War 0%, nach Fix perfekt |
-| **EDPB/DSK Leitlinien (PDF)** | **80-98%** | Gut |
-| **EU-Amtsblatt-PDFs (AI Act, CRA, NIS2, DSGVO)** | **13-35%** | OFFEN — siehe unten |
-| **Tech-Specs (JSON, MD)** | **0%** | Erwartet, keine §/Artikel |
-
-**Gesamt: 69.888 Chunks in 6 Collections, 430 von 436 Dokumenten.**
-
-## Offene Probleme
-
-### 1. EU-Amtsblatt-PDFs (~40 Dokumente, 13-35% Section-Rate)
-
-**Ursache:** EU Official Journal PDFs verwenden mehrspaltige Layouts. Sowohl pypdf als auch pdfplumber extrahieren gebrochene Woerter (`"Ar tik el"` statt `"Artikel"`). Kein PDF-Extractor loest das zuverlaessig.
-
-**Empfohlene Loesung:** EU-Verordnungen als HTML von EUR-Lex herunterladen statt PDF. EUR-Lex bietet alle Verordnungen als sauberes HTML. Unser HTML-Stripping + Legal Chunker funktioniert perfekt dafuer.
-
-**Betroffene Dokumente (Beispiele):**
- ai_act_2024_1689.pdf (33%) → HTML von EUR-Lex
- cra_2024_2847.pdf (34%) → HTML von EUR-Lex
- nis2_2022_2555.pdf (13%) → HTML von EUR-Lex
- dsgvo_2016_679.pdf (0%) → HTML von EUR-Lex
- amlr_2024_1624.pdf (12%) → HTML von EUR-Lex
- Alle Dateien mit `_20XX_XXXX.pdf` Pattern im bp_compliance_ce Collection
-
-### 2. 3 komplett fehlende PDFs
-
-Diese wurden NIE erfolgreich in Qdrant gespeichert:
- `edpb_controller_processor_07_2020.pdf` — KEINE CHUNKS
- `edpb_gl_7_2020.pdf` — KEINE CHUNKS
- `edpb_rtbf_05_2019.pdf` — KEINE CHUNKS
-
-**Ursache:** Timeout bei Upload (selbst mit 3600s). Die PDFs sind gross und die Embedding-Generierung dauert zu lange.
-
-**Loesung:** Manuell aufteilen (Split in Abschnitte) oder als kleinere Teile hochladen.
-
-### 3. 1 korrupte PDF
-
- `dsk_kpnr_3.pdf` — 500 Internal Server Error bei Extraktion
-
-## Commits dieser Session
+## Commits (breakpilot-core)

 ```
 93099b2 feat(pipeline): structural metadata end-to-end (Blocks D2-D4)
@@ -65,44 +27,110 @@ ddad58f fix(rag): strip HTML tags before chunking + D5 re-ingestion scripts
 a459636 fix(rag): HTML charset detection + opening block tag newlines
 75dda9a feat(embedding): add pdfplumber backend for multi-column PDF extraction
 41183ff fix(docker): set PDF_EXTRACTION_BACKEND to auto (was pymupdf)
+5a6e588 docs: update session handover
+3009f3d feat(embedding): add NIST/ENISA/standard section numbering to chunker
 ```

+## Aktuelle Qualitaet
+
+### Section-Rate nach Dateityp
+| Typ | Avg Section-Rate | Anmerkung |
+|-----|-----------------|-----------|
+| TXT (DE Gesetze) | **79%** | Exzellent |
+| HTML (EUR-Lex + gesetze-im-internet) | **56%** | Gut (Praambeln haben keine Artikel) |
+| PDF (EDPB/DSK Leitlinien) | **60-98%** | Gut |
+| PDF (EU-Amtsblatt) | N/A | Durch EUR-Lex HTML ersetzt |
+| PDF (NIST/BSI) | **0-10%** | Problematisch — siehe unten |
+| TXT (OWASP) | **0%** | Eigenes Format, kein §/Artikel |
+| Legal Templates (JSON/MD) | **0%** | Erwartet — keine juristische Struktur |
+
+### Qualitaetsreport (500 Controls Stichprobe)
+- **13% vollstaendig korrekt** (Artikel gefunden, passt zum Chunk)
+- **41% Article-nicht-im-Source-Text** — Hauptursache: Controls aus alten kaputten PDF-Chunks generiert
+- **7% kein Article in Citation**
+- **39% sonstige** (Regulation gefunden aber Section-Matching-Issues)
+
+## Offene Probleme (Naechste Session)
+
+### 1. NIST/ENISA/BSI PDFs (KRITISCH)
+
+**Problem:** pypdf UND pdfplumber brechen den Text mehrspaliger Dokumente. Section-Nummern landen nicht am Zeilenanfang. Regex-Erweiterung hat nicht geholfen.
+
+**3 NIST PDFs waren kurzzeitig verloren** (Chunks geloescht, Upload Timeout). Wiederherstellung aus MinIO wurde gestartet.
+
+**Loesung (priorisiert):**
+
+1. **Text-Normalisierung** nach PDF-Extraktion:
+   ```python
+   def _normalize_multicolumn_text(text):
+       # "1 . 1" → "1.1", "AC - 1" → "AC-1", "GV . OC - 01" → "GV.OC-01"
+       text = re.sub(r'(\d+)\s*\.\s*(\d+)', r'\1.\2', text)
+       text = re.sub(r'([A-Z]{2})\s*[-\.]\s*([A-Z]{2})\s*-\s*(\d+)', r'\1.\2-\3', text)
+       text = re.sub(r'([A-Z]{2})\s*-\s*(\d+)', r'\1-\2', text)
+       return text
+   ```
+
+2. **Fehlende Regex-Patterns:**
+   - `GV.OC-01` (NIST CSF 2.0): `[A-Z]{2}\.[A-Z]{2}-\d{2}`
+   - `A01:2021` (OWASP): `A\d{2}(?::\d{4})?`
+   - `AC-1(1)` (NIST Enhancements): `[A-Z]{2}-\d+\(\d+\)`
+
+3. **Alternative: HTML von NIST/ENISA-Websites** (wie EUR-Lex-Ansatz):
+   - NIST: csrc.nist.gov bietet HTML-Versionen
+   - ENISA: enisa.europa.eu bietet HTML-Versionen
+   - Bester Ansatz fuer maximale Qualitaet
+
+4. **D5-Script fixen:** Upload ZUERST, Delete NUR bei Erfolg (verhindert Datenverlust bei Timeout)
+
+### 2. Citation-Backfill (WICHTIG)
+
+**Problem:** 41% der Controls haben falsche Article-Citations (aus alten kaputten PDF-Chunks).
+
+**Loesung:** Nachtraeglicher Abgleich:
+1. Fuer jeden Control mit source_citation: Regulation in Qdrant suchen
+2. Chunks mit passender Section finden
+3. source_citation.article aktualisieren wenn besserer Match
+
+**Script:** `control-pipeline/scripts/quality_report.py` existiert bereits als Basis.
+
+### 3. Fehlende Gesetze (Block E3)
+
+- BEG IV (Viertes Buerokratieentlastungsgesetz, 2024)
+- Weitere aus `project_missing_legal_sources.md`
+
+### 4. Frontend 500-Fehler
+
+Das Compliance-Frontend (macmini:3007) zeigt noch 500-Fehler bei:
+- /controls, /canonical, /control-instances, /findings, /vendors, /contracts
+
+Der `requirements.map` TypeError ist gefixt, aber die API-500er deuten auf Backend-Probleme hin die separat untersucht werden muessen.
+
 ## Kritische Dateien

-| Datei | Aenderung |
-|-------|-----------|
-| `embedding-service/main.py` | Overlap-Bug-Fix, pdfplumber-Backend |
-| `rag-service/api/documents.py` | D2 Payload-Felder + HTML-Erkennung |
-| `rag-service/html_utils.py` | HTML-Stripping + Charset-Erkennung (NEU) |
-| `rag-service/embedding_client.py` | ChunkResult Dataclass (D2) |
-| `control-pipeline/services/rag_client.py` | page-Feld in RAGSearchResult (D3) |
-| `control-pipeline/services/control_generator.py` | section-Prioritaet + page (D3) |
-| `control-pipeline/scripts/reingest_d5.py` | Re-Ingestion Script (NEU) |
-| `control-pipeline/scripts/reingest_d5_config.py` | Config + Helpers (NEU) |
-| `docker-compose.yml` | PDF_EXTRACTION_BACKEND=auto |
-
-## Naechste Schritte (Block E)
-
-### E1: EU-Verordnungen als HTML von EUR-Lex ersetzen
-1. Liste aller EU-Amtsblatt-PDFs mit <50% Section-Rate erstellen
-2. EUR-Lex HTML-Versionen herunterladen (CELEX-Nummern sind in den Qdrant-Payloads)
-3. Alte PDF-Chunks loeschen, HTML-Versionen hochladen
-4. Qualitaetspruefung → erwartete Section-Rate >90%
-
-### E2: 3 fehlende EDPB-PDFs aufteilen + hochladen
-
-### E3: Fehlende Gesetze ingestieren (BGB aktuell, ArbZG, MuSchG, etc.)
-Siehe Masterplan Block E in `jazzy-snacking-creek.md`
+| Datei | Repo | Aenderung |
+|-------|------|-----------|
+| `embedding-service/main.py` | core | Overlap-Fix, pdfplumber, NIST-Regex |
+| `rag-service/api/documents.py` | core | D2 Payloads + HTML-Stripping |
+| `rag-service/html_utils.py` | core | HTML-Strip + Charset (NEU) |
+| `rag-service/embedding_client.py` | core | ChunkResult (D2) |
+| `control-pipeline/services/rag_client.py` | core | page-Feld (D3) |
+| `control-pipeline/services/control_generator.py` | core | section-Prio + page (D3) |
+| `control-pipeline/scripts/reingest_d5.py` | core | Re-Ingestion Script |
+| `control-pipeline/scripts/replace_eu_pdfs_with_html.py` | core | EUR-Lex Replacement |
+| `control-pipeline/scripts/quality_report.py` | core | Qualitaetstest |
+| `docker-compose.yml` | core | PDF_EXTRACTION_BACKEND=auto |
+| `canonical_control_service.py` | compliance | _ensure_list Fix |
+| `ControlDetail.tsx` | compliance | Array.isArray Guard |

 ## DB-Stand

 | Collection | Chunks | Dokumente |
 |-----------|--------|-----------|
-| bp_compliance_ce | ~18.000 | ~60 |
-| bp_compliance_gesetze | ~31.500 | ~98 |
-| bp_compliance_datenschutz | ~15.000 | ~107 |
-| bp_dsfa_corpus | ~3.500 | ~30 |
-| bp_legal_templates | ~2.000 | ~100 |
+| bp_compliance_ce | ~23.600 | ~55 |
+| bp_compliance_gesetze | ~32.000 | ~98 |
+| bp_compliance_datenschutz | ~13.000 | ~107 |
+| bp_dsfa_corpus | ~320 | ~20 |
+| bp_legal_templates | ~1.460 | ~100 |
 | **Gesamt** | **~70.000** | **~430** |

 ## Tests
@@ -116,12 +144,17 @@ cd rag-service && PYTHONPATH=. python3 -m pytest tests/ -v

 # Control-Pipeline (387 Tests)
 PYTHONPATH=control-pipeline python3 -m pytest control-pipeline/tests/ -v
+
+# Qualitaetsreport (500 Controls)
+python3 control-pipeline/scripts/quality_report.py --db-host macmini --sample 500
 ```

-## Memory-Dateien (lesen!)
+## Empfohlene Reihenfolge naechste Session

-Alle unter `/Users/benjaminadmin/.claude/projects/-Users-benjaminadmin-Projekte-breakpilot-core/memory/`:
- `MEMORY.md` — Index
- `project_structural_chunking.md` — Architektur-Entscheidung
- `feedback_legal_source_licensing.md` — Rule 1/2/3
- `project_control_pipeline_masterplan.md` — Gesamtplan A-G
+1. NIST-Wiederherstellung pruefen (3 PDFs aus MinIO)
+2. D5-Script fixen (Upload-before-Delete)
+3. Text-Normalisierung fuer PDF-Extraktion
+4. NIST/ENISA HTML-Downloads (wie EUR-Lex)
+5. Citation-Backfill fuer bestehende Controls
+6. Frontend 500-Fehler untersuchen
+7. BEG IV + fehlende Gesetze ingestieren