docs: session handover — D2-D5 complete, quality report, NIST plan
Major session achievements: - Structural metadata end-to-end (D2-D4) - 430 docs re-ingested with new chunking - HTML stripping + charset detection (0% → 97.6%) - 20 EU regulations from EUR-Lex HTML (DSGVO: 0% → 92%) - Quality report script (500 controls: 13% fully correct) - Frontend requirements.map fix Open: NIST/ENISA text normalization, citation backfill, D5 script safety (upload-before-delete), BEG IV ingestion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,63 +1,25 @@
|
|||||||
# Session-Uebergabe: Strukturelles Chunking + Re-Ingestion
|
# Session-Uebergabe: Strukturelles Chunking + Qualitaetssicherung
|
||||||
|
|
||||||
**Datum:** 2026-05-02
|
**Datum:** 2026-05-02
|
||||||
**Uebergeben von:** Pipeline-Session (01.05 - 02.05.2026)
|
**Uebergeben von:** Pipeline-Session (01.05 - 02.05.2026, ~20h)
|
||||||
|
|
||||||
## Was wurde erledigt
|
## Was wurde erledigt
|
||||||
|
|
||||||
| Block | Was | Status |
|
| Block | Was | Status |
|
||||||
|-------|-----|--------|
|
|-------|-----|--------|
|
||||||
| **D2** | RAG-Service speichert section/section_title/paragraph/paragraph_num/page in Qdrant | ✅ |
|
| **D2** | RAG-Service speichert section/section_title/paragraph/page in Qdrant | ✅ |
|
||||||
| **D3** | Control Generator liest strukturelle Metadaten, page in source_citation | ✅ |
|
| **D3** | Control Generator nutzt strukturelle Metadaten in source_citation | ✅ |
|
||||||
| **D4** | BGB § 312k Validierung — Overlap-Bug gefunden + gefixt | ✅ |
|
| **D4** | BGB § 312k Validierung — kritischen Overlap-Bug gefunden + gefixt | ✅ |
|
||||||
| **D5** | 430/436 Dokumente re-ingestiert mit neuem Chunking | ✅ |
|
| **D5** | 430/436 Dokumente re-ingestiert (alle 6 Collections) | ✅ |
|
||||||
| **HTML-Fix** | HTML-Stripping + Charset-Erkennung (ISO-8859-1) | ✅ |
|
| **HTML-Fix** | Stripping + Charset-Erkennung (ISO-8859-1), Opening-Block-Tags | ✅ |
|
||||||
| **pdfplumber** | Als PDF-Backend hinzugefuegt, PDF_EXTRACTION_BACKEND=auto | ✅ |
|
| **EUR-Lex** | 20 EU-Verordnungen als HTML ersetzt (DSGVO: 0%→92%) | ✅ |
|
||||||
|
| **pdfplumber** | Als PDF-Backend hinzugefuegt + PDF_EXTRACTION_BACKEND=auto | ✅ |
|
||||||
|
| **NIST Regex** | Nummerierte Abschnitte (1.1), Control-IDs (AC-1, PO.1) | ✅ |
|
||||||
|
| **3 fehlende PDFs** | EDPB Controller/Processor, GL 7, RTBF hochgeladen | ✅ |
|
||||||
|
| **Frontend-Bug** | requirements.map TypeError in Compliance Admin gefixt | ✅ |
|
||||||
|
| **Qualitaetsreport** | 500 Controls geprueft: 13% OK, 41% Article-Mismatch | ✅ |
|
||||||
|
|
||||||
## Ergebnisse der Re-Ingestion
|
## Commits (breakpilot-core)
|
||||||
|
|
||||||
| Dokumenttyp | Section-Rate | Anmerkung |
|
|
||||||
|-------------|-------------|-----------|
|
|
||||||
| **DE Gesetze (TXT)** | **95-100%** | Exzellent |
|
|
||||||
| **HTML (gesetze-im-internet.de)** | **97.6%** | War 0%, nach Fix perfekt |
|
|
||||||
| **EDPB/DSK Leitlinien (PDF)** | **80-98%** | Gut |
|
|
||||||
| **EU-Amtsblatt-PDFs (AI Act, CRA, NIS2, DSGVO)** | **13-35%** | OFFEN — siehe unten |
|
|
||||||
| **Tech-Specs (JSON, MD)** | **0%** | Erwartet, keine §/Artikel |
|
|
||||||
|
|
||||||
**Gesamt: 69.888 Chunks in 6 Collections, 430 von 436 Dokumenten.**
|
|
||||||
|
|
||||||
## Offene Probleme
|
|
||||||
|
|
||||||
### 1. EU-Amtsblatt-PDFs (~40 Dokumente, 13-35% Section-Rate)
|
|
||||||
|
|
||||||
**Ursache:** EU Official Journal PDFs verwenden mehrspaltige Layouts. Sowohl pypdf als auch pdfplumber extrahieren gebrochene Woerter (`"Ar tik el"` statt `"Artikel"`). Kein PDF-Extractor loest das zuverlaessig.
|
|
||||||
|
|
||||||
**Empfohlene Loesung:** EU-Verordnungen als HTML von EUR-Lex herunterladen statt PDF. EUR-Lex bietet alle Verordnungen als sauberes HTML. Unser HTML-Stripping + Legal Chunker funktioniert perfekt dafuer.
|
|
||||||
|
|
||||||
**Betroffene Dokumente (Beispiele):**
|
|
||||||
- ai_act_2024_1689.pdf (33%) → HTML von EUR-Lex
|
|
||||||
- cra_2024_2847.pdf (34%) → HTML von EUR-Lex
|
|
||||||
- nis2_2022_2555.pdf (13%) → HTML von EUR-Lex
|
|
||||||
- dsgvo_2016_679.pdf (0%) → HTML von EUR-Lex
|
|
||||||
- amlr_2024_1624.pdf (12%) → HTML von EUR-Lex
|
|
||||||
- Alle Dateien mit `_20XX_XXXX.pdf` Pattern im bp_compliance_ce Collection
|
|
||||||
|
|
||||||
### 2. 3 komplett fehlende PDFs
|
|
||||||
|
|
||||||
Diese wurden NIE erfolgreich in Qdrant gespeichert:
|
|
||||||
- `edpb_controller_processor_07_2020.pdf` — KEINE CHUNKS
|
|
||||||
- `edpb_gl_7_2020.pdf` — KEINE CHUNKS
|
|
||||||
- `edpb_rtbf_05_2019.pdf` — KEINE CHUNKS
|
|
||||||
|
|
||||||
**Ursache:** Timeout bei Upload (selbst mit 3600s). Die PDFs sind gross und die Embedding-Generierung dauert zu lange.
|
|
||||||
|
|
||||||
**Loesung:** Manuell aufteilen (Split in Abschnitte) oder als kleinere Teile hochladen.
|
|
||||||
|
|
||||||
### 3. 1 korrupte PDF
|
|
||||||
|
|
||||||
- `dsk_kpnr_3.pdf` — 500 Internal Server Error bei Extraktion
|
|
||||||
|
|
||||||
## Commits dieser Session
|
|
||||||
|
|
||||||
```
|
```
|
||||||
93099b2 feat(pipeline): structural metadata end-to-end (Blocks D2-D4)
|
93099b2 feat(pipeline): structural metadata end-to-end (Blocks D2-D4)
|
||||||
@@ -65,44 +27,110 @@ ddad58f fix(rag): strip HTML tags before chunking + D5 re-ingestion scripts
|
|||||||
a459636 fix(rag): HTML charset detection + opening block tag newlines
|
a459636 fix(rag): HTML charset detection + opening block tag newlines
|
||||||
75dda9a feat(embedding): add pdfplumber backend for multi-column PDF extraction
|
75dda9a feat(embedding): add pdfplumber backend for multi-column PDF extraction
|
||||||
41183ff fix(docker): set PDF_EXTRACTION_BACKEND to auto (was pymupdf)
|
41183ff fix(docker): set PDF_EXTRACTION_BACKEND to auto (was pymupdf)
|
||||||
|
5a6e588 docs: update session handover
|
||||||
|
3009f3d feat(embedding): add NIST/ENISA/standard section numbering to chunker
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Aktuelle Qualitaet
|
||||||
|
|
||||||
|
### Section-Rate nach Dateityp
|
||||||
|
| Typ | Avg Section-Rate | Anmerkung |
|
||||||
|
|-----|-----------------|-----------|
|
||||||
|
| TXT (DE Gesetze) | **79%** | Exzellent |
|
||||||
|
| HTML (EUR-Lex + gesetze-im-internet) | **56%** | Gut (Praambeln haben keine Artikel) |
|
||||||
|
| PDF (EDPB/DSK Leitlinien) | **60-98%** | Gut |
|
||||||
|
| PDF (EU-Amtsblatt) | N/A | Durch EUR-Lex HTML ersetzt |
|
||||||
|
| PDF (NIST/BSI) | **0-10%** | Problematisch — siehe unten |
|
||||||
|
| TXT (OWASP) | **0%** | Eigenes Format, kein §/Artikel |
|
||||||
|
| Legal Templates (JSON/MD) | **0%** | Erwartet — keine juristische Struktur |
|
||||||
|
|
||||||
|
### Qualitaetsreport (500 Controls Stichprobe)
|
||||||
|
- **13% vollstaendig korrekt** (Artikel gefunden, passt zum Chunk)
|
||||||
|
- **41% Article-nicht-im-Source-Text** — Hauptursache: Controls aus alten kaputten PDF-Chunks generiert
|
||||||
|
- **7% kein Article in Citation**
|
||||||
|
- **39% sonstige** (Regulation gefunden aber Section-Matching-Issues)
|
||||||
|
|
||||||
|
## Offene Probleme (Naechste Session)
|
||||||
|
|
||||||
|
### 1. NIST/ENISA/BSI PDFs (KRITISCH)
|
||||||
|
|
||||||
|
**Problem:** pypdf UND pdfplumber brechen den Text mehrspaliger Dokumente. Section-Nummern landen nicht am Zeilenanfang. Regex-Erweiterung hat nicht geholfen.
|
||||||
|
|
||||||
|
**3 NIST PDFs waren kurzzeitig verloren** (Chunks geloescht, Upload Timeout). Wiederherstellung aus MinIO wurde gestartet.
|
||||||
|
|
||||||
|
**Loesung (priorisiert):**
|
||||||
|
|
||||||
|
1. **Text-Normalisierung** nach PDF-Extraktion:
|
||||||
|
```python
|
||||||
|
def _normalize_multicolumn_text(text):
|
||||||
|
# "1 . 1" → "1.1", "AC - 1" → "AC-1", "GV . OC - 01" → "GV.OC-01"
|
||||||
|
text = re.sub(r'(\d+)\s*\.\s*(\d+)', r'\1.\2', text)
|
||||||
|
text = re.sub(r'([A-Z]{2})\s*[-\.]\s*([A-Z]{2})\s*-\s*(\d+)', r'\1.\2-\3', text)
|
||||||
|
text = re.sub(r'([A-Z]{2})\s*-\s*(\d+)', r'\1-\2', text)
|
||||||
|
return text
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Fehlende Regex-Patterns:**
|
||||||
|
- `GV.OC-01` (NIST CSF 2.0): `[A-Z]{2}\.[A-Z]{2}-\d{2}`
|
||||||
|
- `A01:2021` (OWASP): `A\d{2}(?::\d{4})?`
|
||||||
|
- `AC-1(1)` (NIST Enhancements): `[A-Z]{2}-\d+\(\d+\)`
|
||||||
|
|
||||||
|
3. **Alternative: HTML von NIST/ENISA-Websites** (wie EUR-Lex-Ansatz):
|
||||||
|
- NIST: csrc.nist.gov bietet HTML-Versionen
|
||||||
|
- ENISA: enisa.europa.eu bietet HTML-Versionen
|
||||||
|
- Bester Ansatz fuer maximale Qualitaet
|
||||||
|
|
||||||
|
4. **D5-Script fixen:** Upload ZUERST, Delete NUR bei Erfolg (verhindert Datenverlust bei Timeout)
|
||||||
|
|
||||||
|
### 2. Citation-Backfill (WICHTIG)
|
||||||
|
|
||||||
|
**Problem:** 41% der Controls haben falsche Article-Citations (aus alten kaputten PDF-Chunks).
|
||||||
|
|
||||||
|
**Loesung:** Nachtraeglicher Abgleich:
|
||||||
|
1. Fuer jeden Control mit source_citation: Regulation in Qdrant suchen
|
||||||
|
2. Chunks mit passender Section finden
|
||||||
|
3. source_citation.article aktualisieren wenn besserer Match
|
||||||
|
|
||||||
|
**Script:** `control-pipeline/scripts/quality_report.py` existiert bereits als Basis.
|
||||||
|
|
||||||
|
### 3. Fehlende Gesetze (Block E3)
|
||||||
|
|
||||||
|
- BEG IV (Viertes Buerokratieentlastungsgesetz, 2024)
|
||||||
|
- Weitere aus `project_missing_legal_sources.md`
|
||||||
|
|
||||||
|
### 4. Frontend 500-Fehler
|
||||||
|
|
||||||
|
Das Compliance-Frontend (macmini:3007) zeigt noch 500-Fehler bei:
|
||||||
|
- /controls, /canonical, /control-instances, /findings, /vendors, /contracts
|
||||||
|
|
||||||
|
Der `requirements.map` TypeError ist gefixt, aber die API-500er deuten auf Backend-Probleme hin die separat untersucht werden muessen.
|
||||||
|
|
||||||
## Kritische Dateien
|
## Kritische Dateien
|
||||||
|
|
||||||
| Datei | Aenderung |
|
| Datei | Repo | Aenderung |
|
||||||
|-------|-----------|
|
|-------|------|-----------|
|
||||||
| `embedding-service/main.py` | Overlap-Bug-Fix, pdfplumber-Backend |
|
| `embedding-service/main.py` | core | Overlap-Fix, pdfplumber, NIST-Regex |
|
||||||
| `rag-service/api/documents.py` | D2 Payload-Felder + HTML-Erkennung |
|
| `rag-service/api/documents.py` | core | D2 Payloads + HTML-Stripping |
|
||||||
| `rag-service/html_utils.py` | HTML-Stripping + Charset-Erkennung (NEU) |
|
| `rag-service/html_utils.py` | core | HTML-Strip + Charset (NEU) |
|
||||||
| `rag-service/embedding_client.py` | ChunkResult Dataclass (D2) |
|
| `rag-service/embedding_client.py` | core | ChunkResult (D2) |
|
||||||
| `control-pipeline/services/rag_client.py` | page-Feld in RAGSearchResult (D3) |
|
| `control-pipeline/services/rag_client.py` | core | page-Feld (D3) |
|
||||||
| `control-pipeline/services/control_generator.py` | section-Prioritaet + page (D3) |
|
| `control-pipeline/services/control_generator.py` | core | section-Prio + page (D3) |
|
||||||
| `control-pipeline/scripts/reingest_d5.py` | Re-Ingestion Script (NEU) |
|
| `control-pipeline/scripts/reingest_d5.py` | core | Re-Ingestion Script |
|
||||||
| `control-pipeline/scripts/reingest_d5_config.py` | Config + Helpers (NEU) |
|
| `control-pipeline/scripts/replace_eu_pdfs_with_html.py` | core | EUR-Lex Replacement |
|
||||||
| `docker-compose.yml` | PDF_EXTRACTION_BACKEND=auto |
|
| `control-pipeline/scripts/quality_report.py` | core | Qualitaetstest |
|
||||||
|
| `docker-compose.yml` | core | PDF_EXTRACTION_BACKEND=auto |
|
||||||
## Naechste Schritte (Block E)
|
| `canonical_control_service.py` | compliance | _ensure_list Fix |
|
||||||
|
| `ControlDetail.tsx` | compliance | Array.isArray Guard |
|
||||||
### E1: EU-Verordnungen als HTML von EUR-Lex ersetzen
|
|
||||||
1. Liste aller EU-Amtsblatt-PDFs mit <50% Section-Rate erstellen
|
|
||||||
2. EUR-Lex HTML-Versionen herunterladen (CELEX-Nummern sind in den Qdrant-Payloads)
|
|
||||||
3. Alte PDF-Chunks loeschen, HTML-Versionen hochladen
|
|
||||||
4. Qualitaetspruefung → erwartete Section-Rate >90%
|
|
||||||
|
|
||||||
### E2: 3 fehlende EDPB-PDFs aufteilen + hochladen
|
|
||||||
|
|
||||||
### E3: Fehlende Gesetze ingestieren (BGB aktuell, ArbZG, MuSchG, etc.)
|
|
||||||
Siehe Masterplan Block E in `jazzy-snacking-creek.md`
|
|
||||||
|
|
||||||
## DB-Stand
|
## DB-Stand
|
||||||
|
|
||||||
| Collection | Chunks | Dokumente |
|
| Collection | Chunks | Dokumente |
|
||||||
|-----------|--------|-----------|
|
|-----------|--------|-----------|
|
||||||
| bp_compliance_ce | ~18.000 | ~60 |
|
| bp_compliance_ce | ~23.600 | ~55 |
|
||||||
| bp_compliance_gesetze | ~31.500 | ~98 |
|
| bp_compliance_gesetze | ~32.000 | ~98 |
|
||||||
| bp_compliance_datenschutz | ~15.000 | ~107 |
|
| bp_compliance_datenschutz | ~13.000 | ~107 |
|
||||||
| bp_dsfa_corpus | ~3.500 | ~30 |
|
| bp_dsfa_corpus | ~320 | ~20 |
|
||||||
| bp_legal_templates | ~2.000 | ~100 |
|
| bp_legal_templates | ~1.460 | ~100 |
|
||||||
| **Gesamt** | **~70.000** | **~430** |
|
| **Gesamt** | **~70.000** | **~430** |
|
||||||
|
|
||||||
## Tests
|
## Tests
|
||||||
@@ -116,12 +144,17 @@ cd rag-service && PYTHONPATH=. python3 -m pytest tests/ -v
|
|||||||
|
|
||||||
# Control-Pipeline (387 Tests)
|
# Control-Pipeline (387 Tests)
|
||||||
PYTHONPATH=control-pipeline python3 -m pytest control-pipeline/tests/ -v
|
PYTHONPATH=control-pipeline python3 -m pytest control-pipeline/tests/ -v
|
||||||
|
|
||||||
|
# Qualitaetsreport (500 Controls)
|
||||||
|
python3 control-pipeline/scripts/quality_report.py --db-host macmini --sample 500
|
||||||
```
|
```
|
||||||
|
|
||||||
## Memory-Dateien (lesen!)
|
## Empfohlene Reihenfolge naechste Session
|
||||||
|
|
||||||
Alle unter `/Users/benjaminadmin/.claude/projects/-Users-benjaminadmin-Projekte-breakpilot-core/memory/`:
|
1. NIST-Wiederherstellung pruefen (3 PDFs aus MinIO)
|
||||||
- `MEMORY.md` — Index
|
2. D5-Script fixen (Upload-before-Delete)
|
||||||
- `project_structural_chunking.md` — Architektur-Entscheidung
|
3. Text-Normalisierung fuer PDF-Extraktion
|
||||||
- `feedback_legal_source_licensing.md` — Rule 1/2/3
|
4. NIST/ENISA HTML-Downloads (wie EUR-Lex)
|
||||||
- `project_control_pipeline_masterplan.md` — Gesamtplan A-G
|
5. Citation-Backfill fuer bestehende Controls
|
||||||
|
6. Frontend 500-Fehler untersuchen
|
||||||
|
7. BEG IV + fehlende Gesetze ingestieren
|
||||||
|
|||||||
@@ -0,0 +1,303 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
E2E Quality Report: Verify controls have correct source citations.
|
||||||
|
|
||||||
|
Loads N random controls from PostgreSQL, cross-references with Qdrant chunks,
|
||||||
|
and reports mismatches between source_citation and actual chunk metadata.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
# Against Mac Mini
|
||||||
|
python3 scripts/quality_report.py --db-host macmini --qdrant-url http://macmini:6333
|
||||||
|
|
||||||
|
# Smaller sample
|
||||||
|
python3 scripts/quality_report.py --db-host macmini --sample 100
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
from sqlalchemy import create_engine, text
|
||||||
|
from sqlalchemy.orm import sessionmaker
|
||||||
|
|
||||||
|
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||||
|
logger = logging.getLogger("quality-report")
|
||||||
|
|
||||||
|
COLLECTIONS = [
|
||||||
|
"bp_compliance_ce", "bp_compliance_gesetze", "bp_compliance_datenschutz",
|
||||||
|
"bp_dsfa_corpus", "bp_legal_templates",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def load_controls(db_url: str, sample_size: int) -> list[dict]:
|
||||||
|
"""Load random controls with source_citation from PostgreSQL."""
|
||||||
|
engine = create_engine(db_url)
|
||||||
|
Session = sessionmaker(bind=engine)
|
||||||
|
|
||||||
|
with Session() as db:
|
||||||
|
rows = db.execute(text("""
|
||||||
|
SELECT id::text, control_id, title,
|
||||||
|
source_citation::text, source_original_text,
|
||||||
|
generation_metadata::text, release_state
|
||||||
|
FROM compliance.canonical_controls
|
||||||
|
WHERE source_citation IS NOT NULL
|
||||||
|
AND source_original_text IS NOT NULL
|
||||||
|
AND release_state = 'draft'
|
||||||
|
ORDER BY RANDOM()
|
||||||
|
LIMIT :n
|
||||||
|
"""), {"n": sample_size}).fetchall()
|
||||||
|
|
||||||
|
controls = []
|
||||||
|
for row in rows:
|
||||||
|
citation = json.loads(row[3]) if row[3] else {}
|
||||||
|
metadata = json.loads(row[5]) if row[5] else {}
|
||||||
|
controls.append({
|
||||||
|
"id": row[0],
|
||||||
|
"control_id": row[1],
|
||||||
|
"title": row[2],
|
||||||
|
"citation": citation,
|
||||||
|
"source_text": row[4],
|
||||||
|
"metadata": metadata,
|
||||||
|
"release_state": row[6],
|
||||||
|
})
|
||||||
|
return controls
|
||||||
|
|
||||||
|
|
||||||
|
def build_qdrant_index(qdrant_url: str) -> dict:
|
||||||
|
"""Build regulation_id → list[chunk] index from Qdrant.
|
||||||
|
|
||||||
|
Controls were generated from OLD chunks (512 chars). Qdrant now has
|
||||||
|
NEW chunks (1500 chars). Hash matching won't work — use regulation +
|
||||||
|
section matching instead.
|
||||||
|
"""
|
||||||
|
logger.info("Building Qdrant chunk index by regulation_id...")
|
||||||
|
index = {} # regulation_id → [{"section": ..., "text_snippet": ..., ...}]
|
||||||
|
client = httpx.Client(timeout=60.0)
|
||||||
|
|
||||||
|
for coll in COLLECTIONS:
|
||||||
|
offset = None
|
||||||
|
for _ in range(600):
|
||||||
|
body = {"limit": 250, "with_payload": True, "with_vector": False}
|
||||||
|
if offset:
|
||||||
|
body["offset"] = offset
|
||||||
|
r = client.post(f"{qdrant_url}/collections/{coll}/points/scroll", json=body)
|
||||||
|
if r.status_code != 200:
|
||||||
|
break
|
||||||
|
data = r.json()["result"]
|
||||||
|
for pt in data["points"]:
|
||||||
|
reg_id = pt["payload"].get("regulation_id", "")
|
||||||
|
if not reg_id:
|
||||||
|
continue
|
||||||
|
chunk = {
|
||||||
|
"section": pt["payload"].get("section", ""),
|
||||||
|
"section_title": pt["payload"].get("section_title", ""),
|
||||||
|
"paragraph": pt["payload"].get("paragraph", ""),
|
||||||
|
"text_snippet": pt["payload"].get("chunk_text", "")[:200],
|
||||||
|
"filename": pt["payload"].get("filename", ""),
|
||||||
|
"collection": coll,
|
||||||
|
}
|
||||||
|
index.setdefault(reg_id, []).append(chunk)
|
||||||
|
offset = data.get("next_page_offset")
|
||||||
|
if not offset:
|
||||||
|
break
|
||||||
|
|
||||||
|
client.close()
|
||||||
|
total = sum(len(v) for v in index.values())
|
||||||
|
logger.info("Qdrant index: %d regulations, %d chunks", len(index), total)
|
||||||
|
return index
|
||||||
|
|
||||||
|
|
||||||
|
def check_control(ctrl: dict, qdrant_index: dict) -> dict:
|
||||||
|
"""Check a single control's source_citation against Qdrant chunks.
|
||||||
|
|
||||||
|
Strategy: Find chunks by regulation_id from generation_metadata,
|
||||||
|
then check if any chunk has a matching section/article.
|
||||||
|
"""
|
||||||
|
result = {
|
||||||
|
"control_id": ctrl["control_id"],
|
||||||
|
"title": (ctrl["title"] or "")[:60],
|
||||||
|
"citation_source": ctrl["citation"].get("source", ""),
|
||||||
|
"citation_article": ctrl["citation"].get("article", ""),
|
||||||
|
"citation_paragraph": ctrl["citation"].get("paragraph", ""),
|
||||||
|
"citation_page": ctrl["citation"].get("page"),
|
||||||
|
"issues": [],
|
||||||
|
}
|
||||||
|
|
||||||
|
# Get regulation_id from generation_metadata
|
||||||
|
reg_code = ctrl["metadata"].get("source_regulation", "")
|
||||||
|
citation_article = ctrl["citation"].get("article", "")
|
||||||
|
|
||||||
|
# Check 1: Does the control have a regulation reference?
|
||||||
|
if not reg_code:
|
||||||
|
result["issues"].append("NO_REGULATION_CODE")
|
||||||
|
return result
|
||||||
|
|
||||||
|
# Check 2: Does this regulation exist in Qdrant?
|
||||||
|
chunks = qdrant_index.get(reg_code, [])
|
||||||
|
if not chunks:
|
||||||
|
result["issues"].append(f"REGULATION_NOT_IN_QDRANT: {reg_code}")
|
||||||
|
result["reg_found"] = False
|
||||||
|
return result
|
||||||
|
|
||||||
|
result["reg_found"] = True
|
||||||
|
result["reg_chunks"] = len(chunks)
|
||||||
|
|
||||||
|
# Check 3: Does the control have an article citation?
|
||||||
|
if not citation_article:
|
||||||
|
result["issues"].append("NO_ARTICLE_IN_CITATION")
|
||||||
|
# Still check if chunks have section metadata at all
|
||||||
|
has_section = any(c["section"] for c in chunks)
|
||||||
|
if has_section:
|
||||||
|
result["issues"].append("CHUNKS_HAVE_SECTIONS_BUT_CONTROL_MISSING")
|
||||||
|
return result
|
||||||
|
|
||||||
|
# Check 4: Is the cited article found in any chunk's section?
|
||||||
|
norm_article = citation_article.strip().lower()
|
||||||
|
matching_chunks = [
|
||||||
|
c for c in chunks
|
||||||
|
if c["section"] and (
|
||||||
|
norm_article == c["section"].strip().lower()
|
||||||
|
or norm_article in c["section"].strip().lower()
|
||||||
|
or c["section"].strip().lower() in norm_article
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|
||||||
|
if matching_chunks:
|
||||||
|
result["article_match"] = True
|
||||||
|
result["matched_section"] = matching_chunks[0]["section"]
|
||||||
|
else:
|
||||||
|
# Check if ANY chunk has sections (the article might just not match)
|
||||||
|
sections_in_regulation = sorted(set(c["section"] for c in chunks if c["section"]))
|
||||||
|
if sections_in_regulation:
|
||||||
|
result["issues"].append(
|
||||||
|
f"ARTICLE_NOT_FOUND_IN_CHUNKS: '{citation_article}' not in {sections_in_regulation[:5]}"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
result["issues"].append("NO_SECTIONS_IN_REGULATION_CHUNKS")
|
||||||
|
|
||||||
|
# Check 5: Does source_original_text contain the cited article?
|
||||||
|
source_text = ctrl["source_text"] or ""
|
||||||
|
if citation_article and source_text:
|
||||||
|
if citation_article.lower() not in source_text.lower():
|
||||||
|
if f"[{citation_article}" not in source_text:
|
||||||
|
result["issues"].append("ARTICLE_NOT_IN_SOURCE_TEXT")
|
||||||
|
|
||||||
|
if not result["issues"]:
|
||||||
|
result["issues"] = ["OK"]
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def generate_report(results: list[dict]):
|
||||||
|
"""Print the quality report."""
|
||||||
|
total = len(results)
|
||||||
|
ok = sum(1 for r in results if r["issues"] == ["OK"])
|
||||||
|
chunk_found = sum(1 for r in results if r.get("chunk_found", False))
|
||||||
|
no_chunk = sum(1 for r in results if "CHUNK_NOT_FOUND" in r["issues"])
|
||||||
|
no_article = sum(1 for r in results if "NO_ARTICLE_IN_CITATION" in r["issues"])
|
||||||
|
no_section = sum(1 for r in results if "NO_SECTION_IN_CHUNK" in r["issues"])
|
||||||
|
mismatch = sum(1 for r in results if any("MISMATCH" in i for i in r["issues"]))
|
||||||
|
not_in_text = sum(1 for r in results if "ARTICLE_NOT_IN_SOURCE_TEXT" in r["issues"])
|
||||||
|
|
||||||
|
print("\n" + "=" * 100)
|
||||||
|
print("QUALITAETSREPORT: CONTROL SOURCE CITATION VERIFICATION")
|
||||||
|
print("=" * 100)
|
||||||
|
|
||||||
|
print(f"\nStichprobe: {total} Controls")
|
||||||
|
print(f"\n{'Metrik':<45} {'Anzahl':>8} {'Anteil':>8}")
|
||||||
|
print("-" * 65)
|
||||||
|
print(f"{'OK (keine Probleme)':<45} {ok:>8} {ok*100//max(total,1):>7}%")
|
||||||
|
print(f"{'Chunk in Qdrant gefunden':<45} {chunk_found:>8} {chunk_found*100//max(total,1):>7}%")
|
||||||
|
print(f"{'Chunk NICHT gefunden':<45} {no_chunk:>8} {no_chunk*100//max(total,1):>7}%")
|
||||||
|
print(f"{'Kein article in source_citation':<45} {no_article:>8} {no_article*100//max(total,1):>7}%")
|
||||||
|
print(f"{'Kein section im Qdrant-Chunk':<45} {no_section:>8} {no_section*100//max(total,1):>7}%")
|
||||||
|
print(f"{'Article/Section Mismatch':<45} {mismatch:>8} {mismatch*100//max(total,1):>7}%")
|
||||||
|
print(f"{'Article nicht im Source-Text':<45} {not_in_text:>8} {not_in_text*100//max(total,1):>7}%")
|
||||||
|
|
||||||
|
# Show sample mismatches
|
||||||
|
mismatches = [r for r in results if any("MISMATCH" in i for i in r["issues"])]
|
||||||
|
if mismatches:
|
||||||
|
print("\n=== MISMATCHES (erste 10) ===\n")
|
||||||
|
for r in mismatches[:10]:
|
||||||
|
issues = [i for i in r["issues"] if "MISMATCH" in i]
|
||||||
|
print(f" {r['control_id']:20s} {r['title'][:40]:40s}")
|
||||||
|
for i in issues:
|
||||||
|
print(f" → {i}")
|
||||||
|
|
||||||
|
# Show sample NOT_FOUND
|
||||||
|
not_found = [r for r in results if "CHUNK_NOT_FOUND" in r["issues"]]
|
||||||
|
if not_found:
|
||||||
|
print("\n=== CHUNK NOT FOUND (erste 10) ===\n")
|
||||||
|
for r in not_found[:10]:
|
||||||
|
src = r.get("citation_source", "?")
|
||||||
|
art = r.get("citation_article", "?")
|
||||||
|
print(f" {r['control_id']:20s} {src[:25]:25s} {art}")
|
||||||
|
|
||||||
|
# Distribution by source
|
||||||
|
print("\n=== NACH QUELLE ===\n")
|
||||||
|
source_stats = {}
|
||||||
|
for r in results:
|
||||||
|
src = r.get("citation_source", "?")[:30]
|
||||||
|
if src not in source_stats:
|
||||||
|
source_stats[src] = {"total": 0, "ok": 0, "no_chunk": 0, "no_section": 0}
|
||||||
|
source_stats[src]["total"] += 1
|
||||||
|
if r["issues"] == ["OK"]:
|
||||||
|
source_stats[src]["ok"] += 1
|
||||||
|
if "CHUNK_NOT_FOUND" in r["issues"]:
|
||||||
|
source_stats[src]["no_chunk"] += 1
|
||||||
|
if "NO_SECTION_IN_CHUNK" in r["issues"]:
|
||||||
|
source_stats[src]["no_section"] += 1
|
||||||
|
|
||||||
|
print(f" {'Quelle':<32} {'Total':>6} {'OK':>6} {'OK%':>6} {'NoChunk':>8} {'NoSect':>8}")
|
||||||
|
print(f" {'-'*72}")
|
||||||
|
for src in sorted(source_stats.keys(), key=lambda s: -source_stats[s]["total"]):
|
||||||
|
s = source_stats[src]
|
||||||
|
pct = s["ok"] * 100 // max(s["total"], 1)
|
||||||
|
print(f" {src:<32} {s['total']:>6} {s['ok']:>6} {pct:>5}% {s['no_chunk']:>8} {s['no_section']:>8}")
|
||||||
|
|
||||||
|
print(f"\n{'='*100}")
|
||||||
|
verdict = "PASS" if ok * 100 // max(total, 1) >= 50 else "NEEDS IMPROVEMENT"
|
||||||
|
print(f"ERGEBNIS: {verdict} — {ok}/{total} Controls ({ok*100//max(total,1)}%) vollstaendig korrekt")
|
||||||
|
print(f"{'='*100}")
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description="Control Source Citation Quality Report")
|
||||||
|
parser.add_argument("--db-host", default="macmini")
|
||||||
|
parser.add_argument("--db-port", type=int, default=5432)
|
||||||
|
parser.add_argument("--db-name", default="breakpilot_db")
|
||||||
|
parser.add_argument("--db-user", default="breakpilot")
|
||||||
|
parser.add_argument("--db-pass", default="breakpilot123")
|
||||||
|
parser.add_argument("--qdrant-url", default="http://macmini:6333")
|
||||||
|
parser.add_argument("--sample", type=int, default=500)
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
db_url = f"postgresql://{args.db_user}:{args.db_pass}@{args.db_host}:{args.db_port}/{args.db_name}"
|
||||||
|
|
||||||
|
# Load controls
|
||||||
|
logger.info("Loading %d random controls from DB...", args.sample)
|
||||||
|
controls = load_controls(db_url, args.sample)
|
||||||
|
logger.info("Loaded %d controls with source_citation", len(controls))
|
||||||
|
|
||||||
|
if not controls:
|
||||||
|
print("ERROR: No controls found with source_citation")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
# Build Qdrant index
|
||||||
|
qdrant_index = build_qdrant_index(args.qdrant_url)
|
||||||
|
|
||||||
|
# Check each control
|
||||||
|
logger.info("Checking %d controls against Qdrant...", len(controls))
|
||||||
|
results = []
|
||||||
|
for ctrl in controls:
|
||||||
|
result = check_control(ctrl, qdrant_index)
|
||||||
|
results.append(result)
|
||||||
|
|
||||||
|
# Report
|
||||||
|
generate_report(results)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Reference in New Issue
Block a user