diff --git a/control-pipeline/INSTRUCTION-session-handover.md b/control-pipeline/INSTRUCTION-session-handover.md index 202413f..d9285f1 100644 --- a/control-pipeline/INSTRUCTION-session-handover.md +++ b/control-pipeline/INSTRUCTION-session-handover.md @@ -1,63 +1,25 @@ -# Session-Uebergabe: Strukturelles Chunking + Re-Ingestion +# Session-Uebergabe: Strukturelles Chunking + Qualitaetssicherung **Datum:** 2026-05-02 -**Uebergeben von:** Pipeline-Session (01.05 - 02.05.2026) +**Uebergeben von:** Pipeline-Session (01.05 - 02.05.2026, ~20h) ## Was wurde erledigt | Block | Was | Status | |-------|-----|--------| -| **D2** | RAG-Service speichert section/section_title/paragraph/paragraph_num/page in Qdrant | ✅ | -| **D3** | Control Generator liest strukturelle Metadaten, page in source_citation | ✅ | -| **D4** | BGB § 312k Validierung — Overlap-Bug gefunden + gefixt | ✅ | -| **D5** | 430/436 Dokumente re-ingestiert mit neuem Chunking | ✅ | -| **HTML-Fix** | HTML-Stripping + Charset-Erkennung (ISO-8859-1) | ✅ | -| **pdfplumber** | Als PDF-Backend hinzugefuegt, PDF_EXTRACTION_BACKEND=auto | ✅ | +| **D2** | RAG-Service speichert section/section_title/paragraph/page in Qdrant | ✅ | +| **D3** | Control Generator nutzt strukturelle Metadaten in source_citation | ✅ | +| **D4** | BGB § 312k Validierung — kritischen Overlap-Bug gefunden + gefixt | ✅ | +| **D5** | 430/436 Dokumente re-ingestiert (alle 6 Collections) | ✅ | +| **HTML-Fix** | Stripping + Charset-Erkennung (ISO-8859-1), Opening-Block-Tags | ✅ | +| **EUR-Lex** | 20 EU-Verordnungen als HTML ersetzt (DSGVO: 0%→92%) | ✅ | +| **pdfplumber** | Als PDF-Backend hinzugefuegt + PDF_EXTRACTION_BACKEND=auto | ✅ | +| **NIST Regex** | Nummerierte Abschnitte (1.1), Control-IDs (AC-1, PO.1) | ✅ | +| **3 fehlende PDFs** | EDPB Controller/Processor, GL 7, RTBF hochgeladen | ✅ | +| **Frontend-Bug** | requirements.map TypeError in Compliance Admin gefixt | ✅ | +| **Qualitaetsreport** | 500 Controls geprueft: 13% OK, 41% Article-Mismatch | ✅ | -## Ergebnisse der Re-Ingestion - -| Dokumenttyp | Section-Rate | Anmerkung | -|-------------|-------------|-----------| -| **DE Gesetze (TXT)** | **95-100%** | Exzellent | -| **HTML (gesetze-im-internet.de)** | **97.6%** | War 0%, nach Fix perfekt | -| **EDPB/DSK Leitlinien (PDF)** | **80-98%** | Gut | -| **EU-Amtsblatt-PDFs (AI Act, CRA, NIS2, DSGVO)** | **13-35%** | OFFEN — siehe unten | -| **Tech-Specs (JSON, MD)** | **0%** | Erwartet, keine §/Artikel | - -**Gesamt: 69.888 Chunks in 6 Collections, 430 von 436 Dokumenten.** - -## Offene Probleme - -### 1. EU-Amtsblatt-PDFs (~40 Dokumente, 13-35% Section-Rate) - -**Ursache:** EU Official Journal PDFs verwenden mehrspaltige Layouts. Sowohl pypdf als auch pdfplumber extrahieren gebrochene Woerter (`"Ar tik el"` statt `"Artikel"`). Kein PDF-Extractor loest das zuverlaessig. - -**Empfohlene Loesung:** EU-Verordnungen als HTML von EUR-Lex herunterladen statt PDF. EUR-Lex bietet alle Verordnungen als sauberes HTML. Unser HTML-Stripping + Legal Chunker funktioniert perfekt dafuer. - -**Betroffene Dokumente (Beispiele):** -- ai_act_2024_1689.pdf (33%) → HTML von EUR-Lex -- cra_2024_2847.pdf (34%) → HTML von EUR-Lex -- nis2_2022_2555.pdf (13%) → HTML von EUR-Lex -- dsgvo_2016_679.pdf (0%) → HTML von EUR-Lex -- amlr_2024_1624.pdf (12%) → HTML von EUR-Lex -- Alle Dateien mit `_20XX_XXXX.pdf` Pattern im bp_compliance_ce Collection - -### 2. 3 komplett fehlende PDFs - -Diese wurden NIE erfolgreich in Qdrant gespeichert: -- `edpb_controller_processor_07_2020.pdf` — KEINE CHUNKS -- `edpb_gl_7_2020.pdf` — KEINE CHUNKS -- `edpb_rtbf_05_2019.pdf` — KEINE CHUNKS - -**Ursache:** Timeout bei Upload (selbst mit 3600s). Die PDFs sind gross und die Embedding-Generierung dauert zu lange. - -**Loesung:** Manuell aufteilen (Split in Abschnitte) oder als kleinere Teile hochladen. - -### 3. 1 korrupte PDF - -- `dsk_kpnr_3.pdf` — 500 Internal Server Error bei Extraktion - -## Commits dieser Session +## Commits (breakpilot-core) ``` 93099b2 feat(pipeline): structural metadata end-to-end (Blocks D2-D4) @@ -65,44 +27,110 @@ ddad58f fix(rag): strip HTML tags before chunking + D5 re-ingestion scripts a459636 fix(rag): HTML charset detection + opening block tag newlines 75dda9a feat(embedding): add pdfplumber backend for multi-column PDF extraction 41183ff fix(docker): set PDF_EXTRACTION_BACKEND to auto (was pymupdf) +5a6e588 docs: update session handover +3009f3d feat(embedding): add NIST/ENISA/standard section numbering to chunker ``` +## Aktuelle Qualitaet + +### Section-Rate nach Dateityp +| Typ | Avg Section-Rate | Anmerkung | +|-----|-----------------|-----------| +| TXT (DE Gesetze) | **79%** | Exzellent | +| HTML (EUR-Lex + gesetze-im-internet) | **56%** | Gut (Praambeln haben keine Artikel) | +| PDF (EDPB/DSK Leitlinien) | **60-98%** | Gut | +| PDF (EU-Amtsblatt) | N/A | Durch EUR-Lex HTML ersetzt | +| PDF (NIST/BSI) | **0-10%** | Problematisch — siehe unten | +| TXT (OWASP) | **0%** | Eigenes Format, kein §/Artikel | +| Legal Templates (JSON/MD) | **0%** | Erwartet — keine juristische Struktur | + +### Qualitaetsreport (500 Controls Stichprobe) +- **13% vollstaendig korrekt** (Artikel gefunden, passt zum Chunk) +- **41% Article-nicht-im-Source-Text** — Hauptursache: Controls aus alten kaputten PDF-Chunks generiert +- **7% kein Article in Citation** +- **39% sonstige** (Regulation gefunden aber Section-Matching-Issues) + +## Offene Probleme (Naechste Session) + +### 1. NIST/ENISA/BSI PDFs (KRITISCH) + +**Problem:** pypdf UND pdfplumber brechen den Text mehrspaliger Dokumente. Section-Nummern landen nicht am Zeilenanfang. Regex-Erweiterung hat nicht geholfen. + +**3 NIST PDFs waren kurzzeitig verloren** (Chunks geloescht, Upload Timeout). Wiederherstellung aus MinIO wurde gestartet. + +**Loesung (priorisiert):** + +1. **Text-Normalisierung** nach PDF-Extraktion: + ```python + def _normalize_multicolumn_text(text): + # "1 . 1" → "1.1", "AC - 1" → "AC-1", "GV . OC - 01" → "GV.OC-01" + text = re.sub(r'(\d+)\s*\.\s*(\d+)', r'\1.\2', text) + text = re.sub(r'([A-Z]{2})\s*[-\.]\s*([A-Z]{2})\s*-\s*(\d+)', r'\1.\2-\3', text) + text = re.sub(r'([A-Z]{2})\s*-\s*(\d+)', r'\1-\2', text) + return text + ``` + +2. **Fehlende Regex-Patterns:** + - `GV.OC-01` (NIST CSF 2.0): `[A-Z]{2}\.[A-Z]{2}-\d{2}` + - `A01:2021` (OWASP): `A\d{2}(?::\d{4})?` + - `AC-1(1)` (NIST Enhancements): `[A-Z]{2}-\d+\(\d+\)` + +3. **Alternative: HTML von NIST/ENISA-Websites** (wie EUR-Lex-Ansatz): + - NIST: csrc.nist.gov bietet HTML-Versionen + - ENISA: enisa.europa.eu bietet HTML-Versionen + - Bester Ansatz fuer maximale Qualitaet + +4. **D5-Script fixen:** Upload ZUERST, Delete NUR bei Erfolg (verhindert Datenverlust bei Timeout) + +### 2. Citation-Backfill (WICHTIG) + +**Problem:** 41% der Controls haben falsche Article-Citations (aus alten kaputten PDF-Chunks). + +**Loesung:** Nachtraeglicher Abgleich: +1. Fuer jeden Control mit source_citation: Regulation in Qdrant suchen +2. Chunks mit passender Section finden +3. source_citation.article aktualisieren wenn besserer Match + +**Script:** `control-pipeline/scripts/quality_report.py` existiert bereits als Basis. + +### 3. Fehlende Gesetze (Block E3) + +- BEG IV (Viertes Buerokratieentlastungsgesetz, 2024) +- Weitere aus `project_missing_legal_sources.md` + +### 4. Frontend 500-Fehler + +Das Compliance-Frontend (macmini:3007) zeigt noch 500-Fehler bei: +- /controls, /canonical, /control-instances, /findings, /vendors, /contracts + +Der `requirements.map` TypeError ist gefixt, aber die API-500er deuten auf Backend-Probleme hin die separat untersucht werden muessen. + ## Kritische Dateien -| Datei | Aenderung | -|-------|-----------| -| `embedding-service/main.py` | Overlap-Bug-Fix, pdfplumber-Backend | -| `rag-service/api/documents.py` | D2 Payload-Felder + HTML-Erkennung | -| `rag-service/html_utils.py` | HTML-Stripping + Charset-Erkennung (NEU) | -| `rag-service/embedding_client.py` | ChunkResult Dataclass (D2) | -| `control-pipeline/services/rag_client.py` | page-Feld in RAGSearchResult (D3) | -| `control-pipeline/services/control_generator.py` | section-Prioritaet + page (D3) | -| `control-pipeline/scripts/reingest_d5.py` | Re-Ingestion Script (NEU) | -| `control-pipeline/scripts/reingest_d5_config.py` | Config + Helpers (NEU) | -| `docker-compose.yml` | PDF_EXTRACTION_BACKEND=auto | - -## Naechste Schritte (Block E) - -### E1: EU-Verordnungen als HTML von EUR-Lex ersetzen -1. Liste aller EU-Amtsblatt-PDFs mit <50% Section-Rate erstellen -2. EUR-Lex HTML-Versionen herunterladen (CELEX-Nummern sind in den Qdrant-Payloads) -3. Alte PDF-Chunks loeschen, HTML-Versionen hochladen -4. Qualitaetspruefung → erwartete Section-Rate >90% - -### E2: 3 fehlende EDPB-PDFs aufteilen + hochladen - -### E3: Fehlende Gesetze ingestieren (BGB aktuell, ArbZG, MuSchG, etc.) -Siehe Masterplan Block E in `jazzy-snacking-creek.md` +| Datei | Repo | Aenderung | +|-------|------|-----------| +| `embedding-service/main.py` | core | Overlap-Fix, pdfplumber, NIST-Regex | +| `rag-service/api/documents.py` | core | D2 Payloads + HTML-Stripping | +| `rag-service/html_utils.py` | core | HTML-Strip + Charset (NEU) | +| `rag-service/embedding_client.py` | core | ChunkResult (D2) | +| `control-pipeline/services/rag_client.py` | core | page-Feld (D3) | +| `control-pipeline/services/control_generator.py` | core | section-Prio + page (D3) | +| `control-pipeline/scripts/reingest_d5.py` | core | Re-Ingestion Script | +| `control-pipeline/scripts/replace_eu_pdfs_with_html.py` | core | EUR-Lex Replacement | +| `control-pipeline/scripts/quality_report.py` | core | Qualitaetstest | +| `docker-compose.yml` | core | PDF_EXTRACTION_BACKEND=auto | +| `canonical_control_service.py` | compliance | _ensure_list Fix | +| `ControlDetail.tsx` | compliance | Array.isArray Guard | ## DB-Stand | Collection | Chunks | Dokumente | |-----------|--------|-----------| -| bp_compliance_ce | ~18.000 | ~60 | -| bp_compliance_gesetze | ~31.500 | ~98 | -| bp_compliance_datenschutz | ~15.000 | ~107 | -| bp_dsfa_corpus | ~3.500 | ~30 | -| bp_legal_templates | ~2.000 | ~100 | +| bp_compliance_ce | ~23.600 | ~55 | +| bp_compliance_gesetze | ~32.000 | ~98 | +| bp_compliance_datenschutz | ~13.000 | ~107 | +| bp_dsfa_corpus | ~320 | ~20 | +| bp_legal_templates | ~1.460 | ~100 | | **Gesamt** | **~70.000** | **~430** | ## Tests @@ -116,12 +144,17 @@ cd rag-service && PYTHONPATH=. python3 -m pytest tests/ -v # Control-Pipeline (387 Tests) PYTHONPATH=control-pipeline python3 -m pytest control-pipeline/tests/ -v + +# Qualitaetsreport (500 Controls) +python3 control-pipeline/scripts/quality_report.py --db-host macmini --sample 500 ``` -## Memory-Dateien (lesen!) +## Empfohlene Reihenfolge naechste Session -Alle unter `/Users/benjaminadmin/.claude/projects/-Users-benjaminadmin-Projekte-breakpilot-core/memory/`: -- `MEMORY.md` — Index -- `project_structural_chunking.md` — Architektur-Entscheidung -- `feedback_legal_source_licensing.md` — Rule 1/2/3 -- `project_control_pipeline_masterplan.md` — Gesamtplan A-G +1. NIST-Wiederherstellung pruefen (3 PDFs aus MinIO) +2. D5-Script fixen (Upload-before-Delete) +3. Text-Normalisierung fuer PDF-Extraktion +4. NIST/ENISA HTML-Downloads (wie EUR-Lex) +5. Citation-Backfill fuer bestehende Controls +6. Frontend 500-Fehler untersuchen +7. BEG IV + fehlende Gesetze ingestieren diff --git a/control-pipeline/scripts/quality_report.py b/control-pipeline/scripts/quality_report.py new file mode 100644 index 0000000..37ad792 --- /dev/null +++ b/control-pipeline/scripts/quality_report.py @@ -0,0 +1,303 @@ +#!/usr/bin/env python3 +""" +E2E Quality Report: Verify controls have correct source citations. + +Loads N random controls from PostgreSQL, cross-references with Qdrant chunks, +and reports mismatches between source_citation and actual chunk metadata. + +Usage: + # Against Mac Mini + python3 scripts/quality_report.py --db-host macmini --qdrant-url http://macmini:6333 + + # Smaller sample + python3 scripts/quality_report.py --db-host macmini --sample 100 +""" + +import argparse +import json +import logging +import sys + +import httpx +from sqlalchemy import create_engine, text +from sqlalchemy.orm import sessionmaker + +logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s") +logger = logging.getLogger("quality-report") + +COLLECTIONS = [ + "bp_compliance_ce", "bp_compliance_gesetze", "bp_compliance_datenschutz", + "bp_dsfa_corpus", "bp_legal_templates", +] + + +def load_controls(db_url: str, sample_size: int) -> list[dict]: + """Load random controls with source_citation from PostgreSQL.""" + engine = create_engine(db_url) + Session = sessionmaker(bind=engine) + + with Session() as db: + rows = db.execute(text(""" + SELECT id::text, control_id, title, + source_citation::text, source_original_text, + generation_metadata::text, release_state + FROM compliance.canonical_controls + WHERE source_citation IS NOT NULL + AND source_original_text IS NOT NULL + AND release_state = 'draft' + ORDER BY RANDOM() + LIMIT :n + """), {"n": sample_size}).fetchall() + + controls = [] + for row in rows: + citation = json.loads(row[3]) if row[3] else {} + metadata = json.loads(row[5]) if row[5] else {} + controls.append({ + "id": row[0], + "control_id": row[1], + "title": row[2], + "citation": citation, + "source_text": row[4], + "metadata": metadata, + "release_state": row[6], + }) + return controls + + +def build_qdrant_index(qdrant_url: str) -> dict: + """Build regulation_id → list[chunk] index from Qdrant. + + Controls were generated from OLD chunks (512 chars). Qdrant now has + NEW chunks (1500 chars). Hash matching won't work — use regulation + + section matching instead. + """ + logger.info("Building Qdrant chunk index by regulation_id...") + index = {} # regulation_id → [{"section": ..., "text_snippet": ..., ...}] + client = httpx.Client(timeout=60.0) + + for coll in COLLECTIONS: + offset = None + for _ in range(600): + body = {"limit": 250, "with_payload": True, "with_vector": False} + if offset: + body["offset"] = offset + r = client.post(f"{qdrant_url}/collections/{coll}/points/scroll", json=body) + if r.status_code != 200: + break + data = r.json()["result"] + for pt in data["points"]: + reg_id = pt["payload"].get("regulation_id", "") + if not reg_id: + continue + chunk = { + "section": pt["payload"].get("section", ""), + "section_title": pt["payload"].get("section_title", ""), + "paragraph": pt["payload"].get("paragraph", ""), + "text_snippet": pt["payload"].get("chunk_text", "")[:200], + "filename": pt["payload"].get("filename", ""), + "collection": coll, + } + index.setdefault(reg_id, []).append(chunk) + offset = data.get("next_page_offset") + if not offset: + break + + client.close() + total = sum(len(v) for v in index.values()) + logger.info("Qdrant index: %d regulations, %d chunks", len(index), total) + return index + + +def check_control(ctrl: dict, qdrant_index: dict) -> dict: + """Check a single control's source_citation against Qdrant chunks. + + Strategy: Find chunks by regulation_id from generation_metadata, + then check if any chunk has a matching section/article. + """ + result = { + "control_id": ctrl["control_id"], + "title": (ctrl["title"] or "")[:60], + "citation_source": ctrl["citation"].get("source", ""), + "citation_article": ctrl["citation"].get("article", ""), + "citation_paragraph": ctrl["citation"].get("paragraph", ""), + "citation_page": ctrl["citation"].get("page"), + "issues": [], + } + + # Get regulation_id from generation_metadata + reg_code = ctrl["metadata"].get("source_regulation", "") + citation_article = ctrl["citation"].get("article", "") + + # Check 1: Does the control have a regulation reference? + if not reg_code: + result["issues"].append("NO_REGULATION_CODE") + return result + + # Check 2: Does this regulation exist in Qdrant? + chunks = qdrant_index.get(reg_code, []) + if not chunks: + result["issues"].append(f"REGULATION_NOT_IN_QDRANT: {reg_code}") + result["reg_found"] = False + return result + + result["reg_found"] = True + result["reg_chunks"] = len(chunks) + + # Check 3: Does the control have an article citation? + if not citation_article: + result["issues"].append("NO_ARTICLE_IN_CITATION") + # Still check if chunks have section metadata at all + has_section = any(c["section"] for c in chunks) + if has_section: + result["issues"].append("CHUNKS_HAVE_SECTIONS_BUT_CONTROL_MISSING") + return result + + # Check 4: Is the cited article found in any chunk's section? + norm_article = citation_article.strip().lower() + matching_chunks = [ + c for c in chunks + if c["section"] and ( + norm_article == c["section"].strip().lower() + or norm_article in c["section"].strip().lower() + or c["section"].strip().lower() in norm_article + ) + ] + + if matching_chunks: + result["article_match"] = True + result["matched_section"] = matching_chunks[0]["section"] + else: + # Check if ANY chunk has sections (the article might just not match) + sections_in_regulation = sorted(set(c["section"] for c in chunks if c["section"])) + if sections_in_regulation: + result["issues"].append( + f"ARTICLE_NOT_FOUND_IN_CHUNKS: '{citation_article}' not in {sections_in_regulation[:5]}" + ) + else: + result["issues"].append("NO_SECTIONS_IN_REGULATION_CHUNKS") + + # Check 5: Does source_original_text contain the cited article? + source_text = ctrl["source_text"] or "" + if citation_article and source_text: + if citation_article.lower() not in source_text.lower(): + if f"[{citation_article}" not in source_text: + result["issues"].append("ARTICLE_NOT_IN_SOURCE_TEXT") + + if not result["issues"]: + result["issues"] = ["OK"] + + return result + + +def generate_report(results: list[dict]): + """Print the quality report.""" + total = len(results) + ok = sum(1 for r in results if r["issues"] == ["OK"]) + chunk_found = sum(1 for r in results if r.get("chunk_found", False)) + no_chunk = sum(1 for r in results if "CHUNK_NOT_FOUND" in r["issues"]) + no_article = sum(1 for r in results if "NO_ARTICLE_IN_CITATION" in r["issues"]) + no_section = sum(1 for r in results if "NO_SECTION_IN_CHUNK" in r["issues"]) + mismatch = sum(1 for r in results if any("MISMATCH" in i for i in r["issues"])) + not_in_text = sum(1 for r in results if "ARTICLE_NOT_IN_SOURCE_TEXT" in r["issues"]) + + print("\n" + "=" * 100) + print("QUALITAETSREPORT: CONTROL SOURCE CITATION VERIFICATION") + print("=" * 100) + + print(f"\nStichprobe: {total} Controls") + print(f"\n{'Metrik':<45} {'Anzahl':>8} {'Anteil':>8}") + print("-" * 65) + print(f"{'OK (keine Probleme)':<45} {ok:>8} {ok*100//max(total,1):>7}%") + print(f"{'Chunk in Qdrant gefunden':<45} {chunk_found:>8} {chunk_found*100//max(total,1):>7}%") + print(f"{'Chunk NICHT gefunden':<45} {no_chunk:>8} {no_chunk*100//max(total,1):>7}%") + print(f"{'Kein article in source_citation':<45} {no_article:>8} {no_article*100//max(total,1):>7}%") + print(f"{'Kein section im Qdrant-Chunk':<45} {no_section:>8} {no_section*100//max(total,1):>7}%") + print(f"{'Article/Section Mismatch':<45} {mismatch:>8} {mismatch*100//max(total,1):>7}%") + print(f"{'Article nicht im Source-Text':<45} {not_in_text:>8} {not_in_text*100//max(total,1):>7}%") + + # Show sample mismatches + mismatches = [r for r in results if any("MISMATCH" in i for i in r["issues"])] + if mismatches: + print("\n=== MISMATCHES (erste 10) ===\n") + for r in mismatches[:10]: + issues = [i for i in r["issues"] if "MISMATCH" in i] + print(f" {r['control_id']:20s} {r['title'][:40]:40s}") + for i in issues: + print(f" → {i}") + + # Show sample NOT_FOUND + not_found = [r for r in results if "CHUNK_NOT_FOUND" in r["issues"]] + if not_found: + print("\n=== CHUNK NOT FOUND (erste 10) ===\n") + for r in not_found[:10]: + src = r.get("citation_source", "?") + art = r.get("citation_article", "?") + print(f" {r['control_id']:20s} {src[:25]:25s} {art}") + + # Distribution by source + print("\n=== NACH QUELLE ===\n") + source_stats = {} + for r in results: + src = r.get("citation_source", "?")[:30] + if src not in source_stats: + source_stats[src] = {"total": 0, "ok": 0, "no_chunk": 0, "no_section": 0} + source_stats[src]["total"] += 1 + if r["issues"] == ["OK"]: + source_stats[src]["ok"] += 1 + if "CHUNK_NOT_FOUND" in r["issues"]: + source_stats[src]["no_chunk"] += 1 + if "NO_SECTION_IN_CHUNK" in r["issues"]: + source_stats[src]["no_section"] += 1 + + print(f" {'Quelle':<32} {'Total':>6} {'OK':>6} {'OK%':>6} {'NoChunk':>8} {'NoSect':>8}") + print(f" {'-'*72}") + for src in sorted(source_stats.keys(), key=lambda s: -source_stats[s]["total"]): + s = source_stats[src] + pct = s["ok"] * 100 // max(s["total"], 1) + print(f" {src:<32} {s['total']:>6} {s['ok']:>6} {pct:>5}% {s['no_chunk']:>8} {s['no_section']:>8}") + + print(f"\n{'='*100}") + verdict = "PASS" if ok * 100 // max(total, 1) >= 50 else "NEEDS IMPROVEMENT" + print(f"ERGEBNIS: {verdict} — {ok}/{total} Controls ({ok*100//max(total,1)}%) vollstaendig korrekt") + print(f"{'='*100}") + + +def main(): + parser = argparse.ArgumentParser(description="Control Source Citation Quality Report") + parser.add_argument("--db-host", default="macmini") + parser.add_argument("--db-port", type=int, default=5432) + parser.add_argument("--db-name", default="breakpilot_db") + parser.add_argument("--db-user", default="breakpilot") + parser.add_argument("--db-pass", default="breakpilot123") + parser.add_argument("--qdrant-url", default="http://macmini:6333") + parser.add_argument("--sample", type=int, default=500) + args = parser.parse_args() + + db_url = f"postgresql://{args.db_user}:{args.db_pass}@{args.db_host}:{args.db_port}/{args.db_name}" + + # Load controls + logger.info("Loading %d random controls from DB...", args.sample) + controls = load_controls(db_url, args.sample) + logger.info("Loaded %d controls with source_citation", len(controls)) + + if not controls: + print("ERROR: No controls found with source_citation") + sys.exit(1) + + # Build Qdrant index + qdrant_index = build_qdrant_index(args.qdrant_url) + + # Check each control + logger.info("Checking %d controls against Qdrant...", len(controls)) + results = [] + for ctrl in controls: + result = check_control(ctrl, qdrant_index) + results.append(result) + + # Report + generate_report(results) + + +if __name__ == "__main__": + main()