feat: Control Library UI, dedup migration, QA tooling, docs
Some checks failed
CI/CD / go-lint (push) Has been skipped
CI/CD / python-lint (push) Has been skipped
CI/CD / nodejs-lint (push) Has been skipped
CI/CD / test-go-ai-compliance (push) Failing after 31s
CI/CD / test-python-backend-compliance (push) Successful in 1m35s
CI/CD / test-python-document-crawler (push) Successful in 20s
CI/CD / test-python-dsms-gateway (push) Successful in 17s
CI/CD / validate-canonical-controls (push) Successful in 10s
CI/CD / Deploy (push) Has been skipped
Some checks failed
CI/CD / go-lint (push) Has been skipped
CI/CD / python-lint (push) Has been skipped
CI/CD / nodejs-lint (push) Has been skipped
CI/CD / test-go-ai-compliance (push) Failing after 31s
CI/CD / test-python-backend-compliance (push) Successful in 1m35s
CI/CD / test-python-document-crawler (push) Successful in 20s
CI/CD / test-python-dsms-gateway (push) Successful in 17s
CI/CD / validate-canonical-controls (push) Successful in 10s
CI/CD / Deploy (push) Has been skipped
- Control Library: parent control display, ObligationTypeBadge, GenerationStrategyBadge variants, evidence string fallback - API: expose parent_control_uuid/id/title in canonical controls - Fix: DSFA SQLAlchemy 2.0 Row._mapping compatibility - Migration 074: control_parent_links + control_dedup_reviews tables - QA scripts: benchmark, gap analysis, OSCAL import, OWASP cleanup, phase5 normalize, phase74 gap fill, sync_db, run_job - Docs: dedup engine, RAG benchmark, lessons learned, pipeline docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
206
docs-src/development/rag-pipeline-benchmark.md
Normal file
206
docs-src/development/rag-pipeline-benchmark.md
Normal file
@@ -0,0 +1,206 @@
|
||||
# RAG Pipeline Benchmark & Optimierungen
|
||||
|
||||
Stand: 2026-03-21. Vergleich unserer Implementierung mit State of the Art. Priorisierte Empfehlungen nach Impact/Effort.
|
||||
|
||||
---
|
||||
|
||||
## Aktuelle Pipeline (Ist-Zustand)
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
A[Dokumente] -->|Document Crawler| B[Chunks 512/50]
|
||||
B -->|bge-m3| C[Qdrant Dense]
|
||||
C -->|Cosine Search| D[Control Generator v2]
|
||||
D -->|LLM| E[Rich Controls 6.373]
|
||||
E -->|Pass 0a| F[Obligations]
|
||||
F -->|Pass 0b| G[Atomare Controls]
|
||||
G -->|4-Stage Dedup| H[Master Controls ~18K]
|
||||
```
|
||||
|
||||
| Komponente | Implementierung | SOTA-Bewertung |
|
||||
|-----------|----------------|----------------|
|
||||
| **Chunking** | Rekursiv, 512 Zeichen, 50 Overlap | Zu klein fuer Rechtstexte |
|
||||
| **Embedding** | bge-m3 (1024-dim, Ollama) | Gut, aber nur Dense genutzt |
|
||||
| **Vector DB** | Qdrant mit Payload-Filtering | Hybrid Search nicht aktiviert |
|
||||
| **Retrieval** | Pure Dense Cosine Similarity | Kein Re-Ranking, kein BM25 |
|
||||
| **Extraktion** | 3-Tier (Exact → Embedding → LLM) | Solide Architektur |
|
||||
| **Dedup** | 4-Stage (Pattern → Action → Object → Embedding) | Ueberdurchschnittlich |
|
||||
| **QA** | 5-Metrik Similarity + PDF-QA Matching | Gut, RAGAS fehlt |
|
||||
|
||||
---
|
||||
|
||||
## Tier 1: Quick Wins (Tage, nicht Wochen)
|
||||
|
||||
### 1. Chunk-Groesse erhoehen: 512 → 1024, Overlap 50 → 128
|
||||
|
||||
**Problem:** NAACL 2025 Vectara-Studie zeigt: fuer analytische/juristische Queries sind 512-1024 Token optimal. Unsere 512-Zeichen-Chunks (= ~128 Token) sind deutlich zu klein.
|
||||
|
||||
**Unsere Lessons Learned:** "Chunks werden mitten im Absatz abgeschnitten. Artikel- und Paragraphennummern fehlen."
|
||||
|
||||
**Aenderung:** Config-Parameter in `ingest-phase-h.sh` anpassen.
|
||||
|
||||
| Metrik | Vorher | Nachher |
|
||||
|--------|--------|---------|
|
||||
| Chunk Size | 512 chars (~128 Token) | 1024 chars (~256 Token) |
|
||||
| Overlap | 50 chars (10%) | 128 chars (12.5%) |
|
||||
|
||||
**Impact:** HOCH | **Effort:** NIEDRIG
|
||||
|
||||
### 2. Ollama JSON-Mode fuer Obligation Extraction
|
||||
|
||||
**Problem:** `_parse_json` in `decomposition_pass.py` hat Regex-Fallback — das zeigt, dass LLM-Output nicht zuverlaessig JSON ist.
|
||||
|
||||
**Aenderung:** `format: "json"` in Ollama-API-Calls setzen.
|
||||
|
||||
**Impact:** MITTEL | **Effort:** NIEDRIG (1 Parameter)
|
||||
|
||||
### 3. Chain-of-Thought Prompting fuer Pass 0a/0b
|
||||
|
||||
**Problem:** LegalGPT-Framework zeigt: explizite Reasoning-Chains ("Erst Addressat identifizieren, dann Aktion, dann normative Staerke") verbessern Extraktionsqualitaet signifikant.
|
||||
|
||||
**Impact:** MITTEL | **Effort:** NIEDRIG (Prompt Engineering)
|
||||
|
||||
---
|
||||
|
||||
## Tier 2: High Impact, Medium Effort (1-2 Wochen)
|
||||
|
||||
### 4. Hybrid Search (Dense + Sparse) via Qdrant
|
||||
|
||||
**Problem:** Reine Dense-Suche. Juristische Queries enthalten spezifische Begriffe ("DSGVO Art. 35", "Abs. 3"), die BM25/Sparse besser findet.
|
||||
|
||||
**Loesungsansatz:** BGE-M3 generiert bereits Sparse Vectors — wir verwerfen sie aktuell!
|
||||
|
||||
```
|
||||
Qdrant Query API:
|
||||
- Dense: bge-m3 Cosine (wie bisher)
|
||||
- Sparse: bge-m3 Sparse Vectors (neu)
|
||||
- Fusion: Reciprocal Rank Fusion (RRF)
|
||||
```
|
||||
|
||||
**Benchmarks (Anthropic):** 49% weniger fehlgeschlagene Retrievals mit Contextual Retrieval, 67% mit Re-Ranking.
|
||||
|
||||
**Impact:** SEHR HOCH | **Effort:** MITTEL
|
||||
|
||||
### 5. Cross-Encoder Re-Ranking
|
||||
|
||||
**Problem:** Top-5 Ergebnisse direkt an LLM — keine Qualitaetspruefung der Retrieval-Ergebnisse.
|
||||
|
||||
**Loesungsansatz:** BGE Reranker v2 (MIT-Lizenz) auf Top-20 Ergebnisse, dann Top-5 an LLM.
|
||||
|
||||
| Re-Ranker | Lizenz | Empfehlung |
|
||||
|-----------|--------|------------|
|
||||
| BGE Reranker v2 | MIT | Empfohlen |
|
||||
| Jina Reranker v2 | Apache-2.0 | Alternative |
|
||||
| ColBERT v2 | MIT | Spaeter |
|
||||
|
||||
**Impact:** HOCH | **Effort:** MITTEL
|
||||
|
||||
### 6. Cross-Regulation Dedup Pass
|
||||
|
||||
**Problem:** Dedup filtert immer nach `pattern_id` — Controls aus DSGVO Art. 25 und NIS2 Art. 21 (beide Security-by-Design) werden nie verglichen.
|
||||
|
||||
**Loesungsansatz:** Zweiter Qdrant-Search ohne `pattern_id`-Filter nach dem normalen Dedup-Pass.
|
||||
|
||||
**Impact:** HOCH | **Effort:** MITTEL
|
||||
|
||||
### 7. Automatische Regressionstests (Golden Set)
|
||||
|
||||
**Problem:** Keine systematische Qualitaetsmessung nach Pipeline-Aenderungen.
|
||||
|
||||
**Loesungsansatz:** 20-Chunk Golden Set → Control-Generation → Output-Stabilitaet pruefen.
|
||||
|
||||
**Impact:** HOCH | **Effort:** NIEDRIG
|
||||
|
||||
---
|
||||
|
||||
## Tier 3: Strategische Investitionen (Wochen bis Monate)
|
||||
|
||||
### 8. Artikel-Boundary Chunking
|
||||
|
||||
Eigener Splitter fuer EU-Verordnungen und deutsche Gesetze: Split an "Art.", "Artikel", "Paragraph"-Grenzen statt nach Zeichenzahl.
|
||||
|
||||
### 9. RAGAS Evaluation Pipeline
|
||||
|
||||
[RAGAS](https://docs.ragas.io/) mit Golden Dataset (50-100 manuell verifizierte Control-to-Source Mappings). Metriken: Faithfulness, Answer Relevancy, Context Precision, Context Recall.
|
||||
|
||||
### 10. BGE-M3 Fine-Tuning
|
||||
|
||||
Fine-Tuning auf Compliance-Corpus (~6.373 Control-Titel/Objective-Paare). Research zeigt +10-30% Domain-Retrieval-Verbesserung.
|
||||
|
||||
### 11. LLM-as-Judge
|
||||
|
||||
Claude Sonnet bewertet jeden generierten Control auf Faithfulness zum Quelltext (~$0.01/Control).
|
||||
|
||||
### 12. Active Learning aus Review-Queue
|
||||
|
||||
Menschliche Entscheidungen der Dedup Review-Queue nutzen, um Schwellenwerte ueber die Zeit zu optimieren.
|
||||
|
||||
---
|
||||
|
||||
## Nicht empfohlen (niedriger ROI oder Konflikte)
|
||||
|
||||
| Ansatz | Grund |
|
||||
|--------|-------|
|
||||
| Jina v3 Embeddings | **CC-BY-NC-4.0** — verletzt Open Source Policy |
|
||||
| Voyage-law-2 | API-only, proprietaer — kein Self-Hosting |
|
||||
| Semantic Chunking | Benchmarks zeigen keinen Vorteil gegenueber Recursive fuer strukturierte Dokumente |
|
||||
| HyDE als Primaerstrategie | Latenz (+43-60%) + Halluzinationsrisiko |
|
||||
| Knowledge Graph RAG | Massiver Aufwand, unklarer Gewinn bei strukturiertem Rechtskorpus |
|
||||
|
||||
---
|
||||
|
||||
## Embedding-Modell Vergleich
|
||||
|
||||
| Modell | MTEB Score | Multilingual | Kontext | Lizenz | Bewertung |
|
||||
|--------|-----------|-------------|---------|--------|-----------|
|
||||
| **BGE-M3** (aktuell) | 63.0 | 100+ Sprachen | 8192 Token | MIT | Gut, Dense+Sparse+ColBERT |
|
||||
| Jina v3 | 65.5 | 89 Sprachen | 8192 Token | CC-BY-NC | Nicht nutzbar (Lizenz!) |
|
||||
| E5-Mistral-7B | ~65 | Gut | 4096 Token | MIT | Gross, hoher RAM |
|
||||
| Voyage-law-2 | Best Legal | EN Legal | 16K Token | Proprietaer | Nicht nutzbar (API-only) |
|
||||
|
||||
**Fazit:** BGE-M3 bleibt die beste Wahl fuer unseren Stack. Sparse-Vectors aktivieren und Fine-Tuning bringen mehr als ein Modellwechsel.
|
||||
|
||||
---
|
||||
|
||||
## Test-Coverage Analyse
|
||||
|
||||
### Pipeline-Module (567 Tests)
|
||||
|
||||
| Modul | Tests | Bewertung | Fehlende Tests |
|
||||
|-------|-------|-----------|----------------|
|
||||
| Control Generator | 110 | Exzellent | 10-15 Edge Cases |
|
||||
| Obligation Extractor | 107 | Exzellent | 8-10 Edge Cases |
|
||||
| Decomposition Pass | 90 | Exzellent | 5-8 Edge Cases |
|
||||
| Pattern Matcher | 72 | Gut | 10-15 Edge Cases |
|
||||
| Control Dedup | 56 | Exzellent | 5-8 Edge Cases |
|
||||
| Control Composer | 54 | Gut | 8-10 Edge Cases |
|
||||
| Pipeline Adapter | 36 | Gut | 10-15 Edge Cases |
|
||||
| Citation Backfill | 20 | Moderat | 5-8 Edge Cases |
|
||||
| License Gate | 12 | Minimal | 5-8 Edge Cases |
|
||||
| RAG Client | 10 | Minimal | 5-8 Edge Cases |
|
||||
|
||||
### Kritische Luecken (fehlende Tests)
|
||||
|
||||
| Service | Datei | Prioritaet |
|
||||
|---------|-------|------------|
|
||||
| AI Compliance Assistant | `ai_compliance_assistant.py` | HOCH (25-30 Tests noetig) |
|
||||
| PDF Extractor | `pdf_extractor.py` | HOCH (20-25 Tests noetig) |
|
||||
| LLM Provider | `llm_provider.py` | HOCH (15-20 Tests noetig) |
|
||||
| Similarity Detector | `similarity_detector.py` | MITTEL (20-25 Tests noetig) |
|
||||
| Anchor Finder | `anchor_finder.py` | MITTEL |
|
||||
|
||||
### Test-Infrastruktur
|
||||
|
||||
**Fehlend:** Shared `conftest.py` mit gemeinsamen Fixtures (LLM-Mock, DB-Mock, Embedding-Mock). Aktuell sind Fixtures in jedem Test-File dupliziert.
|
||||
|
||||
---
|
||||
|
||||
## Quellen
|
||||
|
||||
- [NAACL 2025 Vectara Chunking Study](https://blog.premai.io/rag-chunking-strategies-the-2026-benchmark-guide/)
|
||||
- [Anthropic Contextual Retrieval](https://www.anthropic.com/news/contextual-retrieval)
|
||||
- [Qdrant Hybrid Search Query API](https://qdrant.tech/articles/hybrid-search/)
|
||||
- [Structure-Aware Chunking for Legal (ACL 2025)](https://aclanthology.org/2025.justnlp-main.19/)
|
||||
- [RAGAS Evaluation Framework](https://docs.ragas.io/)
|
||||
- [BGE Reranker v2 (MIT)](https://huggingface.co/BAAI/bge-reranker-v2-m3)
|
||||
- [LegalGPT / CALLM Framework](https://www.emergentmind.com/topics/compliance-alignment-llm-callm)
|
||||
Reference in New Issue
Block a user