# KI-Daten-Pipeline Architektur
Diese Seite dokumentiert die technische Architektur der KI-Daten-Pipeline im Detail.
## Systemuebersicht
```mermaid
graph TB
subgraph Users["Benutzer"]
U1[Entwickler]
U2[Data Scientists]
U3[Lehrer]
end
subgraph Frontend["Frontend (admin-v2)"]
direction TB
F1["OCR-Labeling
/ai/ocr-labeling"]
F2["RAG Pipeline
/ai/rag-pipeline"]
F3["Daten & RAG
/ai/rag"]
F4["Klausur-Korrektur
/ai/klausur-korrektur"]
end
subgraph Backend["Backend Services"]
direction TB
B1["klausur-service
Port 8086"]
B2["embedding-service
Port 8087"]
end
subgraph Storage["Persistenz"]
direction TB
D1[(PostgreSQL
Metadaten)]
D2[(Qdrant
Vektoren)]
D3[(MinIO
Bilder/PDFs)]
end
subgraph External["Externe APIs"]
E1[OpenAI API]
E2[Ollama]
end
U1 --> F1
U2 --> F2
U3 --> F4
F1 --> B1
F2 --> B1
F3 --> B1
F4 --> B1
B1 --> D1
B1 --> D2
B1 --> D3
B1 --> B2
B2 --> E1
B1 --> E2
```
## Komponenten-Details
### OCR-Labeling Modul
```mermaid
flowchart TB
subgraph Upload["Upload-Prozess"]
U1[Bilder hochladen] --> U2[MinIO speichern]
U2 --> U3[Session erstellen]
end
subgraph OCR["OCR-Verarbeitung"]
O1[Bild laden] --> O2{Modell wählen}
O2 -->|llama3.2-vision| O3a[Vision LLM]
O2 -->|trocr| O3b[Transformer]
O2 -->|paddleocr| O3c[PaddleOCR]
O2 -->|donut| O3d[Document AI]
O3a --> O4[OCR-Text]
O3b --> O4
O3c --> O4
O3d --> O4
end
subgraph Labeling["Labeling-Prozess"]
L1[Queue laden] --> L2[Item anzeigen]
L2 --> L3{Entscheidung}
L3 -->|korrekt| L4[Bestaetigen]
L3 -->|falsch| L5[Korrigieren]
L3 -->|unklar| L6[Ueberspringen]
L4 --> L7[PostgreSQL]
L5 --> L7
L6 --> L7
end
subgraph Export["Export"]
E1[Gelabelte Items] --> E2{Format}
E2 -->|TrOCR| E3a[Transformer Format]
E2 -->|Llama| E3b[Vision Format]
E2 -->|Generic| E3c[JSON]
end
Upload --> OCR
OCR --> Labeling
Labeling --> Export
```
### RAG Pipeline Modul
```mermaid
flowchart TB
subgraph Sources["Datenquellen"]
S1[NiBiS PDFs]
S2[Uploads]
S3[Rechtskorpus]
S4[Schulordnungen]
end
subgraph Processing["Verarbeitung"]
direction TB
P1[PDF Parser] --> P2[OCR falls noetig]
P2 --> P3[Text Cleaning]
P3 --> P4[Chunking
1000 chars, 200 overlap]
P4 --> P5[Metadata Extraction]
end
subgraph Embedding["Embedding"]
E1[embedding-service] --> E2[OpenAI API]
E2 --> E3[1536-dim Vektor]
end
subgraph Indexing["Indexierung"]
I1{Collection waehlen}
I1 -->|EH| I2a[bp_nibis_eh]
I1 -->|Custom| I2b[bp_eh]
I1 -->|Legal| I2c[bp_legal_corpus]
I1 -->|Schul| I2d[bp_schulordnungen]
I2a --> I3[Qdrant upsert]
I2b --> I3
I2c --> I3
I2d --> I3
end
Sources --> Processing
Processing --> Embedding
Embedding --> Indexing
```
### Daten & RAG Modul
```mermaid
flowchart TB
subgraph Query["Suchanfrage"]
Q1[User Query] --> Q2[Query Embedding]
Q2 --> Q3[1536-dim Vektor]
end
subgraph Search["Qdrant Suche"]
S1[Collection waehlen] --> S2[Vector Search]
S2 --> S3[Top-k Results]
S3 --> S4[Score Filtering]
end
subgraph Results["Ergebnisse"]
R1[Chunks] --> R2[Metadata anreichern]
R2 --> R3[Source URLs]
R3 --> R4[Response]
end
Query --> Search
Search --> Results
```
## Datenmodelle
### OCR-Labeling
```typescript
interface OCRSession {
id: string
name: string
source_type: 'klausur' | 'handwriting_sample' | 'scan'
ocr_model: 'llama3.2-vision:11b' | 'trocr' | 'paddleocr' | 'donut'
total_items: number
labeled_items: number
status: 'active' | 'completed' | 'archived'
created_at: string
}
interface OCRItem {
id: string
session_id: string
image_path: string
ocr_text: string | null
ocr_confidence: number | null
ground_truth: string | null
status: 'pending' | 'confirmed' | 'corrected' | 'skipped'
label_time_seconds: number | null
}
```
### RAG Pipeline
```typescript
interface TrainingJob {
id: string
name: string
status: 'queued' | 'preparing' | 'training' | 'validating' | 'completed' | 'failed' | 'paused'
progress: number
current_epoch: number
total_epochs: number
documents_processed: number
total_documents: number
config: {
batch_size: number
bundeslaender: string[]
mixed_precision: boolean
}
}
interface DataSource {
id: string
name: string
collection: string
document_count: number
chunk_count: number
status: 'active' | 'pending' | 'error'
last_updated: string | null
}
```
### Legal Corpus
```typescript
interface RegulationStatus {
code: string
name: string
fullName: string
type: 'eu_regulation' | 'eu_directive' | 'de_law' | 'bsi_standard'
chunkCount: number
status: 'ready' | 'empty' | 'error'
}
interface SearchResult {
text: string
regulation_code: string
regulation_name: string
article: string | null
paragraph: string | null
source_url: string
score: number
}
```
## Qdrant Collections
### Konfiguration
| Collection | Vektor-Dimension | Distanz-Metrik | Payload |
|------------|-----------------|----------------|---------|
| `bp_nibis_eh` | 1536 | COSINE | bundesland, fach, aufgabe |
| `bp_eh` | 1536 | COSINE | user_id, klausur_id |
| `bp_legal_corpus` | 1536 | COSINE | regulation, article, source_url |
| `bp_schulordnungen` | 1536 | COSINE | bundesland, typ, datum |
### Chunk-Strategie
```
┌─────────────────────────────────────────────────────────────┐
│ Originaldokument │
│ Lorem ipsum dolor sit amet, consectetur adipiscing elit... │
└─────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │
│ 0-1000 chars │ │ 800-1800 chars │ │ 1600-2600 chars │
│ │ │ (200 overlap) │ │ (200 overlap) │
└──────────────────────┘ └──────────────────────┘ └──────────────────────┘
```
## API-Authentifizierung
Alle Endpunkte nutzen die zentrale Auth-Middleware:
```mermaid
sequenceDiagram
participant C as Client
participant A as API Gateway
participant S as klausur-service
participant D as Datenbank
C->>A: Request + JWT Token
A->>A: Token validieren
A->>S: Forwarded Request
S->>D: Daten abfragen
D->>S: Response
S->>C: JSON Response
```
## Monitoring & Metriken
### Verfuegbare Metriken
| Metrik | Beschreibung | Endpoint |
|--------|--------------|----------|
| `ocr_items_total` | Gesamtzahl OCR-Items | `/api/v1/ocr-label/stats` |
| `ocr_accuracy_rate` | OCR-Genauigkeit | `/api/v1/ocr-label/stats` |
| `rag_chunk_count` | Anzahl indexierter Chunks | `/api/legal-corpus/status` |
| `rag_collection_status` | Collection-Status | `/api/legal-corpus/status` |
### Logging
```python
# Strukturiertes Logging im klausur-service
logger.info("OCR processing started", extra={
"session_id": session_id,
"item_count": item_count,
"model": ocr_model
})
```
## Fehlerbehandlung
### Retry-Strategien
| Operation | Max Retries | Backoff |
|-----------|-------------|---------|
| OCR-Verarbeitung | 3 | Exponentiell (1s, 2s, 4s) |
| Embedding-API | 5 | Exponentiell mit Jitter |
| Qdrant-Upsert | 3 | Linear (1s) |
### Fallback-Verhalten
```mermaid
flowchart TD
A[Embedding Request] --> B{OpenAI verfuegbar?}
B -->|Ja| C[OpenAI API]
B -->|Nein| D{Lokales Modell?}
D -->|Ja| E[Ollama Embedding]
D -->|Nein| F[Error + Queue]
```
## Skalierung
### Aktueller Stand
- **Single Node**: Alle Services auf Mac Mini
- **Qdrant**: Standalone, ~50k Chunks
- **PostgreSQL**: Shared mit anderen Services
### Geplante Erweiterungen
1. **Qdrant Cluster**: Bei > 1M Chunks
2. **Worker Queue**: Redis-basiert fuer Batch-Jobs
3. **GPU-Offloading**: OCR auf vast.ai GPU-Instanzen