Add CLAUDE.md, MkDocs docs, .claude/rules
- CLAUDE.md: Comprehensive documentation for Lehrer KI platform - docs-src: Klausur, Voice, Agent-Core, KI-Pipeline docs - mkdocs.yml: Lehrer-specific nav with blue theme - docker-compose: Added docs service (port 8010, profile: docs) - .claude/rules: testing, docs, open-source, abiturkorrektur, vocab-worksheet, multi-agent, experimental-dashboard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
353
docs-src/services/ki-daten-pipeline/architecture.md
Normal file
353
docs-src/services/ki-daten-pipeline/architecture.md
Normal file
@@ -0,0 +1,353 @@
|
||||
# KI-Daten-Pipeline Architektur
|
||||
|
||||
Diese Seite dokumentiert die technische Architektur der KI-Daten-Pipeline im Detail.
|
||||
|
||||
## Systemuebersicht
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph Users["Benutzer"]
|
||||
U1[Entwickler]
|
||||
U2[Data Scientists]
|
||||
U3[Lehrer]
|
||||
end
|
||||
|
||||
subgraph Frontend["Frontend (admin-v2)"]
|
||||
direction TB
|
||||
F1["OCR-Labeling<br/>/ai/ocr-labeling"]
|
||||
F2["RAG Pipeline<br/>/ai/rag-pipeline"]
|
||||
F3["Daten & RAG<br/>/ai/rag"]
|
||||
F4["Klausur-Korrektur<br/>/ai/klausur-korrektur"]
|
||||
end
|
||||
|
||||
subgraph Backend["Backend Services"]
|
||||
direction TB
|
||||
B1["klausur-service<br/>Port 8086"]
|
||||
B2["embedding-service<br/>Port 8087"]
|
||||
end
|
||||
|
||||
subgraph Storage["Persistenz"]
|
||||
direction TB
|
||||
D1[(PostgreSQL<br/>Metadaten)]
|
||||
D2[(Qdrant<br/>Vektoren)]
|
||||
D3[(MinIO<br/>Bilder/PDFs)]
|
||||
end
|
||||
|
||||
subgraph External["Externe APIs"]
|
||||
E1[OpenAI API]
|
||||
E2[Ollama]
|
||||
end
|
||||
|
||||
U1 --> F1
|
||||
U2 --> F2
|
||||
U3 --> F4
|
||||
|
||||
F1 --> B1
|
||||
F2 --> B1
|
||||
F3 --> B1
|
||||
F4 --> B1
|
||||
|
||||
B1 --> D1
|
||||
B1 --> D2
|
||||
B1 --> D3
|
||||
B1 --> B2
|
||||
|
||||
B2 --> E1
|
||||
B1 --> E2
|
||||
```
|
||||
|
||||
## Komponenten-Details
|
||||
|
||||
### OCR-Labeling Modul
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph Upload["Upload-Prozess"]
|
||||
U1[Bilder hochladen] --> U2[MinIO speichern]
|
||||
U2 --> U3[Session erstellen]
|
||||
end
|
||||
|
||||
subgraph OCR["OCR-Verarbeitung"]
|
||||
O1[Bild laden] --> O2{Modell wählen}
|
||||
O2 -->|llama3.2-vision| O3a[Vision LLM]
|
||||
O2 -->|trocr| O3b[Transformer]
|
||||
O2 -->|paddleocr| O3c[PaddleOCR]
|
||||
O2 -->|donut| O3d[Document AI]
|
||||
O3a --> O4[OCR-Text]
|
||||
O3b --> O4
|
||||
O3c --> O4
|
||||
O3d --> O4
|
||||
end
|
||||
|
||||
subgraph Labeling["Labeling-Prozess"]
|
||||
L1[Queue laden] --> L2[Item anzeigen]
|
||||
L2 --> L3{Entscheidung}
|
||||
L3 -->|korrekt| L4[Bestaetigen]
|
||||
L3 -->|falsch| L5[Korrigieren]
|
||||
L3 -->|unklar| L6[Ueberspringen]
|
||||
L4 --> L7[PostgreSQL]
|
||||
L5 --> L7
|
||||
L6 --> L7
|
||||
end
|
||||
|
||||
subgraph Export["Export"]
|
||||
E1[Gelabelte Items] --> E2{Format}
|
||||
E2 -->|TrOCR| E3a[Transformer Format]
|
||||
E2 -->|Llama| E3b[Vision Format]
|
||||
E2 -->|Generic| E3c[JSON]
|
||||
end
|
||||
|
||||
Upload --> OCR
|
||||
OCR --> Labeling
|
||||
Labeling --> Export
|
||||
```
|
||||
|
||||
### RAG Pipeline Modul
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph Sources["Datenquellen"]
|
||||
S1[NiBiS PDFs]
|
||||
S2[Uploads]
|
||||
S3[Rechtskorpus]
|
||||
S4[Schulordnungen]
|
||||
end
|
||||
|
||||
subgraph Processing["Verarbeitung"]
|
||||
direction TB
|
||||
P1[PDF Parser] --> P2[OCR falls noetig]
|
||||
P2 --> P3[Text Cleaning]
|
||||
P3 --> P4[Chunking<br/>1000 chars, 200 overlap]
|
||||
P4 --> P5[Metadata Extraction]
|
||||
end
|
||||
|
||||
subgraph Embedding["Embedding"]
|
||||
E1[embedding-service] --> E2[OpenAI API]
|
||||
E2 --> E3[1536-dim Vektor]
|
||||
end
|
||||
|
||||
subgraph Indexing["Indexierung"]
|
||||
I1{Collection waehlen}
|
||||
I1 -->|EH| I2a[bp_nibis_eh]
|
||||
I1 -->|Custom| I2b[bp_eh]
|
||||
I1 -->|Legal| I2c[bp_legal_corpus]
|
||||
I1 -->|Schul| I2d[bp_schulordnungen]
|
||||
I2a --> I3[Qdrant upsert]
|
||||
I2b --> I3
|
||||
I2c --> I3
|
||||
I2d --> I3
|
||||
end
|
||||
|
||||
Sources --> Processing
|
||||
Processing --> Embedding
|
||||
Embedding --> Indexing
|
||||
```
|
||||
|
||||
### Daten & RAG Modul
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph Query["Suchanfrage"]
|
||||
Q1[User Query] --> Q2[Query Embedding]
|
||||
Q2 --> Q3[1536-dim Vektor]
|
||||
end
|
||||
|
||||
subgraph Search["Qdrant Suche"]
|
||||
S1[Collection waehlen] --> S2[Vector Search]
|
||||
S2 --> S3[Top-k Results]
|
||||
S3 --> S4[Score Filtering]
|
||||
end
|
||||
|
||||
subgraph Results["Ergebnisse"]
|
||||
R1[Chunks] --> R2[Metadata anreichern]
|
||||
R2 --> R3[Source URLs]
|
||||
R3 --> R4[Response]
|
||||
end
|
||||
|
||||
Query --> Search
|
||||
Search --> Results
|
||||
```
|
||||
|
||||
## Datenmodelle
|
||||
|
||||
### OCR-Labeling
|
||||
|
||||
```typescript
|
||||
interface OCRSession {
|
||||
id: string
|
||||
name: string
|
||||
source_type: 'klausur' | 'handwriting_sample' | 'scan'
|
||||
ocr_model: 'llama3.2-vision:11b' | 'trocr' | 'paddleocr' | 'donut'
|
||||
total_items: number
|
||||
labeled_items: number
|
||||
status: 'active' | 'completed' | 'archived'
|
||||
created_at: string
|
||||
}
|
||||
|
||||
interface OCRItem {
|
||||
id: string
|
||||
session_id: string
|
||||
image_path: string
|
||||
ocr_text: string | null
|
||||
ocr_confidence: number | null
|
||||
ground_truth: string | null
|
||||
status: 'pending' | 'confirmed' | 'corrected' | 'skipped'
|
||||
label_time_seconds: number | null
|
||||
}
|
||||
```
|
||||
|
||||
### RAG Pipeline
|
||||
|
||||
```typescript
|
||||
interface TrainingJob {
|
||||
id: string
|
||||
name: string
|
||||
status: 'queued' | 'preparing' | 'training' | 'validating' | 'completed' | 'failed' | 'paused'
|
||||
progress: number
|
||||
current_epoch: number
|
||||
total_epochs: number
|
||||
documents_processed: number
|
||||
total_documents: number
|
||||
config: {
|
||||
batch_size: number
|
||||
bundeslaender: string[]
|
||||
mixed_precision: boolean
|
||||
}
|
||||
}
|
||||
|
||||
interface DataSource {
|
||||
id: string
|
||||
name: string
|
||||
collection: string
|
||||
document_count: number
|
||||
chunk_count: number
|
||||
status: 'active' | 'pending' | 'error'
|
||||
last_updated: string | null
|
||||
}
|
||||
```
|
||||
|
||||
### Legal Corpus
|
||||
|
||||
```typescript
|
||||
interface RegulationStatus {
|
||||
code: string
|
||||
name: string
|
||||
fullName: string
|
||||
type: 'eu_regulation' | 'eu_directive' | 'de_law' | 'bsi_standard'
|
||||
chunkCount: number
|
||||
status: 'ready' | 'empty' | 'error'
|
||||
}
|
||||
|
||||
interface SearchResult {
|
||||
text: string
|
||||
regulation_code: string
|
||||
regulation_name: string
|
||||
article: string | null
|
||||
paragraph: string | null
|
||||
source_url: string
|
||||
score: number
|
||||
}
|
||||
```
|
||||
|
||||
## Qdrant Collections
|
||||
|
||||
### Konfiguration
|
||||
|
||||
| Collection | Vektor-Dimension | Distanz-Metrik | Payload |
|
||||
|------------|-----------------|----------------|---------|
|
||||
| `bp_nibis_eh` | 1536 | COSINE | bundesland, fach, aufgabe |
|
||||
| `bp_eh` | 1536 | COSINE | user_id, klausur_id |
|
||||
| `bp_legal_corpus` | 1536 | COSINE | regulation, article, source_url |
|
||||
| `bp_schulordnungen` | 1536 | COSINE | bundesland, typ, datum |
|
||||
|
||||
### Chunk-Strategie
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Originaldokument │
|
||||
│ Lorem ipsum dolor sit amet, consectetur adipiscing elit... │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
|
||||
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │
|
||||
│ 0-1000 chars │ │ 800-1800 chars │ │ 1600-2600 chars │
|
||||
│ │ │ (200 overlap) │ │ (200 overlap) │
|
||||
└──────────────────────┘ └──────────────────────┘ └──────────────────────┘
|
||||
```
|
||||
|
||||
## API-Authentifizierung
|
||||
|
||||
Alle Endpunkte nutzen die zentrale Auth-Middleware:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant C as Client
|
||||
participant A as API Gateway
|
||||
participant S as klausur-service
|
||||
participant D as Datenbank
|
||||
|
||||
C->>A: Request + JWT Token
|
||||
A->>A: Token validieren
|
||||
A->>S: Forwarded Request
|
||||
S->>D: Daten abfragen
|
||||
D->>S: Response
|
||||
S->>C: JSON Response
|
||||
```
|
||||
|
||||
## Monitoring & Metriken
|
||||
|
||||
### Verfuegbare Metriken
|
||||
|
||||
| Metrik | Beschreibung | Endpoint |
|
||||
|--------|--------------|----------|
|
||||
| `ocr_items_total` | Gesamtzahl OCR-Items | `/api/v1/ocr-label/stats` |
|
||||
| `ocr_accuracy_rate` | OCR-Genauigkeit | `/api/v1/ocr-label/stats` |
|
||||
| `rag_chunk_count` | Anzahl indexierter Chunks | `/api/legal-corpus/status` |
|
||||
| `rag_collection_status` | Collection-Status | `/api/legal-corpus/status` |
|
||||
|
||||
### Logging
|
||||
|
||||
```python
|
||||
# Strukturiertes Logging im klausur-service
|
||||
logger.info("OCR processing started", extra={
|
||||
"session_id": session_id,
|
||||
"item_count": item_count,
|
||||
"model": ocr_model
|
||||
})
|
||||
```
|
||||
|
||||
## Fehlerbehandlung
|
||||
|
||||
### Retry-Strategien
|
||||
|
||||
| Operation | Max Retries | Backoff |
|
||||
|-----------|-------------|---------|
|
||||
| OCR-Verarbeitung | 3 | Exponentiell (1s, 2s, 4s) |
|
||||
| Embedding-API | 5 | Exponentiell mit Jitter |
|
||||
| Qdrant-Upsert | 3 | Linear (1s) |
|
||||
|
||||
### Fallback-Verhalten
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Embedding Request] --> B{OpenAI verfuegbar?}
|
||||
B -->|Ja| C[OpenAI API]
|
||||
B -->|Nein| D{Lokales Modell?}
|
||||
D -->|Ja| E[Ollama Embedding]
|
||||
D -->|Nein| F[Error + Queue]
|
||||
```
|
||||
|
||||
## Skalierung
|
||||
|
||||
### Aktueller Stand
|
||||
|
||||
- **Single Node**: Alle Services auf Mac Mini
|
||||
- **Qdrant**: Standalone, ~50k Chunks
|
||||
- **PostgreSQL**: Shared mit anderen Services
|
||||
|
||||
### Geplante Erweiterungen
|
||||
|
||||
1. **Qdrant Cluster**: Bei > 1M Chunks
|
||||
2. **Worker Queue**: Redis-basiert fuer Batch-Jobs
|
||||
3. **GPU-Offloading**: OCR auf vast.ai GPU-Instanzen
|
||||
Reference in New Issue
Block a user