# KI-Daten-Pipeline Architektur Diese Seite dokumentiert die technische Architektur der KI-Daten-Pipeline im Detail. ## Systemuebersicht ```mermaid graph TB subgraph Users["Benutzer"] U1[Entwickler] U2[Data Scientists] U3[Lehrer] end subgraph Frontend["Frontend (admin-v2)"] direction TB F1["OCR-Labeling
/ai/ocr-labeling"] F2["RAG Pipeline
/ai/rag-pipeline"] F3["Daten & RAG
/ai/rag"] F4["Klausur-Korrektur
/ai/klausur-korrektur"] end subgraph Backend["Backend Services"] direction TB B1["klausur-service
Port 8086"] B2["embedding-service
Port 8087"] end subgraph Storage["Persistenz"] direction TB D1[(PostgreSQL
Metadaten)] D2[(Qdrant
Vektoren)] D3[(MinIO
Bilder/PDFs)] end subgraph External["Externe APIs"] E1[OpenAI API] E2[Ollama] end U1 --> F1 U2 --> F2 U3 --> F4 F1 --> B1 F2 --> B1 F3 --> B1 F4 --> B1 B1 --> D1 B1 --> D2 B1 --> D3 B1 --> B2 B2 --> E1 B1 --> E2 ``` ## Komponenten-Details ### OCR-Labeling Modul ```mermaid flowchart TB subgraph Upload["Upload-Prozess"] U1[Bilder hochladen] --> U2[MinIO speichern] U2 --> U3[Session erstellen] end subgraph OCR["OCR-Verarbeitung"] O1[Bild laden] --> O2{Modell wählen} O2 -->|llama3.2-vision| O3a[Vision LLM] O2 -->|trocr| O3b[Transformer] O2 -->|paddleocr| O3c[PaddleOCR] O2 -->|donut| O3d[Document AI] O3a --> O4[OCR-Text] O3b --> O4 O3c --> O4 O3d --> O4 end subgraph Labeling["Labeling-Prozess"] L1[Queue laden] --> L2[Item anzeigen] L2 --> L3{Entscheidung} L3 -->|korrekt| L4[Bestaetigen] L3 -->|falsch| L5[Korrigieren] L3 -->|unklar| L6[Ueberspringen] L4 --> L7[PostgreSQL] L5 --> L7 L6 --> L7 end subgraph Export["Export"] E1[Gelabelte Items] --> E2{Format} E2 -->|TrOCR| E3a[Transformer Format] E2 -->|Llama| E3b[Vision Format] E2 -->|Generic| E3c[JSON] end Upload --> OCR OCR --> Labeling Labeling --> Export ``` ### RAG Pipeline Modul ```mermaid flowchart TB subgraph Sources["Datenquellen"] S1[NiBiS PDFs] S2[Uploads] S3[Rechtskorpus] S4[Schulordnungen] end subgraph Processing["Verarbeitung"] direction TB P1[PDF Parser] --> P2[OCR falls noetig] P2 --> P3[Text Cleaning] P3 --> P4[Chunking
1000 chars, 200 overlap] P4 --> P5[Metadata Extraction] end subgraph Embedding["Embedding"] E1[embedding-service] --> E2[OpenAI API] E2 --> E3[1536-dim Vektor] end subgraph Indexing["Indexierung"] I1{Collection waehlen} I1 -->|EH| I2a[bp_nibis_eh] I1 -->|Custom| I2b[bp_eh] I1 -->|Legal| I2c[bp_legal_corpus] I1 -->|Schul| I2d[bp_schulordnungen] I2a --> I3[Qdrant upsert] I2b --> I3 I2c --> I3 I2d --> I3 end Sources --> Processing Processing --> Embedding Embedding --> Indexing ``` ### Daten & RAG Modul ```mermaid flowchart TB subgraph Query["Suchanfrage"] Q1[User Query] --> Q2[Query Embedding] Q2 --> Q3[1536-dim Vektor] end subgraph Search["Qdrant Suche"] S1[Collection waehlen] --> S2[Vector Search] S2 --> S3[Top-k Results] S3 --> S4[Score Filtering] end subgraph Results["Ergebnisse"] R1[Chunks] --> R2[Metadata anreichern] R2 --> R3[Source URLs] R3 --> R4[Response] end Query --> Search Search --> Results ``` ## Datenmodelle ### OCR-Labeling ```typescript interface OCRSession { id: string name: string source_type: 'klausur' | 'handwriting_sample' | 'scan' ocr_model: 'llama3.2-vision:11b' | 'trocr' | 'paddleocr' | 'donut' total_items: number labeled_items: number status: 'active' | 'completed' | 'archived' created_at: string } interface OCRItem { id: string session_id: string image_path: string ocr_text: string | null ocr_confidence: number | null ground_truth: string | null status: 'pending' | 'confirmed' | 'corrected' | 'skipped' label_time_seconds: number | null } ``` ### RAG Pipeline ```typescript interface TrainingJob { id: string name: string status: 'queued' | 'preparing' | 'training' | 'validating' | 'completed' | 'failed' | 'paused' progress: number current_epoch: number total_epochs: number documents_processed: number total_documents: number config: { batch_size: number bundeslaender: string[] mixed_precision: boolean } } interface DataSource { id: string name: string collection: string document_count: number chunk_count: number status: 'active' | 'pending' | 'error' last_updated: string | null } ``` ### Legal Corpus ```typescript interface RegulationStatus { code: string name: string fullName: string type: 'eu_regulation' | 'eu_directive' | 'de_law' | 'bsi_standard' chunkCount: number status: 'ready' | 'empty' | 'error' } interface SearchResult { text: string regulation_code: string regulation_name: string article: string | null paragraph: string | null source_url: string score: number } ``` ## Qdrant Collections ### Konfiguration | Collection | Vektor-Dimension | Distanz-Metrik | Payload | |------------|-----------------|----------------|---------| | `bp_nibis_eh` | 1536 | COSINE | bundesland, fach, aufgabe | | `bp_eh` | 1536 | COSINE | user_id, klausur_id | | `bp_legal_corpus` | 1536 | COSINE | regulation, article, source_url | | `bp_schulordnungen` | 1536 | COSINE | bundesland, typ, datum | ### Chunk-Strategie ``` ┌─────────────────────────────────────────────────────────────┐ │ Originaldokument │ │ Lorem ipsum dolor sit amet, consectetur adipiscing elit... │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐ │ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │ │ 0-1000 chars │ │ 800-1800 chars │ │ 1600-2600 chars │ │ │ │ (200 overlap) │ │ (200 overlap) │ └──────────────────────┘ └──────────────────────┘ └──────────────────────┘ ``` ## API-Authentifizierung Alle Endpunkte nutzen die zentrale Auth-Middleware: ```mermaid sequenceDiagram participant C as Client participant A as API Gateway participant S as klausur-service participant D as Datenbank C->>A: Request + JWT Token A->>A: Token validieren A->>S: Forwarded Request S->>D: Daten abfragen D->>S: Response S->>C: JSON Response ``` ## Monitoring & Metriken ### Verfuegbare Metriken | Metrik | Beschreibung | Endpoint | |--------|--------------|----------| | `ocr_items_total` | Gesamtzahl OCR-Items | `/api/v1/ocr-label/stats` | | `ocr_accuracy_rate` | OCR-Genauigkeit | `/api/v1/ocr-label/stats` | | `rag_chunk_count` | Anzahl indexierter Chunks | `/api/legal-corpus/status` | | `rag_collection_status` | Collection-Status | `/api/legal-corpus/status` | ### Logging ```python # Strukturiertes Logging im klausur-service logger.info("OCR processing started", extra={ "session_id": session_id, "item_count": item_count, "model": ocr_model }) ``` ## Fehlerbehandlung ### Retry-Strategien | Operation | Max Retries | Backoff | |-----------|-------------|---------| | OCR-Verarbeitung | 3 | Exponentiell (1s, 2s, 4s) | | Embedding-API | 5 | Exponentiell mit Jitter | | Qdrant-Upsert | 3 | Linear (1s) | ### Fallback-Verhalten ```mermaid flowchart TD A[Embedding Request] --> B{OpenAI verfuegbar?} B -->|Ja| C[OpenAI API] B -->|Nein| D{Lokales Modell?} D -->|Ja| E[Ollama Embedding] D -->|Nein| F[Error + Queue] ``` ## Skalierung ### Aktueller Stand - **Single Node**: Alle Services auf Mac Mini - **Qdrant**: Standalone, ~50k Chunks - **PostgreSQL**: Shared mit anderen Services ### Geplante Erweiterungen 1. **Qdrant Cluster**: Bei > 1M Chunks 2. **Worker Queue**: Redis-basiert fuer Batch-Jobs 3. **GPU-Offloading**: OCR auf vast.ai GPU-Instanzen