Files
Benjamin Boenisch ad111d5e69 Initial commit: breakpilot-core - Shared Infrastructure
Docker Compose with 24+ services:
- PostgreSQL (PostGIS), Valkey, MinIO, Qdrant
- Vault (PKI/TLS), Nginx (Reverse Proxy)
- Backend Core API, Consent Service, Billing Service
- RAG Service, Embedding Service
- Gitea, Woodpecker CI/CD
- Night Scheduler, Health Aggregator
- Jitsi (Web/XMPP/JVB/Jicofo), Mailpit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 23:47:13 +01:00

8.8 KiB

KI-Daten-Pipeline Architektur

Diese Seite dokumentiert die technische Architektur der KI-Daten-Pipeline im Detail.

Systemuebersicht

graph TB
    subgraph Users["Benutzer"]
        U1[Entwickler]
        U2[Data Scientists]
        U3[Lehrer]
    end

    subgraph Frontend["Frontend (admin-v2)"]
        direction TB
        F1["OCR-Labeling<br/>/ai/ocr-labeling"]
        F2["RAG Pipeline<br/>/ai/rag-pipeline"]
        F3["Daten & RAG<br/>/ai/rag"]
        F4["Klausur-Korrektur<br/>/ai/klausur-korrektur"]
    end

    subgraph Backend["Backend Services"]
        direction TB
        B1["klausur-service<br/>Port 8086"]
        B2["embedding-service<br/>Port 8087"]
    end

    subgraph Storage["Persistenz"]
        direction TB
        D1[(PostgreSQL<br/>Metadaten)]
        D2[(Qdrant<br/>Vektoren)]
        D3[(MinIO<br/>Bilder/PDFs)]
    end

    subgraph External["Externe APIs"]
        E1[OpenAI API]
        E2[Ollama]
    end

    U1 --> F1
    U2 --> F2
    U3 --> F4

    F1 --> B1
    F2 --> B1
    F3 --> B1
    F4 --> B1

    B1 --> D1
    B1 --> D2
    B1 --> D3
    B1 --> B2

    B2 --> E1
    B1 --> E2

Komponenten-Details

OCR-Labeling Modul

flowchart TB
    subgraph Upload["Upload-Prozess"]
        U1[Bilder hochladen] --> U2[MinIO speichern]
        U2 --> U3[Session erstellen]
    end

    subgraph OCR["OCR-Verarbeitung"]
        O1[Bild laden] --> O2{Modell wählen}
        O2 -->|llama3.2-vision| O3a[Vision LLM]
        O2 -->|trocr| O3b[Transformer]
        O2 -->|paddleocr| O3c[PaddleOCR]
        O2 -->|donut| O3d[Document AI]
        O3a --> O4[OCR-Text]
        O3b --> O4
        O3c --> O4
        O3d --> O4
    end

    subgraph Labeling["Labeling-Prozess"]
        L1[Queue laden] --> L2[Item anzeigen]
        L2 --> L3{Entscheidung}
        L3 -->|korrekt| L4[Bestaetigen]
        L3 -->|falsch| L5[Korrigieren]
        L3 -->|unklar| L6[Ueberspringen]
        L4 --> L7[PostgreSQL]
        L5 --> L7
        L6 --> L7
    end

    subgraph Export["Export"]
        E1[Gelabelte Items] --> E2{Format}
        E2 -->|TrOCR| E3a[Transformer Format]
        E2 -->|Llama| E3b[Vision Format]
        E2 -->|Generic| E3c[JSON]
    end

    Upload --> OCR
    OCR --> Labeling
    Labeling --> Export

RAG Pipeline Modul

flowchart TB
    subgraph Sources["Datenquellen"]
        S1[NiBiS PDFs]
        S2[Uploads]
        S3[Rechtskorpus]
        S4[Schulordnungen]
    end

    subgraph Processing["Verarbeitung"]
        direction TB
        P1[PDF Parser] --> P2[OCR falls noetig]
        P2 --> P3[Text Cleaning]
        P3 --> P4[Chunking<br/>1000 chars, 200 overlap]
        P4 --> P5[Metadata Extraction]
    end

    subgraph Embedding["Embedding"]
        E1[embedding-service] --> E2[OpenAI API]
        E2 --> E3[1536-dim Vektor]
    end

    subgraph Indexing["Indexierung"]
        I1{Collection waehlen}
        I1 -->|EH| I2a[bp_nibis_eh]
        I1 -->|Custom| I2b[bp_eh]
        I1 -->|Legal| I2c[bp_legal_corpus]
        I1 -->|Schul| I2d[bp_schulordnungen]
        I2a --> I3[Qdrant upsert]
        I2b --> I3
        I2c --> I3
        I2d --> I3
    end

    Sources --> Processing
    Processing --> Embedding
    Embedding --> Indexing

Daten & RAG Modul

flowchart TB
    subgraph Query["Suchanfrage"]
        Q1[User Query] --> Q2[Query Embedding]
        Q2 --> Q3[1536-dim Vektor]
    end

    subgraph Search["Qdrant Suche"]
        S1[Collection waehlen] --> S2[Vector Search]
        S2 --> S3[Top-k Results]
        S3 --> S4[Score Filtering]
    end

    subgraph Results["Ergebnisse"]
        R1[Chunks] --> R2[Metadata anreichern]
        R2 --> R3[Source URLs]
        R3 --> R4[Response]
    end

    Query --> Search
    Search --> Results

Datenmodelle

OCR-Labeling

interface OCRSession {
  id: string
  name: string
  source_type: 'klausur' | 'handwriting_sample' | 'scan'
  ocr_model: 'llama3.2-vision:11b' | 'trocr' | 'paddleocr' | 'donut'
  total_items: number
  labeled_items: number
  status: 'active' | 'completed' | 'archived'
  created_at: string
}

interface OCRItem {
  id: string
  session_id: string
  image_path: string
  ocr_text: string | null
  ocr_confidence: number | null
  ground_truth: string | null
  status: 'pending' | 'confirmed' | 'corrected' | 'skipped'
  label_time_seconds: number | null
}

RAG Pipeline

interface TrainingJob {
  id: string
  name: string
  status: 'queued' | 'preparing' | 'training' | 'validating' | 'completed' | 'failed' | 'paused'
  progress: number
  current_epoch: number
  total_epochs: number
  documents_processed: number
  total_documents: number
  config: {
    batch_size: number
    bundeslaender: string[]
    mixed_precision: boolean
  }
}

interface DataSource {
  id: string
  name: string
  collection: string
  document_count: number
  chunk_count: number
  status: 'active' | 'pending' | 'error'
  last_updated: string | null
}
interface RegulationStatus {
  code: string
  name: string
  fullName: string
  type: 'eu_regulation' | 'eu_directive' | 'de_law' | 'bsi_standard'
  chunkCount: number
  status: 'ready' | 'empty' | 'error'
}

interface SearchResult {
  text: string
  regulation_code: string
  regulation_name: string
  article: string | null
  paragraph: string | null
  source_url: string
  score: number
}

Qdrant Collections

Konfiguration

Collection Vektor-Dimension Distanz-Metrik Payload
bp_nibis_eh 1536 COSINE bundesland, fach, aufgabe
bp_eh 1536 COSINE user_id, klausur_id
bp_legal_corpus 1536 COSINE regulation, article, source_url
bp_schulordnungen 1536 COSINE bundesland, typ, datum

Chunk-Strategie

┌─────────────────────────────────────────────────────────────┐
│                      Originaldokument                        │
│  Lorem ipsum dolor sit amet, consectetur adipiscing elit...  │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────┐  ┌──────────────────────┐  ┌──────────────────────┐
│      Chunk 1         │  │      Chunk 2         │  │      Chunk 3         │
│  0-1000 chars        │  │  800-1800 chars      │  │  1600-2600 chars     │
│                      │  │  (200 overlap)       │  │  (200 overlap)       │
└──────────────────────┘  └──────────────────────┘  └──────────────────────┘

API-Authentifizierung

Alle Endpunkte nutzen die zentrale Auth-Middleware:

sequenceDiagram
    participant C as Client
    participant A as API Gateway
    participant S as klausur-service
    participant D as Datenbank

    C->>A: Request + JWT Token
    A->>A: Token validieren
    A->>S: Forwarded Request
    S->>D: Daten abfragen
    D->>S: Response
    S->>C: JSON Response

Monitoring & Metriken

Verfuegbare Metriken

Metrik Beschreibung Endpoint
ocr_items_total Gesamtzahl OCR-Items /api/v1/ocr-label/stats
ocr_accuracy_rate OCR-Genauigkeit /api/v1/ocr-label/stats
rag_chunk_count Anzahl indexierter Chunks /api/legal-corpus/status
rag_collection_status Collection-Status /api/legal-corpus/status

Logging

# Strukturiertes Logging im klausur-service
logger.info("OCR processing started", extra={
    "session_id": session_id,
    "item_count": item_count,
    "model": ocr_model
})

Fehlerbehandlung

Retry-Strategien

Operation Max Retries Backoff
OCR-Verarbeitung 3 Exponentiell (1s, 2s, 4s)
Embedding-API 5 Exponentiell mit Jitter
Qdrant-Upsert 3 Linear (1s)

Fallback-Verhalten

flowchart TD
    A[Embedding Request] --> B{OpenAI verfuegbar?}
    B -->|Ja| C[OpenAI API]
    B -->|Nein| D{Lokales Modell?}
    D -->|Ja| E[Ollama Embedding]
    D -->|Nein| F[Error + Queue]

Skalierung

Aktueller Stand

  • Single Node: Alle Services auf Mac Mini
  • Qdrant: Standalone, ~50k Chunks
  • PostgreSQL: Shared mit anderen Services

Geplante Erweiterungen

  1. Qdrant Cluster: Bei > 1M Chunks
  2. Worker Queue: Redis-basiert fuer Batch-Jobs
  3. GPU-Offloading: OCR auf vast.ai GPU-Instanzen