Initial commit: breakpilot-lehrer - Lehrer KI Platform
Services: Admin-Lehrer, Backend-Lehrer, Studio v2, Website, Klausur-Service, School-Service, Voice-Service, Geo-Service, BreakPilot Drive, Agent-Core Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
293
klausur-service/docs/Vocab-Worksheet-Architecture.md
Normal file
293
klausur-service/docs/Vocab-Worksheet-Architecture.md
Normal file
@@ -0,0 +1,293 @@
|
||||
# Vokabel-Arbeitsblatt Generator - Architektur
|
||||
|
||||
**Version:** 1.0.0
|
||||
**Datum:** 2026-01-23
|
||||
**Status:** Produktiv
|
||||
|
||||
---
|
||||
|
||||
## 1. Uebersicht
|
||||
|
||||
Der Vokabel-Arbeitsblatt Generator ist ein DSGVO-konformes Tool fuer Lehrer, das Vokabeln aus Schulbuchseiten extrahiert und druckfertige Arbeitsblaetter generiert.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ Studio v2 (Next.js) │
|
||||
│ Port 3001 │
|
||||
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ /vocab-worksheet │ │
|
||||
│ │ - Session-Management (erstellen, fortsetzen, loeschen) │ │
|
||||
│ │ - PDF-Upload mit Seitenauswahl │ │
|
||||
│ │ - Vokabel-Bearbeitung (Grid-Editor) │ │
|
||||
│ │ - Arbeitsblatt-Konfiguration │ │
|
||||
│ │ - PDF-Export │ │
|
||||
│ └─────────────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼ HTTP/REST
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ Klausur-Service (FastAPI) │
|
||||
│ Port 8086 │
|
||||
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ /api/v1/vocab/* │ │
|
||||
│ │ - Session CRUD │ │
|
||||
│ │ - PDF-Verarbeitung (PyMuPDF) │ │
|
||||
│ │ - Vokabel-Extraktion (Vision LLM / Hybrid OCR) │ │
|
||||
│ │ - Arbeitsblatt-Generierung (WeasyPrint) │ │
|
||||
│ └─────────────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
┌───────────────┴───────────────┐
|
||||
▼ ▼
|
||||
┌───────────────────────────────┐ ┌───────────────────────────────────┐
|
||||
│ Ollama Vision LLM │ │ LLM Gateway │
|
||||
│ Port 11434 │ │ Port 8002 │
|
||||
│ ┌─────────────────────────┐ │ │ ┌─────────────────────────────┐ │
|
||||
│ │ qwen2.5vl:32b │ │ │ │ qwen2.5:14b │ │
|
||||
│ │ (Bild → Vokabeln) │ │ │ │ (OCR-Text → strukturiert) │ │
|
||||
│ └─────────────────────────┘ │ │ └─────────────────────────────┘ │
|
||||
└───────────────────────────────┘ └───────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Komponenten
|
||||
|
||||
### 2.1 Frontend (studio-v2)
|
||||
|
||||
**Datei:** `/studio-v2/app/vocab-worksheet/page.tsx`
|
||||
|
||||
| Aspekt | Details |
|
||||
|--------|---------|
|
||||
| Framework | Next.js 16.1.4 mit React 19.0.0 |
|
||||
| Styling | Tailwind CSS 3.4.17 |
|
||||
| Sprache | TypeScript 5.7.0 |
|
||||
| State | React Hooks (useState, useRef, useEffect) |
|
||||
|
||||
**Tab-basierter Workflow:**
|
||||
|
||||
1. **Upload** - Session benennen, Datei auswaehlen (Bild/PDF)
|
||||
2. **Pages** - Bei PDFs: Seiten mit Thumbnails auswaehlen
|
||||
3. **Vocabulary** - Extrahierte Vokabeln pruefen/bearbeiten
|
||||
4. **Worksheet** - Arbeitsblatt-Typ und Format waehlen
|
||||
5. **Export** - PDF herunterladen
|
||||
|
||||
**Datenstrukturen:**
|
||||
|
||||
```typescript
|
||||
interface VocabularyEntry {
|
||||
id: string
|
||||
english: string
|
||||
german: string
|
||||
example_sentence?: string
|
||||
word_type?: string
|
||||
source_page?: number
|
||||
}
|
||||
|
||||
interface Session {
|
||||
id: string
|
||||
name: string
|
||||
status: 'pending' | 'processing' | 'extracted' | 'completed'
|
||||
vocabulary_count: number
|
||||
}
|
||||
|
||||
type WorksheetType = 'en_to_de' | 'de_to_en' | 'copy' | 'gap_fill'
|
||||
```
|
||||
|
||||
### 2.2 Backend API
|
||||
|
||||
**Datei:** `/klausur-service/backend/vocab_worksheet_api.py`
|
||||
|
||||
| Aspekt | Details |
|
||||
|--------|---------|
|
||||
| Framework | FastAPI (async) |
|
||||
| Router-Prefix | `/api/v1/vocab` |
|
||||
| Storage | In-Memory (Dict) + Filesystem |
|
||||
|
||||
**Endpoints:**
|
||||
|
||||
| Methode | Pfad | Beschreibung |
|
||||
|---------|------|--------------|
|
||||
| POST | `/sessions` | Session erstellen |
|
||||
| GET | `/sessions` | Sessions auflisten |
|
||||
| GET | `/sessions/{id}` | Session-Details |
|
||||
| DELETE | `/sessions/{id}` | Session loeschen |
|
||||
| POST | `/sessions/{id}/upload` | Bild/PDF hochladen |
|
||||
| POST | `/sessions/{id}/upload-pdf-info` | PDF-Info abrufen |
|
||||
| GET | `/sessions/{id}/pdf-thumbnail/{page}` | Seiten-Thumbnail |
|
||||
| POST | `/sessions/{id}/process-single-page/{page}` | Einzelne Seite verarbeiten |
|
||||
| GET | `/sessions/{id}/vocabulary` | Vokabeln abrufen |
|
||||
| PUT | `/sessions/{id}/vocabulary` | Vokabeln aktualisieren |
|
||||
| POST | `/sessions/{id}/generate` | Arbeitsblatt generieren |
|
||||
| GET | `/worksheets/{id}/pdf` | Arbeitsblatt-PDF |
|
||||
| GET | `/worksheets/{id}/solution` | Loesungs-PDF |
|
||||
|
||||
### 2.3 Vokabel-Extraktion
|
||||
|
||||
**Zwei Modi verfuegbar:**
|
||||
|
||||
#### A. Vision LLM (Standard)
|
||||
|
||||
```python
|
||||
OLLAMA_URL = "http://host.docker.internal:11434"
|
||||
VISION_MODEL = "qwen2.5vl:32b"
|
||||
```
|
||||
|
||||
- Bild wird Base64-kodiert an Ollama gesendet
|
||||
- Prompt in Deutsch fuer bessere Erkennung
|
||||
- Timeout: 5 Minuten pro Seite
|
||||
- Confidence: ~85%
|
||||
|
||||
#### B. Hybrid OCR + LLM (Optional)
|
||||
|
||||
**Datei:** `/klausur-service/backend/hybrid_vocab_extractor.py`
|
||||
|
||||
```
|
||||
Bild → PaddleOCR → Text-Regionen → LLM Gateway → Strukturiertes JSON
|
||||
```
|
||||
|
||||
- PaddleOCR 3.x fuer Text-Erkennung
|
||||
- Automatische Spalten-Erkennung (2 oder 3 Spalten)
|
||||
- qwen2.5:14b fuer Strukturierung
|
||||
- ~4x schneller als Vision LLM
|
||||
|
||||
### 2.4 PDF-Verarbeitung
|
||||
|
||||
| Aufgabe | Bibliothek |
|
||||
|---------|------------|
|
||||
| PDF → PNG | PyMuPDF (fitz) |
|
||||
| Thumbnails | PyMuPDF mit Zoom 0.5 |
|
||||
| OCR-Bilder | PyMuPDF mit Zoom 2.0 |
|
||||
| PDF-Generierung | WeasyPrint |
|
||||
|
||||
---
|
||||
|
||||
## 3. Datenfluss
|
||||
|
||||
```
|
||||
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
|
||||
│ Upload │───►│ OCR/ │───►│ Edit │───►│ Export │
|
||||
│ PDF │ │ Extract │ │ Vocab │ │ PDF │
|
||||
└──────────┘ └──────────┘ └──────────┘ └──────────┘
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
/upload /process- /vocabulary /generate
|
||||
single-page
|
||||
```
|
||||
|
||||
**Session-Status-Workflow:**
|
||||
|
||||
```
|
||||
PENDING → PROCESSING → EXTRACTED → COMPLETED
|
||||
│ │ │ │
|
||||
Upload Extraktion Bereit zum Worksheet
|
||||
erfolgt laeuft Bearbeiten generiert
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Arbeitsblatt-Typen
|
||||
|
||||
| Typ | Beschreibung |
|
||||
|-----|--------------|
|
||||
| `en_to_de` | Englisch → Deutsch uebersetzen |
|
||||
| `de_to_en` | Deutsch → Englisch uebersetzen |
|
||||
| `copy` | Woerter mehrfach abschreiben |
|
||||
| `gap_fill` | Lueckentext mit Beispielsaetzen |
|
||||
|
||||
**Optionen:**
|
||||
|
||||
- Zeilenhoehe: normal / large / extra-large
|
||||
- Loesungen: ja / nein
|
||||
- Wiederholungen (bei Copy): 1-5
|
||||
|
||||
---
|
||||
|
||||
## 5. Datenschutz (DSGVO)
|
||||
|
||||
| Aspekt | Umsetzung |
|
||||
|--------|-----------|
|
||||
| Verarbeitung | 100% lokal (Mac Mini) |
|
||||
| Externe APIs | Keine |
|
||||
| LLM | Ollama (lokal) |
|
||||
| Speicherung | Lokales Filesystem |
|
||||
| Datentransfer | Nur innerhalb LAN |
|
||||
|
||||
**Keine Daten werden an externe Server gesendet.**
|
||||
|
||||
---
|
||||
|
||||
## 6. Konfiguration
|
||||
|
||||
**Umgebungsvariablen:**
|
||||
|
||||
```bash
|
||||
# Ollama Vision LLM
|
||||
OLLAMA_URL=http://host.docker.internal:11434
|
||||
OLLAMA_VISION_MODEL=qwen2.5vl:32b
|
||||
|
||||
# LLM Gateway (Hybrid Mode)
|
||||
LLM_GATEWAY_URL=http://host.docker.internal:8002
|
||||
LLM_MODEL=qwen2.5:14b
|
||||
|
||||
# Storage
|
||||
VOCAB_STORAGE_PATH=/app/vocab-worksheets
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Abhaengigkeiten
|
||||
|
||||
### Backend (Python)
|
||||
|
||||
| Paket | Version | Zweck |
|
||||
|-------|---------|-------|
|
||||
| FastAPI | 0.123.9 | Web Framework |
|
||||
| PyMuPDF | 1.25.4 | PDF-Verarbeitung |
|
||||
| WeasyPrint | 66.0 | PDF-Generierung |
|
||||
| Pillow | 11.3.0 | Bildverarbeitung |
|
||||
| httpx | 0.28.1 | Async HTTP Client |
|
||||
| PaddleOCR | 3.x | OCR (optional) |
|
||||
|
||||
### Frontend (Node.js)
|
||||
|
||||
| Paket | Version | Zweck |
|
||||
|-------|---------|-------|
|
||||
| Next.js | 16.1.4 | Framework |
|
||||
| React | 19.0.0 | UI Library |
|
||||
| Tailwind CSS | 3.4.17 | Styling |
|
||||
| TypeScript | 5.7.0 | Type Safety |
|
||||
|
||||
---
|
||||
|
||||
## 8. Deployment
|
||||
|
||||
**Docker-Container:**
|
||||
|
||||
- `klausur-service` (Port 8086) - Backend API
|
||||
- `studio-v2` (Port 3001) - Frontend
|
||||
|
||||
**URLs:**
|
||||
|
||||
- Frontend: `http://macmini:3001/vocab-worksheet`
|
||||
- API: `http://macmini:8086/api/v1/vocab/`
|
||||
|
||||
---
|
||||
|
||||
## 9. Erweiterungsmoeglichkeiten
|
||||
|
||||
| Feature | Status |
|
||||
|---------|--------|
|
||||
| Weitere Sprachen (FR, ES) | Geplant |
|
||||
| Datenbank-Persistenz | Geplant |
|
||||
| Batch-Verarbeitung | Geplant |
|
||||
| Woerterbuch-Integration | Idee |
|
||||
| Audio-Ausspracheuebungen | Idee |
|
||||
|
||||
---
|
||||
|
||||
## 10. Verwandte Dokumentation
|
||||
|
||||
- [BYOEH-Architecture.md](./BYOEH-Architecture.md)
|
||||
- [OCR-Labeling-Spec.md](./OCR-Labeling-Spec.md)
|
||||
- [DSGVO-Audit-OCR-Labeling.md](./DSGVO-Audit-OCR-Labeling.md)
|
||||
Reference in New Issue
Block a user