# NiBiS Ingestion Pipeline

## Overview

Die NiBiS Ingestion Pipeline verarbeitet Abitur-Erwartungshorizonte aus Niedersachsen und indexiert sie in Qdrant für RAG-basierte Klausurkorrektur.

## Unterstützte Daten

### Verzeichnisse

| Verzeichnis | Jahre | Namenskonvention |
|-------------|-------|------------------|
| `docs/za-download` | 2024, 2025 | `{Jahr}_{Fach}_{niveau}_{Nr}_EWH.pdf` |
| `docs/za-download-2` | 2016 | `{Jahr}{Fach}{Niveau}Lehrer/{Jahr}{Fach}{Niveau}A{Nr}L.pdf` |
| `docs/za-download-3` | 2017 | `{Jahr}{Fach}{Niveau}Lehrer/{Jahr}{Fach}{Niveau}A{Nr}L.pdf` |

### Dokumenttypen

- **EWH** - Erwartungshorizont (Hauptziel)
- **Aufgabe** - Prüfungsaufgaben
- **Material** - Zusatzmaterialien
- **GBU** - Gefährdungsbeurteilung (Chemie/Biologie)
- **Bewertungsbogen** - Standardisierte Bewertungsbögen

### Fächer

Deutsch, Englisch, Mathematik, Informatik, Biologie, Chemie, Physik, Geschichte, Erdkunde, Kunst, Musik, Sport, Latein, Griechisch, Französisch, Spanisch, Katholische Religion, Evangelische Religion, Werte und Normen, BRC, BVW, Gesundheit-Pflege

## Architektur

```
┌─────────────────────────────────────────────────────────────────┐
│                     NiBiS Ingestion Pipeline                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. ZIP Extraction                                              │
│     └── Entpackt 2024.zip, 2025.zip, etc.                       │
│                                                                 │
│  2. Document Discovery                                          │
│     ├── Parst alte Namenskonvention (2016/2017)                 │
│     └── Parst neue Namenskonvention (2024/2025)                 │
│                                                                 │
│  3. PDF Processing                                              │
│     ├── Text-Extraktion (PyPDF2)                                │
│     └── Chunking (1000 chars, 200 overlap)                      │
│                                                                 │
│  4. Embedding Generation                                        │
│     └── OpenAI text-embedding-3-small (1536 dim)                │
│                                                                 │
│  5. Qdrant Indexing                                             │
│     └── Collection: bp_nibis_eh                                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## Verwendung

### Via API (empfohlen)

```bash
# 1. Vorschau der verfügbaren Dokumente
curl http://localhost:8086/api/v1/admin/nibis/discover

# 2. ZIP-Dateien entpacken
curl -X POST http://localhost:8086/api/v1/admin/nibis/extract-zips

# 3. Ingestion starten
curl -X POST http://localhost:8086/api/v1/admin/nibis/ingest \
  -H "Content-Type: application/json" \
  -d '{"ewh_only": true}'

# 4. Status prüfen
curl http://localhost:8086/api/v1/admin/nibis/status

# 5. Semantische Suche testen
curl -X POST http://localhost:8086/api/v1/admin/nibis/search \
  -H "Content-Type: application/json" \
  -d '{"query": "Analyse literarischer Texte", "subject": "Deutsch", "limit": 5}'
```

### Via CLI

```bash
# Dry-Run (nur analysieren)
cd klausur-service/backend
python nibis_ingestion.py --dry-run

# Vollständige Ingestion
python nibis_ingestion.py

# Nur bestimmtes Jahr
python nibis_ingestion.py --year 2024

# Nur bestimmtes Fach
python nibis_ingestion.py --subject Deutsch

# Manifest erstellen
python nibis_ingestion.py --manifest /tmp/nibis_manifest.json
```

### Via Shell Script

```bash
./klausur-service/scripts/run_nibis_ingestion.sh --dry-run
./klausur-service/scripts/run_nibis_ingestion.sh --year 2024 --subject Deutsch
```

## Qdrant Schema

### Collection: `bp_nibis_eh`

```json
{
  "id": "nibis_2024_deutsch_ea_1_abc123_chunk_0",
  "vector": [1536 dimensions],
  "payload": {
    "doc_id": "nibis_2024_deutsch_ea_1_abc123",
    "chunk_index": 0,
    "text": "Der Erwartungshorizont...",
    "year": 2024,
    "subject": "Deutsch",
    "niveau": "eA",
    "task_number": 1,
    "doc_type": "EWH",
    "bundesland": "NI",
    "variant": null,
    "source": "nibis",
    "training_allowed": true
  }
}
```

## API Endpoints

| Methode | Endpoint | Beschreibung |
|---------|----------|--------------|
| GET | `/api/v1/admin/nibis/status` | Ingestion-Status |
| POST | `/api/v1/admin/nibis/extract-zips` | ZIP-Dateien entpacken |
| GET | `/api/v1/admin/nibis/discover` | Dokumente finden |
| POST | `/api/v1/admin/nibis/ingest` | Ingestion starten |
| POST | `/api/v1/admin/nibis/search` | Semantische Suche |
| GET | `/api/v1/admin/nibis/stats` | Statistiken |
| GET | `/api/v1/admin/nibis/collections` | Qdrant Collections |
| DELETE | `/api/v1/admin/nibis/collection` | Collection löschen |

## Erweiterung für andere Bundesländer

Die Pipeline ist so designed, dass sie leicht erweitert werden kann:

### 1. Neues Bundesland hinzufügen

```python
# In nibis_ingestion.py

# Bundesland-Code (ISO 3166-2:DE)
BUNDESLAND_CODES = {
    "NI": "Niedersachsen",
    "BE": "Berlin",
    "BY": "Bayern",
    # ...
}

# Parsing-Funktion für neues Format
def parse_filename_berlin(filename: str, file_path: Path) -> Optional[Dict]:
    # Berlin-spezifische Namenskonvention
    pass
```

### 2. Neues Verzeichnis registrieren

```python
# docs/za-download-berlin/ hinzufügen
ZA_DOWNLOAD_DIRS = [
    "za-download",
    "za-download-2",
    "za-download-3",
    "za-download-berlin",  # NEU
]
```

### 3. Dokumenttyp-Erweiterung

Für Zeugnisgeneration oder andere Dokumenttypen:

```python
DOC_TYPES = {
    "EWH": "Erwartungshorizont",
    "ZEUGNIS_VORLAGE": "Zeugnisvorlage",
    "NOTENSPIEGEL": "Notenspiegel",
    "BEMERKUNG": "Bemerkungstexte",
}
```

## Rechtliche Hinweise

- NiBiS-Daten sind unter den [NiBiS-Nutzungsbedingungen](https://nibis.de) frei nutzbar
- `training_allowed: true` - Strukturelles Wissen darf für KI-Training genutzt werden
- Für Lehrer-eigene Erwartungshorizonte (BYOEH) gilt: `training_allowed: false`

## Troubleshooting

### Qdrant nicht erreichbar

```bash
# Prüfen ob Qdrant läuft
curl http://localhost:6333/health

# Docker starten
docker-compose up -d qdrant
```

### OpenAI API Fehler

```bash
# API Key setzen
export OPENAI_API_KEY=sk-...
```

### PDF-Extraktion fehlgeschlagen

Einige PDFs können problematisch sein (gescannte Dokumente ohne OCR). Diese werden übersprungen und im Error-Log protokolliert.

## Performance

- ~500-1000 Chunks pro Minute (abhängig von OpenAI API)
- ~2-3 GB Qdrant Storage für alle NiBiS-Daten (2016-2025)
- Embeddings werden nur einmal generiert (idempotent via Hash)