docs: Instruktion fuer RAG-Pipeline — Dokumenten-Upload Backend
Vollstaendige Spezifikation:
- DB-Schema (iace_uploaded_documents)
- 3 Go Endpoints (POST/GET/DELETE)
- Async PDF → Text → Chunks → Embed → Qdrant Pipeline
- Tenant-isolierte Collections (bp_norms_tenant_{id})
- Multi-Collection RAG-Suche
- Frontend-API-Vertrag
- Sicherheit (Tenant-Isolation, Datei-Validierung)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,225 @@
|
||||
# Auftrag: Dokumenten-Upload Backend fuer IACE Normenrecherche
|
||||
|
||||
## Kontext
|
||||
|
||||
Das IACE CE-Compliance Frontend hat einen Dokumenten-Upload-Bereich im Normenrecherche-Tab.
|
||||
Kunden koennen dort eigene PDFs hochladen (Normen, technische Spezifikationen, Pruefberichte).
|
||||
Die Dokumente muessen tenant-isoliert verarbeitet und durchsuchbar gemacht werden.
|
||||
|
||||
**Frontend ist fertig** — ruft diese Endpoints auf:
|
||||
```
|
||||
POST /sdk/v1/iace/projects/{projectId}/documents (multipart/form-data, Feld: "file")
|
||||
GET /sdk/v1/iace/projects/{projectId}/documents
|
||||
DELETE /sdk/v1/iace/projects/{projectId}/documents/{docId}
|
||||
```
|
||||
|
||||
**Frontend-Header:** `X-Tenant-ID` und `X-User-ID` werden automatisch gesetzt.
|
||||
|
||||
---
|
||||
|
||||
## Was gebaut werden muss
|
||||
|
||||
### 1. DB-Tabelle fuer Dokument-Tracking
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS public.iace_uploaded_documents (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
project_id UUID NOT NULL REFERENCES public.iace_projects(id) ON DELETE CASCADE,
|
||||
tenant_id UUID NOT NULL,
|
||||
filename TEXT NOT NULL,
|
||||
original_filename TEXT NOT NULL,
|
||||
file_size BIGINT NOT NULL DEFAULT 0,
|
||||
mime_type TEXT NOT NULL DEFAULT 'application/pdf',
|
||||
storage_path TEXT NOT NULL, -- Pfad im Object Storage oder lokal
|
||||
status TEXT NOT NULL DEFAULT 'uploaded', -- uploaded, processing, indexed, error
|
||||
error_message TEXT DEFAULT '',
|
||||
chunk_count INT DEFAULT 0,
|
||||
qdrant_collection TEXT DEFAULT '', -- z.B. bp_norms_tenant_{tenant_id}
|
||||
uploaded_by TEXT DEFAULT '',
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
CREATE INDEX idx_iace_docs_project ON public.iace_uploaded_documents(project_id);
|
||||
CREATE INDEX idx_iace_docs_tenant ON public.iace_uploaded_documents(tenant_id);
|
||||
```
|
||||
|
||||
### 2. Go Backend Endpoints (ai-compliance-sdk)
|
||||
|
||||
#### POST /projects/{projectId}/documents
|
||||
|
||||
Ablauf:
|
||||
1. Multipart-Form parsen, PDF-Datei extrahieren
|
||||
2. Validierung: nur PDF, max 50 MB
|
||||
3. PDF lokal speichern (z.B. `/data/uploads/{tenant_id}/{uuid}.pdf`)
|
||||
4. DB-Eintrag erstellen mit status=`uploaded`
|
||||
5. **Async-Job starten** fuer Verarbeitung (oder direkt im Handler wenn einfacher)
|
||||
6. Response: `{ "document": { "id": "...", "filename": "...", "status": "uploaded" } }`
|
||||
|
||||
#### Async-Verarbeitung (nach Upload):
|
||||
|
||||
```
|
||||
PDF → Text-Extraktion (pdftotext oder Go-Library wie pdfcpu/unidoc)
|
||||
→ Chunking (500-1000 Tokens pro Chunk, Overlap 100 Tokens)
|
||||
→ Embedding (bge-m3 via Ollama oder bestehenden Embedding-Service)
|
||||
→ Upsert in tenant-spezifische Qdrant-Collection
|
||||
→ Status-Update in DB: status=`indexed`, chunk_count=N
|
||||
```
|
||||
|
||||
**Qdrant-Collection-Name:** `bp_norms_tenant_{tenant_id_short}`
|
||||
- Beispiel: `bp_norms_tenant_9282a473`
|
||||
- Pro Tenant eine eigene Collection → vollstaendige Isolation
|
||||
- Collection erstellen wenn nicht vorhanden (bei erstem Upload)
|
||||
|
||||
**Qdrant-Punkt-Payload:**
|
||||
```json
|
||||
{
|
||||
"text": "chunk_text",
|
||||
"document_id": "uuid des Dokuments",
|
||||
"filename": "EN_692_2005.pdf",
|
||||
"page": 5,
|
||||
"chunk_index": 3,
|
||||
"tenant_id": "9282a473-...",
|
||||
"project_id": "a4c4031e-..."
|
||||
}
|
||||
```
|
||||
|
||||
#### GET /projects/{projectId}/documents
|
||||
|
||||
```sql
|
||||
SELECT id, filename, original_filename, file_size, status, error_message,
|
||||
chunk_count, created_at
|
||||
FROM iace_uploaded_documents
|
||||
WHERE project_id = $1
|
||||
ORDER BY created_at DESC
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"documents": [
|
||||
{
|
||||
"id": "uuid",
|
||||
"filename": "EN_692_2005.pdf",
|
||||
"file_size": 2456789,
|
||||
"status": "indexed",
|
||||
"chunk_count": 42,
|
||||
"created_at": "2026-05-09T..."
|
||||
}
|
||||
],
|
||||
"total": 1
|
||||
}
|
||||
```
|
||||
|
||||
#### DELETE /projects/{projectId}/documents/{docId}
|
||||
|
||||
1. Lade Dokument-Metadaten aus DB
|
||||
2. Loesche Datei vom Storage
|
||||
3. Loesche Chunks aus Qdrant (Filter: `document_id == docId`)
|
||||
4. Loesche DB-Eintrag
|
||||
5. Response: `{ "message": "document deleted" }`
|
||||
|
||||
### 3. RAG-Suche erweitern
|
||||
|
||||
Die bestehende RAG-Suche (`POST /sdk/v1/rag/search`) muss erweitert werden um
|
||||
tenant-spezifische Collections einzubeziehen.
|
||||
|
||||
**Aktueller Flow:**
|
||||
```
|
||||
Query → Embedding → Suche in bp_compliance_ce → Ergebnisse
|
||||
```
|
||||
|
||||
**Neuer Flow:**
|
||||
```
|
||||
Query → Embedding → Suche in [bp_compliance_ce, bp_norms_tenant_{tenant_id}]
|
||||
→ Ergebnisse zusammenfuehren (RRF oder Score-Merge)
|
||||
```
|
||||
|
||||
**Option A: Multi-Collection Query (bevorzugt)**
|
||||
Qdrant unterstuetzt seit v1.7 die Suche ueber mehrere Collections in einem Request.
|
||||
Der Tenant-Header bestimmt welche tenant-spezifische Collection hinzugefuegt wird.
|
||||
|
||||
**Option B: Zwei separate Queries + Merge**
|
||||
Wenn Multi-Collection nicht verfuegbar: Zwei parallele Queries, Ergebnisse nach Score sortiert zusammenfuehren.
|
||||
|
||||
### 4. Embedding-Service
|
||||
|
||||
Der bestehende Embedding-Service nutzt Ollama mit `bge-m3`. Die gleiche Konfiguration
|
||||
fuer die Dokument-Chunks verwenden:
|
||||
|
||||
```
|
||||
OLLAMA_URL: http://host.docker.internal:11434 (oder wie in docker-compose konfiguriert)
|
||||
MODEL: bge-m3
|
||||
```
|
||||
|
||||
**Alternative:** Wenn Ollama nicht verfuegbar, Fallback auf den bestehenden
|
||||
Qdrant Fastembed (falls aktiviert).
|
||||
|
||||
---
|
||||
|
||||
## Dateien die erstellt/geaendert werden muessen
|
||||
|
||||
### Neue Dateien:
|
||||
```
|
||||
ai-compliance-sdk/internal/iace/store_documents.go — CRUD fuer uploaded_documents
|
||||
ai-compliance-sdk/internal/iace/document_processor.go — PDF → Text → Chunks → Embed
|
||||
ai-compliance-sdk/internal/api/handlers/iace_handler_documents.go — Upload/List/Delete Handlers
|
||||
ai-compliance-sdk/migrations/024_iace_uploaded_documents.sql
|
||||
```
|
||||
|
||||
### Zu aendern:
|
||||
```
|
||||
ai-compliance-sdk/internal/app/routes.go — 3 neue Routes registrieren
|
||||
ai-compliance-sdk/internal/ucca/legal_rag_client.go — Multi-Collection Suche
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Abhaengigkeiten
|
||||
|
||||
- **Qdrant:** Muss erreichbar sein (bestehende Konfiguration: qdrant-dev.breakpilot.ai)
|
||||
- **Ollama:** Fuer Embeddings (bestehende Konfiguration)
|
||||
- **PDF-Extraktion:** Go-Library (z.B. `github.com/ledongthuc/pdf` oder `pdfcpu`)
|
||||
oder alternativ `pdftotext` als Systembinary im Docker-Container
|
||||
|
||||
---
|
||||
|
||||
## Frontend-API-Vertrag
|
||||
|
||||
Das Frontend sendet bei Upload:
|
||||
```javascript
|
||||
const formData = new FormData()
|
||||
formData.append('file', file) // File-Objekt
|
||||
|
||||
fetch(`/api/sdk/v1/iace/projects/${projectId}/documents`, {
|
||||
method: 'POST',
|
||||
body: formData,
|
||||
// KEIN Content-Type Header — wird automatisch gesetzt mit boundary
|
||||
})
|
||||
```
|
||||
|
||||
Das Frontend erwartet bei List:
|
||||
```json
|
||||
{
|
||||
"documents": [
|
||||
{
|
||||
"id": "string",
|
||||
"filename": "string",
|
||||
"file_size": number,
|
||||
"status": "uploaded" | "processing" | "indexed" | "error",
|
||||
"error_message": "string (nur bei error)",
|
||||
"chunk_count": number,
|
||||
"created_at": "ISO-8601"
|
||||
}
|
||||
],
|
||||
"total": number
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sicherheit
|
||||
|
||||
- **Tenant-Isolation:** Jeder Tenant hat seine eigene Qdrant-Collection. KEIN Cross-Tenant-Zugriff.
|
||||
- **Datei-Validierung:** Nur PDF, max 50 MB, Magic-Bytes pruefen (nicht nur Extension)
|
||||
- **Storage-Isolation:** Dateien unter `/data/uploads/{tenant_id}/` — kein Path-Traversal
|
||||
- **Loesch-Kaskade:** Bei Projekt-Loeschung auch Dokumente + Qdrant-Chunks loeschen
|
||||
Reference in New Issue
Block a user