docs: add Industry Compliance Ingestion documentation
- Document all 10 industry compliance PDFs and their sources - Cover ingestion script usage, phases, chunking config - Document IFRS timeout workaround and endorsement warning - Add license overview for all document sources Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
115
docs-src/services/sdk-modules/industry-compliance-ingestion.md
Normal file
115
docs-src/services/sdk-modules/industry-compliance-ingestion.md
Normal file
@@ -0,0 +1,115 @@
|
||||
# Industry Compliance Ingestion
|
||||
|
||||
## Uebersicht
|
||||
|
||||
Das Ingestion-Skript `scripts/ingest-industry-compliance.sh` laedt oeffentlich verfuegbare Industrie-Compliance-Dokumente herunter und ingestiert sie in Qdrant via die Core RAG-API (Port 8097).
|
||||
|
||||
**Ausfuehrung:** Mac Mini
|
||||
**Speicherort:** `~/rag-ingestion/`
|
||||
**RAG-API:** `https://localhost:8097/api/v1/documents/upload`
|
||||
|
||||
---
|
||||
|
||||
## Dokumente (10 PDFs)
|
||||
|
||||
| # | Dokument | Quelle | Collection | Chunks |
|
||||
|---|----------|--------|------------|--------|
|
||||
| 1 | EU Maschinenverordnung 2023/1230 | EUR-Lex | `bp_compliance_ce` | ~882 |
|
||||
| 2 | EU Blue Guide 2022 | EUR-Lex | `bp_compliance_ce` | ~1600 |
|
||||
| 3 | ENISA Advancing Software Security | enisa.europa.eu | `bp_compliance_datenschutz` | ~99 |
|
||||
| 4 | ENISA Supply Chain Threat Landscape | enisa.europa.eu | `bp_compliance_datenschutz` | ~284 |
|
||||
| 5 | NIST SP 800-218 (SSDF) | nist.gov | `bp_compliance_datenschutz` | ~242 |
|
||||
| 6 | NIST Cybersecurity Framework 2.0 | nist.gov | `bp_compliance_datenschutz` | ~162 |
|
||||
| 7 | OECD AI Principles | oecd.org | `bp_compliance_datenschutz` | ~76 |
|
||||
| 8 | EU-IFRS Verordnung 2023/1803 (DE) | EUR-Lex | `bp_compliance_ce` | ~8942 |
|
||||
| 9 | EU-IFRS Verordnung 2023/1803 (EN) | EUR-Lex | `bp_compliance_ce` | ~9000 |
|
||||
| 10 | EFRAG Endorsement Status Report | efrag.org | `bp_compliance_datenschutz` | ~48 |
|
||||
|
||||
---
|
||||
|
||||
## Ausfuehrung
|
||||
|
||||
```bash
|
||||
# Vollstaendig (Download + Upload + Verify)
|
||||
bash ~/rag-ingestion/ingest-industry-compliance.sh
|
||||
|
||||
# Nur Downloads
|
||||
bash ~/rag-ingestion/ingest-industry-compliance.sh --only download
|
||||
|
||||
# Nur CE-Collection uploaden
|
||||
bash ~/rag-ingestion/ingest-industry-compliance.sh --only ce --skip-download
|
||||
|
||||
# Nur Datenschutz-Collection uploaden
|
||||
bash ~/rag-ingestion/ingest-industry-compliance.sh --only datenschutz --skip-download
|
||||
|
||||
# Nur Verifizierung
|
||||
bash ~/rag-ingestion/ingest-industry-compliance.sh --only verify
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phasen
|
||||
|
||||
### Phase A: Downloads
|
||||
- Laedt alle 10 PDFs nach `~/rag-ingestion/pdfs/`
|
||||
- Ueberspringe bereits vorhandene Dateien
|
||||
- User-Agent Header fuer ENISA-Kompatibilitaet
|
||||
|
||||
### Phase B: CE-Collection (`bp_compliance_ce`)
|
||||
- EU-Rechtstexte (Maschinenverordnung, Blue Guide, IFRS)
|
||||
- Metadata: CELEX-Nummer, Kategorie, Sprache
|
||||
|
||||
### Phase C: Datenschutz-Collection (`bp_compliance_datenschutz`)
|
||||
- Frameworks und Guidance (ENISA, NIST, OECD, EFRAG)
|
||||
- Metadata: Source-ID, Typ, Attribution
|
||||
|
||||
### Phase D: Verifizierung
|
||||
- Collection-Counts pruefen
|
||||
- Test-Suchen durchfuehren
|
||||
|
||||
---
|
||||
|
||||
## Chunking-Konfiguration
|
||||
|
||||
| Parameter | Wert |
|
||||
|-----------|------|
|
||||
| Strategie | `recursive` |
|
||||
| Chunk-Groesse | 512 Token |
|
||||
| Chunk-Overlap | 50 Token |
|
||||
| Embedding-Modell | BGE-M3 (1024-dim) |
|
||||
|
||||
---
|
||||
|
||||
## IFRS-Besonderheit
|
||||
|
||||
Die IFRS-Verordnung (EU) 2023/1803 ist mit ~8MB sehr gross und erzeugt ~9000 Chunks. Der Upload dauert 10-15 Minuten wegen der sequenziellen Embedding-Erzeugung.
|
||||
|
||||
**Workaround fuer Timeout:**
|
||||
```bash
|
||||
# PDF in Container kopieren und von dort uploaden
|
||||
docker cp ifrs_regulation_2023_1803_de.pdf bp-core-rag-service:/tmp/
|
||||
docker exec -d bp-core-rag-service sh -c "curl -s --max-time 1800 -X POST http://localhost:8097/api/v1/documents/upload -F file=@/tmp/ifrs_regulation_2023_1803_de.pdf -F collection=bp_compliance_ce ..."
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Compliance Advisor Integration
|
||||
|
||||
Der System-Prompt in `admin-compliance/app/api/sdk/compliance-advisor/chat/route.ts` referenziert alle ingestierten Dokumente. Bei IFRS-Fragen wird ein spezieller Endorsement-Hinweis angezeigt:
|
||||
|
||||
> Dieser Hinweis basiert auf den EU-endorsed IFRS (Stand: Verordnung 2023/1803).
|
||||
> Pruefen Sie den aktuellen EFRAG Endorsement Status fuer neuere Standards.
|
||||
|
||||
---
|
||||
|
||||
## Lizenzen
|
||||
|
||||
Alle Dokumente sind unter oeffentlich nutzbaren Lizenzen:
|
||||
|
||||
| Quelle | Lizenz |
|
||||
|--------|--------|
|
||||
| EUR-Lex | Amtliches Werk der EU (Public Domain) |
|
||||
| ENISA | EUPL/Reuse Notice |
|
||||
| NIST | Public Domain (US Government) |
|
||||
| OECD | Reuse Notice |
|
||||
| EFRAG | Oeffentliches Dokument |
|
||||
@@ -74,6 +74,7 @@ nav:
|
||||
- Document Crawler: services/sdk-modules/document-crawler.md
|
||||
- Advisory Board: services/sdk-modules/advisory-board.md
|
||||
- DSB Portal: services/sdk-modules/dsb-portal.md
|
||||
- Industry Compliance Ingestion: services/sdk-modules/industry-compliance-ingestion.md
|
||||
- Entwicklung:
|
||||
- Testing: development/testing.md
|
||||
- Dokumentation: development/documentation.md
|
||||
|
||||
Reference in New Issue
Block a user