docs: add Industry Compliance Ingestion documentation
- Document all 10 industry compliance PDFs and their sources - Cover ingestion script usage, phases, chunking config - Document IFRS timeout workaround and endorsement warning - Add license overview for all document sources Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
115
docs-src/services/sdk-modules/industry-compliance-ingestion.md
Normal file
115
docs-src/services/sdk-modules/industry-compliance-ingestion.md
Normal file
@@ -0,0 +1,115 @@
|
|||||||
|
# Industry Compliance Ingestion
|
||||||
|
|
||||||
|
## Uebersicht
|
||||||
|
|
||||||
|
Das Ingestion-Skript `scripts/ingest-industry-compliance.sh` laedt oeffentlich verfuegbare Industrie-Compliance-Dokumente herunter und ingestiert sie in Qdrant via die Core RAG-API (Port 8097).
|
||||||
|
|
||||||
|
**Ausfuehrung:** Mac Mini
|
||||||
|
**Speicherort:** `~/rag-ingestion/`
|
||||||
|
**RAG-API:** `https://localhost:8097/api/v1/documents/upload`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dokumente (10 PDFs)
|
||||||
|
|
||||||
|
| # | Dokument | Quelle | Collection | Chunks |
|
||||||
|
|---|----------|--------|------------|--------|
|
||||||
|
| 1 | EU Maschinenverordnung 2023/1230 | EUR-Lex | `bp_compliance_ce` | ~882 |
|
||||||
|
| 2 | EU Blue Guide 2022 | EUR-Lex | `bp_compliance_ce` | ~1600 |
|
||||||
|
| 3 | ENISA Advancing Software Security | enisa.europa.eu | `bp_compliance_datenschutz` | ~99 |
|
||||||
|
| 4 | ENISA Supply Chain Threat Landscape | enisa.europa.eu | `bp_compliance_datenschutz` | ~284 |
|
||||||
|
| 5 | NIST SP 800-218 (SSDF) | nist.gov | `bp_compliance_datenschutz` | ~242 |
|
||||||
|
| 6 | NIST Cybersecurity Framework 2.0 | nist.gov | `bp_compliance_datenschutz` | ~162 |
|
||||||
|
| 7 | OECD AI Principles | oecd.org | `bp_compliance_datenschutz` | ~76 |
|
||||||
|
| 8 | EU-IFRS Verordnung 2023/1803 (DE) | EUR-Lex | `bp_compliance_ce` | ~8942 |
|
||||||
|
| 9 | EU-IFRS Verordnung 2023/1803 (EN) | EUR-Lex | `bp_compliance_ce` | ~9000 |
|
||||||
|
| 10 | EFRAG Endorsement Status Report | efrag.org | `bp_compliance_datenschutz` | ~48 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Ausfuehrung
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Vollstaendig (Download + Upload + Verify)
|
||||||
|
bash ~/rag-ingestion/ingest-industry-compliance.sh
|
||||||
|
|
||||||
|
# Nur Downloads
|
||||||
|
bash ~/rag-ingestion/ingest-industry-compliance.sh --only download
|
||||||
|
|
||||||
|
# Nur CE-Collection uploaden
|
||||||
|
bash ~/rag-ingestion/ingest-industry-compliance.sh --only ce --skip-download
|
||||||
|
|
||||||
|
# Nur Datenschutz-Collection uploaden
|
||||||
|
bash ~/rag-ingestion/ingest-industry-compliance.sh --only datenschutz --skip-download
|
||||||
|
|
||||||
|
# Nur Verifizierung
|
||||||
|
bash ~/rag-ingestion/ingest-industry-compliance.sh --only verify
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phasen
|
||||||
|
|
||||||
|
### Phase A: Downloads
|
||||||
|
- Laedt alle 10 PDFs nach `~/rag-ingestion/pdfs/`
|
||||||
|
- Ueberspringe bereits vorhandene Dateien
|
||||||
|
- User-Agent Header fuer ENISA-Kompatibilitaet
|
||||||
|
|
||||||
|
### Phase B: CE-Collection (`bp_compliance_ce`)
|
||||||
|
- EU-Rechtstexte (Maschinenverordnung, Blue Guide, IFRS)
|
||||||
|
- Metadata: CELEX-Nummer, Kategorie, Sprache
|
||||||
|
|
||||||
|
### Phase C: Datenschutz-Collection (`bp_compliance_datenschutz`)
|
||||||
|
- Frameworks und Guidance (ENISA, NIST, OECD, EFRAG)
|
||||||
|
- Metadata: Source-ID, Typ, Attribution
|
||||||
|
|
||||||
|
### Phase D: Verifizierung
|
||||||
|
- Collection-Counts pruefen
|
||||||
|
- Test-Suchen durchfuehren
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Chunking-Konfiguration
|
||||||
|
|
||||||
|
| Parameter | Wert |
|
||||||
|
|-----------|------|
|
||||||
|
| Strategie | `recursive` |
|
||||||
|
| Chunk-Groesse | 512 Token |
|
||||||
|
| Chunk-Overlap | 50 Token |
|
||||||
|
| Embedding-Modell | BGE-M3 (1024-dim) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## IFRS-Besonderheit
|
||||||
|
|
||||||
|
Die IFRS-Verordnung (EU) 2023/1803 ist mit ~8MB sehr gross und erzeugt ~9000 Chunks. Der Upload dauert 10-15 Minuten wegen der sequenziellen Embedding-Erzeugung.
|
||||||
|
|
||||||
|
**Workaround fuer Timeout:**
|
||||||
|
```bash
|
||||||
|
# PDF in Container kopieren und von dort uploaden
|
||||||
|
docker cp ifrs_regulation_2023_1803_de.pdf bp-core-rag-service:/tmp/
|
||||||
|
docker exec -d bp-core-rag-service sh -c "curl -s --max-time 1800 -X POST http://localhost:8097/api/v1/documents/upload -F file=@/tmp/ifrs_regulation_2023_1803_de.pdf -F collection=bp_compliance_ce ..."
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Compliance Advisor Integration
|
||||||
|
|
||||||
|
Der System-Prompt in `admin-compliance/app/api/sdk/compliance-advisor/chat/route.ts` referenziert alle ingestierten Dokumente. Bei IFRS-Fragen wird ein spezieller Endorsement-Hinweis angezeigt:
|
||||||
|
|
||||||
|
> Dieser Hinweis basiert auf den EU-endorsed IFRS (Stand: Verordnung 2023/1803).
|
||||||
|
> Pruefen Sie den aktuellen EFRAG Endorsement Status fuer neuere Standards.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Lizenzen
|
||||||
|
|
||||||
|
Alle Dokumente sind unter oeffentlich nutzbaren Lizenzen:
|
||||||
|
|
||||||
|
| Quelle | Lizenz |
|
||||||
|
|--------|--------|
|
||||||
|
| EUR-Lex | Amtliches Werk der EU (Public Domain) |
|
||||||
|
| ENISA | EUPL/Reuse Notice |
|
||||||
|
| NIST | Public Domain (US Government) |
|
||||||
|
| OECD | Reuse Notice |
|
||||||
|
| EFRAG | Oeffentliches Dokument |
|
||||||
@@ -74,6 +74,7 @@ nav:
|
|||||||
- Document Crawler: services/sdk-modules/document-crawler.md
|
- Document Crawler: services/sdk-modules/document-crawler.md
|
||||||
- Advisory Board: services/sdk-modules/advisory-board.md
|
- Advisory Board: services/sdk-modules/advisory-board.md
|
||||||
- DSB Portal: services/sdk-modules/dsb-portal.md
|
- DSB Portal: services/sdk-modules/dsb-portal.md
|
||||||
|
- Industry Compliance Ingestion: services/sdk-modules/industry-compliance-ingestion.md
|
||||||
- Entwicklung:
|
- Entwicklung:
|
||||||
- Testing: development/testing.md
|
- Testing: development/testing.md
|
||||||
- Dokumentation: development/documentation.md
|
- Dokumentation: development/documentation.md
|
||||||
|
|||||||
Reference in New Issue
Block a user