fix(rag): Dedup check, BGB split, GewO timeout, arithmetic fix

- Add Qdrant dedup check in upload_file() — skip if regulation_id already exists
- Split BGB (2.7MB) into 5 targeted parts via XML extraction:
  AGB §§305-310, Fernabsatz §§312-312k, Kaufrecht §§433-480,
  Widerruf §§355-361, Digitale Produkte §§327-327u
- Lower large-file threshold 512KB→384KB (fixes GewO 432KB timeout)
- Fix arithmetic syntax error when collection_count returns "?"
- Replace EGBGB PDF (was empty) with XML extraction
- Add unzip to Alpine container for XML archives

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-12 09:39:09 +01:00
parent 87d06c8b20
commit c88653b221
2 changed files with 220 additions and 21 deletions

View File

@@ -78,7 +78,7 @@ jobs:
-e "SDK_URL=http://bp-compliance-ai-sdk:8090" \
alpine:3.19 \
sh -c "
apk add --no-cache curl bash coreutils git python3 > /dev/null 2>&1
apk add --no-cache curl bash coreutils git python3 unzip > /dev/null 2>&1
mkdir -p /tmp/rag-ingestion/{pdfs,repos,texts}
cd /workspace
if [ '${PHASE}' = 'all' ]; then