breakpilot-core

Benjamin_Boenisch/breakpilot-core

Fork 0

Commit Graph

Author	SHA1	Message	Date
Benjamin Admin	75dda9ac92	feat(embedding): add pdfplumber backend for multi-column PDF extraction EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use multi-column layouts that pypdf breaks into fragmented words ("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly. Backend priority: unstructured > pdfplumber > pypdf (auto mode). Also increases D5 re-ingestion timeout to 3600s for large PDFs. 58 embedding-service tests passing. pdfplumber: MIT license. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-02 15:42:25 +02:00
Benjamin Boenisch	ad111d5e69	Initial commit: breakpilot-core - Shared Infrastructure Docker Compose with 24+ services: - PostgreSQL (PostGIS), Valkey, MinIO, Qdrant - Vault (PKI/TLS), Nginx (Reverse Proxy) - Backend Core API, Consent Service, Billing Service - RAG Service, Embedding Service - Gitea, Woodpecker CI/CD - Night Scheduler, Health Aggregator - Jitsi (Web/XMPP/JVB/Jicofo), Mailpit Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 23:47:13 +01:00

Author

SHA1

Message

Date

Benjamin Admin

75dda9ac92

feat(embedding): add pdfplumber backend for multi-column PDF extraction

EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use
multi-column layouts that pypdf breaks into fragmented words
("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly.

Backend priority: unstructured > pdfplumber > pypdf (auto mode).
Also increases D5 re-ingestion timeout to 3600s for large PDFs.

58 embedding-service tests passing. pdfplumber: MIT license.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-02 15:42:25 +02:00

Benjamin Boenisch

ad111d5e69

Initial commit: breakpilot-core - Shared Infrastructure

Docker Compose with 24+ services:
- PostgreSQL (PostGIS), Valkey, MinIO, Qdrant
- Vault (PKI/TLS), Nginx (Reverse Proxy)
- Backend Core API, Consent Service, Billing Service
- RAG Service, Embedding Service
- Gitea, Woodpecker CI/CD
- Night Scheduler, Health Aggregator
- Jitsi (Web/XMPP/JVB/Jicofo), Mailpit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-11 23:47:13 +01:00

2 Commits