Benjamin Admin 75dda9ac92 feat(embedding): add pdfplumber backend for multi-column PDF extraction
EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use
multi-column layouts that pypdf breaks into fragmented words
("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly.

Backend priority: unstructured > pdfplumber > pypdf (auto mode).
Also increases D5 re-ingestion timeout to 3600s for large PDFs.

58 embedding-service tests passing. pdfplumber: MIT license.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 15:42:25 +02:00
S
Description
No description provided
53 MiB
Languages
Python 38.7%
TypeScript 33.8%
Go 22.8%
HTML 2.9%
Shell 0.8%
Other 1%