feat(embedding): add pdfplumber backend for multi-column PDF extraction

EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use multi-column layouts that pypdf breaks into fragmented words ("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly. Backend priority: unstructured > pdfplumber > pypdf (auto mode). Also increases D5 re-ingestion timeout to 3600s for large PDFs. 58 embedding-service tests passing. pdfplumber: MIT license. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 15:42:25 +02:00
parent a459636bc4
commit 75dda9ac92
3 changed files with 42 additions and 3 deletions
@@ -14,6 +14,7 @@ sentence-transformers>=2.2.0
 # PDF Extraction
 unstructured>=0.12.0
 pypdf>=4.0.0
+pdfplumber>=0.11.0
 python-magic>=0.4.27

 # HTTP Client (for OpenAI/Cohere API calls)