75dda9ac92
EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use
multi-column layouts that pypdf breaks into fragmented words
("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly.
Backend priority: unstructured > pdfplumber > pypdf (auto mode).
Also increases D5 re-ingestion timeout to 3600s for large PDFs.
58 embedding-service tests passing. pdfplumber: MIT license.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
25 lines
463 B
Plaintext
25 lines
463 B
Plaintext
# Embedding Service Dependencies
|
|
# This service handles ML-heavy operations (embeddings, re-ranking, PDF extraction)
|
|
|
|
# Web Framework
|
|
fastapi>=0.109.0
|
|
uvicorn[standard]>=0.27.0
|
|
pydantic>=2.0.0
|
|
python-multipart>=0.0.6
|
|
|
|
# ML / Embeddings
|
|
torch>=2.0.0
|
|
sentence-transformers>=2.2.0
|
|
|
|
# PDF Extraction
|
|
unstructured>=0.12.0
|
|
pypdf>=4.0.0
|
|
pdfplumber>=0.11.0
|
|
python-magic>=0.4.27
|
|
|
|
# HTTP Client (for OpenAI/Cohere API calls)
|
|
httpx>=0.26.0
|
|
|
|
# Utilities
|
|
python-dotenv>=1.0.0
|