feat(embedding): add pdfplumber backend for multi-column PDF extraction
EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use
multi-column layouts that pypdf breaks into fragmented words
("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly.
Backend priority: unstructured > pdfplumber > pypdf (auto mode).
Also increases D5 re-ingestion timeout to 3600s for large PDFs.
58 embedding-service tests passing. pdfplumber: MIT license.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -44,7 +44,7 @@ from reingest_d5_config import (
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
logger = logging.getLogger("d5-reingest")
|
||||
|
||||
UPLOAD_TIMEOUT = httpx.Timeout(timeout=1800.0, connect=30.0)
|
||||
UPLOAD_TIMEOUT = httpx.Timeout(timeout=3600.0, connect=30.0)
|
||||
SCROLL_TIMEOUT = httpx.Timeout(timeout=60.0, connect=10.0)
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user