OCR engines don't detect | pipe chars used as syllable dividers in dictionaries. After dictionary detection (is_dict=True), use pyphen (MIT) to insert syllable breaks into headword cells. Tries DE first, then EN. Skips IPA content, short words, and cells already containing |. Also adds pyphen>=0.16.0 to requirements.txt. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
60 lines
1.3 KiB
Plaintext
60 lines
1.3 KiB
Plaintext
fastapi>=0.109.0
|
|
uvicorn[standard]>=0.27.0
|
|
python-multipart>=0.0.6
|
|
pyjwt>=2.8.0
|
|
httpx>=0.26.0
|
|
python-dotenv>=1.0.0
|
|
|
|
# BYOEH Dependencies
|
|
qdrant-client>=1.7.0
|
|
cryptography>=41.0.0
|
|
PyPDF2>=3.0.0
|
|
PyMuPDF>=1.24.0
|
|
|
|
# PyTorch CPU-only (smaller, no CUDA needed for Docker on Mac)
|
|
--extra-index-url https://download.pytorch.org/whl/cpu
|
|
torch>=2.0.0
|
|
|
|
# Local Embeddings (no API key needed)
|
|
sentence-transformers>=2.2.0
|
|
|
|
# MinIO Object Storage
|
|
minio>=7.2.0
|
|
|
|
# OpenCV for handwriting detection (headless = no GUI, smaller for CI)
|
|
opencv-python-headless>=4.8.0
|
|
|
|
# Tesseract OCR Python binding (requires system tesseract-ocr package)
|
|
pytesseract>=0.3.10
|
|
Pillow>=10.0.0
|
|
|
|
# RapidOCR (PaddleOCR models on ONNX Runtime — works on ARM64 natively)
|
|
rapidocr
|
|
onnxruntime
|
|
|
|
# IPA pronunciation dictionary lookup (MIT license, bundled CMU dict ~134k words)
|
|
eng-to-ipa
|
|
|
|
# Spell-checker for rule-based OCR correction (MIT license)
|
|
pyspellchecker>=0.8.1
|
|
|
|
# Syllable hyphenation for dictionary pipe-divider insertion (MIT license)
|
|
pyphen>=0.16.0
|
|
|
|
# PostgreSQL (for metrics storage)
|
|
psycopg2-binary>=2.9.0
|
|
asyncpg>=0.29.0
|
|
|
|
# Email validation for Pydantic
|
|
email-validator>=2.0.0
|
|
|
|
# DOCX export for reconstruction editor (MIT license)
|
|
python-docx>=1.1.0
|
|
|
|
# ONNX model export and optimization (Apache-2.0)
|
|
optimum[onnxruntime]>=1.17.0
|
|
|
|
# Testing
|
|
pytest>=8.0.0
|
|
pytest-asyncio>=0.23.0
|