Tesseract mangles IPA square brackets into curly braces or parentheses
(e.g. China [ˈtʃaɪnə] → China {'tfatno]). The previous regex only
matched [...], missing all garbled variants.
- Match any bracket type: [...], {...}, (...) including mixed pairs
- Add _is_meaningful_bracket_content() to preserve legitimate German
prefixes like (zer)brechen and Tanz(veranstaltung)
- Trigger IPA replacement on any bracket character, not just [
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RapidOCR (PaddleOCR) is optimized for full-page scene text and produces
artifacts on small isolated cell crops: extra characters ("Tanz z",
"er r wollte"), missing punctuation, garbled phonetic transcriptions.
Tesseract works much better on isolated binarized crops with upscaling,
which is exactly what cell-first OCR provides. RapidOCR remains available
as explicit engine choice via the dropdown.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Batch OCR takes 30-60s with 3x upscaling. Without keepalive events,
proxy servers (Nginx) drop the SSE connection after their read timeout.
Now sends keepalive events every 5s to prevent timeout, with elapsed
time for debugging. Also checks for client disconnect between keepalives.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 22:21:14 +01:00
2 changed files with 82 additions and 16 deletions
logger.info(f"SSE batch: client disconnected during OCR for {session_id}")
ocr_future.cancel()
return
else:
cells,columns_meta=ocr_future.result()
ifawaitrequest.is_disconnected():
logger.info(f"SSE batch: client disconnected after OCR for {session_id}")
return
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.