sed replacement left orphaned hostname references in story page
and empty lines in getApiBase functions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HTTPS pages cannot fetch from HTTP backend ports. Added Next.js
API route proxies for /api/vocabulary, /api/learning-units, /api/progress
that forward to backend-lehrer internally (same Docker network, HTTP).
All frontend pages now use same-origin requests (getApiBase = '')
instead of direct port:8001 connections.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Trigram extension and index are now created in a separate try/catch
so table creation succeeds even without pg_trgm. Search falls back
to ILIKE when trigram functions are not available.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Strip SQLAlchemy dialect prefix from DATABASE_URL for asyncpg.
Set search_path via server_settings on pool creation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Strategic pivot: Studio-v2 becomes a language learning platform.
Compliance guardrail added to CLAUDE.md — no scan/OCR of third-party
content in customer frontend. Upload of OWN materials remains allowed.
Phase 1.1 — vocabulary_db.py: PostgreSQL model for 160k+ words
with english, german, IPA, syllables, examples, images, audio,
difficulty, tags, translations (multilingual). Trigram search index.
Phase 1.2 — vocabulary_api.py: Search, browse, filters, bulk import,
learning unit creation from word selection. Creates QA items with
enhanced fields (IPA, syllables, image, audio) for flashcards.
Phase 1.3 — /vocabulary page: Search bar with POS/difficulty filters,
word cards with audio buttons, unit builder sidebar. Teacher selects
words → creates learning unit → redirects to flashcards.
Sidebar: Added "Woerterbuch" (/vocabulary) and "Lernmodule" (/learn).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
qwen2.5vl:32b needs ~100GB RAM and crashes Ollama.
llama3.2-vision:11b is already installed and fits in memory.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New module vision_ocr_fusion.py: Sends scan image + OCR word
coordinates + document type to Qwen2.5-VL 32B. The LLM reads
the image visually while using OCR positions as structural hints.
Key features:
- Document-type-aware prompts (Vokabelseite, Woerterbuch, etc.)
- OCR words grouped into lines with x/y coordinates in prompt
- Low-confidence words marked with (?) for LLM attention
- Continuation row merging instructions in prompt
- JSON response parsing with markdown code block handling
- Fallback to original OCR on any error
Frontend (admin-lehrer Grid Review):
- "Vision-LLM" checkbox toggle
- "Typ" dropdown (Vokabelseite, Woerterbuch, etc.)
- Steps 1-3 defaults set to inactive
Activate: Check "Vision-LLM", select document type, click "OCR neu + Grid".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New endpoint POST /sessions/{id}/rerun-ocr-and-build-grid that:
1. Runs scan quality assessment
2. Applies CLAHE enhancement if degraded (controlled by enhance toggle)
3. Re-runs dual-engine OCR (RapidOCR + Tesseract) with min_conf filter
4. Merges OCR results and stores updated word_result
5. Builds grid with max_columns constraint
Frontend: Orange "OCR neu + Grid" button in GridToolbar.
Unlike "Neu berechnen" (which only rebuilds grid from existing words),
this button re-runs the full OCR pipeline with quality settings.
Now CLAHE toggle actually has an effect — it enhances the image
before OCR runs, not after.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The max_columns parameter was only implemented in cv_words_first.py
(vocab-worksheet path) but NOT in _build_grid_core which is what
the admin OCR Kombi pipeline uses. The Kombi pipeline uses
grid_editor_helpers._cluster_columns_by_alignment() which has its
own column detection.
Fix: Post-processing step 5k merges narrowest columns after grid
building when zone has more columns than max_columns. Cells from
merged columns get their text appended to the target column.
min_conf word filtering was already working (applied before grid build).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dev-only toggles belong in admin-lehrer (port 3002) only.
The customer frontend runs the pipeline with optimal defaults
and shows only the finished results.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each quality improvement step can now be toggled independently:
- CLAHE checkbox (Step 3: image enhancement on/off)
- MaxCols dropdown (Step 2: 0=unlimited, 2-5)
- MinConf dropdown (Step 1: auto/20/30/40/50/60)
Backend: Query params enhance, max_cols, min_conf on process-single-page.
Response includes active_steps dict showing which steps are enabled.
Frontend: Toggle controls in VocabularyTab above the table.
This allows empirical A/B testing of each step on the same scan.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Step 1: scan_quality.py — Laplacian blur + contrast scoring, adjusts
OCR confidence threshold (40 for good scans, 30 for degraded).
Quality report included in API response + shown in frontend.
Step 2: max_columns parameter in cv_words_first.py — limits column
detection to 3 for vocab tables, preventing phantom columns D/E
from degraded OCR fragments.
Step 3: ocr_image_enhance.py — CLAHE contrast + bilateral filter
denoising + unsharp mask, only for degraded scans (gated by
quality score). Pattern from handwriting_htr_api.py.
Frontend: quality info shown in extraction status after processing.
Reprocess button now derives pages from vocabulary data.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Types from deleted ocr-pipeline/types.ts inlined into ocr-kombi/types.ts.
All imports updated across components/ocr-kombi/ and components/ocr-pipeline/.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs fixed:
1. reprocessPages() failed silently after session resume because
successfulPages was empty. Now derives pages from vocabulary
source_page or selectedPages as fallback.
2. process-single-page endpoint built vocabulary entries WITHOUT
applying merge logic (_merge_wrapped_rows, _merge_continuation_rows).
Now applies full merge pipeline after vocabulary extraction.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allows reprocessing pages from the vocabulary view to apply
new merge logic without navigating back to page selection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When textbook authors wrap text within a cell (e.g. long German
translations), OCR treats each physical line as a separate row.
New _merge_wrapped_rows() detects this by checking if the primary
column (EN) is empty — indicating a continuation, not a new entry.
Handles: empty EN + DE text, empty EN + example text, parenthetical
continuations like "(bei)", triple wraps, comma-separated lists.
12 tests added covering all cases.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 3.2 — MicrophoneInput.tsx: Browser Web Speech API for
speech-to-text recognition (EN+DE), integrated for pronunciation practice.
Phase 4.1 — Story Generator: LLM-powered mini-stories using vocabulary
words, with highlighted vocab in HTML output. Backend endpoint
POST /learning-units/{id}/generate-story + frontend /learn/[unitId]/story.
Phase 4.2 — SyllableBow.tsx: SVG arc component for syllable visualization
under words, clickable for per-syllable TTS.
Phase 4.3 — Gamification system:
- CoinAnimation.tsx: Floating coin rewards with accumulator
- CrownBadge.tsx: Crown/medal display for milestones
- ProgressRing.tsx: Circular progress indicator
- progress_api.py: Backend tracking coins, crowns, streaks per unit
Also adds "Geschichte" exercise type button to UnitCard.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New feature: After OCR vocabulary extraction, users can generate interactive
learning modules (flashcards, quiz, type trainer) with one click.
Frontend (studio-v2):
- Fortune Sheet spreadsheet editor tab in vocab-worksheet
- "Lernmodule generieren" button in ExportTab
- /learn page with unit overview and exercise type cards
- /learn/[unitId]/flashcards — Flip-card trainer with Leitner spaced repetition
- /learn/[unitId]/quiz — Multiple choice quiz with explanations
- /learn/[unitId]/type — Type-in trainer with Levenshtein distance feedback
- AudioButton component using Web Speech API for EN+DE TTS
Backend (klausur-service):
- vocab_learn_bridge.py: Converts VocabularyEntry[] to analysis_data format
- POST /sessions/{id}/generate-learning-unit endpoint
Backend (backend-lehrer):
- generate-qa, generate-mc, generate-cloze endpoints on learning units
- get-qa/mc/cloze data retrieval endpoints
- Leitner progress update + next review items endpoints
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The tokenizer regex only matches alphabetic characters, so text
before the first word match (like "(= " in "(= I won...") was
silently dropped when reassembling the corrected text.
Now preserves text[:first_match_start] as a leading prefix.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of keeping only specific symbols (_KEEP_SYMBOLS), now only
removes explicitly decorative symbols (_REMOVE_SYMBOLS: > < ~ \ ^ etc).
All other punctuation (= ( ) ; : - etc.) is preserved by default.
This is more robust: any new symbol used in textbooks will be kept
unless it's in the small block-list of known decorative artifacts.
Fixes: (= token still being removed on page 5 despite being in
the allow-list (possibly due to Unicode variants or whitespace).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rule (a2) in Step 5i removed word_boxes with no letters/digits as
"graphic OCR artifacts". This incorrectly removed = signs used as
definition markers in textbooks ("film = 1. Film; 2. filmen").
Added exception list _KEEP_SYMBOLS for meaningful punctuation:
= (= =) ; : - – — / + • · ( ) & * → ← ↔
The root cause: PaddleOCR returns "film = 1. Film; 2. filmen" as one
block, which gets split into word_boxes ["film", "=", "1.", ...].
The "=" word_box had no alphanumeric chars and was removed as artifact.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
= signs are used as definition markers in textbooks ("film = 1. Film").
They were incorrectly removed by two filters:
1. grid_build_core.py Step 5j-pre: _PURE_JUNK_RE matched "=" as
artifact noise. Now exempts =, (=, ;, :, - and similar meaningful
punctuation tokens.
2. cv_ocr_engines.py _is_noise_tail_token: "pure non-alpha" check
removed trailing = tokens. Now exempts meaningful punctuation.
Fixes: "film = 1. Film; 2. filmen" losing the = sign,
"(= I won and he lost.)" losing the (=.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Words like "probieren)" or "Englisch)" were incorrectly flagged as
gutter OCR errors because the closing parenthesis wasn't stripped
before dictionary lookup. The spellchecker then suggested "probierend"
(replacing ) with d, edit distance 1).
Two fixes:
1. Strip trailing/leading parentheses in _try_spell_fix before checking
if the bare word is valid — skip correction if it is
2. Add )( to the rstrip characters in the analysis phase so
"probieren)" becomes "probieren" for the known-word check
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Text like "Betonung auf der 1. Silbe: profit ['profit]" was
incorrectly detected as garbled IPA and replaced with generated
IPA transcription of the previous row's example sentence.
Added guard: if the cell text contains >=3 recognizable words
(3+ letter alpha tokens), it's normal text, not garbled IPA.
Garbled IPA is typically short and has no real dictionary words.
Fixes: Row 13 C3 showing IPA instead of pronunciation hint text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>