breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	f2521d2b9e	feat(ocr-pipeline): British/American IPA pronunciation choice - Integrate Britfone dictionary (MIT, 15k British English IPA entries) - Add pronunciation parameter: 'british' (default) or 'american' - British uses Britfone (Received Pronunciation), falls back to CMU - American uses eng_to_ipa/CMU, falls back to Britfone - Frontend: dropdown to switch pronunciation, default = British - API: ?pronunciation=british\|american query parameter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 11:08:52 +01:00
Benjamin Admin	010616be5a	fix(ocr-pipeline): generic example attachment + cell padding 1. Semantic example matching: instead of attaching example sentences to the immediately preceding entry, find the vocab entry whose English word(s) appear in the example. "a broken arm" → matches "broken" via word overlap, not "egg/Ei". Uses stem matching for word form variants (break/broken share stem "bro"). 2. Cell padding: add 8px padding to each cell region so words at column/row edges don't get clipped by OCR (fixes "er wollte" missing at cell boundaries). 3. Treat very short DE text (≤2 chars) as OCR noise, not real translation — prevents false positives in example detection. All fixes are generic and deterministic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 21:24:28 +01:00
Benjamin Admin	ab294d5a6f	feat(ocr-pipeline): deterministic post-processing pipeline Add 4 post-processing steps after OCR (no LLM needed): 1. Character confusion fix: I/1/l/\| correction using cross-language context (if DE has "Ich", EN "1" → "I") 2. IPA dictionary replacement: detect [phonetics] brackets, look up correct IPA from eng_to_ipa (MIT, 134k words) — replaces OCR'd phonetic symbols with dictionary-correct transcription 3. Comma-split: "break, broke, broken" / "brechen, brach, gebrochen" → 3 individual entries when part counts match 4. Example sentence attachment: rows with EN but no DE translation get attached as examples to the preceding vocab entry All fixes are deterministic and generic — no hardcoded word lists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 21:00:09 +01:00
Benjamin Admin	f7e0f2bb4f	feat(ocr-pipeline): line breaks, hyphen rejoin & oversized row splitting - Preserve \n between visual lines within cells (instead of joining with space) - Rejoin hyphenated words split across line breaks (e.g. Fuß-\nboden → Fußboden) - Split oversized rows (>1.5× median height) into sub-entries when EN/DE line counts match — deterministic fix for missed Step 4 row boundaries - Frontend: render \n as <br/>, use textarea for multiline editing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 18:49:28 +01:00
Benjamin Admin	859342300e	fix(ocr-pipeline): configure RapidOCR for German + tighter word detection - Switch to PP-OCRv5 Latin model (supports ä, ö, ü, ß) - Use SERVER model for better accuracy - Lower Det.unclip_ratio 1.6→1.3 to reduce word merging - Raise Det.box_thresh 0.5→0.6 for stricter detection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 18:17:49 +01:00
Benjamin Admin	45435f226f	feat(ocr-pipeline): line grouping fix + RapidOCR integration Fix A: Use _group_words_into_lines() with adaptive Y-tolerance to correctly order words in multi-line cells (fixes word reordering bug). RapidOCR: Add as alternative OCR engine (PaddleOCR models on ONNX Runtime, native ARM64). Engine selectable via dropdown in UI or ?engine= query param. Auto mode prefers RapidOCR when available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 17:13:58 +01:00
Benjamin Admin	356d39d6ee	fix(ocr-pipeline): use PSM 6 (block) for multi-line cell OCR in word grid PSM 7 (single line) missed the second line in cells with two lines. PSM 6 handles multi-line content. Also fix sort order to Y-then-X for correct reading order. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 09:40:04 +01:00
Benjamin Admin	954103cdf2	feat(ocr-pipeline): add Step 5 word recognition (grid from columns × rows) Backend: build_word_grid() intersects column regions with content rows, OCRs each cell with language-specific Tesseract, and returns vocabulary entries with percent-based bounding boxes. New endpoints: POST /words, GET /image/words-overlay, ground-truth save/retrieve for words. Frontend: StepWordRecognition with overview + step-through labeling modes, goToStep callback for row correction feedback loop. MkDocs: OCR Pipeline documentation added. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 02:18:29 +01:00
Benjamin Admin	203b3c0e2d	fix(ocr-pipeline): mask out images in row detection horizontal projection Build a word-coverage mask so only pixels near Tesseract word bounding boxes contribute to the horizontal projection. Image regions (high ink but no words) are treated as white, preventing illustrations from merging multiple vocabulary rows into one. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 01:39:20 +01:00
Benjamin Admin	04b83d5f46	feat(ocr-pipeline): add row detection step with horizontal gap analysis Add Step 4 (row detection) between column detection and word recognition. Uses horizontal projection profiles + whitespace gaps (same method as columns). Includes header/footer classification via gap-size heuristics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 01:14:31 +01:00
Benjamin Admin	ce0815007e	feat(ocr-pipeline): replace clustering column detection with whitespace-gap analysis Column detection now uses vertical projection profiles to find whitespace gaps between columns, then validates gaps against word bounding boxes to prevent splitting through words. Old clustering algorithm extracted as fallback (_detect_columns_by_clustering) for pages with < 2 detected gaps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 00:36:28 +01:00
Benjamin Admin	164b35c06a	fix(ocr-pipeline): tighten page_ref constraints based on live testing - Reduce left-side threshold from 35% to 20% of content width - Strong language signal (eng/deu > 0.3) now prevents page_ref assignment - Increase column_ignore word threshold from 3 to 8 for edge columns - Apply language guard to Level 1 and Level 2 classification Fixes: column with deu=0.921 was misclassified as page_ref because reference score check ran before language analysis. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 23:33:11 +01:00
Benjamin Admin	db8327f039	fix(ocr-pipeline): tune column detection based on GT comparison Address 5 weaknesses found via ground-truth comparison on session df3548d1: - Add column_ignore for edge columns with < 3 words (margin detection) - Absorb tiny clusters (< 5% width) into neighbors post-merge - Restrict page_ref to left 35% of content area across all 3 levels - Loosen marker thresholds (width < 6%, words <= 15) and add strong marker score for very narrow non-edge columns (< 4%) - Add EN/DE position tiebreaker when language signals are both weak Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 23:16:31 +01:00
Benjamin Admin	03fa186fec	fix(ocr-pipeline): increase merge distance to 6% for better column merging Sub-alignments within a column (indented words, etc.) were 60-90px apart and not getting merged at 3%. On a typical 5-col page (~1500px), 6% = ~90px merges sub-alignments while keeping real column boundaries (~300px) separate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 20:19:09 +01:00
Benjamin Admin	1040729874	fix(ocr-pipeline): avoid backslash in f-string for Python 3.11 compat Use format() instead of nested f-strings with escaped quotes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 20:06:20 +01:00
Benjamin Admin	4f37afa222	feat(ocr-pipeline): verticality filter for column detection Clusters now track Y-positions of their words and filter by vertical coverage (>=30% primary, >=15%+5words secondary) to reject noise from indentations or page numbers. Merge distance widened to 3% content width. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 19:57:13 +01:00
Benjamin Admin	1393a994f9	Flexible inhaltsbasierte Spaltenerkennung (2-Phasen) Ersetzt hardcodierte Positionsregeln durch ein zweistufiges System: Phase A erkennt Spaltengeometrie (Clustering), Phase B klassifiziert Typen per Inhalt (Sprache/Rolle) mit 3-stufiger Fallback-Kette. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 23:33:35 +01:00
Benjamin Admin	cf27a95308	feat(ocr-pipeline): word-based 5-column detection for vocabulary pages Replace projection-profile layout analysis with Tesseract word bounding box clustering to detect 5-column vocabulary layouts (page_ref, EN, DE, markers, examples). Falls back to projection profiles when < 3 clusters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 23:08:14 +01:00
Benjamin Admin	09b820efbe	refactor(dewarp): replace displacement map with affine shear correction The old displacement-map approach shifted entire rows by a parabolic profile, creating a circle/barrel distortion. The actual problem is a linear vertical shear: after deskew aligns horizontal lines, the vertical column edges are still tilted by ~0.5°. New approach: - Detect shear angle from strongest vertical edge slope (not curvature) - Apply cv2.warpAffine shear to straighten vertical features - Manual slider: -2.0° to +2.0° in 0.05° steps - Slider initializes to auto-detected shear angle - Ground truth question: "Spalten vertikal ausgerichtet?" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 18:23:04 +01:00
Benjamin Admin	ff2bb79a91	fix(dewarp): change manual slider to percentage (0-200%) instead of raw multiplier The old -3.0 to +3.0 scale multiplied the full displacement map (up to ~79px) directly, causing extreme distortion at values >1. New slider: - 0% = no correction - 100% = auto-detected correction (default) - 200% = double correction - Step size: 5% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 18:10:34 +01:00
Benjamin Admin	9df745574b	fix(ocr-pipeline): dewarp visibility, grid on both sides, session persistence - Fix dewarp method selection: prefer methods with >5px curvature over higher confidence (vertical_edge 79px was being ignored for text_baseline 2px) - Add grid overlay on left image in Dewarp step for side-by-side comparison - Add GET /sessions/{id} endpoint to reload session data - StepDeskew accepts sessionId prop to restore state when navigating back - SessionInfo type extended with optional deskew_result and dewarp_result Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 17:29:53 +01:00
Benjamin Admin	589d2f811a	feat: Dewarp-Korrektur als Schritt 2 in OCR Pipeline (7 Schritte) Implementiert Buchwoelbungs-Entzerrung mit zwei Methoden: - Methode A: Vertikale-Kanten-Analyse (Sobel + Polynom 2. Grades) - Methode B: Textzeilen-Baseline (Tesseract + Baseline-Kruemmung) Beste Methode wird automatisch gewaehlt, manueller Slider (-3 bis +3). Backend: 3 neue Endpoints (auto/manual dewarp, ground truth) Frontend: StepDewarp + DewarpControls, Pipeline von 6 auf 7 Schritte Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 16:46:41 +01:00
Benjamin Boenisch	5a31f52310	Initial commit: breakpilot-lehrer - Lehrer KI Platform Services: Admin-Lehrer, Backend-Lehrer, Studio v2, Website, Klausur-Service, School-Service, Voice-Service, Geo-Service, BreakPilot Drive, Agent-Core Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 23:47:26 +01:00

1 2 3

123 Commits