breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	a6069631cc	feat: PaddleOCR Remote-Engine (PP-OCRv5 Latin auf Hetzner x86_64) Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 31s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m7s Details CI / test-python-agent-core (push) Successful in 21s Details CI / test-nodejs-website (push) Successful in 21s Details PaddleOCR als neue engine=paddle Option in der OCR-Pipeline. Microservice auf Hetzner (paddleocr-service/), async HTTP-Client (paddleocr_remote.py), Frontend-Dropdown, automatisch words_first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 09:31:22 +01:00
Benjamin Admin	ced5bb3dd3	feat: Words-First Grid Builder (bottom-up alternative zu cell_grid_v2) Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 54s Details CI / test-go-edu-search (push) Successful in 47s Details CI / test-python-klausur (push) Failing after 2m31s Details CI / test-python-agent-core (push) Successful in 23s Details CI / test-nodejs-website (push) Successful in 32s Details Neuer Algorithmus in cv_words_first.py: Clustert Tesseract word_boxes direkt zu Spalten (X-Gap) und Zeilen (Y-Proximity), baut Zellen an Schnittpunkten. Kein Spalten-/Zeilenerkennung noetig. - cv_words_first.py: _cluster_columns, _cluster_rows, _build_cells, build_grid_from_words - ocr_pipeline_api.py: grid_method Parameter (v2\|words_first) im /words Endpoint - StepWordRecognition.tsx: Dropdown Toggle fuer Grid-Methode - OCR-Pipeline.md: Doku v4.3.0 mit Words-First Algorithmus - 15 Unit-Tests fuer cv_words_first Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 06:46:05 +01:00
Benjamin Admin	8a60f4bf30	fix: Overlay-Zellen ohne _heal_row_gaps positionieren (skip_heal_gaps) Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 36s Details CI / test-go-edu-search (push) Successful in 35s Details CI / test-python-klausur (push) Failing after 2m12s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 21s Details _heal_row_gaps verschiebt Zell-Positionen nach Entfernung von Artefakt-Zeilen, was im Overlay zu sichtbarem Versatz fuehrt (z.B. 23px bei "badge"). Neuer skip_heal_gaps Parameter in build_cell_grid_v2 und words-Endpoint behaelt die exakten Zeilen-Positionen bei. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 08:59:50 +01:00
Benjamin Admin	e60254bc75	fix: alle Post-Crop-Schritte nutzen cropped statt dewarped Bild Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 24s Details Spalten-, Zeilen-, Woerter-Overlay und alle nachfolgenden Steps (LLM-Review, Rekonstruktion) lesen jetzt image/cropped mit Fallback auf image/dewarped. Tests fuer page_crop.py hinzugefuegt (25 Tests). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 09:10:10 +01:00
Benjamin Admin	dd16c88007	fix: retry words request on 400/404 + add backend diagnostic logging Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details Frontend: retry /words POST once after 2s delay if it gets 400/404, which happens when navigating via wizard after container restart (session cache not yet warm). Backend: log when session needs DB reload and when dewarped_bgr is missing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 20:15:54 +01:00
Benjamin Admin	4d428980c1	refactor(word-step): make table fully generic and fix marker-only row filter Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 24s Details CI / test-python-klausur (push) Failing after 1m43s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 17s Details Frontend: Replace hardcoded EN/DE/Example vocab table with unified dynamic table driven by columns_used from backend. Labeling, confirmation, counts, and summary badges are now all cell-based instead of branching on isVocab. Backend: Change _cells_to_vocab_entries() entry filter from checking only english/german/example to checking ANY mapped field. This preserves rows with only marker or source_page content, fixing the issue where marker sub-columns disappeared at the end of OCR processing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 08:45:24 +01:00
Benjamin Admin	dea3349b23	fix(ocr-pipeline): preserve sub-column data in vocab table display Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m51s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 16s Details Three fixes for sub-columns disappearing at end of streaming: 1. Backend: add column_marker mapping in _cells_to_vocab_entries() so marker text is included in vocab entries (not silently dropped) 2. Frontend types: add source_page and bbox_ref to WordEntry interface 3. Frontend table: show page_ref column (Seite) in vocab table when entries have source_page data, instead of only EN/DE/Example Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 08:06:15 +01:00
Benjamin Admin	50ad06f43a	fix(ocr-pipeline): always run fresh word detection, skip stale cache Word-lookup is now ~0.03s (vs seconds with per-cell Tesseract), so always re-run detection when entering Step 5 instead of showing potentially stale cached word_result from the session DB. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 08:05:13 +01:00
Benjamin Admin	7f27783008	feat(ocr-pipeline): add SSE streaming for word recognition (Step 5) Cells now appear one-by-one in the UI as they are OCR'd, with a live progress bar, instead of waiting for the full result. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 17:54:20 +01:00
Benjamin Admin	27b895a848	feat(ocr-pipeline): generic cell-grid with optional vocab mapping Extract build_cell_grid() as layout-agnostic foundation from build_word_grid(). Step 5 now produces a generic cell grid (columns x rows) and auto-detects whether vocab layout is present. Frontend dynamically switches between vocab table (EN/DE/Example) and generic cell table based on layout type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 17:22:56 +01:00
Benjamin Admin	f2521d2b9e	feat(ocr-pipeline): British/American IPA pronunciation choice - Integrate Britfone dictionary (MIT, 15k British English IPA entries) - Add pronunciation parameter: 'british' (default) or 'american' - British uses Britfone (Received Pronunciation), falls back to CMU - American uses eng_to_ipa/CMU, falls back to Britfone - Frontend: dropdown to switch pronunciation, default = British - API: ?pronunciation=british\|american query parameter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 11:08:52 +01:00
Benjamin Admin	f7e0f2bb4f	feat(ocr-pipeline): line breaks, hyphen rejoin & oversized row splitting - Preserve \n between visual lines within cells (instead of joining with space) - Rejoin hyphenated words split across line breaks (e.g. Fuß-\nboden → Fußboden) - Split oversized rows (>1.5× median height) into sub-entries when EN/DE line counts match — deterministic fix for missed Step 4 row boundaries - Frontend: render \n as <br/>, use textarea for multiline editing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 18:49:28 +01:00
Benjamin Admin	45435f226f	feat(ocr-pipeline): line grouping fix + RapidOCR integration Fix A: Use _group_words_into_lines() with adaptive Y-tolerance to correctly order words in multi-line cells (fixes word reordering bug). RapidOCR: Add as alternative OCR engine (PaddleOCR models on ONNX Runtime, native ARM64). Engine selectable via dropdown in UI or ?engine= query param. Auto mode prefers RapidOCR when available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 17:13:58 +01:00
Benjamin Admin	954103cdf2	feat(ocr-pipeline): add Step 5 word recognition (grid from columns × rows) Backend: build_word_grid() intersects column regions with content rows, OCRs each cell with language-specific Tesseract, and returns vocabulary entries with percent-based bounding boxes. New endpoints: POST /words, GET /image/words-overlay, ground-truth save/retrieve for words. Frontend: StepWordRecognition with overview + step-through labeling modes, goToStep callback for row correction feedback loop. MkDocs: OCR Pipeline documentation added. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 02:18:29 +01:00
Benjamin Admin	589d2f811a	feat: Dewarp-Korrektur als Schritt 2 in OCR Pipeline (7 Schritte) Implementiert Buchwoelbungs-Entzerrung mit zwei Methoden: - Methode A: Vertikale-Kanten-Analyse (Sobel + Polynom 2. Grades) - Methode B: Textzeilen-Baseline (Tesseract + Baseline-Kruemmung) Beste Methode wird automatisch gewaehlt, manueller Slider (-3 bis +3). Backend: 3 neue Endpoints (auto/manual dewarp, ground truth) Frontend: StepDewarp + DewarpControls, Pipeline von 6 auf 7 Schritte Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 16:46:41 +01:00
Benjamin Admin	d552fd8b6b	feat: OCR Pipeline mit 6-Schritt-Wizard fuer Seitenrekonstruktion All checks were successful CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 38s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Successful in 1m46s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 22s Details Neue Route /ai/ocr-pipeline mit schrittweiser Begradigung (Deskew), Raster-Overlay und Ground Truth. Schritte 2-6 als Platzhalter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 15:38:08 +01:00

16 Commits