breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	b83b38e7f2	feat: use local RapidOCR as default in ocr_region_paddle(), remote as fallback CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 17s Details RapidOCR uses the same PP-OCRv5 ONNX models locally, avoiding 504 timeouts from remote PaddleOCR on large images. Set FORCE_REMOTE_PADDLE=1 to bypass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 08:26:04 +01:00
Benjamin Admin	a994ddee83	feat: add Kombi-Vergleich mode for side-by-side Paddle vs RapidOCR comparison CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 33s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 21s Details Add /rapid-kombi backend endpoint using local RapidOCR + Tesseract merge, KombiCompareStep component for parallel execution and side-by-side overlay, and wordResultOverride prop on OverlayReconstruction for direct data injection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 07:59:06 +01:00
Benjamin Admin	d6f51e4418	fix: deduplicate overlapping OCR words and use per-word Y positions in overlay CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 33s Details CI / test-python-klausur (push) Failing after 2m9s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 24s Details Backend: Add spatial overlap check (>=50% horizontal IoU) to Kombi merge so words at the same position are deduplicated even when OCR text differs. Frontend: Add yPct/hPct to WordPosition so each word renders at its actual vertical position instead of all words collapsing to the cell center Y. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 20:27:08 +01:00
Benjamin Admin	703e110bab	fix: split PaddleOCR multi-word boxes before merge CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 32s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 19s Details PaddleOCR returns entire phrases as single boxes (e.g. "More than 200 singers took part in the"). The merge algorithm compared word-by-word but Paddle had multi-word boxes vs Tesseract's individual words, so nothing matched and all Tesseract words were added as "extras" causing duplicates. Now splits Paddle boxes into individual words before merge. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 10:39:10 +01:00
Benjamin Admin	24e1e93b5b	fix: save raw paddle/tesseract words in kombi session for debugging CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m12s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 19s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 09:03:01 +01:00
Benjamin Admin	846292f632	fix: rewrite Kombi merge with row-based sequence alignment CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 19s Details Replaces position-based word matching with row-based sequence alignment to fix doubled words and cross-line averaging in Kombi-Modus. New algorithm: 1. Group words into rows by Y-position clustering 2. Match rows between engines by vertical center proximity 3. Within each row: walk both sequences left-to-right, deduplicating 4. Unmatched rows kept as-is Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 08:45:03 +01:00
Benjamin Admin	4280298e02	fix: add _deduplicate_words safety net to Kombi merge CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 32s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m5s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 19s Details Even after multi-criteria matching, near-duplicate words can slip through (same text, centers within 30px horizontal / 15px vertical). The new _deduplicate_words() removes these, keeping the higher-confidence copy. Regression test with real session data (row 2 with 145 near-dupes) confirms no duplicates remain after merge + deduplication. Tests: 37 → 45 (added TestDeduplicateWords, TestMergeRealWorldRegression). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 08:27:45 +01:00
Benjamin Admin	4f2fb0e94c	fix: Kombi-Modus merge now deduplicates same words from both engines CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 42s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m13s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 22s Details The merge algorithm now uses 3 criteria instead of just IoU > 0.3: 1. IoU > 0.15 (relaxed threshold) 2. Center proximity < word height AND same row 3. Text similarity > 0.7 AND same row This prevents doubled overlapping words when both PaddleOCR and Tesseract find the same word at similar positions. Unique words from either engine (e.g. bullets from Tesseract) are still added. Tests expanded: 19 → 37 (added _box_center_dist, _text_similarity, _words_match tests + deduplication regression test). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 08:11:31 +01:00
Benjamin Admin	61c8169f9e	docs+test: add Kombi-Modus tests (19 passing) and MkDocs documentation CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 35s Details CI / test-go-edu-search (push) Successful in 32s Details CI / test-python-klausur (push) Failing after 2m33s Details CI / test-python-agent-core (push) Successful in 20s Details CI / test-nodejs-website (push) Successful in 24s Details - test_paddle_kombi.py: 6 IoU tests, 10 merge tests, 2 bullet-point tests - OCR-Pipeline.md: new "OCR Overlay" section with Paddle Direct/Kombi docs, merge algorithm flowchart, dateistruktur update, changelog v4.5.0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 20:18:46 +01:00
Benjamin Admin	e9ccd1e35c	feat: add Kombi-Modus (PaddleOCR + Tesseract) for OCR Overlay CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 35s Details CI / test-go-edu-search (push) Successful in 33s Details CI / test-python-klausur (push) Failing after 2m20s Details CI / test-python-agent-core (push) Successful in 22s Details CI / test-nodejs-website (push) Successful in 41s Details Runs both OCR engines on the preprocessed image and merges results: word boxes matched by IoU, coordinates averaged by confidence weight. Unmatched Tesseract words (bullets, symbols) are added for better coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 20:05:50 +01:00
Benjamin Admin	1f527fcd49	fix: split PaddleOCR boxes at leading ! for overlay word positioning CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details When PaddleOCR returns "!Betonung" as a single word box, the overlay positions text starting at the "!" instead of the actual word. Split such boxes into ["!", "Betonung"] with proportional position splitting, matching the existing IPA bracket splitting logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 17:46:17 +01:00
Benjamin Admin	8349c28f54	fix: paddle_direct reuses build_grid_from_words for correct overlay CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 37s Details CI / test-go-edu-search (push) Successful in 35s Details CI / test-python-klausur (push) Failing after 2m22s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 23s Details Replaces custom _paddle_words_to_grid_cells with the proven build_grid_from_words from cv_words_first.py — same function the regular pipeline uses with PaddleOCR. Handles phrase splitting, column clustering, and produces cells with word_boxes that the slide/cluster positioning hooks expect. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 17:19:52 +01:00
Benjamin Admin	71a1b5f058	fix: paddle_direct groups words per row (matching _build_cells format) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 34s Details CI / test-go-edu-search (push) Successful in 34s Details CI / test-python-klausur (push) Failing after 2m11s Details CI / test-python-agent-core (push) Successful in 20s Details CI / test-nodejs-website (push) Successful in 24s Details One cell per row with all words as word_boxes instead of one cell per word. Gives OverlayReconstruction a row-spanning bbox_pct for correct font sizing and per-word positions for slide/cluster placement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 17:10:10 +01:00
Benjamin Admin	c743a38eaf	fix: Paddle Direct keeps preprocessing (orient/deskew/dewarp/crop) CI / nodejs-lint (push) Has been cancelled Details CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details Uses the cropped/dewarped image instead of the original so the overlay shows the correctly oriented page. 5 steps instead of 2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 16:56:18 +01:00
Benjamin Admin	90c1efd9b0	feat: Paddle Direct — 1-click OCR without deskew/dewarp/crop CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details New 2-step mode (Upload → PaddleOCR+Overlay) alongside the existing 7-step pipeline. Backend endpoint runs PaddleOCR on the original image and clusters words into rows/cells directly. Frontend adds a mode toggle and PaddleDirectStep component. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 16:41:55 +01:00
Benjamin Admin	3e65b14b83	fix: split PaddleOCR boxes at IPA brackets for overlay positioning CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details PaddleOCR returns "badge[bxd3]" without space, but the IPA fixer produces "badge [bˈædʒ]" with space, creating a token count mismatch between cell.text and word_boxes. Now also split at "[" boundaries so each IPA bracket gets its own sub-box. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 16:08:17 +01:00
Benjamin Admin	40ac593d28	fix: split PaddleOCR phrase boxes into per-word boxes for overlay slide CI / test-nodejs-website (push) Has been cancelled Details CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details PaddleOCR returns phrase-level bounding boxes (e.g. "competition [kompa'tifn]" as one box) but the overlay slide mechanism expects one box per word for accurate positioning. Multi-word boxes are now split proportionally by character count with small gaps between words. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 16:00:06 +01:00
Benjamin Admin	ea69239e06	fix: word_boxes in words_first use absolute pixels (consistent with v2 grid) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 39s Details CI / test-go-edu-search (push) Successful in 33s Details CI / test-python-klausur (push) Failing after 2m21s Details CI / test-python-agent-core (push) Successful in 22s Details CI / test-nodejs-website (push) Successful in 33s Details words_first was storing word_boxes in percent coordinates while cv_cell_grid.py uses absolute pixel coordinates. The overlay slide mechanism divides by imgW to get percentages, so percent-in-percent caused positions near zero. Now both grid builders use the same format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 15:04:04 +01:00
Benjamin Admin	685d135be5	fix: downscale large images before PaddleOCR (Traefik 60s limit) CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details Bilder > 1500px werden vor dem Upload verkleinert. Koordinaten werden zurueckskaliert. JPEG statt PNG fuer schnelleren Upload. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 14:28:58 +01:00
Benjamin Admin	e2c2acdf86	fix: increase PaddleOCR remote timeout to 120s for large scans CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 34s Details CI / test-go-edu-search (push) Successful in 31s Details CI / test-python-klausur (push) Failing after 2m14s Details CI / test-python-agent-core (push) Successful in 21s Details CI / test-nodejs-website (push) Successful in 24s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 13:41:39 +01:00
Benjamin Admin	a6069631cc	feat: PaddleOCR Remote-Engine (PP-OCRv5 Latin auf Hetzner x86_64) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 31s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m7s Details CI / test-python-agent-core (push) Successful in 21s Details CI / test-nodejs-website (push) Successful in 21s Details PaddleOCR als neue engine=paddle Option in der OCR-Pipeline. Microservice auf Hetzner (paddleocr-service/), async HTTP-Client (paddleocr_remote.py), Frontend-Dropdown, automatisch words_first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 09:31:22 +01:00
Benjamin Admin	ced5bb3dd3	feat: Words-First Grid Builder (bottom-up alternative zu cell_grid_v2) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 54s Details CI / test-go-edu-search (push) Successful in 47s Details CI / test-python-klausur (push) Failing after 2m31s Details CI / test-python-agent-core (push) Successful in 23s Details CI / test-nodejs-website (push) Successful in 32s Details Neuer Algorithmus in cv_words_first.py: Clustert Tesseract word_boxes direkt zu Spalten (X-Gap) und Zeilen (Y-Proximity), baut Zellen an Schnittpunkten. Kein Spalten-/Zeilenerkennung noetig. - cv_words_first.py: _cluster_columns, _cluster_rows, _build_cells, build_grid_from_words - ocr_pipeline_api.py: grid_method Parameter (v2\|words_first) im /words Endpoint - StepWordRecognition.tsx: Dropdown Toggle fuer Grid-Methode - OCR-Pipeline.md: Doku v4.3.0 mit Words-First Algorithmus - 15 Unit-Tests fuer cv_words_first Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 06:46:05 +01:00
Benjamin Admin	2e21a4b6d0	fix: IPA nur einfügen wenn word_boxes Gap >80px zeigen (kein falsches IPA) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 55s Details CI / test-go-edu-search (push) Successful in 48s Details CI / test-python-klausur (push) Failing after 2m11s Details CI / test-python-agent-core (push) Successful in 23s Details CI / test-nodejs-website (push) Successful in 26s Details _has_ipa_gap() prüft ob Tesseract eine IPA-Klammer übersehen hat anhand des physischen Abstands zwischen Headword und nächstem Wort. Ohne Gap (z.B. "be good at sth.", "Focus on language") wird kein IPA eingefügt. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 23:40:18 +01:00
Benjamin Admin	d98dba9098	fix: Headword-IPA auch in langen column_text Zeilen einfuegen CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 53s Details CI / test-go-edu-search (push) Successful in 49s Details CI / test-python-klausur (push) Failing after 2m14s Details CI / test-python-agent-core (push) Successful in 22s Details CI / test-nodejs-website (push) Successful in 23s Details _insert_missing_ipa ueberspringe Texte mit >6 Woertern oder Klammern. Neue _insert_headword_ipa fuer column_text: prueft nur das erste Wort der Zeile, unabhaengig von Textlaenge oder vorhandenen Klammern. Ausserdem _sync_word_boxes_after_ipa_insert gefixt: Token-Vergleich nutzt jetzt paralleles Durchlaufen statt zip (verschobene Positionen). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 23:25:38 +01:00
Benjamin Admin	cd13eca290	fix: IPA-Einfuegung fuer column_text mit word_boxes Synchronisation CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 31s Details CI / test-go-edu-search (push) Successful in 32s Details CI / test-python-klausur (push) Failing after 2m9s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 20s Details Fuer column_text werden fehlende IPA-Lautschriften (challenge, profit, film, badge) wieder eingefuegt, aber gleichzeitig eine synthetische word_box erzeugt, damit die 1:1 Token-zu-Box Zuordnung im Overlay erhalten bleibt. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 23:15:26 +01:00
Benjamin Admin	aa7db43f02	fix: column_text nur garbled IPA ersetzen, keine Einfuegung/Entfernung CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 34s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 2m8s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 21s Details Fuer column_text (Full-Page Overlay mit gemischtem EN+DE Text): - Kein IPA einfuegen (wuerde Token-Count aendern, Overlay-Positionen brechen) - Keine orphan brackets entfernen (sind oft deutsche Bedeutungen wie (probieren)) - Nur garbled IPA ersetzen (z.B. [teıst] -> [tˈeɪst]) column_en behaelt volle Verarbeitung (replace + strip + insert). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 23:05:37 +01:00
Benjamin Admin	4afd5bd8e8	fix: Klammerwörter wie (probieren), (Profit) nicht mehr als garbled IPA entfernen CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 50s Details CI / test-go-edu-search (push) Successful in 45s Details CI / test-python-klausur (push) Failing after 2m12s Details CI / test-python-agent-core (push) Successful in 23s Details CI / test-nodejs-website (push) Successful in 27s Details _strip_orphan_bracket entfernte deutsche Bedeutungsangaben in Klammern, weil sie weder als Grammar-Partikel noch als IPA erkannt wurden. Fix: Klammerinhalte mit echten Wörtern (>=4 Buchstaben) werden behalten. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 22:47:01 +01:00
Benjamin Admin	7d19145edb	fix: word_boxes auch fuer breite Spalten (Full-Page OCR) speichern CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 32s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m3s Details CI / test-python-agent-core (push) Successful in 20s Details CI / test-nodejs-website (push) Successful in 21s Details word_boxes wurden nur im Cell-Crop-Pfad (narrow columns) gesetzt, aber nicht im Full-Page Word-Assignment-Pfad (broad columns). Jetzt werden die Tesseract-Wort-Koordinaten in beiden Pfaden gespeichert. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 20:41:29 +01:00
Benjamin Admin	0ee92e7210	feat: OCR word_boxes fuer pixelgenaue Overlay-Positionierung CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 37s Details CI / test-go-edu-search (push) Successful in 32s Details CI / test-python-klausur (push) Failing after 2m10s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 20s Details Backend: _ocr_cell_crop speichert jetzt word_boxes mit exakten Tesseract/RapidOCR Wort-Koordinaten (left, top, width, height) im Cell-Ergebnis. Absolute Bildkoordinaten, bereits zurueckgemappt. Frontend: Slide-Hook nutzt word_boxes direkt wenn vorhanden — jedes Wort wird exakt an seiner OCR-Position platziert. Kein Pixel-Scanning noetig. Fallback auf alten Slide wenn keine Boxes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 19:39:49 +01:00
Benjamin Admin	2f51ac617f	feat: IPA-Lautschrift in Cell-Texte einfuegen (fuer Overlay-Modus) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 34s Details CI / test-go-edu-search (push) Successful in 31s Details CI / test-python-klausur (push) Failing after 2m5s Details CI / test-python-agent-core (push) Successful in 23s Details CI / test-nodejs-website (push) Successful in 22s Details fix_cell_phonetics() ersetzt fehlerhafte IPA-Klammern UND fuegt fehlende Lautschrift fuer englische Woerter ein (z.B. badge, film, challenge, profit). Wird auf alle Zellen mit col_type column_en/column_text angewandt. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 15:47:26 +01:00
Benjamin Admin	41e47baf13	fix: skip_heal_gaps Parameter an Stream-Generator durchreichen CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 31s Details CI / test-python-klausur (push) Failing after 2m6s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 28s Details NameError behoben: skip_heal_gaps war nicht im Scope der _word_batch_stream_generator Funktion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 09:11:16 +01:00
Benjamin Admin	8a60f4bf30	fix: Overlay-Zellen ohne _heal_row_gaps positionieren (skip_heal_gaps) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 36s Details CI / test-go-edu-search (push) Successful in 35s Details CI / test-python-klausur (push) Failing after 2m12s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 21s Details _heal_row_gaps verschiebt Zell-Positionen nach Entfernung von Artefakt-Zeilen, was im Overlay zu sichtbarem Versatz fuehrt (z.B. 23px bei "badge"). Neuer skip_heal_gaps Parameter in build_cell_grid_v2 und words-Endpoint behaelt die exakten Zeilen-Positionen bei. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 08:59:50 +01:00
Benjamin Admin	e3ee1de790	Revert "fix: Zeilen-Regularisierung im Overlay ueberspringen (generisch fuer gemischte Inhalte)" CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 31s Details CI / test-python-klausur (push) Failing after 2m2s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 24s Details This reverts commit `b91f799ccf`.	2026-03-11 08:44:07 +01:00
Benjamin Admin	b91f799ccf	fix: Zeilen-Regularisierung im Overlay ueberspringen (generisch fuer gemischte Inhalte) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 49s Details CI / test-go-edu-search (push) Successful in 31s Details CI / test-python-klausur (push) Failing after 2m21s Details CI / test-python-agent-core (push) Successful in 20s Details CI / test-nodejs-website (push) Successful in 26s Details Seiten mit Info-Boxen (andere Zeilenhoehe) fuehren dazu, dass _regularize_row_grid die Zeilenpositionen verzerrt. Neuer skip_regularize Parameter nutzt stattdessen die gap-basierten Zeilen, die der tatsaechlichen Seitengeometrie folgen. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 08:29:06 +01:00
Benjamin Admin	e2ad93fd57	fix: Word-Erkennung ohne Spalten ermoeglichen (Full-Page Pseudo-Column) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 34s Details CI / test-go-edu-search (push) Successful in 31s Details CI / test-python-klausur (push) Failing after 2m14s Details CI / test-python-agent-core (push) Successful in 21s Details CI / test-nodejs-website (push) Successful in 22s Details Wenn column_result fehlt (z.B. OCR Overlay Pipeline), wird automatisch eine einzelne ganzseitige Pseudo-Spalte erzeugt statt einen Fehler zu werfen. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 00:16:31 +01:00
Benjamin Admin	618c82ef42	fix: Zeilen an Box-Grenze nicht mehr abschneiden (border_thickness Margin) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 32s Details CI / test-go-edu-search (push) Successful in 35s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 25s Details - detect_rows: Content-Strips nutzen jetzt box_ranges_inner (geschrumpft um border_thickness, min 5px) statt der vollen Box-Range - detect_words: _row_in_box Filter nutzt ebenfalls inner Range - Dadurch wird die letzte Zeile oberhalb einer Box nicht mehr faelschlicherweise der Box zugeordnet und ausgeschlossen Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 17:44:02 +01:00
Benjamin Admin	6bb023bdc1	fix: vocab_entries fuer column_text Sub-Sessions generieren CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 32s Details CI / test-go-edu-search (push) Successful in 31s Details CI / test-python-klausur (push) Failing after 2m8s Details CI / test-python-agent-core (push) Successful in 21s Details CI / test-nodejs-website (push) Successful in 23s Details _cells_to_vocab_entries wurde nur bei is_vocab (column_en/column_de) aufgerufen. Fuer Sub-Sessions mit column_text wurden keine Eintraege erzeugt, daher blieb die Korrektur-Tabelle leer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 10:28:27 +01:00
Benjamin Admin	13553fc5e6	fix: column_text Typ fuer Sub-Sessions in Korrektur-Tabelle CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m9s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 20s Details _cells_to_vocab_entries kannte column_text nicht, daher wurden keine Eintraege erzeugt. Jetzt mappt column_text -> 'text' Feld. Frontend: column_text in FIELD_LABELS/COL_TYPE_TO_FIELD/COL_TYPE_COLOR. Label: "Tabelle" statt "Vokabeltabelle" fuer Sub-Sessions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 09:48:40 +01:00
Benjamin Admin	964c916a81	fix: _clean_cell_text entfernt Waehrungssymbole am Zeilenende CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 31s Details CI / test-go-edu-search (push) Successful in 31s Details CI / test-python-klausur (push) Failing after 1m57s Details CI / test-python-agent-core (push) Successful in 20s Details CI / test-nodejs-website (push) Successful in 24s Details _is_noise_tail_token() stuft rein nicht-alphabetische Tokens wie €0.50, £1, €2.50 als OCR-Noise ein und entfernt sie. Zusaetzlich zerstoert ' '.join(tokens) das proportionale Spacing. Fuer Single-Column Sub-Sessions wird _clean_cell_text uebersprungen. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 09:41:25 +01:00
Benjamin Admin	13510b62cc	debug: Log-Level auf INFO fuer Sub-Session Zellinhalte CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 2m3s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 19s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 09:33:56 +01:00
Benjamin Admin	3a791179af	debug: Logging fuer Sub-Session Woertererkennung CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 31s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details Zeigt low-confidence Woerter (conf<30) und Zellinhalte pro Zeile, um fehlende Euro/Pfund-Betraege zu diagnostizieren. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 09:31:34 +01:00
Benjamin Admin	f65bd11919	fix: Sub-Session Zeilenerkennung nutzt Word-Grouping statt Gap-Detection CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m0s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 23s Details Gap-basierte Erkennung findet bei kleinen Box-Bildern zu wenige Gaps und mergt Zeilen (7 raw gaps -> 4 validated -> nur 3 rows statt 6). Sub-Sessions nutzen jetzt direkt _build_rows_from_word_grouping(), das Woerter nach Y-Position clustert — robuster fuer komplexe Box-Layouts. Zusaetzlich: alle zones=None Crashes gefixt (replace_all .get("zones") or []). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 09:05:24 +01:00
Benjamin Admin	785b4d7655	fix: zones=None crash bei Sub-Session Zeilenerkennung CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 20s Details column_result.get("zones", []) gibt None zurueck wenn der Key mit Wert None existiert. Geaendert zu .get("zones") or []. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 08:50:58 +01:00
Benjamin Admin	2716495250	fix: Sub-Session Zeilenerkennung — Tesseract+inv im Spalten-Schritt cachen CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m9s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 20s Details Bisher wurden _word_dicts, _inv und _content_bounds fuer Sub-Sessions nicht gecacht, sodass detect_rows auf detect_column_geometry() zurueckfiel. Das konnte bei kleinen Box-Bildern mit <5 Woertern fehlschlagen. Jetzt laeuft Tesseract + Binarisierung direkt im Pseudo-Spalten-Block, und die Intermediates werden gecacht. Zusaetzlich ausfuehrliche Kommentare zur Zeilenerkennung (detect_row_geometry, _regularize_row_grid). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 08:43:26 +01:00
Benjamin Admin	23b7840ea7	feat: Full-Row OCR mit Spacing fuer Box-Sub-Sessions CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 40s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m16s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 22s Details Sub-Sessions ueberspringen Spaltenerkennung und nutzen stattdessen eine Pseudo-Spalte ueber die volle Breite. Text wird mit proportionalem Spacing aus Wort-Positionen rekonstruiert, um raeumliches Layout zu erhalten. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 08:28:29 +01:00
Benjamin Admin	34adb437d0	fix: Bild-Endpoints fallen auf original zurueck fuer Sub-Sessions CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m3s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 20s Details Alle Bild-Endpoints (cropped, columns-overlay, rows-overlay, words-overlay) suchten nur nach cropped/dewarped. Sub-Sessions haben nur ein original-Bild. Neue Hilfsfunktion _get_base_image_png() mit Fallback-Kette: cropped > dewarped > original. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 23:30:38 +01:00
Benjamin Admin	ceaef9c6a6	fix: Sub-Sessions original_bgr als cropped_bgr promoten CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 31s Details CI / test-python-klausur (push) Failing after 2m22s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 18s Details Spalten-/Zeilen-/Woerter-Erkennung suchen nach cropped_bgr oder dewarped_bgr. Bei Sub-Sessions existiert nur original_bgr (der Box-Ausschnitt). Jetzt wird original_bgr automatisch als cropped_bgr gesetzt, sowohl im Cache-Aufbau als auch bei der Erstellung. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 22:57:39 +01:00
Benjamin Admin	256efef3ea	feat: Box-Zonen durch gesamte Pipeline + Sub-Sessions fuer Box-Inhalt CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m0s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 19s Details - Rote semi-transparente Box-Markierung in allen Overlays (Spalten, Zeilen, Woerter) - Zeilenerkennung: Combined-Image-Ansatz schliesst Box-Bereiche aus - Woerter-Erkennung: Zeilen innerhalb von Box-Zonen werden gefiltert - Sub-Sessions: parent_session_id/box_index in DB-Schema - POST /sessions/{id}/create-box-sessions erstellt Sub-Sessions aus Box-Regionen - Session-Info zeigt Sub-Sessions bzw. Parent-Verknuepfung - Sessions-Liste blendet Sub-Sessions per Default aus - Rekonstruktion: Fabric-JSON merged Sub-Session-Zellen an Box-Positionen - Save-Reconstruction routet box{N}_* Updates an Sub-Sessions - GET /sessions/{id}/vocab-entries/merged fuer zusammengefuehrte Eintraege Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 18:24:34 +01:00
Benjamin Admin	4610137ecc	fix: Box-Bereiche aus Bild entfernen statt pro Zone separat Spalten erkennen CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Content-Streifen oberhalb/unterhalb von Boxen werden zu einem Bild zusammengefügt, Spaltenerkennung läuft einmal auf dem kombinierten Bild. Entfernt Step 5c (suspicion-based gap alignment), da der neue Ansatz das Problem an der Wurzel löst. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 17:03:05 +01:00
Benjamin Admin	fb46450802	fix: Alignment-Validierung nur fuer verdaechtige Gaps (>2x Median-Breite) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 20s Details Vorher wurden alle internen Gaps geprueft, was echte Spaltentrennungen (EN→DE) faelschlicherweise entfernte. Jetzt werden nur Gaps geprueft, die eine unverhaeltnismaessig breite rechte Spalte erzeugen wuerden (>2x Median-Spaltenbreite). Schwelle auf 15% gesenkt. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 16:27:14 +01:00

1 2 3 4 5

214 Commits