breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	23b7840ea7	feat: Full-Row OCR mit Spacing fuer Box-Sub-Sessions CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 40s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m16s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 22s Details Sub-Sessions ueberspringen Spaltenerkennung und nutzen stattdessen eine Pseudo-Spalte ueber die volle Breite. Text wird mit proportionalem Spacing aus Wort-Positionen rekonstruiert, um raeumliches Layout zu erhalten. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 08:28:29 +01:00
Benjamin Admin	34adb437d0	fix: Bild-Endpoints fallen auf original zurueck fuer Sub-Sessions CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m3s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 20s Details Alle Bild-Endpoints (cropped, columns-overlay, rows-overlay, words-overlay) suchten nur nach cropped/dewarped. Sub-Sessions haben nur ein original-Bild. Neue Hilfsfunktion _get_base_image_png() mit Fallback-Kette: cropped > dewarped > original. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 23:30:38 +01:00
Benjamin Admin	ceaef9c6a6	fix: Sub-Sessions original_bgr als cropped_bgr promoten CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 31s Details CI / test-python-klausur (push) Failing after 2m22s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 18s Details Spalten-/Zeilen-/Woerter-Erkennung suchen nach cropped_bgr oder dewarped_bgr. Bei Sub-Sessions existiert nur original_bgr (der Box-Ausschnitt). Jetzt wird original_bgr automatisch als cropped_bgr gesetzt, sowohl im Cache-Aufbau als auch bei der Erstellung. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 22:57:39 +01:00
Benjamin Admin	256efef3ea	feat: Box-Zonen durch gesamte Pipeline + Sub-Sessions fuer Box-Inhalt CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m0s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 19s Details - Rote semi-transparente Box-Markierung in allen Overlays (Spalten, Zeilen, Woerter) - Zeilenerkennung: Combined-Image-Ansatz schliesst Box-Bereiche aus - Woerter-Erkennung: Zeilen innerhalb von Box-Zonen werden gefiltert - Sub-Sessions: parent_session_id/box_index in DB-Schema - POST /sessions/{id}/create-box-sessions erstellt Sub-Sessions aus Box-Regionen - Session-Info zeigt Sub-Sessions bzw. Parent-Verknuepfung - Sessions-Liste blendet Sub-Sessions per Default aus - Rekonstruktion: Fabric-JSON merged Sub-Session-Zellen an Box-Positionen - Save-Reconstruction routet box{N}_* Updates an Sub-Sessions - GET /sessions/{id}/vocab-entries/merged fuer zusammengefuehrte Eintraege Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 18:24:34 +01:00
Benjamin Admin	4610137ecc	fix: Box-Bereiche aus Bild entfernen statt pro Zone separat Spalten erkennen CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Content-Streifen oberhalb/unterhalb von Boxen werden zu einem Bild zusammengefügt, Spaltenerkennung läuft einmal auf dem kombinierten Bild. Entfernt Step 5c (suspicion-based gap alignment), da der neue Ansatz das Problem an der Wurzel löst. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 17:03:05 +01:00
Benjamin Admin	fb46450802	fix: Alignment-Validierung nur fuer verdaechtige Gaps (>2x Median-Breite) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 20s Details Vorher wurden alle internen Gaps geprueft, was echte Spaltentrennungen (EN→DE) faelschlicherweise entfernte. Jetzt werden nur Gaps geprueft, die eine unverhaeltnismaessig breite rechte Spalte erzeugen wuerden (>2x Median-Spaltenbreite). Schwelle auf 15% gesenkt. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 16:27:14 +01:00
Benjamin Admin	11126c4436	fix: UnboundLocalError edge_tolerance in Step 5c CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 31s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 1m58s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 19s Details Variable wurde vor ihrer Definition in Step 7 referenziert. Eigene margin_thresh Variable fuer Step 5c eingefuehrt. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 16:18:47 +01:00
Benjamin Admin	7a0ded7562	fix: Left-Edge-Alignment-Validierung fuer Spalten-Gaps CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m7s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 19s Details Interiore Gaps werden jetzt geprueft: rechts des Gaps muessen mindestens 25% der Woerter eine gemeinsame linke Kante teilen. Verhindert falsche Spaltentrennungen innerhalb breiter Spalten (z.B. Example-Spalte mit kurzen und langen Eintraegen). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 16:11:58 +01:00
Benjamin Admin	04be24a89e	fix: fehlende Imports RAPIDOCR_AVAILABLE und _RE_ALPHA in cv_cell_grid.py CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 20s Details Weitere NameError-Probleme vom Modul-Refactoring: beide Symbole werden in cv_cell_grid.py benutzt, sind aber in cv_ocr_engines.py definiert und waren nicht importiert. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 15:59:24 +01:00
Benjamin Admin	cf9dde9876	fix: _group_words_into_lines nach cv_ocr_engines.py verschieben CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 2m4s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 21s Details Funktion war nur in cv_review.py definiert, wurde aber auch in cv_ocr_engines.py und cv_layout.py benutzt — NameError zur Laufzeit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 15:24:56 +01:00
Benjamin Admin	60c4138660	fix: _MIN_WORD_CONF als Modul-Konstante statt lokale Variable CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m12s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 20s Details NameError in build_cell_grid_v2 weil _MIN_WORD_CONF nur in _ocr_cell_crop und build_cell_grid lokal definiert war. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 15:12:02 +01:00
Benjamin Admin	7005b18561	feat: generische Box-Erkennung fuer zonenbasierte Spaltenerkennung CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 19s Details - Neue Datei cv_box_detect.py: 2-Stufen-Algorithmus (Linien + Farbe) - DetectedBox/PageZone Dataclasses in cv_vocab_types.py - detect_column_geometry_zoned() in cv_layout.py - API-Endpoints erweitert: zones/boxes_detected im column_result - Overlay-Funktionen zeichnen Box-Grenzen als gestrichelte Rechtecke - Fix: numpy array or-Verknuepfung an 7 Stellen in ocr_pipeline_api.py - 12 Unit-Tests fuer Box-Erkennung und Zone-Splitting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 15:06:23 +01:00
Benjamin Admin	e60254bc75	fix: alle Post-Crop-Schritte nutzen cropped statt dewarped Bild CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 24s Details Spalten-, Zeilen-, Woerter-Overlay und alle nachfolgenden Steps (LLM-Review, Rekonstruktion) lesen jetzt image/cropped mit Fallback auf image/dewarped. Tests fuer page_crop.py hinzugefuegt (25 Tests). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 09:10:10 +01:00
Benjamin Admin	156a818246	refactor: Crop nach Deskew/Dewarp verschieben + content-basierter Buchscan-Crop CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details Pipeline-Reihenfolge neu: Orientierung → Begradigung → Entzerrung → Zuschneiden → Spalten... Crop arbeitet jetzt auf dem bereits geraden Bild, was bessere Ergebnisse liefert. page_crop.py komplett ersetzt: Adaptive Threshold + 4-Kanten-Erkennung (Buchruecken-Schatten links, Ink-Projektion fuer alle Raender) statt Otsu + groesste Kontur. Backend: Step-Nummern, Input-Bilder, Reprocess-Kaskade angepasst. Frontend: PIPELINE_STEPS umgeordnet, Switch-Cases, Vorher-Bilder aktualisiert. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 08:52:11 +01:00
Benjamin Admin	eb45bb4879	fix: numpy array or-Verknuepfung in Crop/Deskew + ImageCompareView Labels CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 37s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 2m17s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 24s Details - orientation_crop_api.py: `array or array` durch `is not None` ersetzt (ValueError bei numpy Arrays) - ocr_pipeline_api.py: gleicher Fix fuer Deskew-Fallback-Kette - ImageCompareView.tsx: Fallback-Text nutzt rightLabel statt "Begradigung" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 08:02:44 +01:00
Benjamin Admin	2763631711	feat: Orientierung + Zuschneiden als Schritte 1-2 in OCR-Pipeline CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details Zwei neue Wizard-Schritte vor Begradigung: - Step 1: Orientierungserkennung (0/90/180/270° via Tesseract OSD) - Step 2: Seitenrand-Erkennung und Zuschnitt (Scannerraender entfernen) Backend: - orientation_crop_api.py: POST /orientation, POST /crop, POST /crop/skip - page_crop.py: detect_and_crop_page() mit Format-Erkennung (A4/A5/Letter) - Session-Store: orientation_result, crop_result Felder - Pipeline nutzt zugeschnittenes Bild fuer Deskew/Dewarp Frontend: - StepOrientation.tsx: Upload + Auto-Orientierung + Vorher/Nachher - StepCrop.tsx: Auto-Crop + Format-Badge + Ueberspringen-Option - Pipeline-Stepper: 10 Schritte (war 8) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 23:55:23 +01:00
Benjamin Admin	9a5a35bff1	refactor: cv_vocab_pipeline.py in 6 Module aufteilen (8163 → 6 + Fassade) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Monolithische 8163-Zeilen-Datei aufgeteilt in fokussierte Module: - cv_vocab_types.py (156 Z.): Dataklassen, Konstanten, IPA, Feature-Flags - cv_preprocessing.py (1166 Z.): Bild-I/O, Orientierung, Deskew, Dewarp - cv_layout.py (3036 Z.): Dokumenttyp, Spalten, Zeilen, Klassifikation - cv_ocr_engines.py (1282 Z.): OCR-Engines, Vocab-Postprocessing, Text-Cleaning - cv_cell_grid.py (1510 Z.): Cell-Grid v2+Legacy, Vocab-Konvertierung - cv_review.py (1184 Z.): LLM/Spell Review, Pipeline-Orchestrierung cv_vocab_pipeline.py ist jetzt eine Re-Export-Fassade (35 Z.) — alle bestehenden Imports bleiben unveraendert. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 23:46:47 +01:00
Benjamin Admin	931ab92c92	feat: Orientierungserkennung in OCR-Pipeline-Deskew integrieren CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 38s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 21s Details detect_and_fix_orientation() wird jetzt vor dem Deskew-Schritt in der OCR-Pipeline ausgefuehrt, sodass 90/180/270°-gedrehte Scans automatisch korrigiert werden. Frontend zeigt Orientierungskorrektur als Info-Banner. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 22:31:36 +01:00
Benjamin Admin	853638b03c	Revert "fix: _split_broad_columns nur bei maximal 1 breiter Spalte ausfuehren" CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m57s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 15s Details This reverts commit `d98359fceb`.	2026-03-07 22:55:24 +01:00
Benjamin Admin	d98359fceb	fix: _split_broad_columns nur bei maximal 1 breiter Spalte ausfuehren CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m26s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 18s Details Wenn bereits 2+ breite Content-Spalten existieren, ist das Layout wahrscheinlich korrekt in EN/DE getrennt. Split wird nur ausgefuehrt wenn eine einzelne breite Spalte EN+DE kombiniert enthaelt. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 22:51:14 +01:00
Benjamin Admin	e1ae5d5fa9	fix: Edge-Gaps in _split_broad_columns ignorieren + return-Tuple bei leerem Ergebnis CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m57s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 16s Details Gaps die den Spaltenrand beruehren (Margins) werden jetzt ausgeschlossen, nur interne Gaps werden als Split-Kandidaten betrachtet. Behebt das Problem dass trailing whitespace faelschlich als groesster Gap gewaehlt wurde. Early-return in _run_ocr_pipeline_for_page gibt jetzt korrekt ([], rotation) statt [] zurueck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 22:16:29 +01:00
Benjamin Admin	4e8ea77140	fix: leere Spalten als strukturell behandeln + 2-Spalten-Layout korrekt labeln CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m50s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details Spalten mit <=2 Woertern und <15% Breite werden jetzt als column_marker statt als content-Spalte klassifiziert. Bei 2 breiten Content-Spalten wird die rechte als column_example statt column_de gelabelt, da die linke Spalte EN+DE kombiniert enthaelt. OSD-Zoom von 1.0 auf 2.0 erhoeht fuer zuverlaessigere Orientierungserkennung. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 19:35:21 +01:00
Benjamin Admin	e8ba5ec073	fix: Orientierungserkennung beim PDF-Upload statt erst bei OCR CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 23s Details CI / test-go-edu-search (push) Successful in 23s Details CI / test-python-klausur (push) Failing after 1m47s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 17s Details Rotation wird jetzt in upload_pdf_get_info() erkannt, damit Thumbnails bei der Seitenauswahl bereits richtig herum angezeigt werden. Debug-Logging fuer _split_broad_columns hinzugefuegt. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 19:11:45 +01:00
Benjamin Admin	02631dc4e0	feat: breite Spalten per Word-Gap splitten + gedrehte Scans im Frontend anzeigen CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 15s Details _split_broad_columns() erkennt EN/DE-Gemisch in breiten Spalten via Word-Coverage-Analyse und trennt sie am groessten Luecken-Gap. Thumbnails und Page-Images werden serverseitig per fitz rotiert, Frontend laedt Thumbnails nach OCR-Processing neu. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 18:16:32 +01:00
Benjamin Admin	a5635e0c43	feat: automatische Orientierungserkennung fuer umgedrehte Scans CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 23s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m50s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 15s Details Tesseract OSD erkennt 0/90/180/270° Rotation und korrigiert automatisch vor dem Deskew. Loest das Problem mit Buchscannern, bei denen jede 2. Seite auf dem Kopf steht. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 17:26:21 +01:00
Benjamin Admin	7a1bd5e82d	refactor: positional_column_regions auch in OCR Pipeline verwenden CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 24s Details CI / test-python-klausur (push) Failing after 1m48s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 16s Details Shared Funktion positional_column_regions() in cv_vocab_pipeline.py, wird jetzt von beiden Pfaden (Vocab-Worksheet + OCR Pipeline Admin) genutzt. classify_column_types() bleibt als Legacy erhalten. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 17:20:51 +01:00
Benjamin Admin	a5df2b6e15	fix: Spaltenklassifikation im Vocab-Worksheet durch positionsbasierte Zuordnung ersetzen CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 33s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m47s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 20s Details Sprachbasiertes Scoring (classify_column_types) verursachte vertauschte Spalten auf Seite 3 bei Beispielsaetzen mit vielen englischen Funktionswoertern. Neue _positional_column_regions() ordnet Spalten rein geometrisch (links→rechts) zu. OCR Pipeline Admin bleibt unveraendert. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 17:07:11 +01:00
Benjamin Admin	4532f68173	fix: Word-Validation auf Segment-Woerter beschraenken CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 17s Details Woerter aus Sub-Header-Bereichen ueberlappten korrekte Spaltenluecken und liessen die Word-Validation faelschlich Gaps verwerfen. Jetzt werden nur Woerter aus dem gewaehlten Segment fuer die Validation verwendet. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 23:13:19 +01:00
Benjamin Admin	391449fedf	fix: Seite an Sub-Headern segmentieren, groesstes Segment fuer Projektion CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m58s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details Statt full-width Zeilen zu maskieren wird die Seite jetzt an grossen horizontalen Luecken (Sub-Header, Kapitelgrenzen) in Segmente unterteilt. Das groesste Segment wird fuer die vertikale Projektion verwendet. Dadurch stoeren Illustrationen und Ueberschriften nicht mehr. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 23:07:23 +01:00
Benjamin Admin	cb2b924a7b	fix: word-coverage gap detection als Fallback bei Illustrationen CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details Wenn pixel-basierte Projektion zu wenige Spaltenluecken findet (z.B. durch Illustrationen/Grafiken die Luecken fuellen), wird jetzt eine wort-basierte Gap-Detection als Zwischenschritt vor dem Clustering ausgefuehrt. Tesseract-Wort-BBs sind immun gegen dekorative Grafiken. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 22:58:27 +01:00
Benjamin Admin	8f3a50b981	fix: full-width Zeilen vor Spaltenerkennung maskieren CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 17s Details Farbige Sub-Header (z.B. "Unit 4: Bonnie Scotland") mit voller Breite fuellten die Spaltenluecken im vertikalen Projektionsprofil auf und fuehrten zu 11 statt 5 erkannten Spalten. Zeilen mit >40% Tintendichte werden jetzt vor der Projektion maskiert. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 22:50:27 +01:00
Benjamin Admin	2ad391e4e4	feat: Feinabstimmung mit 7 Schiebereglern fuer Deskew/Dewarp CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Neues aufklappbares Panel unter Entzerrung mit individuellen Reglern: - 3 Rotations-Regler (P1 Iterative, P2 Word-Alignment, P3 Textline) - 4 Scherungs-Regler (A-D Methoden) mit Radio-Auswahl - Kombinierte Vorschau und Ground-Truth-Speicherung - Backend: POST /sessions/{id}/adjust-combined Endpoint Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 18:22:33 +01:00
Benjamin Admin	d39d249daa	feat: add pass 3 text-line regression to deskew pipeline CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 15s Details After iterative projection (pass 1) and word-alignment (pass 2), a third pass uses Tesseract word positions + linear regression per text line to measure and correct residual rotation. This catches cases where passes 1-2 leave significant slope (e.g. 1.7° residual on heavily skewed scans). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 17:53:11 +01:00
Benjamin Admin	538d5c732e	feat: two-pass deskew with wider angle range and residual correction CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details - Increase iterative deskew coarse_range from ±2° to ±5° to handle heavily skewed scans - New deskew_two_pass(): runs iterative projection first, then word-alignment on the corrected image to detect/fix residual skew (applied when residual ≥ 0.3°) - OCR pipeline API auto_deskew now uses deskew_two_pass by default - Vocab worksheet _run_ocr_pipeline_for_page uses deskew_two_pass - Deskew result now includes angle_residual and two_pass_debug Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 17:34:57 +01:00
Benjamin Admin	b7ae36e92b	feat: use OCR pipeline instead of LLM vision for vocab worksheet extraction CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 17s Details process-single-page now runs the full CV pipeline (deskew → dewarp → columns → rows → cell-first OCR v2 → LLM review) for much better extraction quality. Falls back to LLM vision if pipeline imports are unavailable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 15:35:44 +01:00
Benjamin Admin	b8a9493310	fix: deskew iterative — use vertical Sobel edges + vertical projection CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details Horizontal projection of binary image is insensitive at 0.5° because text rows look nearly identical. The real discriminator is vertical edge alignment: at the correct angle, word left-edges and column borders become truly vertical, producing sharp peaks in the vertical projection of Sobel-X edges. Also: BORDER_REPLICATE + trim to avoid artifacts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 14:23:43 +01:00
Benjamin Admin	68a6b97654	fix: use gradient score instead of variance for iterative deskew CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m46s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details Variance is insensitive to 0.5° differences. Gradient score (L2 norm of first derivative) detects sharp text-line transitions much better. Also: use horizontal profile in both phases, finer coarse step (0.1°). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 14:11:19 +01:00
Benjamin Admin	af1b12c97d	feat: iterative projection-profile deskew (2-phase variance optimization) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 17s Details Adds deskew_image_iterative() as 3rd deskew method that directly optimizes for projection-profile sharpness instead of proxy signals (Hough/word alignment). Coarse sweep on horizontal profile, fine sweep on vertical profile. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 13:46:44 +01:00
Benjamin Admin	770aea611f	fix: correct example field (fixes iberqueren), disable cell-level bold CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m50s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details - Add "example" to spell correction loop — was only correcting "english" and "german" fields, missing umlauts in example sentences - Use "german" language for example field (mixed-language, umlauts needed) - Disable cell-level bold detection — cannot distinguish bold from non-bold in mixed-format cells (e.g. "cookie ['kuki]") - Keep _measure_stroke_width and _classify_bold_cells for future word-level bold detection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 13:15:59 +01:00
Benjamin Admin	1a2efbf075	fix: relative bold detection (page median), fix save/finish buttons CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m3s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 21s Details Bold detection: - Replace absolute threshold with page-level relative comparison - Measure stroke width for all cells, then mark cells >1.4× median as bold - Adapts automatically to font, DPI and scan quality Save buttons: - Fix status stuck on 'error' preventing re-click - Better error messages with response body - Fallback score to 0 when null Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 13:02:16 +01:00
Benjamin Admin	cd12755da6	feat: OCR umlaut confusion correction + bold detection via stroke-width CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m39s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details - Add umlaut confusion rules (i→ü, a→ä, o→ö, u→ü) to _spell_fix_token for German text — fixes "iberqueren" → "überqueren" etc. - Add _detect_bold() using OpenCV stroke-width analysis on cell crops - Integrate bold detection in both narrow (cell-crop) and broad (word-lookup) paths - Add is_bold field to GridCell TypeScript interface - Render bold text in StepGroundTruth reconstruction view Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 12:06:57 +01:00
Benjamin Admin	1cc69d6b5e	feat: OCR pipeline step 8 — validation view with image detection & generation CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m4s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 19s Details Replaces the stub StepGroundTruth with a full side-by-side Original vs Reconstruction view. Adds VLM-based image region detection (qwen2.5vl), mflux image generation proxy, sync scroll/zoom, manual region drawing, and score/notes persistence. New backend endpoints: detect-images, generate-image, validate, get validation. New standalone mflux-service (scripts/mflux-service.py) for Metal GPU generation. Dockerfile.base: adds fonts-liberation (Apache-2.0). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 10:40:37 +01:00
Benjamin Admin	293e7914d8	feat: improved OCR pipeline session manager with categories, thumbnails, pipeline logging CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 39s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m48s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 20s Details - Add document_category (10 types) and pipeline_log JSONB columns - Session list: thumbnails, copyable IDs, category/doc_type badges - Inline category dropdown, bulk delete, pipeline step logging - New endpoints: thumbnail, delete-all, pipeline-log, categories - Cleared all 22 old test sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 09:44:38 +01:00
Benjamin Admin	a58dfca1d8	fix: move char-confusion fix to correction step, add spell + page-ref corrections CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 30s Details CI / test-nodejs-website (push) Successful in 20s Details CI / nodejs-lint (push) Failing after 10m5s Details - Remove _fix_character_confusion() from words endpoint (now only in Phase 0) - Extend spell checker to find real OCR errors via spell.correction() - Add field-aware dictionary selection (EN/DE) for spell corrections - Add _normalize_page_ref() for page_ref column (p-60 → p.60) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 00:26:13 +01:00
Benjamin Admin	fd99d4f875	cleanup: remove sheet-specific code, reduce logging, document constants CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 17s Details Genericity audit findings: - Remove German prefixes from _GRAMMAR_BRACKET_WORDS (only English field is processed, German prefixes were unreachable dead code) - Move _IPA_CHARS and _MIN_WORD_CONF to module-level constants - Document _NARROW_COL_THRESHOLD_PCT with empirical rationale - Document _PAD=3 with DPI context - Document _PHONETIC_BRACKET_RE intentional mixed-bracket matching - Reduce all diagnostic logger.info() to logger.debug() in: _ocr_cell_crop, _replace_phonetics_in_text, _fix_phonetic_brackets - Keep only summary-level info logging Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 00:04:02 +01:00
Benjamin Admin	1e0c6bb4b5	feat: hybrid OCR — full-page for broad columns, cell-crop for narrow Fundamentally rearchitect build_cell_grid_v2 to combine the best of both approaches: - Broad columns (>15% image width): Use full-page Tesseract word assignment. Handles IPA brackets, punctuation, sentence flow, and ellipsis correctly. No garbled phonetics. - Narrow columns (<15% image width): Use isolated cell-crop OCR to prevent neighbour bleeding from adjacent broad columns. This eliminates the need for complex phonetic bracket replacement on broad columns since full-page Tesseract reads them correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 23:38:44 +01:00
Benjamin Admin	e6dc3fcdd7	fix: only replace phonetics in english field, fix grammar detection - Only process 'english' field for IPA replacement. German and example fields contain meaningful parenthetical content like (gefrorenes Wasser), (sich beschweren) that must never be replaced. - Simplify _is_grammar_bracket_content: only known grammar particles (with, about/of, sth, etc.) are preserved. Removes the >= 4 chars heuristic that incorrectly preserved garbled IPA like [breik], [maus]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 23:19:03 +01:00
Benjamin Admin	edbdac3203	fix: improve phonetic bracket replacement logic - Replace _is_meaningful_bracket_content with _is_grammar_bracket_content that uses a whitelist of grammar particles (with, about/of, auf, etc.) - Check IPA dictionary FIRST: if word has IPA, treat brackets as phonetic - Strip orphan brackets (no word before them) that are garbled IPA - Preserve correct IPA (contains Unicode IPA chars) and grammar info - Fix variable name bug (result → text) Fixes: break [breik] now correctly replaced, cross (with) preserved, orphan [mais] and {'mani setva] stripped. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 23:13:34 +01:00
Benjamin Admin	99573a46ef	debug: add phonetic bracket replacement logging	2026-03-04 23:01:01 +01:00
Benjamin Admin	6ad4b84584	fix: broaden phonetic bracket regex to catch Tesseract-garbled IPA CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 16s Details Tesseract mangles IPA square brackets into curly braces or parentheses (e.g. China [ˈtʃaɪnə] → China {'tfatno]). The previous regex only matched [...], missing all garbled variants. - Match any bracket type: [...], {...}, (...) including mixed pairs - Add _is_meaningful_bracket_content() to preserve legitimate German prefixes like (zer)brechen and Tanz(veranstaltung) - Trigger IPA replacement on any bracket character, not just [ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 22:53:50 +01:00

1 2 3 4

172 Commits