Commit Graph

11 Commits

Author SHA1 Message Date
Benjamin Admin
685d135be5 fix: downscale large images before PaddleOCR (Traefik 60s limit)
Some checks failed
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
Bilder > 1500px werden vor dem Upload verkleinert. Koordinaten
werden zurueckskaliert. JPEG statt PNG fuer schnelleren Upload.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 14:28:58 +01:00
Benjamin Admin
a6069631cc feat: PaddleOCR Remote-Engine (PP-OCRv5 Latin auf Hetzner x86_64)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m7s
CI / test-python-agent-core (push) Successful in 21s
CI / test-nodejs-website (push) Successful in 21s
PaddleOCR als neue engine=paddle Option in der OCR-Pipeline.
Microservice auf Hetzner (paddleocr-service/), async HTTP-Client
(paddleocr_remote.py), Frontend-Dropdown, automatisch words_first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:31:22 +01:00
Benjamin Admin
2e21a4b6d0 fix: IPA nur einfügen wenn word_boxes Gap >80px zeigen (kein falsches IPA)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 55s
CI / test-go-edu-search (push) Successful in 48s
CI / test-python-klausur (push) Failing after 2m11s
CI / test-python-agent-core (push) Successful in 23s
CI / test-nodejs-website (push) Successful in 26s
_has_ipa_gap() prüft ob Tesseract eine IPA-Klammer übersehen hat anhand
des physischen Abstands zwischen Headword und nächstem Wort. Ohne Gap
(z.B. "be good at sth.", "Focus on language") wird kein IPA eingefügt.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 23:40:18 +01:00
Benjamin Admin
d98dba9098 fix: Headword-IPA auch in langen column_text Zeilen einfuegen
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 53s
CI / test-go-edu-search (push) Successful in 49s
CI / test-python-klausur (push) Failing after 2m14s
CI / test-python-agent-core (push) Successful in 22s
CI / test-nodejs-website (push) Successful in 23s
_insert_missing_ipa ueberspringe Texte mit >6 Woertern oder Klammern.
Neue _insert_headword_ipa fuer column_text: prueft nur das erste Wort
der Zeile, unabhaengig von Textlaenge oder vorhandenen Klammern.

Ausserdem _sync_word_boxes_after_ipa_insert gefixt: Token-Vergleich
nutzt jetzt paralleles Durchlaufen statt zip (verschobene Positionen).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 23:25:38 +01:00
Benjamin Admin
cd13eca290 fix: IPA-Einfuegung fuer column_text mit word_boxes Synchronisation
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 32s
CI / test-python-klausur (push) Failing after 2m9s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 20s
Fuer column_text werden fehlende IPA-Lautschriften (challenge, profit,
film, badge) wieder eingefuegt, aber gleichzeitig eine synthetische
word_box erzeugt, damit die 1:1 Token-zu-Box Zuordnung im Overlay
erhalten bleibt.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 23:15:26 +01:00
Benjamin Admin
aa7db43f02 fix: column_text nur garbled IPA ersetzen, keine Einfuegung/Entfernung
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m8s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 21s
Fuer column_text (Full-Page Overlay mit gemischtem EN+DE Text):
- Kein IPA einfuegen (wuerde Token-Count aendern, Overlay-Positionen brechen)
- Keine orphan brackets entfernen (sind oft deutsche Bedeutungen wie (probieren))
- Nur garbled IPA ersetzen (z.B. [teıst] -> [tˈeɪst])

column_en behaelt volle Verarbeitung (replace + strip + insert).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 23:05:37 +01:00
Benjamin Admin
4afd5bd8e8 fix: Klammerwörter wie (probieren), (Profit) nicht mehr als garbled IPA entfernen
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 50s
CI / test-go-edu-search (push) Successful in 45s
CI / test-python-klausur (push) Failing after 2m12s
CI / test-python-agent-core (push) Successful in 23s
CI / test-nodejs-website (push) Successful in 27s
_strip_orphan_bracket entfernte deutsche Bedeutungsangaben in Klammern,
weil sie weder als Grammar-Partikel noch als IPA erkannt wurden.
Fix: Klammerinhalte mit echten Wörtern (>=4 Buchstaben) werden behalten.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 22:47:01 +01:00
Benjamin Admin
2f51ac617f feat: IPA-Lautschrift in Cell-Texte einfuegen (fuer Overlay-Modus)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m5s
CI / test-python-agent-core (push) Successful in 23s
CI / test-nodejs-website (push) Successful in 22s
fix_cell_phonetics() ersetzt fehlerhafte IPA-Klammern UND fuegt fehlende
Lautschrift fuer englische Woerter ein (z.B. badge, film, challenge, profit).
Wird auf alle Zellen mit col_type column_en/column_text angewandt.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 15:47:26 +01:00
Benjamin Admin
23b7840ea7 feat: Full-Row OCR mit Spacing fuer Box-Sub-Sessions
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 40s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 2m16s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 22s
Sub-Sessions ueberspringen Spaltenerkennung und nutzen stattdessen eine
Pseudo-Spalte ueber die volle Breite. Text wird mit proportionalem
Spacing aus Wort-Positionen rekonstruiert, um raeumliches Layout zu erhalten.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 08:28:29 +01:00
Benjamin Admin
cf9dde9876 fix: _group_words_into_lines nach cv_ocr_engines.py verschieben
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m4s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 21s
Funktion war nur in cv_review.py definiert, wurde aber auch in
cv_ocr_engines.py und cv_layout.py benutzt — NameError zur Laufzeit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 15:24:56 +01:00
Benjamin Admin
9a5a35bff1 refactor: cv_vocab_pipeline.py in 6 Module aufteilen (8163 → 6 + Fassade)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 1m59s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Monolithische 8163-Zeilen-Datei aufgeteilt in fokussierte Module:
- cv_vocab_types.py (156 Z.): Dataklassen, Konstanten, IPA, Feature-Flags
- cv_preprocessing.py (1166 Z.): Bild-I/O, Orientierung, Deskew, Dewarp
- cv_layout.py (3036 Z.): Dokumenttyp, Spalten, Zeilen, Klassifikation
- cv_ocr_engines.py (1282 Z.): OCR-Engines, Vocab-Postprocessing, Text-Cleaning
- cv_cell_grid.py (1510 Z.): Cell-Grid v2+Legacy, Vocab-Konvertierung
- cv_review.py (1184 Z.): LLM/Spell Review, Pipeline-Orchestrierung

cv_vocab_pipeline.py ist jetzt eine Re-Export-Fassade (35 Z.) —
alle bestehenden Imports bleiben unveraendert.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 23:46:47 +01:00