Commit Graph

305 Commits

Author SHA1 Message Date
Benjamin Admin
d335a7bbf3 fix: use OCR word_box coordinates directly instead of fuzzy matching
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m6s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 25s
The slide positioning hook was re-matching cell.text tokens against
word_boxes via fuzzy text similarity, which broke positioning for
special characters (!, bullet points, IPA). Now uses word_box
coordinates directly — exact OCR positions without re-interpretation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 18:54:37 +01:00
Benjamin Admin
1f527fcd49 fix: split PaddleOCR boxes at leading ! for overlay word positioning
Some checks failed
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
When PaddleOCR returns "!Betonung" as a single word box, the overlay
positions text starting at the "!" instead of the actual word. Split
such boxes into ["!", "Betonung"] with proportional position splitting,
matching the existing IPA bracket splitting logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 17:46:17 +01:00
Benjamin Admin
8349c28f54 fix: paddle_direct reuses build_grid_from_words for correct overlay
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 37s
CI / test-go-edu-search (push) Successful in 35s
CI / test-python-klausur (push) Failing after 2m22s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 23s
Replaces custom _paddle_words_to_grid_cells with the proven
build_grid_from_words from cv_words_first.py — same function the
regular pipeline uses with PaddleOCR. Handles phrase splitting,
column clustering, and produces cells with word_boxes that the
slide/cluster positioning hooks expect.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 17:19:52 +01:00
Benjamin Admin
71a1b5f058 fix: paddle_direct groups words per row (matching _build_cells format)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 34s
CI / test-python-klausur (push) Failing after 2m11s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 24s
One cell per row with all words as word_boxes instead of one cell per
word. Gives OverlayReconstruction a row-spanning bbox_pct for correct
font sizing and per-word positions for slide/cluster placement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 17:10:10 +01:00
Benjamin Admin
c743a38eaf fix: Paddle Direct keeps preprocessing (orient/deskew/dewarp/crop)
Some checks failed
CI / nodejs-lint (push) Has been cancelled
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
Uses the cropped/dewarped image instead of the original so the overlay
shows the correctly oriented page. 5 steps instead of 2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 16:56:18 +01:00
Benjamin Admin
90c1efd9b0 feat: Paddle Direct — 1-click OCR without deskew/dewarp/crop
Some checks failed
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
New 2-step mode (Upload → PaddleOCR+Overlay) alongside the existing
7-step pipeline. Backend endpoint runs PaddleOCR on the original image
and clusters words into rows/cells directly. Frontend adds a mode
toggle and PaddleDirectStep component.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 16:41:55 +01:00
Benjamin Admin
06d63d18f9 fix: generic fuzzy text matching for overlay word-box positioning
Some checks failed
CI / test-go-edu-search (push) Has been cancelled
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
Replace sequential 1:1 token-to-box mapping with fuzzy text matching.
Each token from cell.text finds its best matching word_box by text
similarity (normalized prefix match + substring bonus). Handles:
- Reordered boxes (different sort between text and boxes)
- IPA corrections changing token boundaries
- Token/box count mismatches
Unmatched tokens get interpolated positions from matched neighbors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 16:19:19 +01:00
Benjamin Admin
3e65b14b83 fix: split PaddleOCR boxes at IPA brackets for overlay positioning
Some checks failed
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
PaddleOCR returns "badge[bxd3]" without space, but the IPA fixer
produces "badge [bˈædʒ]" with space, creating a token count mismatch
between cell.text and word_boxes. Now also split at "[" boundaries
so each IPA bracket gets its own sub-box.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 16:08:17 +01:00
Benjamin Admin
40ac593d28 fix: split PaddleOCR phrase boxes into per-word boxes for overlay slide
Some checks failed
CI / test-nodejs-website (push) Has been cancelled
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
PaddleOCR returns phrase-level bounding boxes (e.g. "competition
[kompa'tifn]" as one box) but the overlay slide mechanism expects
one box per word for accurate positioning. Multi-word boxes are now
split proportionally by character count with small gaps between words.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 16:00:06 +01:00
Benjamin Admin
ea69239e06 fix: word_boxes in words_first use absolute pixels (consistent with v2 grid)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 39s
CI / test-go-edu-search (push) Successful in 33s
CI / test-python-klausur (push) Failing after 2m21s
CI / test-python-agent-core (push) Successful in 22s
CI / test-nodejs-website (push) Successful in 33s
words_first was storing word_boxes in percent coordinates while
cv_cell_grid.py uses absolute pixel coordinates. The overlay slide
mechanism divides by imgW to get percentages, so percent-in-percent
caused positions near zero. Now both grid builders use the same format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:04:04 +01:00
Benjamin Admin
bb90d1ba94 fix: PaddleOCR engine forces words_first in frontend to match backend
Some checks failed
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
When engine=paddle is selected, the backend overrides grid_method to
words_first and returns plain JSON (no SSE streaming). The frontend
was not aware of this override — it sent stream=true and tried to parse
SSE events from a JSON response, resulting in "Keine Daten".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 14:52:18 +01:00
Benjamin Admin
685d135be5 fix: downscale large images before PaddleOCR (Traefik 60s limit)
Some checks failed
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
Bilder > 1500px werden vor dem Upload verkleinert. Koordinaten
werden zurueckskaliert. JPEG statt PNG fuer schnelleren Upload.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 14:28:58 +01:00
Benjamin Admin
e2c2acdf86 fix: increase PaddleOCR remote timeout to 120s for large scans
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m14s
CI / test-python-agent-core (push) Successful in 21s
CI / test-nodejs-website (push) Successful in 24s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 13:41:39 +01:00
Benjamin Admin
3cc496f7f3 feat(rag): Update Verbraucherschutz docs + chunk counts + Landkarte
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 32s
CI / test-go-edu-search (push) Failing after 14s
CI / test-python-klausur (push) Failing after 2m5s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 22s
- Update chunk counts for 8 successfully ingested DE laws (Phase H1)
- Add 6 new BGB-Teile entries (AGB, Fernabsatz, Kaufrecht, Widerruf, Digital)
- Add EGBGB Widerrufsbelehrung entry
- Update COLLECTION_TOTALS: gesetze 58304→63567 (+5263 Phase H chunks)
- Add Verbraucherschutz thematic group to Landkarte
- Extend ecommerce industry map with consumer protection regulations
- Update date to March 2026

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:54:20 +01:00
Benjamin Admin
a6069631cc feat: PaddleOCR Remote-Engine (PP-OCRv5 Latin auf Hetzner x86_64)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m7s
CI / test-python-agent-core (push) Successful in 21s
CI / test-nodejs-website (push) Successful in 21s
PaddleOCR als neue engine=paddle Option in der OCR-Pipeline.
Microservice auf Hetzner (paddleocr-service/), async HTTP-Client
(paddleocr_remote.py), Frontend-Dropdown, automatisch words_first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:31:22 +01:00
Benjamin Admin
ced5bb3dd3 feat: Words-First Grid Builder (bottom-up alternative zu cell_grid_v2)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 54s
CI / test-go-edu-search (push) Successful in 47s
CI / test-python-klausur (push) Failing after 2m31s
CI / test-python-agent-core (push) Successful in 23s
CI / test-nodejs-website (push) Successful in 32s
Neuer Algorithmus in cv_words_first.py: Clustert Tesseract word_boxes
direkt zu Spalten (X-Gap) und Zeilen (Y-Proximity), baut Zellen an
Schnittpunkten. Kein Spalten-/Zeilenerkennung noetig.

- cv_words_first.py: _cluster_columns, _cluster_rows, _build_cells, build_grid_from_words
- ocr_pipeline_api.py: grid_method Parameter (v2|words_first) im /words Endpoint
- StepWordRecognition.tsx: Dropdown Toggle fuer Grid-Methode
- OCR-Pipeline.md: Doku v4.3.0 mit Words-First Algorithmus
- 15 Unit-Tests fuer cv_words_first

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 06:46:05 +01:00
Benjamin Admin
2fdf3ff868 feat(rag): Register Verbraucherschutz laws + EU directives in RAG constants
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 46s
CI / test-go-edu-search (push) Successful in 33s
CI / test-nodejs-website (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
Add 15 new regulations from Phase H ingestion:
- DE: PAngV, VSBG, ProdHaftG, VerpackG, ElektroG, BattDG, BFSG, UWG, GewO
- EU: Warenkauf-RL, Klausel-RL, UGP-RL, Preisangaben-RL, Omnibus-RL, BattVO

Chunk counts set to 0 (will be updated after successful ingestion).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 06:43:19 +01:00
Benjamin Admin
2e21a4b6d0 fix: IPA nur einfügen wenn word_boxes Gap >80px zeigen (kein falsches IPA)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 55s
CI / test-go-edu-search (push) Successful in 48s
CI / test-python-klausur (push) Failing after 2m11s
CI / test-python-agent-core (push) Successful in 23s
CI / test-nodejs-website (push) Successful in 26s
_has_ipa_gap() prüft ob Tesseract eine IPA-Klammer übersehen hat anhand
des physischen Abstands zwischen Headword und nächstem Wort. Ohne Gap
(z.B. "be good at sth.", "Focus on language") wird kein IPA eingefügt.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 23:40:18 +01:00
Benjamin Admin
d98dba9098 fix: Headword-IPA auch in langen column_text Zeilen einfuegen
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 53s
CI / test-go-edu-search (push) Successful in 49s
CI / test-python-klausur (push) Failing after 2m14s
CI / test-python-agent-core (push) Successful in 22s
CI / test-nodejs-website (push) Successful in 23s
_insert_missing_ipa ueberspringe Texte mit >6 Woertern oder Klammern.
Neue _insert_headword_ipa fuer column_text: prueft nur das erste Wort
der Zeile, unabhaengig von Textlaenge oder vorhandenen Klammern.

Ausserdem _sync_word_boxes_after_ipa_insert gefixt: Token-Vergleich
nutzt jetzt paralleles Durchlaufen statt zip (verschobene Positionen).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 23:25:38 +01:00
Benjamin Admin
cd13eca290 fix: IPA-Einfuegung fuer column_text mit word_boxes Synchronisation
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 32s
CI / test-python-klausur (push) Failing after 2m9s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 20s
Fuer column_text werden fehlende IPA-Lautschriften (challenge, profit,
film, badge) wieder eingefuegt, aber gleichzeitig eine synthetische
word_box erzeugt, damit die 1:1 Token-zu-Box Zuordnung im Overlay
erhalten bleibt.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 23:15:26 +01:00
Benjamin Admin
aa7db43f02 fix: column_text nur garbled IPA ersetzen, keine Einfuegung/Entfernung
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m8s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 21s
Fuer column_text (Full-Page Overlay mit gemischtem EN+DE Text):
- Kein IPA einfuegen (wuerde Token-Count aendern, Overlay-Positionen brechen)
- Keine orphan brackets entfernen (sind oft deutsche Bedeutungen wie (probieren))
- Nur garbled IPA ersetzen (z.B. [teıst] -> [tˈeɪst])

column_en behaelt volle Verarbeitung (replace + strip + insert).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 23:05:37 +01:00
Benjamin Admin
4afd5bd8e8 fix: Klammerwörter wie (probieren), (Profit) nicht mehr als garbled IPA entfernen
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 50s
CI / test-go-edu-search (push) Successful in 45s
CI / test-python-klausur (push) Failing after 2m12s
CI / test-python-agent-core (push) Successful in 23s
CI / test-nodejs-website (push) Successful in 27s
_strip_orphan_bracket entfernte deutsche Bedeutungsangaben in Klammern,
weil sie weder als Grammar-Partikel noch als IPA erkannt wurden.
Fix: Klammerinhalte mit echten Wörtern (>=4 Buchstaben) werden behalten.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 22:47:01 +01:00
Benjamin Admin
7d19145edb fix: word_boxes auch fuer breite Spalten (Full-Page OCR) speichern
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 32s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m3s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 21s
word_boxes wurden nur im Cell-Crop-Pfad (narrow columns) gesetzt,
aber nicht im Full-Page Word-Assignment-Pfad (broad columns).
Jetzt werden die Tesseract-Wort-Koordinaten in beiden Pfaden gespeichert.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 20:41:29 +01:00
Benjamin Admin
35f2706098 fix: Slide-Modus nutzt cell.text Tokens statt word_boxes Text (keine Woerter verloren)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m8s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 22s
TEXT kommt aus cell.text (bereinigt, IPA-korrigiert).
POSITIONEN kommen aus word_boxes (exakte OCR-Koordinaten).
Tokens werden 1:1 in Leserichtung zugeordnet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 20:01:57 +01:00
Benjamin Admin
0ee92e7210 feat: OCR word_boxes fuer pixelgenaue Overlay-Positionierung
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 37s
CI / test-go-edu-search (push) Successful in 32s
CI / test-python-klausur (push) Failing after 2m10s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 20s
Backend: _ocr_cell_crop speichert jetzt word_boxes mit exakten
Tesseract/RapidOCR Wort-Koordinaten (left, top, width, height)
im Cell-Ergebnis. Absolute Bildkoordinaten, bereits zurueckgemappt.

Frontend: Slide-Hook nutzt word_boxes direkt wenn vorhanden —
jedes Wort wird exakt an seiner OCR-Position platziert. Kein
Pixel-Scanning noetig. Fallback auf alten Slide wenn keine Boxes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 19:39:49 +01:00
Benjamin Admin
4949863bd7 revert: Zurueck zum Einzelwort-Slide mit fontRatio=1.0 Fix
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m5s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 19s
Gruppen-Sliding schob nicht weit genug nach rechts. Zurueck zum
Original-Einzelwort-Slide, aber mit den Fixes:
- fontRatio=1.0 (konsistente Schriftgroesse wie Fallback)
- Token-Breiten aus medianCh * 0.7 / refFontSize (statt totalInk)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 19:15:52 +01:00
Benjamin Admin
efbe15f895 fix: Slide-Modus auf Gruppen-basiertes Sliding umgestellt
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 2m0s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 23s
Vorher: split(/\s+/) zerlegte alles in Einzelwoerter, verlor die
Spaltenstruktur (3+ Spaces zwischen Gruppen). Woerter stauten sich links.

Jetzt: split(/\s{3,}/) erhält Gruppen wie im Cluster-Modus. Jede Gruppe
wird als Einheit von links nach rechts geschoben bis Tinte gefunden.
Breite = max(gemessene Textbreite, tatsaechliche Tintenbreite).
fontRatio=1.0, kein Wort geht verloren.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 18:31:17 +01:00
Benjamin Admin
c3da131129 fix: Slide fontRatio=1.0 und Token-Breite aus gerenderter Fontgroesse
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m3s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 18s
fontRatio war 0.65 (35% kleiner als Fallback-Rendering). Jetzt 1.0
wie beim Fallback. Token-Breiten berechnet aus measureText skaliert
auf die tatsaechlich gerenderte Schriftgroesse (medianCh * 0.7).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 17:59:31 +01:00
Benjamin Admin
b81baa1d16 fix: Slide-Modus globale Schriftgroesse statt per-Token Scale
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m3s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 25s
Schriftgroesse wird jetzt GLOBAL aus der medianen Zellhoehe berechnet
(65% der Zellhoehe als Ziel-Font). Alle Tokens bekommen dieselbe
konsistente Groesse. Die Slide-Logik bestimmt nur noch die x-Position.

Vorher: Scale pro Zelle aus Ink-Span/Textbreite -> inkonsistente Groessen.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 16:51:55 +01:00
Benjamin Admin
2010cab894 fix: Slide-Modus Scale-Berechnung auf Ink-Span statt Ink-Count
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 36s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m11s
CI / test-python-agent-core (push) Successful in 24s
CI / test-nodejs-website (push) Successful in 31s
totalInk zaehlte nur dunkle Pixel-Spalten (Striche), ignorierte
Luecken zwischen Buchstaben. Scale war dadurch viel zu klein,
Schrift unlesbar. Jetzt wird der Ink-Span (erstes bis letztes
dunkles Pixel) als Referenz fuer die Textbreite verwendet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 16:41:38 +01:00
Benjamin Admin
bc13978bc1 feat: Slide-Modus als alternative Wort-Positionierung im Overlay
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 33s
CI / test-python-klausur (push) Failing after 2m9s
CI / test-python-agent-core (push) Successful in 23s
CI / test-nodejs-website (push) Successful in 24s
Neuer Hook useSlideWordPositions: Schiebt alle erkannten Woerter von links
nach rechts ueber die Pixel-Projektion bis jedes Wort auf seiner Tinte
einrastet. Kein Wort geht verloren, keine Cluster-Matching-Regeln noetig.

Toggle-Button (Slide/Cluster) in der Overlay-Toolbar zum Umschalten.
Bestehender Cluster-Algorithmus bleibt als Alternative erhalten.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 16:13:31 +01:00
Benjamin Admin
2f51ac617f feat: IPA-Lautschrift in Cell-Texte einfuegen (fuer Overlay-Modus)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m5s
CI / test-python-agent-core (push) Successful in 23s
CI / test-nodejs-website (push) Successful in 22s
fix_cell_phonetics() ersetzt fehlerhafte IPA-Klammern UND fuegt fehlende
Lautschrift fuer englische Woerter ein (z.B. badge, film, challenge, profit).
Wird auf alle Zellen mit col_type column_en/column_text angewandt.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 15:47:26 +01:00
Benjamin Admin
8a5f2aa188 fix: Cluster-Zuordnung per Breiten-Proportionalitaet statt Position
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 36s
CI / test-go-edu-search (push) Successful in 36s
CI / test-python-klausur (push) Failing after 2m20s
CI / test-python-agent-core (push) Successful in 21s
CI / test-nodejs-website (push) Successful in 29s
Zwei wesentliche Verbesserungen:

1. Multi-group: Gruppen werden per Best-Fit-Breite den Clustern
   zugeordnet statt naiv links-nach-rechts. Damit wird z.B.
   "Kokosnuss" dem DE-Spalten-Cluster zugeordnet statt dem
   breiteren Box-Cluster.

2. Single-group Fallback: verwendet den BREITESTEN Cluster statt
   first-to-last Span. Verhindert dass Streupixel von benachbarten
   Seitenbereichen den Text nach links ziehen.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 15:39:54 +01:00
Benjamin Admin
d182d87f26 fix: OCR-Artefakte (|, >) vor Cluster-Matching zusammenfuehren
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m23s
CI / test-python-agent-core (push) Successful in 22s
CI / test-nodejs-website (push) Successful in 22s
Box-Rahmen werden vom OCR als einzelne Symbole wie "|" oder ">"
erkannt und als eigene Text-Gruppen behandelt. Das verfaelscht die
Cluster-Zuordnung weil diese Artefakte entweder keinen eigenen
Cluster erzeugen oder den falschen Cluster zugewiesen bekommen.

Fix: Gruppen mit max 2 Zeichen ohne Buchstaben/Ziffern werden mit
der benachbarten Gruppe zusammengefuehrt bevor die Cluster-Zuordnung
laeuft.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 15:03:37 +01:00
Benjamin Admin
87efc1b4ba fix: bei Cluster-Ueberschuss die breitesten N Cluster waehlen
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m5s
CI / test-python-agent-core (push) Successful in 22s
CI / test-nodejs-website (push) Successful in 20s
Wenn mehr Pixel-Cluster als Text-Gruppen existieren (z.B. wegen
Box-Rahmenlinien), werden jetzt die N breitesten Cluster ausgewaehlt
statt naiv clusters[i]→groups[i] zuzuordnen. Text-Cluster sind
breiter als Rahmenlinien-Cluster.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 14:34:58 +01:00
Benjamin Admin
dd7087cd6d fix: Pixel-Analyse nicht mehr ueberspringen wenn Cluster < Gruppen
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 20s
Vorher: wenn Text mehr Wort-Gruppen hatte als Pixel-Cluster gefunden
wurden (z.B. bei Box-Rahmen die Cluster zusammenmergen), wurde die
Zelle komplett uebersprungen → Fallback bei x=0%.

Jetzt: Fallback auf Single-Span Positionierung (first→last Cluster)
statt Skip. Damit wird der Text immer korrekt horizontal platziert.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 10:14:58 +01:00
Benjamin Admin
7282a220d6 fix: useMemo vor Early Returns verschieben (Rules of Hooks)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m0s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 28s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 09:46:25 +01:00
Benjamin Admin
b5d5371f72 fix: einheitliche Schriftgroesse + Border-Cluster-Filter im Overlay
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 35s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m24s
CI / test-python-agent-core (push) Successful in 25s
CI / test-nodejs-website (push) Successful in 25s
1. Schriftgroesse basiert jetzt auf Median-Zeilenhoehe statt
   individueller Zellhoehe — keine Groessensprunge in Box-Bereichen
2. Sehr schmale Pixel-Cluster (< 0.5% Zellbreite) werden gefiltert,
   damit Box-Rahmen nicht als Textposition erkannt werden

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 09:34:41 +01:00
Benjamin Admin
41e47baf13 fix: skip_heal_gaps Parameter an Stream-Generator durchreichen
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m6s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 28s
NameError behoben: skip_heal_gaps war nicht im Scope der
_word_batch_stream_generator Funktion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 09:11:16 +01:00
Benjamin Admin
8a60f4bf30 fix: Overlay-Zellen ohne _heal_row_gaps positionieren (skip_heal_gaps)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 36s
CI / test-go-edu-search (push) Successful in 35s
CI / test-python-klausur (push) Failing after 2m12s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 21s
_heal_row_gaps verschiebt Zell-Positionen nach Entfernung von Artefakt-Zeilen,
was im Overlay zu sichtbarem Versatz fuehrt (z.B. 23px bei "badge").
Neuer skip_heal_gaps Parameter in build_cell_grid_v2 und words-Endpoint
behaelt die exakten Zeilen-Positionen bei.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 08:59:50 +01:00
Benjamin Admin
e3ee1de790 Revert "fix: Zeilen-Regularisierung im Overlay ueberspringen (generisch fuer gemischte Inhalte)"
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m2s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 24s
This reverts commit b91f799ccf.
2026-03-11 08:44:07 +01:00
Benjamin Admin
b91f799ccf fix: Zeilen-Regularisierung im Overlay ueberspringen (generisch fuer gemischte Inhalte)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 49s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m21s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 26s
Seiten mit Info-Boxen (andere Zeilenhoehe) fuehren dazu, dass _regularize_row_grid
die Zeilenpositionen verzerrt. Neuer skip_regularize Parameter nutzt stattdessen
die gap-basierten Zeilen, die der tatsaechlichen Seitengeometrie folgen.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 08:29:06 +01:00
Benjamin Admin
2df2a01a8b feat: Echtes Overlay — Text direkt ueber dem Originalbild
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 33s
CI / test-go-edu-search (push) Successful in 36s
CI / test-python-klausur (push) Failing after 2m11s
CI / test-python-agent-core (push) Successful in 25s
CI / test-nodejs-website (push) Successful in 26s
Statt Side-by-Side wird der erkannte Text jetzt direkt ueber das
Originalbild gelegt. Textfarbe (rot/blau/schwarz) und Deckkraft
per Slider einstellbar fuer einfache visuelle Fehlersuche.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 00:25:11 +01:00
Benjamin Admin
e2ad93fd57 fix: Word-Erkennung ohne Spalten ermoeglichen (Full-Page Pseudo-Column)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m14s
CI / test-python-agent-core (push) Successful in 21s
CI / test-nodejs-website (push) Successful in 22s
Wenn column_result fehlt (z.B. OCR Overlay Pipeline), wird automatisch
eine einzelne ganzseitige Pseudo-Spalte erzeugt statt einen Fehler zu werfen.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 00:16:31 +01:00
Benjamin Admin
2cbdfc56f3 feat: OCR Overlay — ganzseitige Rekonstruktion ohne Spaltenerkennung
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 33s
CI / test-python-klausur (push) Failing after 2m6s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 28s
Neue Route /ai/ocr-overlay mit vereinfachter 7-Schritt-Pipeline
(Orientierung, Begradigung, Entzerrung, Zuschnitt, Zeilen, Woerter, Overlay).
Nutzt bestehende Step-Komponenten, ueberspringt Spalten/LLM-Review/Ground-Truth.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 00:08:05 +01:00
Benjamin Admin
840918df2a fix: Originalbild im Overlay nicht extra drehen (Orientierung bereits im Cropped-Bild)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 33s
CI / test-python-klausur (push) Failing after 2m15s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 22s
Das cropped image ist bereits orientierungskorrigiert. Die zusaetzliche
180°-Rotation ueber imageRotation drehte das Bild falsch herum.
imageRotation wird weiter fuer Pixel-Matching genutzt, aber nicht mehr
fuer die Bildanzeige.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 23:25:20 +01:00
Benjamin Admin
eb3fc05cdc fix: Box-Zone Clamping nach Box-Mitte statt Cell-Center entscheiden
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 34s
CI / test-python-klausur (push) Failing after 2m8s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 21s
Euro/Badge-Zeilen hatten ihren Center innerhalb der Box-Zone, weshalb
das Clamping nicht griff. Jetzt wird anhand der Box-Mitte entschieden
ob eine Zelle nach oben (clamp height) oder unten (push y) gehoert.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 23:10:51 +01:00
Benjamin Admin
9dbb5fa708 fix: useMemo vor Early Returns verschieben (Rules of Hooks)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 2m10s
CI / test-python-agent-core (push) Successful in 22s
CI / test-nodejs-website (push) Successful in 25s
boxZonesPct useMemo war nach bedingten Returns platziert, was gegen
Reacts Rules of Hooks verstoesst und einen Client-Side Crash ausloest.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 22:57:25 +01:00
Benjamin Admin
f468c30112 fix: Zellen an Box-Zone clampen im Overlay-Modus (keine Ueberlappung)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m15s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 23s
Zellen oberhalb der Box werden in der Hoehe begrenzt, Zellen unterhalb
werden nach unten verschoben. Sub-Session-Zellen bleiben unveraendert.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 22:52:08 +01:00
Benjamin Admin
618c82ef42 fix: Zeilen an Box-Grenze nicht mehr abschneiden (border_thickness Margin)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 32s
CI / test-go-edu-search (push) Successful in 35s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 25s
- detect_rows: Content-Strips nutzen jetzt box_ranges_inner (geschrumpft
  um border_thickness, min 5px) statt der vollen Box-Range
- detect_words: _row_in_box Filter nutzt ebenfalls inner Range
- Dadurch wird die letzte Zeile oberhalb einer Box nicht mehr
  faelschlicherweise der Box zugeordnet und ausgeschlossen

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 17:44:02 +01:00