Commit Graph

332 Commits

Author SHA1 Message Date
Benjamin Admin
3aa4a63257 fix: move Struktur step after OCR so word boxes are available for exclusion
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m2s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 18s
Graphic detection needs word positions to exclude text from the ink mask.
Previously Struktur ran before OCR, causing every word to be detected as
a graphic element. Now:
- Pipeline: Struktur at index 7 (after Wörter)
- Kombi: Struktur at index 5 (after PP-OCRv5+Tesseract, before Tabelle)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 13:38:58 +01:00
Benjamin Admin
6b9b280ba3 feat: integrate graphic element detection into structure step
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 19s
Add cv_graphic_detect.py for detecting non-text visual elements (arrows,
circles, lines, exclamation marks, icons, illustrations). Draw detected
graphics on structure overlay image and display them in the frontend
StepStructureDetection component with shape counts and individual listings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 13:21:55 +01:00
Benjamin Admin
1d34785e2b feat: add Structure step to Kombi mode in OCR Overlay page
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 33s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 19s
Insert the Struktur detection step between Zuschneiden and
PP-OCRv5+Tesseract in the Kombi pipeline on /ai/ocr-overlay.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 12:59:05 +01:00
Benjamin Admin
5b5213c2b9 feat: add Structure Detection step to OCR pipeline
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 16s
New pipeline step between Crop and Columns that visualizes detected
document structure: boxes (line-based + shading), page zones, and
color regions. Shows original image on the left, annotated overlay
on the right.

Backend: POST /detect-structure endpoint + /image/structure-overlay
Frontend: StepStructureDetection component with zone/box/color details

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 12:31:09 +01:00
Benjamin Admin
fbbec6cf5e feat: run shading-based box detection alongside line detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 18s
Previously color/shading detection only ran as fallback when no line-based
boxes were found. Now both methods run in parallel with result merging,
so smaller shaded boxes (like "German leihen") get detected even when
larger bordered boxes are already found. Uses median-blur background
analysis that works for both colored and grayscale/B&W scans.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 08:12:52 +01:00
Benjamin Admin
a6951940b9 fix: use median hue, Otsu threshold, and background subtraction for colors
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 36s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 1m59s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
- Median hue instead of mean (robust to background contamination)
- Otsu threshold instead of fixed 180 (adapts to colored backgrounds)
- Background sampling from border pixels with hue-distance filter
- Higher sat_threshold (70) + min_sat_ratio (25%) to reduce false positives
- Classify using saturated pixels only for cleaner hue signal

Fixes: borrow/lend misdetected as orange (actually red, median_H=5)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 07:44:03 +01:00
Benjamin Admin
4a8d43fd71 feat: display detected text colors in grid editor UI
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 32s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 2m8s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
- Add color/color_name/recovered fields to OcrWordBox type
- GridTable: show colored text + left-edge color indicator strip
- GridEditor: show color stats and recovered count in summary bar

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 01:03:09 +01:00
Benjamin Admin
bcd55e12d7 fix: run color annotation on final cell word_boxes, not pre-grid words
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 16s
_build_cells() creates new word_box dicts, so color fields set before
grid building were lost. Now detect_word_colors() runs after cells
are built, on the final word_boxes. Recovery still runs before grid
building so recovered words participate in column/row detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 00:53:04 +01:00
Benjamin Admin
2bd63ec402 feat: add color detection for OCR word boxes
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m51s
CI / test-nodejs-website (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
New cv_color_detect.py module:
- detect_word_colors(): annotates existing words with text color (HSV analysis)
- recover_colored_text(): finds colored text regions missed by standard OCR
  (e.g. red ! markers) using HSV masks + contour detection

Integrated into build-grid: words get color/color_name fields, recovered
colored regions are merged into the word list before grid building.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 00:50:09 +01:00
Benjamin Admin
39a4d8564c chore: add per-cluster debug logging for column alignment detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 2m0s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 00:18:28 +01:00
Benjamin Admin
1162eac7b4 fix: use group-start positions for column detection, not all word left-edges
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 18s
Only cluster left-edges of words that begin a new group within their row
(first word or preceded by a large gap). This filters out mid-phrase
word positions (IPA transcriptions, second words in multi-word entries)
that were causing too many false columns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 00:10:29 +01:00
Benjamin Admin
28352f5bab feat: replace gap-based column detection with left-edge alignment algorithm
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 17s
Column detection now clusters word left-edges by X-proximity and filters
by row coverage (Y-coverage), matching the proven approach from cv_layout.py
but using precise OCR word positions instead of ink-based estimates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 00:03:58 +01:00
Benjamin Admin
c3f1547e32 feat: add Excel-like grid editor for OCR overlay (Kombi mode step 6)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 17s
Backend: new grid_editor_api.py with build-grid endpoint that detects
bordered boxes, splits page into zones, clusters columns/rows per zone
from Kombi word positions. New DB column grid_editor_result JSONB.

Frontend: GridEditor component with editable HTML tables per zone,
column bold toggle, header row toggle, undo/redo, keyboard navigation
(Tab/Enter/Arrow), image overlay verification, and save/load.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 23:41:03 +01:00
Benjamin Admin
4a15d46dfd refactor: rename PaddleOCR → PP-OCRv5 in frontend, remove Kombi-Vergleich tab
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m53s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 16s
Since ocr_region_paddle() now runs RapidOCR locally (same PP-OCRv5 models),
the "PaddleOCR (Hetzner)" labels were misleading. Renamed to "PP-OCRv5 (lokal)".
Removed the Kombi-Vergleich tab since both sides would produce identical results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 09:11:26 +01:00
Benjamin Admin
b83b38e7f2 feat: use local RapidOCR as default in ocr_region_paddle(), remote as fallback
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m55s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 17s
RapidOCR uses the same PP-OCRv5 ONNX models locally, avoiding 504 timeouts
from remote PaddleOCR on large images. Set FORCE_REMOTE_PADDLE=1 to bypass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 08:26:04 +01:00
Benjamin Admin
a994ddee83 feat: add Kombi-Vergleich mode for side-by-side Paddle vs RapidOCR comparison
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 33s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m55s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 21s
Add /rapid-kombi backend endpoint using local RapidOCR + Tesseract merge,
KombiCompareStep component for parallel execution and side-by-side overlay,
and wordResultOverride prop on OverlayReconstruction for direct data injection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 07:59:06 +01:00
Benjamin Admin
c2c082d4b4 docs+tests: update OCR Pipeline docs and add overlay position tests
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 2m5s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 17s
MkDocs: document row-based merge algorithm, spatial overlap dedup,
and per-word yPct/hPct rendering in OCR Pipeline docs.

Tests: add 9 vitest tests for useSlideWordPositions covering
word-box path, fallback path, and yPct/hPct contract.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 21:03:00 +01:00
Benjamin Admin
d6f51e4418 fix: deduplicate overlapping OCR words and use per-word Y positions in overlay
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 33s
CI / test-python-klausur (push) Failing after 2m9s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 24s
Backend: Add spatial overlap check (>=50% horizontal IoU) to Kombi merge
so words at the same position are deduplicated even when OCR text differs.

Frontend: Add yPct/hPct to WordPosition so each word renders at its actual
vertical position instead of all words collapsing to the cell center Y.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 20:27:08 +01:00
Benjamin Admin
703e110bab fix: split PaddleOCR multi-word boxes before merge
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 32s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 19s
PaddleOCR returns entire phrases as single boxes (e.g. "More than 200
singers took part in the"). The merge algorithm compared word-by-word
but Paddle had multi-word boxes vs Tesseract's individual words, so
nothing matched and all Tesseract words were added as "extras" causing
duplicates. Now splits Paddle boxes into individual words before merge.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 10:39:10 +01:00
Benjamin Admin
41ff7671cd fix: update PaddleOCR init for v3.4+ API (lang=en, ocr_version=PP-OCRv5)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 2m5s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
PaddleOCR 3.4.0 removed 'latin' language support. Use 'en' with
explicit ocr_version='PP-OCRv5' instead, with fallback for older API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 09:39:33 +01:00
Benjamin Admin
8e42e36ee4 fix: replace deprecated libgl1-mesa-glx with libgl1 in paddleocr Dockerfile
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m6s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 19s
Package was removed in Debian Trixie.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 09:11:12 +01:00
Benjamin Admin
24e1e93b5b fix: save raw paddle/tesseract words in kombi session for debugging
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m12s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 19s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 09:03:01 +01:00
Benjamin Admin
846292f632 fix: rewrite Kombi merge with row-based sequence alignment
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 1m59s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 19s
Replaces position-based word matching with row-based sequence alignment
to fix doubled words and cross-line averaging in Kombi-Modus.

New algorithm:
1. Group words into rows by Y-position clustering
2. Match rows between engines by vertical center proximity
3. Within each row: walk both sequences left-to-right, deduplicating
4. Unmatched rows kept as-is

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 08:45:03 +01:00
Benjamin Admin
4280298e02 fix: add _deduplicate_words safety net to Kombi merge
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 32s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 2m5s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 19s
Even after multi-criteria matching, near-duplicate words can slip through
(same text, centers within 30px horizontal / 15px vertical). The new
_deduplicate_words() removes these, keeping the higher-confidence copy.

Regression test with real session data (row 2 with 145 near-dupes)
confirms no duplicates remain after merge + deduplication.

Tests: 37 → 45 (added TestDeduplicateWords, TestMergeRealWorldRegression).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 08:27:45 +01:00
Benjamin Admin
4f2fb0e94c fix: Kombi-Modus merge now deduplicates same words from both engines
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 42s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 2m13s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 22s
The merge algorithm now uses 3 criteria instead of just IoU > 0.3:
1. IoU > 0.15 (relaxed threshold)
2. Center proximity < word height AND same row
3. Text similarity > 0.7 AND same row

This prevents doubled overlapping words when both PaddleOCR and
Tesseract find the same word at similar positions. Unique words
from either engine (e.g. bullets from Tesseract) are still added.

Tests expanded: 19 → 37 (added _box_center_dist, _text_similarity,
_words_match tests + deduplication regression test).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 08:11:31 +01:00
Benjamin Admin
61c8169f9e docs+test: add Kombi-Modus tests (19 passing) and MkDocs documentation
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 35s
CI / test-go-edu-search (push) Successful in 32s
CI / test-python-klausur (push) Failing after 2m33s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 24s
- test_paddle_kombi.py: 6 IoU tests, 10 merge tests, 2 bullet-point tests
- OCR-Pipeline.md: new "OCR Overlay" section with Paddle Direct/Kombi docs,
  merge algorithm flowchart, dateistruktur update, changelog v4.5.0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:18:46 +01:00
Benjamin Admin
e9ccd1e35c feat: add Kombi-Modus (PaddleOCR + Tesseract) for OCR Overlay
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 35s
CI / test-go-edu-search (push) Successful in 33s
CI / test-python-klausur (push) Failing after 2m20s
CI / test-python-agent-core (push) Successful in 22s
CI / test-nodejs-website (push) Successful in 41s
Runs both OCR engines on the preprocessed image and merges results:
word boxes matched by IoU, coordinates averaged by confidence weight.
Unmatched Tesseract words (bullets, symbols) are added for better coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:05:50 +01:00
Benjamin Admin
d335a7bbf3 fix: use OCR word_box coordinates directly instead of fuzzy matching
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m6s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 25s
The slide positioning hook was re-matching cell.text tokens against
word_boxes via fuzzy text similarity, which broke positioning for
special characters (!, bullet points, IPA). Now uses word_box
coordinates directly — exact OCR positions without re-interpretation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 18:54:37 +01:00
Benjamin Admin
1f527fcd49 fix: split PaddleOCR boxes at leading ! for overlay word positioning
Some checks failed
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
When PaddleOCR returns "!Betonung" as a single word box, the overlay
positions text starting at the "!" instead of the actual word. Split
such boxes into ["!", "Betonung"] with proportional position splitting,
matching the existing IPA bracket splitting logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 17:46:17 +01:00
Benjamin Admin
8349c28f54 fix: paddle_direct reuses build_grid_from_words for correct overlay
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 37s
CI / test-go-edu-search (push) Successful in 35s
CI / test-python-klausur (push) Failing after 2m22s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 23s
Replaces custom _paddle_words_to_grid_cells with the proven
build_grid_from_words from cv_words_first.py — same function the
regular pipeline uses with PaddleOCR. Handles phrase splitting,
column clustering, and produces cells with word_boxes that the
slide/cluster positioning hooks expect.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 17:19:52 +01:00
Benjamin Admin
71a1b5f058 fix: paddle_direct groups words per row (matching _build_cells format)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 34s
CI / test-python-klausur (push) Failing after 2m11s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 24s
One cell per row with all words as word_boxes instead of one cell per
word. Gives OverlayReconstruction a row-spanning bbox_pct for correct
font sizing and per-word positions for slide/cluster placement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 17:10:10 +01:00
Benjamin Admin
c743a38eaf fix: Paddle Direct keeps preprocessing (orient/deskew/dewarp/crop)
Some checks failed
CI / nodejs-lint (push) Has been cancelled
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
Uses the cropped/dewarped image instead of the original so the overlay
shows the correctly oriented page. 5 steps instead of 2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 16:56:18 +01:00
Benjamin Admin
90c1efd9b0 feat: Paddle Direct — 1-click OCR without deskew/dewarp/crop
Some checks failed
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
New 2-step mode (Upload → PaddleOCR+Overlay) alongside the existing
7-step pipeline. Backend endpoint runs PaddleOCR on the original image
and clusters words into rows/cells directly. Frontend adds a mode
toggle and PaddleDirectStep component.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 16:41:55 +01:00
Benjamin Admin
06d63d18f9 fix: generic fuzzy text matching for overlay word-box positioning
Some checks failed
CI / test-go-edu-search (push) Has been cancelled
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
Replace sequential 1:1 token-to-box mapping with fuzzy text matching.
Each token from cell.text finds its best matching word_box by text
similarity (normalized prefix match + substring bonus). Handles:
- Reordered boxes (different sort between text and boxes)
- IPA corrections changing token boundaries
- Token/box count mismatches
Unmatched tokens get interpolated positions from matched neighbors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 16:19:19 +01:00
Benjamin Admin
3e65b14b83 fix: split PaddleOCR boxes at IPA brackets for overlay positioning
Some checks failed
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
PaddleOCR returns "badge[bxd3]" without space, but the IPA fixer
produces "badge [bˈædʒ]" with space, creating a token count mismatch
between cell.text and word_boxes. Now also split at "[" boundaries
so each IPA bracket gets its own sub-box.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 16:08:17 +01:00
Benjamin Admin
40ac593d28 fix: split PaddleOCR phrase boxes into per-word boxes for overlay slide
Some checks failed
CI / test-nodejs-website (push) Has been cancelled
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
PaddleOCR returns phrase-level bounding boxes (e.g. "competition
[kompa'tifn]" as one box) but the overlay slide mechanism expects
one box per word for accurate positioning. Multi-word boxes are now
split proportionally by character count with small gaps between words.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 16:00:06 +01:00
Benjamin Admin
ea69239e06 fix: word_boxes in words_first use absolute pixels (consistent with v2 grid)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 39s
CI / test-go-edu-search (push) Successful in 33s
CI / test-python-klausur (push) Failing after 2m21s
CI / test-python-agent-core (push) Successful in 22s
CI / test-nodejs-website (push) Successful in 33s
words_first was storing word_boxes in percent coordinates while
cv_cell_grid.py uses absolute pixel coordinates. The overlay slide
mechanism divides by imgW to get percentages, so percent-in-percent
caused positions near zero. Now both grid builders use the same format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:04:04 +01:00
Benjamin Admin
bb90d1ba94 fix: PaddleOCR engine forces words_first in frontend to match backend
Some checks failed
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
When engine=paddle is selected, the backend overrides grid_method to
words_first and returns plain JSON (no SSE streaming). The frontend
was not aware of this override — it sent stream=true and tried to parse
SSE events from a JSON response, resulting in "Keine Daten".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 14:52:18 +01:00
Benjamin Admin
685d135be5 fix: downscale large images before PaddleOCR (Traefik 60s limit)
Some checks failed
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-school (push) Has been cancelled
CI / test-go-edu-search (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-nodejs-website (push) Has been cancelled
Bilder > 1500px werden vor dem Upload verkleinert. Koordinaten
werden zurueckskaliert. JPEG statt PNG fuer schnelleren Upload.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 14:28:58 +01:00
Benjamin Admin
e2c2acdf86 fix: increase PaddleOCR remote timeout to 120s for large scans
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m14s
CI / test-python-agent-core (push) Successful in 21s
CI / test-nodejs-website (push) Successful in 24s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 13:41:39 +01:00
Benjamin Admin
3cc496f7f3 feat(rag): Update Verbraucherschutz docs + chunk counts + Landkarte
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 32s
CI / test-go-edu-search (push) Failing after 14s
CI / test-python-klausur (push) Failing after 2m5s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 22s
- Update chunk counts for 8 successfully ingested DE laws (Phase H1)
- Add 6 new BGB-Teile entries (AGB, Fernabsatz, Kaufrecht, Widerruf, Digital)
- Add EGBGB Widerrufsbelehrung entry
- Update COLLECTION_TOTALS: gesetze 58304→63567 (+5263 Phase H chunks)
- Add Verbraucherschutz thematic group to Landkarte
- Extend ecommerce industry map with consumer protection regulations
- Update date to March 2026

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:54:20 +01:00
Benjamin Admin
a6069631cc feat: PaddleOCR Remote-Engine (PP-OCRv5 Latin auf Hetzner x86_64)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m7s
CI / test-python-agent-core (push) Successful in 21s
CI / test-nodejs-website (push) Successful in 21s
PaddleOCR als neue engine=paddle Option in der OCR-Pipeline.
Microservice auf Hetzner (paddleocr-service/), async HTTP-Client
(paddleocr_remote.py), Frontend-Dropdown, automatisch words_first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:31:22 +01:00
Benjamin Admin
ced5bb3dd3 feat: Words-First Grid Builder (bottom-up alternative zu cell_grid_v2)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 54s
CI / test-go-edu-search (push) Successful in 47s
CI / test-python-klausur (push) Failing after 2m31s
CI / test-python-agent-core (push) Successful in 23s
CI / test-nodejs-website (push) Successful in 32s
Neuer Algorithmus in cv_words_first.py: Clustert Tesseract word_boxes
direkt zu Spalten (X-Gap) und Zeilen (Y-Proximity), baut Zellen an
Schnittpunkten. Kein Spalten-/Zeilenerkennung noetig.

- cv_words_first.py: _cluster_columns, _cluster_rows, _build_cells, build_grid_from_words
- ocr_pipeline_api.py: grid_method Parameter (v2|words_first) im /words Endpoint
- StepWordRecognition.tsx: Dropdown Toggle fuer Grid-Methode
- OCR-Pipeline.md: Doku v4.3.0 mit Words-First Algorithmus
- 15 Unit-Tests fuer cv_words_first

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 06:46:05 +01:00
Benjamin Admin
2fdf3ff868 feat(rag): Register Verbraucherschutz laws + EU directives in RAG constants
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 46s
CI / test-go-edu-search (push) Successful in 33s
CI / test-nodejs-website (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
CI / test-python-klausur (push) Has been cancelled
Add 15 new regulations from Phase H ingestion:
- DE: PAngV, VSBG, ProdHaftG, VerpackG, ElektroG, BattDG, BFSG, UWG, GewO
- EU: Warenkauf-RL, Klausel-RL, UGP-RL, Preisangaben-RL, Omnibus-RL, BattVO

Chunk counts set to 0 (will be updated after successful ingestion).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 06:43:19 +01:00
Benjamin Admin
2e21a4b6d0 fix: IPA nur einfügen wenn word_boxes Gap >80px zeigen (kein falsches IPA)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 55s
CI / test-go-edu-search (push) Successful in 48s
CI / test-python-klausur (push) Failing after 2m11s
CI / test-python-agent-core (push) Successful in 23s
CI / test-nodejs-website (push) Successful in 26s
_has_ipa_gap() prüft ob Tesseract eine IPA-Klammer übersehen hat anhand
des physischen Abstands zwischen Headword und nächstem Wort. Ohne Gap
(z.B. "be good at sth.", "Focus on language") wird kein IPA eingefügt.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 23:40:18 +01:00
Benjamin Admin
d98dba9098 fix: Headword-IPA auch in langen column_text Zeilen einfuegen
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 53s
CI / test-go-edu-search (push) Successful in 49s
CI / test-python-klausur (push) Failing after 2m14s
CI / test-python-agent-core (push) Successful in 22s
CI / test-nodejs-website (push) Successful in 23s
_insert_missing_ipa ueberspringe Texte mit >6 Woertern oder Klammern.
Neue _insert_headword_ipa fuer column_text: prueft nur das erste Wort
der Zeile, unabhaengig von Textlaenge oder vorhandenen Klammern.

Ausserdem _sync_word_boxes_after_ipa_insert gefixt: Token-Vergleich
nutzt jetzt paralleles Durchlaufen statt zip (verschobene Positionen).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 23:25:38 +01:00
Benjamin Admin
cd13eca290 fix: IPA-Einfuegung fuer column_text mit word_boxes Synchronisation
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 32s
CI / test-python-klausur (push) Failing after 2m9s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 20s
Fuer column_text werden fehlende IPA-Lautschriften (challenge, profit,
film, badge) wieder eingefuegt, aber gleichzeitig eine synthetische
word_box erzeugt, damit die 1:1 Token-zu-Box Zuordnung im Overlay
erhalten bleibt.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 23:15:26 +01:00
Benjamin Admin
aa7db43f02 fix: column_text nur garbled IPA ersetzen, keine Einfuegung/Entfernung
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m8s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 21s
Fuer column_text (Full-Page Overlay mit gemischtem EN+DE Text):
- Kein IPA einfuegen (wuerde Token-Count aendern, Overlay-Positionen brechen)
- Keine orphan brackets entfernen (sind oft deutsche Bedeutungen wie (probieren))
- Nur garbled IPA ersetzen (z.B. [teıst] -> [tˈeɪst])

column_en behaelt volle Verarbeitung (replace + strip + insert).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 23:05:37 +01:00
Benjamin Admin
4afd5bd8e8 fix: Klammerwörter wie (probieren), (Profit) nicht mehr als garbled IPA entfernen
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 50s
CI / test-go-edu-search (push) Successful in 45s
CI / test-python-klausur (push) Failing after 2m12s
CI / test-python-agent-core (push) Successful in 23s
CI / test-nodejs-website (push) Successful in 27s
_strip_orphan_bracket entfernte deutsche Bedeutungsangaben in Klammern,
weil sie weder als Grammar-Partikel noch als IPA erkannt wurden.
Fix: Klammerinhalte mit echten Wörtern (>=4 Buchstaben) werden behalten.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 22:47:01 +01:00
Benjamin Admin
7d19145edb fix: word_boxes auch fuer breite Spalten (Full-Page OCR) speichern
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 32s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m3s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 21s
word_boxes wurden nur im Cell-Crop-Pfad (narrow columns) gesetzt,
aber nicht im Full-Page Word-Assignment-Pfad (broad columns).
Jetzt werden die Tesseract-Wort-Koordinaten in beiden Pfaden gespeichert.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 20:41:29 +01:00