-
aae8a96aa2
fix: sort word_boxes in reading order (Y-grouped, then X-sorted)
Benjamin Admin
2026-03-17 10:41:30 +01:00
-
2b73d9beec
fix: increase color recovery occupancy padding to prevent gap artifacts
Benjamin Admin
2026-03-17 10:28:56 +01:00
-
324f39a9cc
fix: merge inline marker columns + improve ghost edge detection
Benjamin Admin
2026-03-17 10:10:07 +01:00
-
febd0a2f84
fix: border ghost filter + row overlap fix for box zones
Benjamin Admin
2026-03-17 09:54:50 +01:00
-
43b1f8be58
diag: increase zone logging threshold to 60 words
Benjamin Admin
2026-03-17 09:49:19 +01:00
-
43dec5dd91
diag: add row-clustering logging for small/box zones
Benjamin Admin
2026-03-17 09:45:29 +01:00
-
dfce8415d7
fix: show per-word colors in grid table instead of whole-cell coloring
Benjamin Admin
2026-03-17 08:55:43 +01:00
-
92a52a3199
fix: apply column union when total_cols >= max (not just >)
Benjamin Admin
2026-03-17 00:14:59 +01:00
-
427fecdce0
fix: union column detection across all content zones
Benjamin Admin
2026-03-16 23:02:33 +01:00
-
9fb3229270
fix: lower tertiary gap threshold for narrow margin column detection
Benjamin Admin
2026-03-16 22:56:03 +01:00
-
91625a2646
fix: add tertiary tier for narrow margin columns (page refs, markers)
Benjamin Admin
2026-03-16 22:40:40 +01:00
-
02ae6249ca
fix: propagate columns from largest content zone instead of global detection
Benjamin Admin
2026-03-16 22:30:15 +01:00
-
cf995f2d52
fix: global column detection across content zones in Kombi grid builder
Benjamin Admin
2026-03-16 22:04:17 +01:00
-
0340204c1f
feat: box-aware column detection — exclude box content from global columns
Benjamin Admin
2026-03-16 18:42:46 +01:00
-
729ebff63c
feat: add border ghost filter + graphic detection tests + structure overlay
Benjamin Admin
2026-03-16 18:28:53 +01:00
-
6668661895
feat: region-based graphic detection with word-overlap filtering
Benjamin Admin
2026-03-16 14:49:15 +01:00
-
eeee61108a
fix: remove morph close that merged balloons into giant blob
Benjamin Admin
2026-03-16 14:42:51 +01:00
-
1653e7cff4
feat: two-pass graphic detection (color channel + ink)
Benjamin Admin
2026-03-16 14:30:33 +01:00
-
86ae71fd65
fix: only detect circles and illustrations, drop arrow/icon/line
Benjamin Admin
2026-03-16 14:20:17 +01:00
-
ba513968c5
fix: relax graphic detection for small circles/balloons
Benjamin Admin
2026-03-16 14:00:09 +01:00
-
f717e1c0df
debug: use INFO level for skip-reason logs
Benjamin Admin
2026-03-16 13:57:08 +01:00
-
934b5648a2
debug: add detailed skip-reason logging to graphic detection
Benjamin Admin
2026-03-16 13:56:12 +01:00
-
fe7339c7a1
fix: suppress text fragments in graphic detection
Benjamin Admin
2026-03-16 13:51:02 +01:00
-
3aa4a63257
fix: move Struktur step after OCR so word boxes are available for exclusion
Benjamin Admin
2026-03-16 13:38:58 +01:00
-
6b9b280ba3
feat: integrate graphic element detection into structure step
Benjamin Admin
2026-03-16 13:21:55 +01:00
-
1d34785e2b
feat: add Structure step to Kombi mode in OCR Overlay page
Benjamin Admin
2026-03-16 12:59:05 +01:00
-
5b5213c2b9
feat: add Structure Detection step to OCR pipeline
Benjamin Admin
2026-03-16 12:31:09 +01:00
-
fbbec6cf5e
feat: run shading-based box detection alongside line detection
Benjamin Admin
2026-03-16 08:12:52 +01:00
-
a6951940b9
fix: use median hue, Otsu threshold, and background subtraction for colors
Benjamin Admin
2026-03-16 07:44:03 +01:00
-
4a8d43fd71
feat: display detected text colors in grid editor UI
Benjamin Admin
2026-03-15 01:03:09 +01:00
-
bcd55e12d7
fix: run color annotation on final cell word_boxes, not pre-grid words
Benjamin Admin
2026-03-15 00:53:04 +01:00
-
2bd63ec402
feat: add color detection for OCR word boxes
Benjamin Admin
2026-03-15 00:50:09 +01:00
-
39a4d8564c
chore: add per-cluster debug logging for column alignment detection
Benjamin Admin
2026-03-15 00:18:28 +01:00
-
1162eac7b4
fix: use group-start positions for column detection, not all word left-edges
Benjamin Admin
2026-03-15 00:10:29 +01:00
-
28352f5bab
feat: replace gap-based column detection with left-edge alignment algorithm
Benjamin Admin
2026-03-15 00:03:58 +01:00
-
c3f1547e32
feat: add Excel-like grid editor for OCR overlay (Kombi mode step 6)
Benjamin Admin
2026-03-14 23:41:03 +01:00
-
4a15d46dfd
refactor: rename PaddleOCR → PP-OCRv5 in frontend, remove Kombi-Vergleich tab
Benjamin Admin
2026-03-14 09:11:26 +01:00
-
b83b38e7f2
feat: use local RapidOCR as default in ocr_region_paddle(), remote as fallback
Benjamin Admin
2026-03-14 08:26:04 +01:00
-
a994ddee83
feat: add Kombi-Vergleich mode for side-by-side Paddle vs RapidOCR comparison
Benjamin Admin
2026-03-14 07:59:06 +01:00
-
c2c082d4b4
docs+tests: update OCR Pipeline docs and add overlay position tests
Benjamin Admin
2026-03-13 21:03:00 +01:00
-
d6f51e4418
fix: deduplicate overlapping OCR words and use per-word Y positions in overlay
Benjamin Admin
2026-03-13 20:27:08 +01:00
-
703e110bab
fix: split PaddleOCR multi-word boxes before merge
Benjamin Admin
2026-03-13 10:39:10 +01:00
-
41ff7671cd
fix: update PaddleOCR init for v3.4+ API (lang=en, ocr_version=PP-OCRv5)
Benjamin Admin
2026-03-13 09:39:33 +01:00
-
8e42e36ee4
fix: replace deprecated libgl1-mesa-glx with libgl1 in paddleocr Dockerfile
Benjamin Admin
2026-03-13 09:11:12 +01:00
-
24e1e93b5b
fix: save raw paddle/tesseract words in kombi session for debugging
Benjamin Admin
2026-03-13 09:03:01 +01:00
-
846292f632
fix: rewrite Kombi merge with row-based sequence alignment
Benjamin Admin
2026-03-13 08:45:03 +01:00
-
4280298e02
fix: add _deduplicate_words safety net to Kombi merge
Benjamin Admin
2026-03-13 08:27:45 +01:00
-
4f2fb0e94c
fix: Kombi-Modus merge now deduplicates same words from both engines
Benjamin Admin
2026-03-13 08:11:31 +01:00
-
61c8169f9e
docs+test: add Kombi-Modus tests (19 passing) and MkDocs documentation
Benjamin Admin
2026-03-12 20:18:46 +01:00
-
e9ccd1e35c
feat: add Kombi-Modus (PaddleOCR + Tesseract) for OCR Overlay
Benjamin Admin
2026-03-12 20:05:50 +01:00
-
d335a7bbf3
fix: use OCR word_box coordinates directly instead of fuzzy matching
Benjamin Admin
2026-03-12 18:54:37 +01:00
-
1f527fcd49
fix: split PaddleOCR boxes at leading ! for overlay word positioning
Benjamin Admin
2026-03-12 17:46:17 +01:00
-
8349c28f54
fix: paddle_direct reuses build_grid_from_words for correct overlay
Benjamin Admin
2026-03-12 17:19:52 +01:00
-
71a1b5f058
fix: paddle_direct groups words per row (matching _build_cells format)
Benjamin Admin
2026-03-12 17:10:10 +01:00
-
c743a38eaf
fix: Paddle Direct keeps preprocessing (orient/deskew/dewarp/crop)
Benjamin Admin
2026-03-12 16:56:18 +01:00
-
90c1efd9b0
feat: Paddle Direct — 1-click OCR without deskew/dewarp/crop
Benjamin Admin
2026-03-12 16:41:55 +01:00
-
06d63d18f9
fix: generic fuzzy text matching for overlay word-box positioning
Benjamin Admin
2026-03-12 16:19:19 +01:00
-
3e65b14b83
fix: split PaddleOCR boxes at IPA brackets for overlay positioning
Benjamin Admin
2026-03-12 16:08:17 +01:00
-
40ac593d28
fix: split PaddleOCR phrase boxes into per-word boxes for overlay slide
Benjamin Admin
2026-03-12 16:00:06 +01:00
-
ea69239e06
fix: word_boxes in words_first use absolute pixels (consistent with v2 grid)
Benjamin Admin
2026-03-12 15:04:04 +01:00
-
bb90d1ba94
fix: PaddleOCR engine forces words_first in frontend to match backend
Benjamin Admin
2026-03-12 14:52:18 +01:00
-
685d135be5
fix: downscale large images before PaddleOCR (Traefik 60s limit)
Benjamin Admin
2026-03-12 14:28:58 +01:00
-
e2c2acdf86
fix: increase PaddleOCR remote timeout to 120s for large scans
Benjamin Admin
2026-03-12 13:41:39 +01:00
-
3cc496f7f3
feat(rag): Update Verbraucherschutz docs + chunk counts + Landkarte
Benjamin Admin
2026-03-12 09:54:20 +01:00
-
a6069631cc
feat: PaddleOCR Remote-Engine (PP-OCRv5 Latin auf Hetzner x86_64)
Benjamin Admin
2026-03-12 09:31:22 +01:00
-
ced5bb3dd3
feat: Words-First Grid Builder (bottom-up alternative zu cell_grid_v2)
Benjamin Admin
2026-03-12 06:46:05 +01:00
-
2fdf3ff868
feat(rag): Register Verbraucherschutz laws + EU directives in RAG constants
Benjamin Admin
2026-03-12 06:43:19 +01:00
-
2e21a4b6d0
fix: IPA nur einfügen wenn word_boxes Gap >80px zeigen (kein falsches IPA)
Benjamin Admin
2026-03-11 23:40:18 +01:00
-
d98dba9098
fix: Headword-IPA auch in langen column_text Zeilen einfuegen
Benjamin Admin
2026-03-11 23:25:38 +01:00
-
cd13eca290
fix: IPA-Einfuegung fuer column_text mit word_boxes Synchronisation
Benjamin Admin
2026-03-11 23:15:26 +01:00
-
aa7db43f02
fix: column_text nur garbled IPA ersetzen, keine Einfuegung/Entfernung
Benjamin Admin
2026-03-11 23:05:37 +01:00
-
4afd5bd8e8
fix: Klammerwörter wie (probieren), (Profit) nicht mehr als garbled IPA entfernen
Benjamin Admin
2026-03-11 22:47:01 +01:00
-
7d19145edb
fix: word_boxes auch fuer breite Spalten (Full-Page OCR) speichern
Benjamin Admin
2026-03-11 20:41:29 +01:00
-
35f2706098
fix: Slide-Modus nutzt cell.text Tokens statt word_boxes Text (keine Woerter verloren)
Benjamin Admin
2026-03-11 20:01:57 +01:00
-
0ee92e7210
feat: OCR word_boxes fuer pixelgenaue Overlay-Positionierung
Benjamin Admin
2026-03-11 19:39:49 +01:00
-
4949863bd7
revert: Zurueck zum Einzelwort-Slide mit fontRatio=1.0 Fix
Benjamin Admin
2026-03-11 19:15:52 +01:00
-
efbe15f895
fix: Slide-Modus auf Gruppen-basiertes Sliding umgestellt
Benjamin Admin
2026-03-11 18:31:17 +01:00
-
c3da131129
fix: Slide fontRatio=1.0 und Token-Breite aus gerenderter Fontgroesse
Benjamin Admin
2026-03-11 17:59:31 +01:00
-
b81baa1d16
fix: Slide-Modus globale Schriftgroesse statt per-Token Scale
Benjamin Admin
2026-03-11 16:51:55 +01:00
-
2010cab894
fix: Slide-Modus Scale-Berechnung auf Ink-Span statt Ink-Count
Benjamin Admin
2026-03-11 16:41:38 +01:00
-
bc13978bc1
feat: Slide-Modus als alternative Wort-Positionierung im Overlay
Benjamin Admin
2026-03-11 16:13:31 +01:00
-
2f51ac617f
feat: IPA-Lautschrift in Cell-Texte einfuegen (fuer Overlay-Modus)
Benjamin Admin
2026-03-11 15:47:26 +01:00
-
8a5f2aa188
fix: Cluster-Zuordnung per Breiten-Proportionalitaet statt Position
Benjamin Admin
2026-03-11 15:39:54 +01:00
-
d182d87f26
fix: OCR-Artefakte (|, >) vor Cluster-Matching zusammenfuehren
Benjamin Admin
2026-03-11 15:03:37 +01:00
-
87efc1b4ba
fix: bei Cluster-Ueberschuss die breitesten N Cluster waehlen
Benjamin Admin
2026-03-11 14:34:58 +01:00
-
dd7087cd6d
fix: Pixel-Analyse nicht mehr ueberspringen wenn Cluster < Gruppen
Benjamin Admin
2026-03-11 10:14:58 +01:00
-
7282a220d6
fix: useMemo vor Early Returns verschieben (Rules of Hooks)
Benjamin Admin
2026-03-11 09:46:25 +01:00
-
b5d5371f72
fix: einheitliche Schriftgroesse + Border-Cluster-Filter im Overlay
Benjamin Admin
2026-03-11 09:34:41 +01:00
-
41e47baf13
fix: skip_heal_gaps Parameter an Stream-Generator durchreichen
Benjamin Admin
2026-03-11 09:11:16 +01:00
-
8a60f4bf30
fix: Overlay-Zellen ohne _heal_row_gaps positionieren (skip_heal_gaps)
Benjamin Admin
2026-03-11 08:59:50 +01:00
-
e3ee1de790
Revert "fix: Zeilen-Regularisierung im Overlay ueberspringen (generisch fuer gemischte Inhalte)"
Benjamin Admin
2026-03-11 08:44:07 +01:00
-
b91f799ccf
fix: Zeilen-Regularisierung im Overlay ueberspringen (generisch fuer gemischte Inhalte)
Benjamin Admin
2026-03-11 08:29:06 +01:00
-
2df2a01a8b
feat: Echtes Overlay — Text direkt ueber dem Originalbild
Benjamin Admin
2026-03-11 00:25:11 +01:00
-
e2ad93fd57
fix: Word-Erkennung ohne Spalten ermoeglichen (Full-Page Pseudo-Column)
Benjamin Admin
2026-03-11 00:16:31 +01:00
-
2cbdfc56f3
feat: OCR Overlay — ganzseitige Rekonstruktion ohne Spaltenerkennung
Benjamin Admin
2026-03-11 00:08:05 +01:00
-
840918df2a
fix: Originalbild im Overlay nicht extra drehen (Orientierung bereits im Cropped-Bild)
Benjamin Admin
2026-03-10 23:25:20 +01:00
-
eb3fc05cdc
fix: Box-Zone Clamping nach Box-Mitte statt Cell-Center entscheiden
Benjamin Admin
2026-03-10 23:10:51 +01:00
-
9dbb5fa708
fix: useMemo vor Early Returns verschieben (Rules of Hooks)
Benjamin Admin
2026-03-10 22:57:25 +01:00
-
f468c30112
fix: Zellen an Box-Zone clampen im Overlay-Modus (keine Ueberlappung)
Benjamin Admin
2026-03-10 22:52:08 +01:00
-
618c82ef42
fix: Zeilen an Box-Grenze nicht mehr abschneiden (border_thickness Margin)
Benjamin Admin
2026-03-10 17:44:02 +01:00