Commit Graph

  • aae8a96aa2 fix: sort word_boxes in reading order (Y-grouped, then X-sorted) Benjamin Admin 2026-03-17 10:41:30 +01:00
  • 2b73d9beec fix: increase color recovery occupancy padding to prevent gap artifacts Benjamin Admin 2026-03-17 10:28:56 +01:00
  • 324f39a9cc fix: merge inline marker columns + improve ghost edge detection Benjamin Admin 2026-03-17 10:10:07 +01:00
  • febd0a2f84 fix: border ghost filter + row overlap fix for box zones Benjamin Admin 2026-03-17 09:54:50 +01:00
  • 43b1f8be58 diag: increase zone logging threshold to 60 words Benjamin Admin 2026-03-17 09:49:19 +01:00
  • 43dec5dd91 diag: add row-clustering logging for small/box zones Benjamin Admin 2026-03-17 09:45:29 +01:00
  • dfce8415d7 fix: show per-word colors in grid table instead of whole-cell coloring Benjamin Admin 2026-03-17 08:55:43 +01:00
  • 92a52a3199 fix: apply column union when total_cols >= max (not just >) Benjamin Admin 2026-03-17 00:14:59 +01:00
  • 427fecdce0 fix: union column detection across all content zones Benjamin Admin 2026-03-16 23:02:33 +01:00
  • 9fb3229270 fix: lower tertiary gap threshold for narrow margin column detection Benjamin Admin 2026-03-16 22:56:03 +01:00
  • 91625a2646 fix: add tertiary tier for narrow margin columns (page refs, markers) Benjamin Admin 2026-03-16 22:40:40 +01:00
  • 02ae6249ca fix: propagate columns from largest content zone instead of global detection Benjamin Admin 2026-03-16 22:30:15 +01:00
  • cf995f2d52 fix: global column detection across content zones in Kombi grid builder Benjamin Admin 2026-03-16 22:04:17 +01:00
  • 0340204c1f feat: box-aware column detection — exclude box content from global columns Benjamin Admin 2026-03-16 18:42:46 +01:00
  • 729ebff63c feat: add border ghost filter + graphic detection tests + structure overlay Benjamin Admin 2026-03-16 18:28:53 +01:00
  • 6668661895 feat: region-based graphic detection with word-overlap filtering Benjamin Admin 2026-03-16 14:49:15 +01:00
  • eeee61108a fix: remove morph close that merged balloons into giant blob Benjamin Admin 2026-03-16 14:42:51 +01:00
  • 1653e7cff4 feat: two-pass graphic detection (color channel + ink) Benjamin Admin 2026-03-16 14:30:33 +01:00
  • 86ae71fd65 fix: only detect circles and illustrations, drop arrow/icon/line Benjamin Admin 2026-03-16 14:20:17 +01:00
  • ba513968c5 fix: relax graphic detection for small circles/balloons Benjamin Admin 2026-03-16 14:00:09 +01:00
  • f717e1c0df debug: use INFO level for skip-reason logs Benjamin Admin 2026-03-16 13:57:08 +01:00
  • 934b5648a2 debug: add detailed skip-reason logging to graphic detection Benjamin Admin 2026-03-16 13:56:12 +01:00
  • fe7339c7a1 fix: suppress text fragments in graphic detection Benjamin Admin 2026-03-16 13:51:02 +01:00
  • 3aa4a63257 fix: move Struktur step after OCR so word boxes are available for exclusion Benjamin Admin 2026-03-16 13:38:58 +01:00
  • 6b9b280ba3 feat: integrate graphic element detection into structure step Benjamin Admin 2026-03-16 13:21:55 +01:00
  • 1d34785e2b feat: add Structure step to Kombi mode in OCR Overlay page Benjamin Admin 2026-03-16 12:59:05 +01:00
  • 5b5213c2b9 feat: add Structure Detection step to OCR pipeline Benjamin Admin 2026-03-16 12:31:09 +01:00
  • fbbec6cf5e feat: run shading-based box detection alongside line detection Benjamin Admin 2026-03-16 08:12:52 +01:00
  • a6951940b9 fix: use median hue, Otsu threshold, and background subtraction for colors Benjamin Admin 2026-03-16 07:44:03 +01:00
  • 4a8d43fd71 feat: display detected text colors in grid editor UI Benjamin Admin 2026-03-15 01:03:09 +01:00
  • bcd55e12d7 fix: run color annotation on final cell word_boxes, not pre-grid words Benjamin Admin 2026-03-15 00:53:04 +01:00
  • 2bd63ec402 feat: add color detection for OCR word boxes Benjamin Admin 2026-03-15 00:50:09 +01:00
  • 39a4d8564c chore: add per-cluster debug logging for column alignment detection Benjamin Admin 2026-03-15 00:18:28 +01:00
  • 1162eac7b4 fix: use group-start positions for column detection, not all word left-edges Benjamin Admin 2026-03-15 00:10:29 +01:00
  • 28352f5bab feat: replace gap-based column detection with left-edge alignment algorithm Benjamin Admin 2026-03-15 00:03:58 +01:00
  • c3f1547e32 feat: add Excel-like grid editor for OCR overlay (Kombi mode step 6) Benjamin Admin 2026-03-14 23:41:03 +01:00
  • 4a15d46dfd refactor: rename PaddleOCR → PP-OCRv5 in frontend, remove Kombi-Vergleich tab Benjamin Admin 2026-03-14 09:11:26 +01:00
  • b83b38e7f2 feat: use local RapidOCR as default in ocr_region_paddle(), remote as fallback Benjamin Admin 2026-03-14 08:26:04 +01:00
  • a994ddee83 feat: add Kombi-Vergleich mode for side-by-side Paddle vs RapidOCR comparison Benjamin Admin 2026-03-14 07:59:06 +01:00
  • c2c082d4b4 docs+tests: update OCR Pipeline docs and add overlay position tests Benjamin Admin 2026-03-13 21:03:00 +01:00
  • d6f51e4418 fix: deduplicate overlapping OCR words and use per-word Y positions in overlay Benjamin Admin 2026-03-13 20:27:08 +01:00
  • 703e110bab fix: split PaddleOCR multi-word boxes before merge Benjamin Admin 2026-03-13 10:39:10 +01:00
  • 41ff7671cd fix: update PaddleOCR init for v3.4+ API (lang=en, ocr_version=PP-OCRv5) Benjamin Admin 2026-03-13 09:39:33 +01:00
  • 8e42e36ee4 fix: replace deprecated libgl1-mesa-glx with libgl1 in paddleocr Dockerfile Benjamin Admin 2026-03-13 09:11:12 +01:00
  • 24e1e93b5b fix: save raw paddle/tesseract words in kombi session for debugging Benjamin Admin 2026-03-13 09:03:01 +01:00
  • 846292f632 fix: rewrite Kombi merge with row-based sequence alignment Benjamin Admin 2026-03-13 08:45:03 +01:00
  • 4280298e02 fix: add _deduplicate_words safety net to Kombi merge Benjamin Admin 2026-03-13 08:27:45 +01:00
  • 4f2fb0e94c fix: Kombi-Modus merge now deduplicates same words from both engines Benjamin Admin 2026-03-13 08:11:31 +01:00
  • 61c8169f9e docs+test: add Kombi-Modus tests (19 passing) and MkDocs documentation Benjamin Admin 2026-03-12 20:18:46 +01:00
  • e9ccd1e35c feat: add Kombi-Modus (PaddleOCR + Tesseract) for OCR Overlay Benjamin Admin 2026-03-12 20:05:50 +01:00
  • d335a7bbf3 fix: use OCR word_box coordinates directly instead of fuzzy matching Benjamin Admin 2026-03-12 18:54:37 +01:00
  • 1f527fcd49 fix: split PaddleOCR boxes at leading ! for overlay word positioning Benjamin Admin 2026-03-12 17:46:17 +01:00
  • 8349c28f54 fix: paddle_direct reuses build_grid_from_words for correct overlay Benjamin Admin 2026-03-12 17:19:52 +01:00
  • 71a1b5f058 fix: paddle_direct groups words per row (matching _build_cells format) Benjamin Admin 2026-03-12 17:10:10 +01:00
  • c743a38eaf fix: Paddle Direct keeps preprocessing (orient/deskew/dewarp/crop) Benjamin Admin 2026-03-12 16:56:18 +01:00
  • 90c1efd9b0 feat: Paddle Direct — 1-click OCR without deskew/dewarp/crop Benjamin Admin 2026-03-12 16:41:55 +01:00
  • 06d63d18f9 fix: generic fuzzy text matching for overlay word-box positioning Benjamin Admin 2026-03-12 16:19:19 +01:00
  • 3e65b14b83 fix: split PaddleOCR boxes at IPA brackets for overlay positioning Benjamin Admin 2026-03-12 16:08:17 +01:00
  • 40ac593d28 fix: split PaddleOCR phrase boxes into per-word boxes for overlay slide Benjamin Admin 2026-03-12 16:00:06 +01:00
  • ea69239e06 fix: word_boxes in words_first use absolute pixels (consistent with v2 grid) Benjamin Admin 2026-03-12 15:04:04 +01:00
  • bb90d1ba94 fix: PaddleOCR engine forces words_first in frontend to match backend Benjamin Admin 2026-03-12 14:52:18 +01:00
  • 685d135be5 fix: downscale large images before PaddleOCR (Traefik 60s limit) Benjamin Admin 2026-03-12 14:28:58 +01:00
  • e2c2acdf86 fix: increase PaddleOCR remote timeout to 120s for large scans Benjamin Admin 2026-03-12 13:41:39 +01:00
  • 3cc496f7f3 feat(rag): Update Verbraucherschutz docs + chunk counts + Landkarte Benjamin Admin 2026-03-12 09:54:20 +01:00
  • a6069631cc feat: PaddleOCR Remote-Engine (PP-OCRv5 Latin auf Hetzner x86_64) Benjamin Admin 2026-03-12 09:31:22 +01:00
  • ced5bb3dd3 feat: Words-First Grid Builder (bottom-up alternative zu cell_grid_v2) Benjamin Admin 2026-03-12 06:46:05 +01:00
  • 2fdf3ff868 feat(rag): Register Verbraucherschutz laws + EU directives in RAG constants Benjamin Admin 2026-03-12 06:43:19 +01:00
  • 2e21a4b6d0 fix: IPA nur einfügen wenn word_boxes Gap >80px zeigen (kein falsches IPA) Benjamin Admin 2026-03-11 23:40:18 +01:00
  • d98dba9098 fix: Headword-IPA auch in langen column_text Zeilen einfuegen Benjamin Admin 2026-03-11 23:25:38 +01:00
  • cd13eca290 fix: IPA-Einfuegung fuer column_text mit word_boxes Synchronisation Benjamin Admin 2026-03-11 23:15:26 +01:00
  • aa7db43f02 fix: column_text nur garbled IPA ersetzen, keine Einfuegung/Entfernung Benjamin Admin 2026-03-11 23:05:37 +01:00
  • 4afd5bd8e8 fix: Klammerwörter wie (probieren), (Profit) nicht mehr als garbled IPA entfernen Benjamin Admin 2026-03-11 22:47:01 +01:00
  • 7d19145edb fix: word_boxes auch fuer breite Spalten (Full-Page OCR) speichern Benjamin Admin 2026-03-11 20:41:29 +01:00
  • 35f2706098 fix: Slide-Modus nutzt cell.text Tokens statt word_boxes Text (keine Woerter verloren) Benjamin Admin 2026-03-11 20:01:57 +01:00
  • 0ee92e7210 feat: OCR word_boxes fuer pixelgenaue Overlay-Positionierung Benjamin Admin 2026-03-11 19:39:49 +01:00
  • 4949863bd7 revert: Zurueck zum Einzelwort-Slide mit fontRatio=1.0 Fix Benjamin Admin 2026-03-11 19:15:52 +01:00
  • efbe15f895 fix: Slide-Modus auf Gruppen-basiertes Sliding umgestellt Benjamin Admin 2026-03-11 18:31:17 +01:00
  • c3da131129 fix: Slide fontRatio=1.0 und Token-Breite aus gerenderter Fontgroesse Benjamin Admin 2026-03-11 17:59:31 +01:00
  • b81baa1d16 fix: Slide-Modus globale Schriftgroesse statt per-Token Scale Benjamin Admin 2026-03-11 16:51:55 +01:00
  • 2010cab894 fix: Slide-Modus Scale-Berechnung auf Ink-Span statt Ink-Count Benjamin Admin 2026-03-11 16:41:38 +01:00
  • bc13978bc1 feat: Slide-Modus als alternative Wort-Positionierung im Overlay Benjamin Admin 2026-03-11 16:13:31 +01:00
  • 2f51ac617f feat: IPA-Lautschrift in Cell-Texte einfuegen (fuer Overlay-Modus) Benjamin Admin 2026-03-11 15:47:26 +01:00
  • 8a5f2aa188 fix: Cluster-Zuordnung per Breiten-Proportionalitaet statt Position Benjamin Admin 2026-03-11 15:39:54 +01:00
  • d182d87f26 fix: OCR-Artefakte (|, >) vor Cluster-Matching zusammenfuehren Benjamin Admin 2026-03-11 15:03:37 +01:00
  • 87efc1b4ba fix: bei Cluster-Ueberschuss die breitesten N Cluster waehlen Benjamin Admin 2026-03-11 14:34:58 +01:00
  • dd7087cd6d fix: Pixel-Analyse nicht mehr ueberspringen wenn Cluster < Gruppen Benjamin Admin 2026-03-11 10:14:58 +01:00
  • 7282a220d6 fix: useMemo vor Early Returns verschieben (Rules of Hooks) Benjamin Admin 2026-03-11 09:46:25 +01:00
  • b5d5371f72 fix: einheitliche Schriftgroesse + Border-Cluster-Filter im Overlay Benjamin Admin 2026-03-11 09:34:41 +01:00
  • 41e47baf13 fix: skip_heal_gaps Parameter an Stream-Generator durchreichen Benjamin Admin 2026-03-11 09:11:16 +01:00
  • 8a60f4bf30 fix: Overlay-Zellen ohne _heal_row_gaps positionieren (skip_heal_gaps) Benjamin Admin 2026-03-11 08:59:50 +01:00
  • e3ee1de790 Revert "fix: Zeilen-Regularisierung im Overlay ueberspringen (generisch fuer gemischte Inhalte)" Benjamin Admin 2026-03-11 08:44:07 +01:00
  • b91f799ccf fix: Zeilen-Regularisierung im Overlay ueberspringen (generisch fuer gemischte Inhalte) Benjamin Admin 2026-03-11 08:29:06 +01:00
  • 2df2a01a8b feat: Echtes Overlay — Text direkt ueber dem Originalbild Benjamin Admin 2026-03-11 00:25:11 +01:00
  • e2ad93fd57 fix: Word-Erkennung ohne Spalten ermoeglichen (Full-Page Pseudo-Column) Benjamin Admin 2026-03-11 00:16:31 +01:00
  • 2cbdfc56f3 feat: OCR Overlay — ganzseitige Rekonstruktion ohne Spaltenerkennung Benjamin Admin 2026-03-11 00:08:05 +01:00
  • 840918df2a fix: Originalbild im Overlay nicht extra drehen (Orientierung bereits im Cropped-Bild) Benjamin Admin 2026-03-10 23:25:20 +01:00
  • eb3fc05cdc fix: Box-Zone Clamping nach Box-Mitte statt Cell-Center entscheiden Benjamin Admin 2026-03-10 23:10:51 +01:00
  • 9dbb5fa708 fix: useMemo vor Early Returns verschieben (Rules of Hooks) Benjamin Admin 2026-03-10 22:57:25 +01:00
  • f468c30112 fix: Zellen an Box-Zone clampen im Overlay-Modus (keine Ueberlappung) Benjamin Admin 2026-03-10 22:52:08 +01:00
  • 618c82ef42 fix: Zeilen an Box-Grenze nicht mehr abschneiden (border_thickness Margin) Benjamin Admin 2026-03-10 17:44:02 +01:00