Commit Graph

  • 5f89913a9a Fix IPA continuation to check all columns, not just en_col_type Benjamin Admin 2026-03-19 23:34:41 +01:00
  • 3c7fc43f43 Fix test expectation: valid IPA in brackets also triggers detection Benjamin Admin 2026-03-19 23:30:24 +01:00
  • 6bfa9eed86 Fix garbled IPA detection for bracket-notation like [n, nn] and [1uedtX,1] Benjamin Admin 2026-03-19 23:28:00 +01:00
  • 7750b2a05f Fix ghost filter for borderless boxes + remove oversized graphic artifacts Benjamin Admin 2026-03-19 23:04:00 +01:00
  • e3395ae8cf Fix overlay word leak, ghost filter false positive, merged zone header Benjamin Admin 2026-03-19 13:56:04 +01:00
  • df30d4eae3 Add zone merging across images + heading detection by color/height Benjamin Admin 2026-03-19 12:22:11 +01:00
  • 2e6ab3a646 Fix IPA marker split: walk back max 3 chars for onset cluster Benjamin Admin 2026-03-19 10:57:15 +01:00
  • cc5ee74921 Use OCR-recognized IPA when word not in dictionary Benjamin Admin 2026-03-19 10:55:36 +01:00
  • 21d37b5da1 Fix prefix matching: use alpha-only chars, min 4-char prefix Benjamin Admin 2026-03-19 10:40:37 +01:00
  • 19cbbf310a Improve garbled IPA cleanup: trailing strip, prefix match, broader guard Benjamin Admin 2026-03-19 10:36:25 +01:00
  • fc0ab84e40 Fix garbled IPA in continuation rows using headword lookup Benjamin Admin 2026-03-19 10:28:14 +01:00
  • 050d410ba0 Preserve IPA continuation rows in grid output Benjamin Admin 2026-03-19 10:22:58 +01:00
  • 038eaf783c Only insert IPA when garbled phonetics exist in OCR text Benjamin Admin 2026-03-19 09:59:21 +01:00
  • 432eee3694 Auto-filter decorative margin strips and header junk Benjamin Admin 2026-03-19 09:38:24 +01:00
  • 8e4cbd84c2 Invalidate grid_editor_result when exclude regions change Benjamin Admin 2026-03-19 09:19:09 +01:00
  • f9d71d50d1 Add exclude region marking in Structure step Benjamin Admin 2026-03-19 09:08:30 +01:00
  • c09838e91c Fix spine shadow false positives: require dark valley, brightness rise, trim convolution edges Benjamin Admin 2026-03-19 08:23:50 +01:00
  • 3fd6523872 Cut at spine center (darkest point) instead of shadow edge Benjamin Admin 2026-03-19 07:54:33 +01:00
  • e56391b0c3 Add right-edge spine shadow detection for book scans Benjamin Admin 2026-03-19 07:41:13 +01:00
  • a3e2a7f994 Add GT button to OCR overlay, prominent category picker, track pipeline Benjamin Admin 2026-03-18 14:49:02 +01:00
  • f655db30e4 Add Ground Truth regression test system for OCR pipeline Benjamin Admin 2026-03-18 13:46:48 +01:00
  • c894a0feeb Improve IPA continuation row detection with phonetic heuristics Benjamin Admin 2026-03-18 12:08:21 +01:00
  • 8ef4c089cf Remove IPA continuation rows and support hyphenated word lookup Benjamin Admin 2026-03-18 12:05:38 +01:00
  • 821e5481c2 Only apply IPA correction on vocabulary tables (≥3 columns) Benjamin Admin 2026-03-18 11:50:03 +01:00
  • b98ea33a3a Strip garbled OCR phonetics after IPA insertion Benjamin Admin 2026-03-18 11:15:14 +01:00
  • f139d0903e Preserve alphabetic marker columns, broaden junk filter, enable IPA in grid Benjamin Admin 2026-03-18 11:08:23 +01:00
  • 962bbbe9f6 Remove scattered debris rows and disable spanning header detection Benjamin Admin 2026-03-18 10:47:17 +01:00
  • 9da45c2a59 Fix false header detection and add decorative margin/footer filters Benjamin Admin 2026-03-18 10:38:20 +01:00
  • 64447ad352 Raise color sat_threshold from 50 to 55 to avoid scanner blue artifacts Benjamin Admin 2026-03-18 09:13:09 +01:00
  • 00cbf266cb Add oversized-stub filter for large page numbers/marks in grid rows Benjamin Admin 2026-03-18 09:05:07 +01:00
  • f9bad7beaa Filter phantom rows from recovered color artifacts and low-conf OCR noise Benjamin Admin 2026-03-18 09:00:43 +01:00
  • 143e41ec76 add: ocr_pipeline_overlays.py for overlay rendering functions Benjamin Admin 2026-03-18 08:46:49 +01:00
  • ec287fd12e refactor: split ocr_pipeline_api.py (5426 lines) into 8 modules Benjamin Admin 2026-03-18 08:42:00 +01:00
  • 98f7f7d7d5 fix: NameError in paddle_kombi/rapid_kombi cache update Benjamin Admin 2026-03-18 08:12:01 +01:00
  • a19bca6060 fix: lower color sat_threshold from 70 to 50 for green text detection Benjamin Admin 2026-03-18 08:00:35 +01:00
  • 7a76697f95 fix: always re-run structure detection instead of using cached result Benjamin Admin 2026-03-18 07:43:44 +01:00
  • 5359a4cc2b fix: cache word_result in paddle_kombi/rapid_kombi for detect-structure Benjamin Admin 2026-03-18 07:29:02 +01:00
  • a25214126d fix: merge overlapping OCR words with different text (Stick/Stück) Benjamin Admin 2026-03-18 07:00:57 +01:00
  • fd79d5e4fa fix: prevent grid table overflow when union columns exceed zone bbox Benjamin Admin 2026-03-17 19:43:00 +01:00
  • 19b93f7762 fix: conservative column detection + smart graphic word filter Benjamin Admin 2026-03-17 18:19:25 +01:00
  • a079ffe8e9 fix: robust colored-text detection in graphic filter Benjamin Admin 2026-03-17 18:09:16 +01:00
  • 6e1d715d0d fix: prevent colored text from being falsely detected as graphics Benjamin Admin 2026-03-17 17:30:35 +01:00
  • d66efdecf5 fix: NameError in detect_page_splits — 'gaps' var removed in rewrite Benjamin Admin 2026-03-17 17:01:34 +01:00
  • d36972b464 fix: detect spine by brightness, not ink density Benjamin Admin 2026-03-17 16:52:29 +01:00
  • f30e526917 fix: merge nearby spine gaps + handle multi-page crop in frontend Benjamin Admin 2026-03-17 16:44:32 +01:00
  • 438a4495c7 fix: swap 90°/270° rotation direction in orientation detection Benjamin Admin 2026-03-17 16:39:15 +01:00
  • 902de027f4 feat: auto-detect multi-page spreads and split into sub-sessions Benjamin Admin 2026-03-17 16:34:06 +01:00
  • b1cdb2531c feat: CSS Grid editor with OCR-measured column widths and row heights Benjamin Admin 2026-03-17 13:48:47 +01:00
  • ab30e8b17a feat: apply IPA phonetic correction in build-grid combo mode Benjamin Admin 2026-03-17 12:53:58 +01:00
  • b0e1fbc8d6 feat: box zone artifact filter, spanning headers, parenthesis fix Benjamin Admin 2026-03-17 11:31:55 +01:00
  • 872b47f691 fix: filter words and color recoveries inside graphic/image regions Benjamin Admin 2026-03-17 11:20:07 +01:00
  • bbf0a5720e fix: require both horizontal AND vertical overlap for word dedup Benjamin Admin 2026-03-17 10:57:44 +01:00
  • 29d3c1caf5 fix: deduplicate overlapping words after Paddle+Tesseract merge Benjamin Admin 2026-03-17 10:47:42 +01:00
  • aae8a96aa2 fix: sort word_boxes in reading order (Y-grouped, then X-sorted) Benjamin Admin 2026-03-17 10:41:30 +01:00
  • 2b73d9beec fix: increase color recovery occupancy padding to prevent gap artifacts Benjamin Admin 2026-03-17 10:28:56 +01:00
  • 324f39a9cc fix: merge inline marker columns + improve ghost edge detection Benjamin Admin 2026-03-17 10:10:07 +01:00
  • febd0a2f84 fix: border ghost filter + row overlap fix for box zones Benjamin Admin 2026-03-17 09:54:50 +01:00
  • 43b1f8be58 diag: increase zone logging threshold to 60 words Benjamin Admin 2026-03-17 09:49:19 +01:00
  • 43dec5dd91 diag: add row-clustering logging for small/box zones Benjamin Admin 2026-03-17 09:45:29 +01:00
  • dfce8415d7 fix: show per-word colors in grid table instead of whole-cell coloring Benjamin Admin 2026-03-17 08:55:43 +01:00
  • 92a52a3199 fix: apply column union when total_cols >= max (not just >) Benjamin Admin 2026-03-17 00:14:59 +01:00
  • 427fecdce0 fix: union column detection across all content zones Benjamin Admin 2026-03-16 23:02:33 +01:00
  • 9fb3229270 fix: lower tertiary gap threshold for narrow margin column detection Benjamin Admin 2026-03-16 22:56:03 +01:00
  • 91625a2646 fix: add tertiary tier for narrow margin columns (page refs, markers) Benjamin Admin 2026-03-16 22:40:40 +01:00
  • 02ae6249ca fix: propagate columns from largest content zone instead of global detection Benjamin Admin 2026-03-16 22:30:15 +01:00
  • cf995f2d52 fix: global column detection across content zones in Kombi grid builder Benjamin Admin 2026-03-16 22:04:17 +01:00
  • 0340204c1f feat: box-aware column detection — exclude box content from global columns Benjamin Admin 2026-03-16 18:42:46 +01:00
  • 729ebff63c feat: add border ghost filter + graphic detection tests + structure overlay Benjamin Admin 2026-03-16 18:28:53 +01:00
  • 6668661895 feat: region-based graphic detection with word-overlap filtering Benjamin Admin 2026-03-16 14:49:15 +01:00
  • eeee61108a fix: remove morph close that merged balloons into giant blob Benjamin Admin 2026-03-16 14:42:51 +01:00
  • 1653e7cff4 feat: two-pass graphic detection (color channel + ink) Benjamin Admin 2026-03-16 14:30:33 +01:00
  • 86ae71fd65 fix: only detect circles and illustrations, drop arrow/icon/line Benjamin Admin 2026-03-16 14:20:17 +01:00
  • ba513968c5 fix: relax graphic detection for small circles/balloons Benjamin Admin 2026-03-16 14:00:09 +01:00
  • f717e1c0df debug: use INFO level for skip-reason logs Benjamin Admin 2026-03-16 13:57:08 +01:00
  • 934b5648a2 debug: add detailed skip-reason logging to graphic detection Benjamin Admin 2026-03-16 13:56:12 +01:00
  • fe7339c7a1 fix: suppress text fragments in graphic detection Benjamin Admin 2026-03-16 13:51:02 +01:00
  • 3aa4a63257 fix: move Struktur step after OCR so word boxes are available for exclusion Benjamin Admin 2026-03-16 13:38:58 +01:00
  • 6b9b280ba3 feat: integrate graphic element detection into structure step Benjamin Admin 2026-03-16 13:21:55 +01:00
  • 1d34785e2b feat: add Structure step to Kombi mode in OCR Overlay page Benjamin Admin 2026-03-16 12:59:05 +01:00
  • 5b5213c2b9 feat: add Structure Detection step to OCR pipeline Benjamin Admin 2026-03-16 12:31:09 +01:00
  • fbbec6cf5e feat: run shading-based box detection alongside line detection Benjamin Admin 2026-03-16 08:12:52 +01:00
  • a6951940b9 fix: use median hue, Otsu threshold, and background subtraction for colors Benjamin Admin 2026-03-16 07:44:03 +01:00
  • 4a8d43fd71 feat: display detected text colors in grid editor UI Benjamin Admin 2026-03-15 01:03:09 +01:00
  • bcd55e12d7 fix: run color annotation on final cell word_boxes, not pre-grid words Benjamin Admin 2026-03-15 00:53:04 +01:00
  • 2bd63ec402 feat: add color detection for OCR word boxes Benjamin Admin 2026-03-15 00:50:09 +01:00
  • 39a4d8564c chore: add per-cluster debug logging for column alignment detection Benjamin Admin 2026-03-15 00:18:28 +01:00
  • 1162eac7b4 fix: use group-start positions for column detection, not all word left-edges Benjamin Admin 2026-03-15 00:10:29 +01:00
  • 28352f5bab feat: replace gap-based column detection with left-edge alignment algorithm Benjamin Admin 2026-03-15 00:03:58 +01:00
  • c3f1547e32 feat: add Excel-like grid editor for OCR overlay (Kombi mode step 6) Benjamin Admin 2026-03-14 23:41:03 +01:00
  • 4a15d46dfd refactor: rename PaddleOCR → PP-OCRv5 in frontend, remove Kombi-Vergleich tab Benjamin Admin 2026-03-14 09:11:26 +01:00
  • b83b38e7f2 feat: use local RapidOCR as default in ocr_region_paddle(), remote as fallback Benjamin Admin 2026-03-14 08:26:04 +01:00
  • a994ddee83 feat: add Kombi-Vergleich mode for side-by-side Paddle vs RapidOCR comparison Benjamin Admin 2026-03-14 07:59:06 +01:00
  • c2c082d4b4 docs+tests: update OCR Pipeline docs and add overlay position tests Benjamin Admin 2026-03-13 21:03:00 +01:00
  • d6f51e4418 fix: deduplicate overlapping OCR words and use per-word Y positions in overlay Benjamin Admin 2026-03-13 20:27:08 +01:00
  • 703e110bab fix: split PaddleOCR multi-word boxes before merge Benjamin Admin 2026-03-13 10:39:10 +01:00
  • 41ff7671cd fix: update PaddleOCR init for v3.4+ API (lang=en, ocr_version=PP-OCRv5) Benjamin Admin 2026-03-13 09:39:33 +01:00
  • 8e42e36ee4 fix: replace deprecated libgl1-mesa-glx with libgl1 in paddleocr Dockerfile Benjamin Admin 2026-03-13 09:11:12 +01:00
  • 24e1e93b5b fix: save raw paddle/tesseract words in kombi session for debugging Benjamin Admin 2026-03-13 09:03:01 +01:00
  • 846292f632 fix: rewrite Kombi merge with row-based sequence alignment Benjamin Admin 2026-03-13 08:45:03 +01:00
  • 4280298e02 fix: add _deduplicate_words safety net to Kombi merge Benjamin Admin 2026-03-13 08:27:45 +01:00