-
5f89913a9a
Fix IPA continuation to check all columns, not just en_col_type
Benjamin Admin
2026-03-19 23:34:41 +01:00
-
3c7fc43f43
Fix test expectation: valid IPA in brackets also triggers detection
Benjamin Admin
2026-03-19 23:30:24 +01:00
-
6bfa9eed86
Fix garbled IPA detection for bracket-notation like [n, nn] and [1uedtX,1]
Benjamin Admin
2026-03-19 23:28:00 +01:00
-
7750b2a05f
Fix ghost filter for borderless boxes + remove oversized graphic artifacts
Benjamin Admin
2026-03-19 23:04:00 +01:00
-
e3395ae8cf
Fix overlay word leak, ghost filter false positive, merged zone header
Benjamin Admin
2026-03-19 13:56:04 +01:00
-
df30d4eae3
Add zone merging across images + heading detection by color/height
Benjamin Admin
2026-03-19 12:22:11 +01:00
-
2e6ab3a646
Fix IPA marker split: walk back max 3 chars for onset cluster
Benjamin Admin
2026-03-19 10:57:15 +01:00
-
cc5ee74921
Use OCR-recognized IPA when word not in dictionary
Benjamin Admin
2026-03-19 10:55:36 +01:00
-
21d37b5da1
Fix prefix matching: use alpha-only chars, min 4-char prefix
Benjamin Admin
2026-03-19 10:40:37 +01:00
-
19cbbf310a
Improve garbled IPA cleanup: trailing strip, prefix match, broader guard
Benjamin Admin
2026-03-19 10:36:25 +01:00
-
fc0ab84e40
Fix garbled IPA in continuation rows using headword lookup
Benjamin Admin
2026-03-19 10:28:14 +01:00
-
050d410ba0
Preserve IPA continuation rows in grid output
Benjamin Admin
2026-03-19 10:22:58 +01:00
-
038eaf783c
Only insert IPA when garbled phonetics exist in OCR text
Benjamin Admin
2026-03-19 09:59:21 +01:00
-
432eee3694
Auto-filter decorative margin strips and header junk
Benjamin Admin
2026-03-19 09:38:24 +01:00
-
8e4cbd84c2
Invalidate grid_editor_result when exclude regions change
Benjamin Admin
2026-03-19 09:19:09 +01:00
-
f9d71d50d1
Add exclude region marking in Structure step
Benjamin Admin
2026-03-19 09:08:30 +01:00
-
c09838e91c
Fix spine shadow false positives: require dark valley, brightness rise, trim convolution edges
Benjamin Admin
2026-03-19 08:23:50 +01:00
-
3fd6523872
Cut at spine center (darkest point) instead of shadow edge
Benjamin Admin
2026-03-19 07:54:33 +01:00
-
e56391b0c3
Add right-edge spine shadow detection for book scans
Benjamin Admin
2026-03-19 07:41:13 +01:00
-
a3e2a7f994
Add GT button to OCR overlay, prominent category picker, track pipeline
Benjamin Admin
2026-03-18 14:49:02 +01:00
-
f655db30e4
Add Ground Truth regression test system for OCR pipeline
Benjamin Admin
2026-03-18 13:46:48 +01:00
-
c894a0feeb
Improve IPA continuation row detection with phonetic heuristics
Benjamin Admin
2026-03-18 12:08:21 +01:00
-
8ef4c089cf
Remove IPA continuation rows and support hyphenated word lookup
Benjamin Admin
2026-03-18 12:05:38 +01:00
-
821e5481c2
Only apply IPA correction on vocabulary tables (≥3 columns)
Benjamin Admin
2026-03-18 11:50:03 +01:00
-
b98ea33a3a
Strip garbled OCR phonetics after IPA insertion
Benjamin Admin
2026-03-18 11:15:14 +01:00
-
f139d0903e
Preserve alphabetic marker columns, broaden junk filter, enable IPA in grid
Benjamin Admin
2026-03-18 11:08:23 +01:00
-
962bbbe9f6
Remove scattered debris rows and disable spanning header detection
Benjamin Admin
2026-03-18 10:47:17 +01:00
-
9da45c2a59
Fix false header detection and add decorative margin/footer filters
Benjamin Admin
2026-03-18 10:38:20 +01:00
-
64447ad352
Raise color sat_threshold from 50 to 55 to avoid scanner blue artifacts
Benjamin Admin
2026-03-18 09:13:09 +01:00
-
00cbf266cb
Add oversized-stub filter for large page numbers/marks in grid rows
Benjamin Admin
2026-03-18 09:05:07 +01:00
-
f9bad7beaa
Filter phantom rows from recovered color artifacts and low-conf OCR noise
Benjamin Admin
2026-03-18 09:00:43 +01:00
-
143e41ec76
add: ocr_pipeline_overlays.py for overlay rendering functions
Benjamin Admin
2026-03-18 08:46:49 +01:00
-
ec287fd12e
refactor: split ocr_pipeline_api.py (5426 lines) into 8 modules
Benjamin Admin
2026-03-18 08:42:00 +01:00
-
98f7f7d7d5
fix: NameError in paddle_kombi/rapid_kombi cache update
Benjamin Admin
2026-03-18 08:12:01 +01:00
-
a19bca6060
fix: lower color sat_threshold from 70 to 50 for green text detection
Benjamin Admin
2026-03-18 08:00:35 +01:00
-
7a76697f95
fix: always re-run structure detection instead of using cached result
Benjamin Admin
2026-03-18 07:43:44 +01:00
-
5359a4cc2b
fix: cache word_result in paddle_kombi/rapid_kombi for detect-structure
Benjamin Admin
2026-03-18 07:29:02 +01:00
-
a25214126d
fix: merge overlapping OCR words with different text (Stick/Stück)
Benjamin Admin
2026-03-18 07:00:57 +01:00
-
fd79d5e4fa
fix: prevent grid table overflow when union columns exceed zone bbox
Benjamin Admin
2026-03-17 19:43:00 +01:00
-
19b93f7762
fix: conservative column detection + smart graphic word filter
Benjamin Admin
2026-03-17 18:19:25 +01:00
-
a079ffe8e9
fix: robust colored-text detection in graphic filter
Benjamin Admin
2026-03-17 18:09:16 +01:00
-
6e1d715d0d
fix: prevent colored text from being falsely detected as graphics
Benjamin Admin
2026-03-17 17:30:35 +01:00
-
d66efdecf5
fix: NameError in detect_page_splits — 'gaps' var removed in rewrite
Benjamin Admin
2026-03-17 17:01:34 +01:00
-
d36972b464
fix: detect spine by brightness, not ink density
Benjamin Admin
2026-03-17 16:52:29 +01:00
-
f30e526917
fix: merge nearby spine gaps + handle multi-page crop in frontend
Benjamin Admin
2026-03-17 16:44:32 +01:00
-
438a4495c7
fix: swap 90°/270° rotation direction in orientation detection
Benjamin Admin
2026-03-17 16:39:15 +01:00
-
902de027f4
feat: auto-detect multi-page spreads and split into sub-sessions
Benjamin Admin
2026-03-17 16:34:06 +01:00
-
b1cdb2531c
feat: CSS Grid editor with OCR-measured column widths and row heights
Benjamin Admin
2026-03-17 13:48:47 +01:00
-
ab30e8b17a
feat: apply IPA phonetic correction in build-grid combo mode
Benjamin Admin
2026-03-17 12:53:58 +01:00
-
b0e1fbc8d6
feat: box zone artifact filter, spanning headers, parenthesis fix
Benjamin Admin
2026-03-17 11:31:55 +01:00
-
872b47f691
fix: filter words and color recoveries inside graphic/image regions
Benjamin Admin
2026-03-17 11:20:07 +01:00
-
bbf0a5720e
fix: require both horizontal AND vertical overlap for word dedup
Benjamin Admin
2026-03-17 10:57:44 +01:00
-
29d3c1caf5
fix: deduplicate overlapping words after Paddle+Tesseract merge
Benjamin Admin
2026-03-17 10:47:42 +01:00
-
aae8a96aa2
fix: sort word_boxes in reading order (Y-grouped, then X-sorted)
Benjamin Admin
2026-03-17 10:41:30 +01:00
-
2b73d9beec
fix: increase color recovery occupancy padding to prevent gap artifacts
Benjamin Admin
2026-03-17 10:28:56 +01:00
-
324f39a9cc
fix: merge inline marker columns + improve ghost edge detection
Benjamin Admin
2026-03-17 10:10:07 +01:00
-
febd0a2f84
fix: border ghost filter + row overlap fix for box zones
Benjamin Admin
2026-03-17 09:54:50 +01:00
-
43b1f8be58
diag: increase zone logging threshold to 60 words
Benjamin Admin
2026-03-17 09:49:19 +01:00
-
43dec5dd91
diag: add row-clustering logging for small/box zones
Benjamin Admin
2026-03-17 09:45:29 +01:00
-
dfce8415d7
fix: show per-word colors in grid table instead of whole-cell coloring
Benjamin Admin
2026-03-17 08:55:43 +01:00
-
92a52a3199
fix: apply column union when total_cols >= max (not just >)
Benjamin Admin
2026-03-17 00:14:59 +01:00
-
427fecdce0
fix: union column detection across all content zones
Benjamin Admin
2026-03-16 23:02:33 +01:00
-
9fb3229270
fix: lower tertiary gap threshold for narrow margin column detection
Benjamin Admin
2026-03-16 22:56:03 +01:00
-
91625a2646
fix: add tertiary tier for narrow margin columns (page refs, markers)
Benjamin Admin
2026-03-16 22:40:40 +01:00
-
02ae6249ca
fix: propagate columns from largest content zone instead of global detection
Benjamin Admin
2026-03-16 22:30:15 +01:00
-
cf995f2d52
fix: global column detection across content zones in Kombi grid builder
Benjamin Admin
2026-03-16 22:04:17 +01:00
-
0340204c1f
feat: box-aware column detection — exclude box content from global columns
Benjamin Admin
2026-03-16 18:42:46 +01:00
-
729ebff63c
feat: add border ghost filter + graphic detection tests + structure overlay
Benjamin Admin
2026-03-16 18:28:53 +01:00
-
6668661895
feat: region-based graphic detection with word-overlap filtering
Benjamin Admin
2026-03-16 14:49:15 +01:00
-
eeee61108a
fix: remove morph close that merged balloons into giant blob
Benjamin Admin
2026-03-16 14:42:51 +01:00
-
1653e7cff4
feat: two-pass graphic detection (color channel + ink)
Benjamin Admin
2026-03-16 14:30:33 +01:00
-
86ae71fd65
fix: only detect circles and illustrations, drop arrow/icon/line
Benjamin Admin
2026-03-16 14:20:17 +01:00
-
ba513968c5
fix: relax graphic detection for small circles/balloons
Benjamin Admin
2026-03-16 14:00:09 +01:00
-
f717e1c0df
debug: use INFO level for skip-reason logs
Benjamin Admin
2026-03-16 13:57:08 +01:00
-
934b5648a2
debug: add detailed skip-reason logging to graphic detection
Benjamin Admin
2026-03-16 13:56:12 +01:00
-
fe7339c7a1
fix: suppress text fragments in graphic detection
Benjamin Admin
2026-03-16 13:51:02 +01:00
-
3aa4a63257
fix: move Struktur step after OCR so word boxes are available for exclusion
Benjamin Admin
2026-03-16 13:38:58 +01:00
-
6b9b280ba3
feat: integrate graphic element detection into structure step
Benjamin Admin
2026-03-16 13:21:55 +01:00
-
1d34785e2b
feat: add Structure step to Kombi mode in OCR Overlay page
Benjamin Admin
2026-03-16 12:59:05 +01:00
-
5b5213c2b9
feat: add Structure Detection step to OCR pipeline
Benjamin Admin
2026-03-16 12:31:09 +01:00
-
fbbec6cf5e
feat: run shading-based box detection alongside line detection
Benjamin Admin
2026-03-16 08:12:52 +01:00
-
a6951940b9
fix: use median hue, Otsu threshold, and background subtraction for colors
Benjamin Admin
2026-03-16 07:44:03 +01:00
-
4a8d43fd71
feat: display detected text colors in grid editor UI
Benjamin Admin
2026-03-15 01:03:09 +01:00
-
bcd55e12d7
fix: run color annotation on final cell word_boxes, not pre-grid words
Benjamin Admin
2026-03-15 00:53:04 +01:00
-
2bd63ec402
feat: add color detection for OCR word boxes
Benjamin Admin
2026-03-15 00:50:09 +01:00
-
39a4d8564c
chore: add per-cluster debug logging for column alignment detection
Benjamin Admin
2026-03-15 00:18:28 +01:00
-
1162eac7b4
fix: use group-start positions for column detection, not all word left-edges
Benjamin Admin
2026-03-15 00:10:29 +01:00
-
28352f5bab
feat: replace gap-based column detection with left-edge alignment algorithm
Benjamin Admin
2026-03-15 00:03:58 +01:00
-
c3f1547e32
feat: add Excel-like grid editor for OCR overlay (Kombi mode step 6)
Benjamin Admin
2026-03-14 23:41:03 +01:00
-
4a15d46dfd
refactor: rename PaddleOCR → PP-OCRv5 in frontend, remove Kombi-Vergleich tab
Benjamin Admin
2026-03-14 09:11:26 +01:00
-
b83b38e7f2
feat: use local RapidOCR as default in ocr_region_paddle(), remote as fallback
Benjamin Admin
2026-03-14 08:26:04 +01:00
-
a994ddee83
feat: add Kombi-Vergleich mode for side-by-side Paddle vs RapidOCR comparison
Benjamin Admin
2026-03-14 07:59:06 +01:00
-
c2c082d4b4
docs+tests: update OCR Pipeline docs and add overlay position tests
Benjamin Admin
2026-03-13 21:03:00 +01:00
-
d6f51e4418
fix: deduplicate overlapping OCR words and use per-word Y positions in overlay
Benjamin Admin
2026-03-13 20:27:08 +01:00
-
703e110bab
fix: split PaddleOCR multi-word boxes before merge
Benjamin Admin
2026-03-13 10:39:10 +01:00
-
41ff7671cd
fix: update PaddleOCR init for v3.4+ API (lang=en, ocr_version=PP-OCRv5)
Benjamin Admin
2026-03-13 09:39:33 +01:00
-
8e42e36ee4
fix: replace deprecated libgl1-mesa-glx with libgl1 in paddleocr Dockerfile
Benjamin Admin
2026-03-13 09:11:12 +01:00
-
24e1e93b5b
fix: save raw paddle/tesseract words in kombi session for debugging
Benjamin Admin
2026-03-13 09:03:01 +01:00
-
846292f632
fix: rewrite Kombi merge with row-based sequence alignment
Benjamin Admin
2026-03-13 08:45:03 +01:00
-
4280298e02
fix: add _deduplicate_words safety net to Kombi merge
Benjamin Admin
2026-03-13 08:27:45 +01:00