breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	729ebff63c	feat: add border ghost filter + graphic detection tests + structure overlay - Add _filter_border_ghost_words() to remove OCR artefacts from box borders (vertical + horizontal edge detection, column cleanup, re-indexing) - Add 20 tests for border ghost filter (basic filtering + column cleanup) - Add 24 tests for cv_graphic_detect (color detection, word overlap, boxes) - Clean up cv_graphic_detect.py logging (per-candidate → DEBUG) - Add structure overlay layer to StepReconstruction (boxes + graphics toggle) - Show border_ghosts_removed badge in StepStructureDetection - Update MkDocs with structure detection documentation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 18:28:53 +01:00
Benjamin Admin	6668661895	feat: region-based graphic detection with word-overlap filtering CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m3s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 19s Details New approach: dilate color mask heavily (25x25) to merge nearby colored pixels into regions, then check word overlap: - >50% overlap with OCR word boxes → colored text → skip - <50% overlap → colored image/graphic → keep This detects balloon clusters as one "image" region instead of trying to classify individual shapes. Red words like "borrow/lend" are filtered because they overlap with their word boxes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 14:49:15 +01:00
Benjamin Admin	eeee61108a	fix: remove morph close that merged balloons into giant blob CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 19s Details The 5x5 MORPH_CLOSE was connecting scattered color pixels into one page-spanning contour that swallowed individual balloons. Fix: - Remove MORPH_CLOSE, keep only MORPH_OPEN for speckle removal - Lower sat threshold 50→40 to catch more colored elements - Filter contours spanning >50% of width OR height (was AND) - Filter contours >10% of image area Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 14:42:51 +01:00
Benjamin Admin	1653e7cff4	feat: two-pass graphic detection (color channel + ink) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 31s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 21s Details Pass 1 (color): Detect colored graphics on HSV saturation channel. Black text is invisible on this channel, so no word exclusion needed. Catches colored balloons, arrows, icons reliably. Pass 2 (ink): Detect large black illustrations on dark ink mask minus word exclusion. Only keeps area > 5000 to avoid text fragments. Fixes: all 5 balloons now detectable (previously word exclusion zones were eating colored graphics that overlapped with nearby OCR words). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 14:30:33 +01:00
Benjamin Admin	86ae71fd65	fix: only detect circles and illustrations, drop arrow/icon/line CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m6s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 22s Details Text fragments after word exclusion are indistinguishable from arrows and icons via contour metrics. Since the goal is detecting graphics, images, boxes and colors (not arrows/icons), simplify to only: - circle/balloon (circularity > 0.55 — very reliable) - illustration (area > 3000 — clearly non-text) Boxes and colors are handled by cv_box_detect and cv_color_detect. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 14:20:17 +01:00
Benjamin Admin	ba513968c5	fix: relax graphic detection for small circles/balloons CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 18s Details - Lower min_area from 200 to 80 (small balloons ~100-300px²) - Lower word_pad from 10 to 5 (10px was eating nearby graphics) - Relax circle detection: circularity>0.55, min_dim>15 (was 0.70/25) - Text fragments still filtered by _classify_shape noise threshold - Add ACCEPT logging for debugging Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 14:00:09 +01:00
Benjamin Admin	f717e1c0df	debug: use INFO level for skip-reason logs CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 13:57:08 +01:00
Benjamin Admin	934b5648a2	debug: add detailed skip-reason logging to graphic detection CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 13:56:12 +01:00
Benjamin Admin	fe7339c7a1	fix: suppress text fragments in graphic detection CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 20s Details - Raise min_area from 30 to 200 (text fragments are small) - Raise word_pad from 3 to 10px (OCR bboxes are tight) - Reduce morph close kernel from 5x5 to 3x3 (avoid reconnecting text) - Tighten arrow detection: min 20px, circularity<0.35, >=2 defects - Add 'noise' category for too-small elements, filter them out - Raise min dimension from 4 to 8px - Add debug logging for word count and exclusion coverage - Raise max_area_ratio to 0.25 (allow larger illustrations) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 13:51:02 +01:00
Benjamin Admin	3aa4a63257	fix: move Struktur step after OCR so word boxes are available for exclusion CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 2m2s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details Graphic detection needs word positions to exclude text from the ink mask. Previously Struktur ran before OCR, causing every word to be detected as a graphic element. Now: - Pipeline: Struktur at index 7 (after Wörter) - Kombi: Struktur at index 5 (after PP-OCRv5+Tesseract, before Tabelle) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 13:38:58 +01:00
Benjamin Admin	6b9b280ba3	feat: integrate graphic element detection into structure step CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m58s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 19s Details Add cv_graphic_detect.py for detecting non-text visual elements (arrows, circles, lines, exclamation marks, icons, illustrations). Draw detected graphics on structure overlay image and display them in the frontend StepStructureDetection component with shape counts and individual listings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 13:21:55 +01:00
Benjamin Admin	1d34785e2b	feat: add Structure step to Kombi mode in OCR Overlay page CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 33s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 19s Details Insert the Struktur detection step between Zuschneiden and PP-OCRv5+Tesseract in the Kombi pipeline on /ai/ocr-overlay. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 12:59:05 +01:00
Benjamin Admin	5b5213c2b9	feat: add Structure Detection step to OCR pipeline CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m58s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 16s Details New pipeline step between Crop and Columns that visualizes detected document structure: boxes (line-based + shading), page zones, and color regions. Shows original image on the left, annotated overlay on the right. Backend: POST /detect-structure endpoint + /image/structure-overlay Frontend: StepStructureDetection component with zone/box/color details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 12:31:09 +01:00
Benjamin Admin	fbbec6cf5e	feat: run shading-based box detection alongside line detection CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 18s Details Previously color/shading detection only ran as fallback when no line-based boxes were found. Now both methods run in parallel with result merging, so smaller shaded boxes (like "German leihen") get detected even when larger bordered boxes are already found. Uses median-blur background analysis that works for both colored and grayscale/B&W scans. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 08:12:52 +01:00
Benjamin Admin	a6951940b9	fix: use median hue, Otsu threshold, and background subtraction for colors CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 36s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details - Median hue instead of mean (robust to background contamination) - Otsu threshold instead of fixed 180 (adapts to colored backgrounds) - Background sampling from border pixels with hue-distance filter - Higher sat_threshold (70) + min_sat_ratio (25%) to reduce false positives - Classify using saturated pixels only for cleaner hue signal Fixes: borrow/lend misdetected as orange (actually red, median_H=5) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 07:44:03 +01:00
Benjamin Admin	4a8d43fd71	feat: display detected text colors in grid editor UI CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 32s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 2m8s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details - Add color/color_name/recovered fields to OcrWordBox type - GridTable: show colored text + left-edge color indicator strip - GridEditor: show color stats and recovered count in summary bar Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 01:03:09 +01:00
Benjamin Admin	bcd55e12d7	fix: run color annotation on final cell word_boxes, not pre-grid words CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 16s Details _build_cells() creates new word_box dicts, so color fields set before grid building were lost. Now detect_word_colors() runs after cells are built, on the final word_boxes. Recovery still runs before grid building so recovered words participate in column/row detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 00:53:04 +01:00
Benjamin Admin	2bd63ec402	feat: add color detection for OCR word boxes CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m51s Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details New cv_color_detect.py module: - detect_word_colors(): annotates existing words with text color (HSV analysis) - recover_colored_text(): finds colored text regions missed by standard OCR (e.g. red ! markers) using HSV masks + contour detection Integrated into build-grid: words get color/color_name fields, recovered colored regions are merged into the word list before grid building. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 00:50:09 +01:00
Benjamin Admin	39a4d8564c	chore: add per-cluster debug logging for column alignment detection CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 2m0s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 00:18:28 +01:00
Benjamin Admin	1162eac7b4	fix: use group-start positions for column detection, not all word left-edges CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 18s Details Only cluster left-edges of words that begin a new group within their row (first word or preceded by a large gap). This filters out mid-phrase word positions (IPA transcriptions, second words in multi-word entries) that were causing too many false columns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 00:10:29 +01:00
Benjamin Admin	28352f5bab	feat: replace gap-based column detection with left-edge alignment algorithm CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m58s Details CI / test-python-agent-core (push) Successful in 20s Details CI / test-nodejs-website (push) Successful in 17s Details Column detection now clusters word left-edges by X-proximity and filters by row coverage (Y-coverage), matching the proven approach from cv_layout.py but using precise OCR word positions instead of ink-based estimates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 00:03:58 +01:00
Benjamin Admin	c3f1547e32	feat: add Excel-like grid editor for OCR overlay (Kombi mode step 6) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 17s Details Backend: new grid_editor_api.py with build-grid endpoint that detects bordered boxes, splits page into zones, clusters columns/rows per zone from Kombi word positions. New DB column grid_editor_result JSONB. Frontend: GridEditor component with editable HTML tables per zone, column bold toggle, header row toggle, undo/redo, keyboard navigation (Tab/Enter/Arrow), image overlay verification, and save/load. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 23:41:03 +01:00
Benjamin Admin	4a15d46dfd	refactor: rename PaddleOCR → PP-OCRv5 in frontend, remove Kombi-Vergleich tab CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 16s Details Since ocr_region_paddle() now runs RapidOCR locally (same PP-OCRv5 models), the "PaddleOCR (Hetzner)" labels were misleading. Renamed to "PP-OCRv5 (lokal)". Removed the Kombi-Vergleich tab since both sides would produce identical results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 09:11:26 +01:00
Benjamin Admin	b83b38e7f2	feat: use local RapidOCR as default in ocr_region_paddle(), remote as fallback CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 17s Details RapidOCR uses the same PP-OCRv5 ONNX models locally, avoiding 504 timeouts from remote PaddleOCR on large images. Set FORCE_REMOTE_PADDLE=1 to bypass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 08:26:04 +01:00
Benjamin Admin	a994ddee83	feat: add Kombi-Vergleich mode for side-by-side Paddle vs RapidOCR comparison CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 33s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 21s Details Add /rapid-kombi backend endpoint using local RapidOCR + Tesseract merge, KombiCompareStep component for parallel execution and side-by-side overlay, and wordResultOverride prop on OverlayReconstruction for direct data injection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 07:59:06 +01:00
Benjamin Admin	c2c082d4b4	docs+tests: update OCR Pipeline docs and add overlay position tests CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m5s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details MkDocs: document row-based merge algorithm, spatial overlap dedup, and per-word yPct/hPct rendering in OCR Pipeline docs. Tests: add 9 vitest tests for useSlideWordPositions covering word-box path, fallback path, and yPct/hPct contract. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 21:03:00 +01:00
Benjamin Admin	d6f51e4418	fix: deduplicate overlapping OCR words and use per-word Y positions in overlay CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 33s Details CI / test-python-klausur (push) Failing after 2m9s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 24s Details Backend: Add spatial overlap check (>=50% horizontal IoU) to Kombi merge so words at the same position are deduplicated even when OCR text differs. Frontend: Add yPct/hPct to WordPosition so each word renders at its actual vertical position instead of all words collapsing to the cell center Y. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 20:27:08 +01:00
Benjamin Admin	703e110bab	fix: split PaddleOCR multi-word boxes before merge CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 32s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 19s Details PaddleOCR returns entire phrases as single boxes (e.g. "More than 200 singers took part in the"). The merge algorithm compared word-by-word but Paddle had multi-word boxes vs Tesseract's individual words, so nothing matched and all Tesseract words were added as "extras" causing duplicates. Now splits Paddle boxes into individual words before merge. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 10:39:10 +01:00
Benjamin Admin	41ff7671cd	fix: update PaddleOCR init for v3.4+ API (lang=en, ocr_version=PP-OCRv5) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m5s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details PaddleOCR 3.4.0 removed 'latin' language support. Use 'en' with explicit ocr_version='PP-OCRv5' instead, with fallback for older API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 09:39:33 +01:00
Benjamin Admin	8e42e36ee4	fix: replace deprecated libgl1-mesa-glx with libgl1 in paddleocr Dockerfile CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 31s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m6s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 19s Details Package was removed in Debian Trixie. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 09:11:12 +01:00
Benjamin Admin	24e1e93b5b	fix: save raw paddle/tesseract words in kombi session for debugging CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m12s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 19s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 09:03:01 +01:00
Benjamin Admin	846292f632	fix: rewrite Kombi merge with row-based sequence alignment CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 19s Details Replaces position-based word matching with row-based sequence alignment to fix doubled words and cross-line averaging in Kombi-Modus. New algorithm: 1. Group words into rows by Y-position clustering 2. Match rows between engines by vertical center proximity 3. Within each row: walk both sequences left-to-right, deduplicating 4. Unmatched rows kept as-is Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 08:45:03 +01:00
Benjamin Admin	4280298e02	fix: add _deduplicate_words safety net to Kombi merge CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 32s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m5s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 19s Details Even after multi-criteria matching, near-duplicate words can slip through (same text, centers within 30px horizontal / 15px vertical). The new _deduplicate_words() removes these, keeping the higher-confidence copy. Regression test with real session data (row 2 with 145 near-dupes) confirms no duplicates remain after merge + deduplication. Tests: 37 → 45 (added TestDeduplicateWords, TestMergeRealWorldRegression). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 08:27:45 +01:00
Benjamin Admin	4f2fb0e94c	fix: Kombi-Modus merge now deduplicates same words from both engines CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 42s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m13s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 22s Details The merge algorithm now uses 3 criteria instead of just IoU > 0.3: 1. IoU > 0.15 (relaxed threshold) 2. Center proximity < word height AND same row 3. Text similarity > 0.7 AND same row This prevents doubled overlapping words when both PaddleOCR and Tesseract find the same word at similar positions. Unique words from either engine (e.g. bullets from Tesseract) are still added. Tests expanded: 19 → 37 (added _box_center_dist, _text_similarity, _words_match tests + deduplication regression test). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 08:11:31 +01:00
Benjamin Admin	61c8169f9e	docs+test: add Kombi-Modus tests (19 passing) and MkDocs documentation CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 35s Details CI / test-go-edu-search (push) Successful in 32s Details CI / test-python-klausur (push) Failing after 2m33s Details CI / test-python-agent-core (push) Successful in 20s Details CI / test-nodejs-website (push) Successful in 24s Details - test_paddle_kombi.py: 6 IoU tests, 10 merge tests, 2 bullet-point tests - OCR-Pipeline.md: new "OCR Overlay" section with Paddle Direct/Kombi docs, merge algorithm flowchart, dateistruktur update, changelog v4.5.0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 20:18:46 +01:00
Benjamin Admin	e9ccd1e35c	feat: add Kombi-Modus (PaddleOCR + Tesseract) for OCR Overlay CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 35s Details CI / test-go-edu-search (push) Successful in 33s Details CI / test-python-klausur (push) Failing after 2m20s Details CI / test-python-agent-core (push) Successful in 22s Details CI / test-nodejs-website (push) Successful in 41s Details Runs both OCR engines on the preprocessed image and merges results: word boxes matched by IoU, coordinates averaged by confidence weight. Unmatched Tesseract words (bullets, symbols) are added for better coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 20:05:50 +01:00
Benjamin Admin	d335a7bbf3	fix: use OCR word_box coordinates directly instead of fuzzy matching CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 2m6s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 25s Details The slide positioning hook was re-matching cell.text tokens against word_boxes via fuzzy text similarity, which broke positioning for special characters (!, bullet points, IPA). Now uses word_box coordinates directly — exact OCR positions without re-interpretation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 18:54:37 +01:00
Benjamin Admin	1f527fcd49	fix: split PaddleOCR boxes at leading ! for overlay word positioning CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details When PaddleOCR returns "!Betonung" as a single word box, the overlay positions text starting at the "!" instead of the actual word. Split such boxes into ["!", "Betonung"] with proportional position splitting, matching the existing IPA bracket splitting logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 17:46:17 +01:00
Benjamin Admin	8349c28f54	fix: paddle_direct reuses build_grid_from_words for correct overlay CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 37s Details CI / test-go-edu-search (push) Successful in 35s Details CI / test-python-klausur (push) Failing after 2m22s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 23s Details Replaces custom _paddle_words_to_grid_cells with the proven build_grid_from_words from cv_words_first.py — same function the regular pipeline uses with PaddleOCR. Handles phrase splitting, column clustering, and produces cells with word_boxes that the slide/cluster positioning hooks expect. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 17:19:52 +01:00
Benjamin Admin	71a1b5f058	fix: paddle_direct groups words per row (matching _build_cells format) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 34s Details CI / test-go-edu-search (push) Successful in 34s Details CI / test-python-klausur (push) Failing after 2m11s Details CI / test-python-agent-core (push) Successful in 20s Details CI / test-nodejs-website (push) Successful in 24s Details One cell per row with all words as word_boxes instead of one cell per word. Gives OverlayReconstruction a row-spanning bbox_pct for correct font sizing and per-word positions for slide/cluster placement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 17:10:10 +01:00
Benjamin Admin	c743a38eaf	fix: Paddle Direct keeps preprocessing (orient/deskew/dewarp/crop) CI / nodejs-lint (push) Has been cancelled Details CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details Uses the cropped/dewarped image instead of the original so the overlay shows the correctly oriented page. 5 steps instead of 2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 16:56:18 +01:00
Benjamin Admin	90c1efd9b0	feat: Paddle Direct — 1-click OCR without deskew/dewarp/crop CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details New 2-step mode (Upload → PaddleOCR+Overlay) alongside the existing 7-step pipeline. Backend endpoint runs PaddleOCR on the original image and clusters words into rows/cells directly. Frontend adds a mode toggle and PaddleDirectStep component. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 16:41:55 +01:00
Benjamin Admin	06d63d18f9	fix: generic fuzzy text matching for overlay word-box positioning CI / test-go-edu-search (push) Has been cancelled Details CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details Replace sequential 1:1 token-to-box mapping with fuzzy text matching. Each token from cell.text finds its best matching word_box by text similarity (normalized prefix match + substring bonus). Handles: - Reordered boxes (different sort between text and boxes) - IPA corrections changing token boundaries - Token/box count mismatches Unmatched tokens get interpolated positions from matched neighbors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 16:19:19 +01:00
Benjamin Admin	3e65b14b83	fix: split PaddleOCR boxes at IPA brackets for overlay positioning CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details PaddleOCR returns "badge[bxd3]" without space, but the IPA fixer produces "badge [bˈædʒ]" with space, creating a token count mismatch between cell.text and word_boxes. Now also split at "[" boundaries so each IPA bracket gets its own sub-box. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 16:08:17 +01:00
Benjamin Admin	40ac593d28	fix: split PaddleOCR phrase boxes into per-word boxes for overlay slide CI / test-nodejs-website (push) Has been cancelled Details CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details PaddleOCR returns phrase-level bounding boxes (e.g. "competition [kompa'tifn]" as one box) but the overlay slide mechanism expects one box per word for accurate positioning. Multi-word boxes are now split proportionally by character count with small gaps between words. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 16:00:06 +01:00
Benjamin Admin	ea69239e06	fix: word_boxes in words_first use absolute pixels (consistent with v2 grid) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 39s Details CI / test-go-edu-search (push) Successful in 33s Details CI / test-python-klausur (push) Failing after 2m21s Details CI / test-python-agent-core (push) Successful in 22s Details CI / test-nodejs-website (push) Successful in 33s Details words_first was storing word_boxes in percent coordinates while cv_cell_grid.py uses absolute pixel coordinates. The overlay slide mechanism divides by imgW to get percentages, so percent-in-percent caused positions near zero. Now both grid builders use the same format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 15:04:04 +01:00
Benjamin Admin	bb90d1ba94	fix: PaddleOCR engine forces words_first in frontend to match backend CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details When engine=paddle is selected, the backend overrides grid_method to words_first and returns plain JSON (no SSE streaming). The frontend was not aware of this override — it sent stream=true and tried to parse SSE events from a JSON response, resulting in "Keine Daten". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 14:52:18 +01:00
Benjamin Admin	685d135be5	fix: downscale large images before PaddleOCR (Traefik 60s limit) CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details Bilder > 1500px werden vor dem Upload verkleinert. Koordinaten werden zurueckskaliert. JPEG statt PNG fuer schnelleren Upload. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 14:28:58 +01:00
Benjamin Admin	e2c2acdf86	fix: increase PaddleOCR remote timeout to 120s for large scans CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 34s Details CI / test-go-edu-search (push) Successful in 31s Details CI / test-python-klausur (push) Failing after 2m14s Details CI / test-python-agent-core (push) Successful in 21s Details CI / test-nodejs-website (push) Successful in 24s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 13:41:39 +01:00
Benjamin Admin	3cc496f7f3	feat(rag): Update Verbraucherschutz docs + chunk counts + Landkarte CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 32s Details CI / test-go-edu-search (push) Failing after 14s Details CI / test-python-klausur (push) Failing after 2m5s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 22s Details - Update chunk counts for 8 successfully ingested DE laws (Phase H1) - Add 6 new BGB-Teile entries (AGB, Fernabsatz, Kaufrecht, Widerruf, Digital) - Add EGBGB Widerrufsbelehrung entry - Update COLLECTION_TOTALS: gesetze 58304→63567 (+5263 Phase H chunks) - Add Verbraucherschutz thematic group to Landkarte - Extend ecommerce industry map with consumer protection regulations - Update date to March 2026 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 09:54:20 +01:00

1 2 3 4 5 ...

341 Commits